Files
DS-for-LA/notebooks/06_modelling_101.ipynb
2026-04-14 18:22:46 -04:00

1634 lines
64 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "4f44b2e6",
"metadata": {},
"source": [
"# Modelling 101: Train/Test Splits & Beyond Linear Regression\n",
"\n",
"## Introduction\n",
"\n",
"So far we have seen how linear regression (ordinary least squares) solves $\\tilde{X}\\tilde{\\beta} = y$ by minimizing $\\|y - \\tilde{X}\\tilde{\\beta}\\|_2^2$. This is a powerful tool, but real data often breaks the assumptions that make linear regression the best choice. We address several of the points made in [notebook 03](03_what_goes_wrong.ipynb).\n",
"\n",
"> **Why linear regression might not cut it:**\n",
"> - **Nonlinear relationships** The true dependency may be curved, periodic, or otherwise not linear.\n",
"> - **High dimensionality** When the number of features $p$ is close to or larger than the number of observations $n$, the matrix $\\tilde{X}^T\\tilde{X}$ becomes singular or nearly singular.\n",
"> - **Multicollinearity** Features are correlated, leading to large condition numbers and unstable coefficients.\n",
"> - **Overfitting** A complex model fits noise instead of signal, especially when $p$ is large.\n",
"> - **Outliers** The $L^2$ norm magnifies large errors, pulling the fit away from the bulk of the data.\n",
"\n",
"In this notebook we will:\n",
"- Work with a real, moderately sized dataset.\n",
"- Learn how to properly split data into training, validation, and test sets.\n",
"- Apply linear and polynomial regression, then diagnose their limitations.\n",
"- Introduce **regularisation** methods (Ridge and Lasso) from a linear algebra perspective.\n",
"- Explore **gradient descent** as a numerical optimisation alternative to the normal equations.\n",
"- Look at **decision trees and random forests** nonlinear models that can capture complex interactions without feature engineering.\n",
"- Cover **logistic regression** for classification.\n",
"- Discuss **feature scaling**, **crossvalidation**, **model interpretation**, and **hyperparameter tuning**.\n",
"\n",
"The goal is to equip the linear algebraist with practical modelling tools while maintaining a geometric / algebraic intuition.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "40f2a9ea",
"metadata": {},
"source": [
"## A Real Dataset: California Housing\n",
"\n",
"A natural next step from our toy housing example is the **California housing dataset** from `sklearn.datasets`. It contains 20,640 observations of 8 features (median income, house age, average rooms, etc.) and the target is the median house value for blocks in California. This dataset is large enough to illustrate interesting effects but small enough to run quickly.\n",
"\n",
"> **Linear algebra view**: Each observation is a row vector $x_i \\in \\mathbb{R}^8$. The features form the columns of the design matrix $X \\in \\mathbb{R}^{20640 \\times 8}$. We will add an intercept column $\\mathbb{1}$ to obtain $\\tilde{X} \\in \\mathbb{R}^{20640 \\times 9}$."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ecbbc640",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.datasets import fetch_california_housing\n",
"\n",
"# Load the data\n",
"housing = fetch_california_housing()\n",
"X = housing.data # shape (20640, 8)\n",
"y = housing.target # shape (20640,)\n",
"feature_names = housing.feature_names\n",
"\n",
"# Convert to DataFrame for convenience\n",
"df = pd.DataFrame(X, columns=feature_names)\n",
"df['MedHouseVal'] = y\n",
"\n",
"print(f\"Data shape: {df.shape}\")\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f60af719",
"metadata": {},
"outputs": [],
"source": [
"# Basic statistics\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"id": "3f3e7ab3",
"metadata": {},
"source": [
"Let's see the relationships between these features and the price."
]
},
{
"cell_type": "markdown",
"id": "3459bcb7",
"metadata": {},
"source": [
"## Train / Test Split (and Validation)\n",
"\n",
"When it comes to real world modelling, we must split our data into training and tests sets.\n",
"\n",
"> **Why split?** If we evaluate a model on the same data we used to train it, we get an overly optimistic estimate of performance. The model may have memorised the training set (overfitting). Splitting mimics a realworld scenario: we test on unseen data.\n",
"\n",
"A common workflow:\n",
"1. **Training set** (e.g., 6080%): used to fit the model parameters.\n",
"2. **Validation set** (e.g., 1020%): used to tune hyperparameters (e.g., degree of polynomial, regularisation strength).\n",
"3. **Test set** (e.g., 1020%): used only once at the end to report final performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f998bdb3",
"metadata": {},
"outputs": [],
"source": [
"# Illustrate the three-way split\n",
"fig, ax = plt.subplots(figsize=(12, 3))\n",
"\n",
"# Create rectangles for each split\n",
"ax.barh(0, 60, left=0, height=0.5, color='blue', alpha=0.7, label='Training (60%)')\n",
"ax.barh(0, 20, left=60, height=0.5, color='orange', alpha=0.7, label='Validation (20%)')\n",
"ax.barh(0, 20, left=80, height=0.5, color='red', alpha=0.7, label='Test (20%)')\n",
"\n",
"# Add labels\n",
"ax.text(30, 0, 'Train Model\\nParameters', ha='center', va='center', fontsize=10, fontweight='bold')\n",
"ax.text(70, 0, 'Tune\\nHyperparams', ha='center', va='center', fontsize=10, fontweight='bold')\n",
"ax.text(90, 0, 'Final\\nEvaluation', ha='center', va='center', fontsize=10, fontweight='bold')\n",
"\n",
"ax.set_xlim(0, 100)\n",
"ax.set_ylim(-0.5, 0.5)\n",
"ax.set_xlabel('Percentage of Data')\n",
"ax.set_yticks([])\n",
"ax.set_title('Train/Validation/Test Split')\n",
"ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig('../images/train_validation_test_split.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "022d6da5",
"metadata": {},
"source": [
"\n",
"Let us first fix a random state."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79e9ce46",
"metadata": {},
"outputs": [],
"source": [
"RANDOM_STATE=3"
]
},
{
"cell_type": "markdown",
"id": "3da87dd2",
"metadata": {},
"source": [
"Let's visualize this."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27eece10",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Generate synthetic data to illustrate the concept\n",
"np.random.seed(3)\n",
"n = 50\n",
"X = np.random.uniform(-5,5,n) # synthetic, wider range\n",
"\n",
"# True relationship\n",
"a_true = 2.0\n",
"c_true = 5.0\n",
"noise = np.random.normal(0,3,n)\n",
"\n",
"y = a_true * X**2 + c_true + noise\n",
"\n",
"# Perform train/test split\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X,y, test_size=0.3, random_state=3\n",
")\n",
"\n",
"# Sort for plotting\n",
"X_curve = np.linspace(X.min(), X. max())\n",
"y_true = a_true * X_curve**2 + c_true\n",
"\n",
"\n",
"\n",
"# Plot\n",
"fig, ax = plt.subplots(figsize=(10,6))\n",
"ax.scatter(X_train, y_train, color='blue', s=50, label='Training data', zorder=3)\n",
"ax.scatter(X_test, y_test, color='red', s=50, label='Test data', zorder=3)\n",
"ax.plot(X_curve, y_true, linewidth=2, label='True relationship', alpha=0.7)\n",
"\n",
"ax.set_xlabel('X')\n",
"ax.set_ylabel('y')\n",
"ax.set_title('Train/Test Split')\n",
"ax.legend()\n",
"ax.grid(True, alpha=0.3)\n",
"plt.tight_layout()\n",
"plt.savefig('../images/train_test_split_illustration.png')\n",
"plt.show()\n",
"\n",
"print(f\"Total samples: {n}\")\n",
"print(f\"Training samples: {len(X_train)} ({len(X_train)/n*100:.0f}%)\")\n",
"print(f\"Test samples: {len(X_test)} ({len(X_test)/n*100:.0f}%)\")"
]
},
{
"cell_type": "markdown",
"id": "0913400b",
"metadata": {},
"source": [
"\n",
"Back to the housing data. We will use `sklearn.model_selection.train_test_split` to create two splits (train+validation vs. test), then further split the train+validation part if needed. For simplicity we will first do a single train/test split and use crossvalidation later."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70a2ca47",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Separate features and target\n",
"X = df[feature_names].values\n",
"y = df['MedHouseVal'].values\n",
"\n",
"# Split: 80% train, 20% test\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)\n",
"\n",
"print(f\"Training set size: {X_train.shape[0]}\")\n",
"print(f\"Test set size: {X_test.shape[0]}\")"
]
},
{
"cell_type": "markdown",
"id": "05795576",
"metadata": {},
"source": [
"With that, let's see the relationship between the features and the price."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "195267d8",
"metadata": {},
"outputs": [],
"source": [
"# Visualize relationships between features and target\n",
"fig, axes = plt.subplots(2, 4, figsize=(16, 8))\n",
"axes = axes.flatten()\n",
"\n",
"for i, (name, ax) in enumerate(zip(feature_names, axes)):\n",
" ax.scatter(X_train[:, i], y_train, alpha=0.1, s=1)\n",
" ax.set_xlabel(name)\n",
" ax.set_ylabel('MedHouseVal')\n",
" ax.set_title(f'{name} vs Price')\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig('../images/california_housing_scatter.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "edfe258e",
"metadata": {},
"source": [
"## Linear Regression in Practice\n",
"\n",
"We can use `sklearn.linear_model.LinearRegression`, which internally solves the normal equations using either a direct solver or an SVDbased approach (the `lstsq` method we saw earlier).\n",
"\n",
"> **Linear algebra reminder**: The leastsquares solution minimises $\\|y - X\\beta\\|_2^2$. The closed form is $\\beta = (X^T X)^{-1} X^T y$ when $X$ has full column rank."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f22373f2",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"# Fit linear regression\n",
"lin_reg = LinearRegression()\n",
"lin_reg.fit(X_train, y_train)\n",
"\n",
"# Predict on train and test\n",
"y_train_pred = lin_reg.predict(X_train)\n",
"y_test_pred = lin_reg.predict(X_test)\n",
"\n",
"# Evaluate\n",
"train_mse = mean_squared_error(y_train, y_train_pred)\n",
"test_mse = mean_squared_error(y_test, y_test_pred)\n",
"train_r2 = r2_score(y_train, y_train_pred)\n",
"test_r2 = r2_score(y_test, y_test_pred)\n",
"\n",
"print(f\"Train MSE: {train_mse:.4f}, Train R²: {train_r2:.4f}\")\n",
"print(f\"Test MSE: {test_mse:.4f}, Test R²: {test_r2:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "62e8cec4",
"metadata": {},
"source": [
"The test $R^2$ is respectable (~0.6), but perhaps we can do better with a more flexible model. However, simply adding polynomial features might lead to overfitting. Let's examine that.\n",
"\n",
"## Polynomial Regression and the Danger of Overfitting\n",
"\n",
"Polynomial regression creates new features by taking powers of the original features. For example, with one feature $x$, a degree2 model uses $[1, x, x^2]$. For multiple features, we can include interaction terms.\n",
"\n",
"> **Linear algebra view**: The Vandermonde matrix (for one feature) or its multivariate generalisation becomes the new design matrix. As degree increases, the condition number often explodes, leading to numerical instability and wild coefficients a sign of overfitting.\n",
"\n",
"Let's illustrate underfitting and overfitting on synthetic data before moving to the housing dataset."
]
},
{
"cell_type": "markdown",
"id": "2e6f26c6",
"metadata": {},
"source": [
"### Illustration: Underfitting vs Overfitting\n",
"\n",
"We generate data from a quadratic function with noise, then fit polynomials of different degrees."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14a7a408",
"metadata": {},
"outputs": [],
"source": [
"# Generate quadratic data (similar to notebook 02) using distinct names\n",
"np.random.seed(3)\n",
"n_synth = 50\n",
"x_synth = np.random.uniform(-5, 5, n_synth)\n",
"y_true_synth = 2.0 * x_synth**2 + 5.0\n",
"noise_synth = np.random.normal(0, 3, n_synth)\n",
"y_synth = y_true_synth + noise_synth\n",
"\n",
"# Fit polynomials of degree 1 (underfit), 2 (good), 11 (overfit)\n",
"degrees = [1, 2, 11]\n",
"x_plot = np.linspace(-5, 5, 200)\n",
"\n",
"fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
"for idx, d in enumerate(degrees):\n",
" coeff = np.polyfit(x_synth, y_synth, d)\n",
" p = np.poly1d(coeff)\n",
" axes[idx].scatter(x_synth, y_synth, alpha=0.7, label='Data')\n",
" axes[idx].plot(x_plot, p(x_plot), 'r-', linewidth=2, label=f'Degree {d}')\n",
" axes[idx].set_title(f'Degree {d} fit')\n",
" axes[idx].set_xlabel('x')\n",
" axes[idx].set_ylabel('y')\n",
" axes[idx].legend()\n",
" axes[idx].grid(True)\n",
"plt.tight_layout()\n",
"plt.savefig('../images/underfitting_vs_overfitting.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "1f79cb7d",
"metadata": {},
"source": [
"- **Degree 1 (underfitting)**: The linear model cannot capture the curvature, resulting in high bias.\n",
"- **Degree 2 (good)**: The quadratic model matches the true underlying structure.\n",
"- **Degree 11 (overfitting)**: The polynomial oscillates wildly to fit the noise, leading to poor generalisation.\n",
"\n",
"Now back to the housing dataset. Let's create polynomial features and see the effect on condition number and test error."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a1c23b1",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"# Create polynomial features of degree 2 (includes interactions)\n",
"poly = PolynomialFeatures(degree=2, include_bias=False)\n",
"X_train_poly = poly.fit_transform(X_train)\n",
"X_test_poly = poly.transform(X_test)\n",
"\n",
"print(f\"Original training features: {X_train.shape[1]}\")\n",
"print(f\"Polynomial training features: {X_train_poly.shape[1]}\")\n",
"\n",
"# Condition number of the augmented polynomial design matrix (with intercept added later)\n",
"from numpy.linalg import cond\n",
"\n",
"X_train_poly_with_intercept = np.hstack([np.ones((X_train_poly.shape[0], 1)), X_train_poly])\n",
"print(f\"Condition number of polynomial design matrix: {cond(X_train_poly_with_intercept):.2e}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70ce5172",
"metadata": {},
"outputs": [],
"source": [
"# Fit linear regression on polynomial features\n",
"poly_reg = LinearRegression()\n",
"poly_reg.fit(X_train_poly, y_train)\n",
"\n",
"y_train_pred_poly = poly_reg.predict(X_train_poly)\n",
"y_test_pred_poly = poly_reg.predict(X_test_poly)\n",
"\n",
"train_mse_poly = mean_squared_error(y_train, y_train_pred_poly)\n",
"test_mse_poly = mean_squared_error(y_test, y_test_pred_poly)\n",
"\n",
"print(f\"Polynomial (deg=2) Train MSE: {train_mse_poly:.4f}\")\n",
"print(f\"Polynomial (deg=2) Test MSE: {test_mse_poly:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "b8e00866",
"metadata": {},
"source": [
"The test error is **worse** than the linear model this is a clear sign of overfitting. The model is too flexible and fits noise in the training data. We need **regularisation**.\n",
"\n",
"## Ridge Regression ($L^2$ Regularisation)\n",
"\n",
"Ridge regression adds a penalty on the squared $L^2$ norm of the coefficient vector:\n",
"\n",
"$$\n",
"\\min_{\\beta} \\|y - X\\beta\\|_2^2 + \\lambda \\|\\beta\\|_2^2\n",
"$$\n",
"\n",
"where $\\lambda \\ge 0$ is the regularisation strength.\n",
"\n",
"> **Linear algebra interpretation**: The normal equations become $(X^T X + \\lambda I)\\beta = X^T y$. Adding $\\lambda I$ to $X^T X$ increases all eigenvalues by $\\lambda$, thereby improving the condition number and making the problem wellposed even when $X^T X$ is singular. This is a form of **Tikhonov regularisation**. \n",
"> This directly shifts the eigenvalues (and singular values) of $X^TX$. \n"
]
},
{
"cell_type": "markdown",
"id": "b41e9e16",
"metadata": {},
"source": [
"\n",
"\n",
"Ridge regression shrinks coefficients towards zero but rarely makes them exactly zero. It is especially useful when features are correlated (multicollinearity)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fa1c509",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Ridge\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"# We'll use the polynomial features because ridge can help with overfitting\n",
"# Choose lambda via cross-validation on the training set\n",
"alphas = np.logspace(-3, 3, 20)\n",
"cv_scores = []\n",
"\n",
"for alpha in alphas:\n",
" ridge = Ridge(alpha=alpha)\n",
" # 5-fold cross-validation, negative MSE (scoring expects higher = better)\n",
" scores = cross_val_score(ridge, X_train_poly, y_train, cv=5, scoring='neg_mean_squared_error')\n",
" cv_scores.append(-scores.mean())\n",
"\n",
"best_alpha = alphas[np.argmin(cv_scores)]\n",
"print(f\"Best alpha from CV: {best_alpha:.4f}\")\n",
"\n",
"# Plot CV error vs alpha\n",
"plt.figure(figsize=(8,4))\n",
"plt.semilogx(alphas, cv_scores)\n",
"plt.xlabel('alpha (λ)')\n",
"plt.ylabel('Cross-validated MSE')\n",
"plt.title('Ridge Regularisation on Polynomial Features')\n",
"plt.grid(True)\n",
"plt.savefig('../images/ridge_regularization_polynomial_features_unscaled.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "b0c512e8",
"metadata": {},
"source": [
"You'll notice we are getting a bunch of errors about ill-conditioned matrices. This happens because the polynomial features are on wildly different scales. Let's standardize our features first. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97aaaeba",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Ridge\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Add scaler\n",
"scaler = StandardScaler()\n",
"X_train_poly_scaled = scaler.fit_transform(X_train_poly)\n",
"X_test_poly_scaled = scaler.transform(X_test_poly)\n",
"\n",
"# We'll use the polynomial features because ridge can help with overfitting\n",
"# Choose lambda via cross-validation on the training set\n",
"alphas = np.logspace(-3, 3, 20)\n",
"cv_scores = []\n",
"\n",
"for alpha in alphas:\n",
" ridge = Ridge(alpha=alpha)\n",
" scores = cross_val_score(ridge, X_train_poly_scaled, y_train, cv=5, scoring='neg_mean_squared_error')\n",
" cv_scores.append(-scores.mean())\n",
"\n",
"best_alpha = alphas[np.argmin(cv_scores)]\n",
"print(f\"Best alpha from CV: {best_alpha:.4f}\")\n",
"\n",
"# Plot CV error vs alpha\n",
"plt.figure(figsize=(8,4))\n",
"plt.semilogx(alphas, cv_scores)\n",
"plt.xlabel('alpha (λ)')\n",
"plt.ylabel('Cross-validated MSE')\n",
"plt.title('Ridge Regularisation on Polynomial Features')\n",
"plt.grid(True)\n",
"plt.savefig('../images/ridge_regularization_polynomial_features_scaled.png')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99c974ca",
"metadata": {},
"outputs": [],
"source": [
"# Fit ridge with best alpha on SCALED polynomial features\n",
"ridge_best = Ridge(alpha=best_alpha)\n",
"ridge_best.fit(X_train_poly_scaled, y_train)\n",
"\n",
"y_test_pred_ridge = ridge_best.predict(X_test_poly_scaled)\n",
"test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)\n",
"print(f\"Ridge (poly deg=2) Test MSE: {test_mse_ridge:.4f}\")\n",
"print(f\"Ridge improved over plain polynomial (MSE {test_mse_poly:.4f} -> {test_mse_ridge:.4f})\")"
]
},
{
"cell_type": "markdown",
"id": "100ef094",
"metadata": {},
"source": [
"### Ridge from the SVD Perspective\n",
"\n",
"The Ridge solution has a beautiful interpretation in terms of singular values. Recall from Notebook 2 that if $X = U\\Sigma V^T$ is the SVD of the (centered) design matrix, then the OLS solution is\n",
"\n",
"$$\n",
"\\hat{\\beta}_{OLS} = V\\Sigma^{-1}U^T y = \\sum_{i=1}^{p} \\frac{1}{\\sigma_i} (u_i^T y) v_i.\n",
"$$\n",
"\n",
"When $\\sigma_i$ is small, the coefficient $\\frac{1}{\\sigma_i}$ explodes — this is the condition number problem.\n",
"\n",
"For Ridge regression, one can show that\n",
"\n",
"$$\n",
"\\hat{\\beta}_{Ridge} = \\sum_{i=1}^{p} \\frac{\\sigma_i}{\\sigma_i^2 + \\lambda} (u_i^T y) v_i.\n",
"$$\n",
"\n",
"Notice what happens:\n",
"- When $\\sigma_i \\gg \\sqrt{\\lambda}$, the coefficient is approximately $\\frac{1}{\\sigma_i}$ (same as OLS).\n",
"- When $\\sigma_i \\ll \\sqrt{\\lambda}$, the coefficient is approximately $\\frac{\\sigma_i}{\\lambda}$ — **shrunk towards zero**.\n",
"- The effective condition number becomes $\\frac{\\sigma_1^2 + \\lambda}{\\sigma_p^2 + \\lambda}$, which is much better than $\\frac{\\sigma_1^2}{\\sigma_p^2}$.\n",
"\n",
"This is why Ridge helps with multicollinearity: it dampens precisely those directions that were poorly determined."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c438ebc5",
"metadata": {},
"outputs": [],
"source": [
"# Visualize how Ridge shrinks coefficients relative to singular values\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Use scaled data for clean SVD interpretation\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"\n",
"# Compute SVD of centered design matrix\n",
"U, s, Vt = np.linalg.svd(X_train_scaled, full_matrices=False)\n",
"\n",
"# For different lambda values, compute the \"shrinkage factor\" for each singular direction\n",
"lambdas = [0, 0.1, 1, 10, 100]\n",
"\n",
"plt.figure(figsize=(10, 5))\n",
"\n",
"for lam in lambdas:\n",
" if lam == 0:\n",
" # OLS: no shrinkage\n",
" shrinkage = np.ones_like(s)\n",
" label = 'OLS (λ=0)'\n",
" else:\n",
" # Ridge shrinkage factor: sigma / (sigma^2 + lambda)\n",
" shrinkage = s / (s**2 + lam)\n",
" # Normalize so we can compare shapes\n",
" shrinkage = shrinkage / shrinkage[0] # normalize to first component\n",
" label = f'Ridge (λ={lam})'\n",
" \n",
" plt.plot(range(1, len(s)+1), shrinkage, 'o-', label=label, markersize=8)\n",
"\n",
"plt.xlabel('Singular value index (decreasing)')\n",
"plt.ylabel('Shrinkage factor (normalized)')\n",
"plt.title('Ridge Shrinkage: How λ Dampens Small Singular Directions')\n",
"plt.legend()\n",
"plt.grid(True, alpha=0.3)\n",
"plt.xticks(range(1, len(s)+1))\n",
"plt.tight_layout()\n",
"plt.savefig('../images/ridge_svd_shrinkage.png')\n",
"plt.show()\n",
"\n",
"# Show condition number improvement\n",
"print(\"Singular values:\", s.round(2))\n",
"print(f\"\\nCondition number (OLS): {s[0]/s[-1]:.2f}\")\n",
"for lam in [0.1, 1, 10]:\n",
" effective_cond = (s[0]**2 + lam) / (s[-1]**2 + lam)\n",
" print(f\"Effective condition number (λ={lam}): {effective_cond:.2f}\")"
]
},
{
"cell_type": "markdown",
"id": "50490dca",
"metadata": {},
"source": [
"## Lasso Regression ($L^1$ Regularisation)\n",
"\n",
"Lasso replaces the $L^2$ penalty with an $L^1$ penalty:\n",
"\n",
"$$\n",
"\\min_{\\beta} \\|y - X\\beta\\|_2^2 + \\lambda \\|\\beta\\|_1\n",
"$$\n",
"\n",
"> **Geometric intuition**: The $L^1$ ball is a diamond (in $\\mathbb{R}^2$). The intersection of the quadratic loss contours with this diamond often occurs at a corner, forcing some coefficients to be **exactly zero**. Thus Lasso performs **feature selection**."
]
},
{
"cell_type": "markdown",
"id": "c8638a17",
"metadata": {},
"source": [
"\n",
"\n",
"Lasso is useful when we suspect that only a few features are truly relevant, especially in highdimensional settings. However, it does not have a closedform solution; it is typically solved via coordinate descent or other optimisation algorithms."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ebe0612",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Lasso\n",
"\n",
"# Lasso also requires tuning of alpha\n",
"lasso = Lasso(alpha=0.01, max_iter=10000) # start with a small alpha\n",
"lasso.fit(X_train_poly, y_train)\n",
"\n",
"# Count non-zero coefficients\n",
"n_nonzero = np.sum(np.abs(lasso.coef_) > 1e-10)\n",
"print(f\"Number of non-zero coefficients: {n_nonzero} out of {len(lasso.coef_)}\")\n",
"\n",
"y_test_pred_lasso = lasso.predict(X_test_poly)\n",
"test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)\n",
"print(f\"Lasso (poly deg=2) Test MSE: {test_mse_lasso:.4f}\")\n",
"\n",
"# Cross-validation for Lasso alpha\n",
"from sklearn.linear_model import LassoCV\n",
"\n",
"lasso_cv = LassoCV(alphas=np.logspace(-3, 1, 30), cv=5, max_iter=10000, random_state=RANDOM_STATE)\n",
"lasso_cv.fit(X_train_poly, y_train)\n",
"print(f\"Best alpha from LassoCV: {lasso_cv.alpha_:.4f}\")\n",
"print(f\"Number of non-zero coefficients (CV best): {np.sum(np.abs(lasso_cv.coef_) > 1e-10)}\")\n",
"\n",
"y_test_pred_lasso_cv = lasso_cv.predict(X_test_poly)\n",
"test_mse_lasso_cv = mean_squared_error(y_test, y_test_pred_lasso_cv)\n",
"print(f\"LassoCV Test MSE: {test_mse_lasso_cv:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "48be5c6e",
"metadata": {},
"source": [
"Again, there are unscaled polynomial features, so we get convergence warnings. LASSO is sensivitve to scalling becuase the penalty treats all coefficients equally. We also get a suggestion to increase the number of iterations. \n",
"\n",
"Let's fix this. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02c2cdfa",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import Lasso, LassoCV\n",
"\n",
"# Lasso with more iterations\n",
"lasso = Lasso(alpha=0.01, max_iter=100000, tol=1e-4)\n",
"lasso.fit(X_train_poly_scaled, y_train)\n",
"\n",
"# Count non-zero coefficients\n",
"n_nonzero = np.sum(np.abs(lasso.coef_) > 1e-10)\n",
"print(f\"Number of non-zero coefficients: {n_nonzero} out of {len(lasso.coef_)}\")\n",
"\n",
"y_test_pred_lasso = lasso.predict(X_test_poly_scaled)\n",
"test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)\n",
"print(f\"Lasso Test MSE: {test_mse_lasso:.4f}\")\n",
"\n",
"# Cross-validation for Lasso alpha\n",
"lasso_cv = LassoCV(\n",
" alphas=np.logspace(-3, 1, 30), \n",
" cv=5, \n",
" max_iter=100000, \n",
" tol=1e-4,\n",
" random_state=RANDOM_STATE\n",
")\n",
"lasso_cv.fit(X_train_poly_scaled, y_train)\n",
"print(f\"Best alpha from LassoCV: {lasso_cv.alpha_:.4f}\")\n",
"print(f\"Number of non-zero coefficients (CV best): {np.sum(np.abs(lasso_cv.coef_) > 1e-10)}\")\n",
"\n",
"y_test_pred_lasso_cv = lasso_cv.predict(X_test_poly_scaled)\n",
"test_mse_lasso_cv = mean_squared_error(y_test, y_test_pred_lasso_cv)\n",
"print(f\"LassoCV Test MSE: {test_mse_lasso_cv:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "b6fcd6ba",
"metadata": {},
"source": [
"### Why Lasso Produces Sparse Solutions: The L¹ Geometry\n",
"\n",
"Recall from Notebook 3 that the $L^1$ unit ball is a diamond (a rotated square in $\\mathbb{R}^1$). This geometric fact is precisely why Lasso tends to produce coefficients that are **exactly zero**.\n",
"\n",
"Consider the constrained form of the problem:\n",
"\n",
"$$\n",
"\\min_{\\beta} \\|y - X\\beta\\|_2^2 \\quad \\text{subject to} \\quad \\|\\beta\\|_1 \\leq t.\n",
"$$\n",
"\n",
"The constraint region is the $L^1$ ball — a diamond with corners on the axes. The contours of the loss function $\\|y - X\\beta\\|_2^2$ are ellipses centered at the OLS solution.\n",
"\n",
"**Key insight**: When an elliptical contour expands and first touches the diamond, it often hits a **corner**. Corners lie on the axes, meaning some coefficients are exactly zero.\n",
"\n",
"This is in contrast to Ridge, where the constraint region is a ball (circle in $\\mathbb{R}^1$), and the first contact is typically at a smooth point — coefficients are shrunk but rarely zero."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca02a5f4",
"metadata": {},
"outputs": [],
"source": [
"# Visualize L1 vs L2 constraint regions and why Lasso gives sparsity\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
"\n",
"# L1 ball (diamond)\n",
"theta = np.linspace(0, 2*np.pi, 100)\n",
"r = 1\n",
"\n",
"# L1 ball vertices\n",
"l1_x = [r, 0, -r, 0, r]\n",
"l1_y = [0, r, 0, -r, 0]\n",
"\n",
"# L2 ball (circle)\n",
"l2_x = r * np.cos(theta)\n",
"l2_y = r * np.sin(theta)\n",
"\n",
"# Simulated loss contours (ellipses centered away from origin)\n",
"# The OLS solution is at some point (beta1_ols, beta2_ols)\n",
"beta_ols = np.array([0.7, 0.3])\n",
"\n",
"for idx, (ax, ball_type) in enumerate(zip(axes, ['Lasso (L¹)', 'Ridge (L²)'])):\n",
" # Draw constraint region\n",
" if idx == 0: # Lasso - L1 ball\n",
" ax.fill(l1_x, l1_y, alpha=0.3, color='blue', label='L¹ constraint region')\n",
" ax.plot(l1_x, l1_y, 'b-', linewidth=2)\n",
" else: # Ridge - L2 ball\n",
" ax.fill(l2_x, l2_y, alpha=0.3, color='green', label='L² constraint region')\n",
" ax.plot(l2_x, l2_y, 'g-', linewidth=2)\n",
" \n",
" # Draw loss contours (ellipses)\n",
" # Simplified: concentric ellipses around OLS solution\n",
" for scale in [0.3, 0.5, 0.7, 1.0]:\n",
" ellipse_x = beta_ols[0] + scale * 0.4 * np.cos(theta)\n",
" ellipse_y = beta_ols[1] + scale * 0.2 * np.sin(theta)\n",
" ax.plot(ellipse_x, ellipse_y, 'r--', alpha=0.5, linewidth=1)\n",
" \n",
" # Mark OLS solution\n",
" ax.scatter(*beta_ols, color='red', s=100, zorder=5, label='OLS solution')\n",
" \n",
" # Mark the \"first contact\" point (approximate)\n",
" if idx == 0: # Lasso hits corner\n",
" contact = np.array([1.0, 0.0]) # on the axis!\n",
" ax.scatter(*contact, color='purple', s=150, marker='*', zorder=6, label='Lasso solution (sparse!)')\n",
" else: # Ridge hits smooth part\n",
" contact = np.array([0.85, 0.35]) # not on axis\n",
" ax.scatter(*contact, color='purple', s=150, marker='*', zorder=6, label='Ridge solution')\n",
" \n",
" ax.set_xlim(-1.5, 1.5)\n",
" ax.set_ylim(-1.5, 1.5)\n",
" ax.set_xlabel(r'$\\beta_1$')\n",
" ax.set_ylabel(r'$\\beta_2$')\n",
" ax.set_title(f'{ball_type} Constraint')\n",
" ax.legend(loc='upper right', fontsize=9)\n",
" ax.set_aspect('equal')\n",
" ax.grid(True, alpha=0.3)\n",
" ax.axhline(0, color='k', linewidth=0.5)\n",
" ax.axvline(0, color='k', linewidth=0.5)\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig('../images/lasso_vs_ridge_geometry.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "6ce05c85",
"metadata": {},
"source": [
"## Principal Component Regression (PCR)\n",
"\n",
"Principal Component Regression combines the dimensionality reduction from Notebook 4 with linear regression. The idea is simple:\n",
"\n",
"1. Compute the principal components of $X$ (via SVD on centered data).\n",
"2. Keep only the top $k$ components (those with largest singular values).\n",
"3. Regress $y$ on these $k$ components.\n",
"\n",
"**Linear algebra perspective**: We project $X$ onto its best rank-$k$ approximation (in Frobenius norm) and then solve a least-squares problem in the reduced space. This is different from Ridge:\n",
"- Ridge **shrinks** all directions but keeps them.\n",
"- PCR **discards** the smallest singular directions entirely.\n",
"\n",
"PCR is particularly useful when:\n",
"- Features are highly correlated (multicollinearity).\n",
"- You want interpretable, low-dimensional representations.\n",
"- The signal lives in the top principal components while noise dominates the rest.\n",
"\n",
"The tradeoff: if the target $y$ is correlated with a small singular direction, PCR will discard useful information."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46adb22c",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"# Compare PCR with varying number of components\n",
"n_components_range = range(1, X_train_scaled.shape[1] + 1)\n",
"pcr_scores = []\n",
"\n",
"for n_comp in n_components_range:\n",
" pcr = make_pipeline(\n",
" PCA(n_components=n_comp),\n",
" LinearRegression()\n",
" )\n",
" # Negative MSE (sklearn convention: higher is better)\n",
" scores = cross_val_score(pcr, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')\n",
" pcr_scores.append(-scores.mean())\n",
"\n",
"# Also compute variance explained\n",
"pca_full = PCA()\n",
"pca_full.fit(X_train_scaled)\n",
"var_explained = np.cumsum(pca_full.explained_variance_ratio_)\n",
"\n",
"# Plot\n",
"fig, ax1 = plt.subplots(figsize=(10, 5))\n",
"\n",
"ax1.plot(n_components_range, pcr_scores, 'b-o', label='CV MSE')\n",
"ax1.set_xlabel('Number of Principal Components')\n",
"ax1.set_ylabel('Cross-Validated MSE', color='b')\n",
"ax1.tick_params(axis='y', labelcolor='b')\n",
"\n",
"ax2 = ax1.twinx()\n",
"ax2.plot(n_components_range, var_explained, 'r--s', label='Variance Explained')\n",
"ax2.set_ylabel('Cumulative Variance Explained', color='r')\n",
"ax2.tick_params(axis='y', labelcolor='r')\n",
"ax2.set_ylim(0, 1.05)\n",
"\n",
"plt.title('Principal Component Regression: Choosing k')\n",
"fig.legend(loc='center right', bbox_to_anchor=(0.85, 0.5))\n",
"plt.grid(True, alpha=0.3)\n",
"plt.tight_layout()\n",
"plt.savefig('../images/pcr_components_selection.png')\n",
"plt.show()\n",
"\n",
"# Best number of components\n",
"best_n_comp = n_components_range[np.argmin(pcr_scores)]\n",
"print(f\"Best number of components: {best_n_comp}\")\n",
"print(f\"Variance explained: {var_explained[best_n_comp-1]:.2%}\")\n",
"\n",
"# Compare with OLS and Ridge\n",
"print(f\"\\nModel Comparison (Test MSE):\")\n",
"print(f\" OLS (all features): {test_mse:.4f}\")\n",
"print(f\" PCR (k={best_n_comp}): {pcr_scores[best_n_comp-1]:.4f}\")\n",
"print(f\" Ridge (λ={best_alpha:.2f}): {test_mse_ridge:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "f67d60bc",
"metadata": {},
"source": [
"## Gradient Descent: When the Normal Equations Are Not Enough\n",
"\n",
"For very large datasets, computing $(X^T X)^{-1}$ or even forming $X^T X$ becomes prohibitive. **Gradient descent** is an iterative optimisation method that uses only firstorder derivatives.\n",
"\n",
"### The Linear Algebra of Convergence\n",
"\n",
"The loss function and its gradient:\n",
"\n",
"$$\n",
"L(\\beta) = \\frac{1}{2n}\\|y - X\\beta\\|_2^2, \\qquad \\nabla L(\\beta) = -\\frac{1}{n} X^T (y - X\\beta).\n",
"$$\n",
"\n",
"Starting from $\\beta^{(0)}$, we update:\n",
"\n",
"$$\n",
"\\beta^{(t+1)} = \\beta^{(t)} - \\eta \\nabla L(\\beta^{(t)}).\n",
"$$\n",
"\n",
"**Convergence depends on the eigenvalues of $X^T X$.** Let $\\lambda_{\\max}$ and $\\lambda_{\\min}$ be the largest and smallest eigenvalues. Then:\n",
"- The learning rate must satisfy $\\eta < \\frac{2}{\\lambda_{\\max}}$ for convergence.\n",
"- The convergence rate is governed by the **condition number** $\\kappa = \\frac{\\lambda_{\\max}}{\\lambda_{\\min}}$.\n",
"- When $\\kappa$ is large, gradients point in \"wrong\" directions — the loss surface is a narrow valley.\n",
"\n",
"This is why feature scaling matters: it reduces $\\kappa$, making the loss surface more spherical and convergence faster."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4135ebb7",
"metadata": {},
"outputs": [],
"source": [
"# Demonstrate how condition number affects gradient descent convergence\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):\n",
" \"\"\"Batch gradient descent for linear regression.\"\"\"\n",
" n, p = X.shape\n",
" beta = np.zeros(p)\n",
" losses = []\n",
" for i in range(n_iter):\n",
" residual = y - X @ beta\n",
" grad = - (1/n) * X.T @ residual\n",
" beta -= learning_rate * grad\n",
" loss = (1/(2*n)) * np.linalg.norm(residual)**2\n",
" losses.append(loss)\n",
" return beta, losses\n",
"\n",
"# Use a subset for illustration\n",
"X_subset = X_train[:1000]\n",
"y_subset = y_train[:1000]\n",
"\n",
"# Add intercept\n",
"X_subset_aug = np.hstack([np.ones((X_subset.shape[0], 1)), X_subset])\n",
"\n",
"# Compute eigenvalues of X^T X\n",
"eigenvalues = np.linalg.eigvalsh(X_subset_aug.T @ X_subset_aug)\n",
"lambda_max, lambda_min = eigenvalues.max(), eigenvalues[eigenvalues > 1e-10].min()\n",
"cond_num = lambda_max / lambda_min\n",
"\n",
"print(f\"Eigenvalue range: [{lambda_min:.2e}, {lambda_max:.2e}]\")\n",
"print(f\"Condition number: {cond_num:.2e}\")\n",
"print(f\"Max stable learning rate: {2/lambda_max:.2e}\")\n",
"\n",
"# Try gradient descent with different learning rates on UNSCALED data\n",
"fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
"\n",
"# UNSCALED\n",
"learning_rates = [1e-10, 1e-9, 1e-8]\n",
"for lr in learning_rates:\n",
" _, losses = gradient_descent_linear(X_subset_aug, y_subset, learning_rate=lr, n_iter=200)\n",
" axes[0].plot(losses, label=f'η = {lr:.0e}')\n",
"axes[0].set_xlabel('Iteration')\n",
"axes[0].set_ylabel('Loss (MSE)')\n",
"axes[0].set_title(f'Unscaled Data (κ = {cond_num:.1e})')\n",
"axes[0].legend()\n",
"axes[0].grid(True, alpha=0.3)\n",
"axes[0].set_yscale('log')\n",
"\n",
"# SCALED\n",
"scaler = StandardScaler()\n",
"X_subset_scaled = scaler.fit_transform(X_subset)\n",
"X_subset_scaled_aug = np.hstack([np.ones((X_subset_scaled.shape[0], 1)), X_subset_scaled])\n",
"\n",
"eigenvalues_scaled = np.linalg.eigvalsh(X_subset_scaled_aug.T @ X_subset_scaled_aug)\n",
"lambda_max_s, lambda_min_s = eigenvalues_scaled.max(), eigenvalues_scaled[eigenvalues_scaled > 1e-10].min()\n",
"cond_num_scaled = lambda_max_s / lambda_min_s\n",
"\n",
"learning_rates_scaled = [0.001, 0.01, 0.1]\n",
"for lr in learning_rates_scaled:\n",
" _, losses = gradient_descent_linear(X_subset_scaled_aug, y_subset, learning_rate=lr, n_iter=200)\n",
" axes[1].plot(losses, label=f'η = {lr}')\n",
"axes[1].set_xlabel('Iteration')\n",
"axes[1].set_ylabel('Loss (MSE)')\n",
"axes[1].set_title(f'Scaled Data (κ = {cond_num_scaled:.1f})')\n",
"axes[1].legend()\n",
"axes[1].grid(True, alpha=0.3)\n",
"axes[1].set_yscale('log')\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig('../images/gd_condition_number_effect.png')\n",
"plt.show()\n",
"\n",
"print(f\"\\nScaling reduced condition number from {cond_num:.1e} to {cond_num_scaled:.1f}\")\n",
"print(\"This allows much larger learning rates and faster convergence.\")"
]
},
{
"cell_type": "markdown",
"id": "e1197d1a",
"metadata": {},
"source": [
"Let's apply gradient descent to our housing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cdc96859",
"metadata": {},
"outputs": [],
"source": [
"# Implement batch gradient descent for linear regression on a small subset for illustration\n",
"def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):\n",
" n, p = X.shape\n",
" beta = np.zeros(p)\n",
" losses = []\n",
" for i in range(n_iter):\n",
" grad = - (1/n) * X.T @ (y - X @ beta)\n",
" beta -= learning_rate * grad\n",
" loss = (1/(2*n)) * np.linalg.norm(y - X @ beta)**2\n",
" losses.append(loss)\n",
" if verbose and i % 200 == 0:\n",
" print(f\"Iter {i}: loss = {loss:.6f}\")\n",
" return beta, losses\n",
"\n",
"# Use a small subset for speed\n",
"X_small = X_train[:1000]\n",
"y_small = y_train[:1000]\n",
"\n",
"# Add intercept column\n",
"X_small_aug = np.hstack([np.ones((X_small.shape[0], 1)), X_small])\n",
"\n",
"beta_gd, losses = gradient_descent_linear(X_small_aug, y_small, learning_rate=0.01, n_iter=500)\n",
"\n",
"plt.figure(figsize=(8,4))\n",
"plt.plot(losses)\n",
"plt.xlabel('Iteration')\n",
"plt.ylabel('Loss')\n",
"plt.title('Gradient Descent Convergence')\n",
"plt.grid(True)\n",
"plt.savefig('../images/gradient_descent_convergence_unscaled')\n",
"plt.show()\n",
"\n",
"# Compare with closed-form solution on the same subset\n",
"beta_closed = np.linalg.lstsq(X_small_aug, y_small, rcond=None)[0]\n",
"print(f\"Difference between GD and closed-form: {np.linalg.norm(beta_gd - beta_closed):.2e}\")"
]
},
{
"cell_type": "markdown",
"id": "b3a02ebc",
"metadata": {},
"source": [
"Again, we have scaling issues causing some errors. Large values will dominate gradients giving rise to instability."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b39817b2",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# 1. Prepare Data\n",
"# Use a small subset for speed\n",
"X_small = X_train[:1000].copy() # Use .copy() to avoid SettingWithCopyWarning\n",
"y_small = y_train[:1000].copy()\n",
"\n",
"# 2. SCALE THE FEATURES (Critical for Gradient Descent!)\n",
"scaler = StandardScaler()\n",
"X_small_scaled = scaler.fit_transform(X_small)\n",
"\n",
"# Add intercept column AFTER scaling\n",
"# (We don't scale the intercept column, it stays as 1s)\n",
"X_small_aug = np.hstack([np.ones((X_small_scaled.shape[0], 1)), X_small_scaled])\n",
"\n",
"# 3. Run Gradient Descent\n",
"def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):\n",
" n, p = X.shape\n",
" beta = np.zeros(p)\n",
" losses = []\n",
" for i in range(n_iter):\n",
" # Predict\n",
" prediction = X @ beta\n",
" # Residual\n",
" residual = y - prediction\n",
" # Gradient\n",
" grad = - (1/n) * X.T @ residual\n",
" # Update\n",
" beta -= learning_rate * grad\n",
" \n",
" # Calculate Loss (MSE)\n",
" loss = (1/(2*n)) * np.linalg.norm(residual)**2\n",
" losses.append(loss)\n",
" \n",
" if verbose and i % 200 == 0:\n",
" print(f\"Iter {i}: loss = {loss:.6f}\")\n",
" return beta, losses\n",
"\n",
"# With scaled data, learning_rate=0.01 or even 0.1 is usually safe\n",
"beta_gd, losses = gradient_descent_linear(X_small_aug, y_small, learning_rate=0.1, n_iter=500, verbose=True)\n",
"\n",
"# Plot convergence\n",
"plt.figure(figsize=(8,4))\n",
"plt.plot(losses)\n",
"plt.xlabel('Iteration')\n",
"plt.ylabel('Loss (MSE)')\n",
"plt.title('Gradient Descent Convergence (Scaled Data)')\n",
"plt.grid(True)\n",
"plt.savefig('../images/gradient_descent_convergence_scaled')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "f74f1fe8",
"metadata": {},
"source": [
"In practice, we use **stochastic** or **minibatch** gradient descent for large data. `sklearn`'s `SGDRegressor` implements these with various loss functions and penalties.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffb519e1",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import SGDRegressor\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.pipeline import make_pipeline\n",
"\n",
"# SGDRegressor is sensitive to feature scaling, so we use a pipeline\n",
"# penalty=None means no regularization (standard Linear Regression)\n",
"sgd_reg = make_pipeline(\n",
" StandardScaler(),\n",
" SGDRegressor(penalty=None, learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42)\n",
")\n",
"\n",
"sgd_reg.fit(X_train, y_train)\n",
"\n",
"# Note: SGDRegressor optimizes a different loss function formulation by default,\n",
"# so coefficients might differ slightly from closed-form, but the prediction quality is similar.\n",
"print(f\"Coefficients: {sgd_reg.named_steps['sgdregressor'].coef_}\")"
]
},
{
"cell_type": "markdown",
"id": "b204fd46",
"metadata": {},
"source": [
"\n",
"## Decision Trees and Random Forests\n",
"\n",
"Linear models assume a linear relationship. Decision trees are **nonparametric** models that partition the feature space into rectangular regions and assign a constant prediction (or a simple model) in each region. The prediction function is piecewise constant. The basis functions are indicator functions of the leaves. While not linear in the original features, the model is linear in the (highdimensional) leafindicator basis.\n",
"\n",
"**Random forests** combine many decision trees, each trained on a bootstrapped sample and a random subset of features. They reduce variance (overfitting) and often outperform single trees.\n",
"\n",
"When to use trees / forests:\n",
"- Nonlinear relationships with interactions.\n",
"- When interpretability is desired (a single tree can be visualised).\n",
"- When you have mixed categorical and continuous features.\n",
"- As a strong baseline before trying deep learning."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ac05f78",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.tree import DecisionTreeRegressor\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"# Single decision tree (max depth 10)\n",
"tree = DecisionTreeRegressor(max_depth=10, random_state=RANDOM_STATE)\n",
"tree.fit(X_train, y_train)\n",
"y_test_pred_tree = tree.predict(X_test)\n",
"test_mse_tree = mean_squared_error(y_test, y_test_pred_tree)\n",
"\n",
"# Random forest (100 trees)\n",
"rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1)\n",
"rf.fit(X_train, y_train)\n",
"y_test_pred_rf = rf.predict(X_test)\n",
"test_mse_rf = mean_squared_error(y_test, y_test_pred_rf)\n",
"\n",
"print(f\"Decision Tree Test MSE: {test_mse_tree:.4f}\")\n",
"print(f\"Random Forest Test MSE: {test_mse_rf:.4f}\")\n",
"\n",
"# Compare with best linear model\n",
"print(f\"Ridge (poly) Test MSE: {test_mse_ridge:.4f}\")\n",
"print(f\"LassoCV Test MSE: {test_mse_lasso_cv:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "4923e4e3",
"metadata": {},
"source": [
"Random forests often outperform linear models on complex realworld data without requiring feature engineering or scaling.\n",
"\n",
"## Logistic Regression for Classification\n",
"\n",
"So far we have focused on regression (continuous targets). For binary classification (e.g., spam vs. not spam), **logistic regression** is a natural extension. It models the probability that an observation belongs to a class using the logistic (sigmoid) function:\n",
"\n",
"$$\n",
"P(y=1 \\mid x) = \\frac{1}{1 + e^{-x^T\\beta}}.\n",
"$$\n",
"\n",
"The decision boundary is linear in the features: $x^T\\beta = 0$. The model is fitted by **maximum likelihood estimation**, which is equivalent to minimising the **logloss** (crossentropy). There is no closedform solution; we typically use gradient descent or Newton's method.\n",
"\n",
"We will illustrate logistic regression on a subset of the California housing data by creating a binary target (e.g., whether the median house value is above the median)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c7802e5",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
"\n",
"# Create binary target: 1 if house value > median, else 0\n",
"# Use the ORIGINAL dataframe to avoid confusion with scaled/transformed versions\n",
"y_binary = (df['MedHouseVal'] > df['MedHouseVal'].median()).astype(int).values\n",
"X_original = df[feature_names].values # original features, not overwritten\n",
"\n",
"# Split\n",
"X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(\n",
" X_original, y_binary, test_size=0.2, random_state=RANDOM_STATE\n",
")\n",
"\n",
"# Scale features (important for logistic regression with regularization)\n",
"scaler_bin = StandardScaler()\n",
"X_train_bin_scaled = scaler_bin.fit_transform(X_train_bin)\n",
"X_test_bin_scaled = scaler_bin.transform(X_test_bin)\n",
"\n",
"# Train logistic regression\n",
"log_reg = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)\n",
"log_reg.fit(X_train_bin_scaled, y_train_bin)\n",
"\n",
"# Predict\n",
"y_pred_bin = log_reg.predict(X_test_bin_scaled)\n",
"accuracy = accuracy_score(y_test_bin, y_pred_bin)\n",
"\n",
"print(f\"Logistic Regression Accuracy: {accuracy:.4f}\")\n",
"print(\"\\nClassification Report:\")\n",
"print(classification_report(y_test_bin, y_pred_bin))\n",
"\n",
"# Coefficients (on scaled features)\n",
"coef_df = pd.DataFrame({\n",
" 'Feature': feature_names,\n",
" 'Coefficient': log_reg.coef_[0]\n",
"}).sort_values('Coefficient', key=abs, ascending=False)\n",
"print(\"\\nLogistic Regression Coefficients (scaled features):\")\n",
"print(coef_df)"
]
},
{
"cell_type": "markdown",
"id": "cb23e281",
"metadata": {},
"source": [
"## CrossValidation: A Deeper Look\n",
"\n",
"Crossvalidation (CV) is a technique for assessing how well a model generalises to unseen data. Instead of a single train/validation split, we partition the training data into $k$ folds (typically 5 or 10). For each fold $i$, we train on the other $k-1$ folds and validate on fold $i$. The performance is averaged over the $k$ folds."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c445e5c9",
"metadata": {},
"outputs": [],
"source": [
"# Illustrate 5-fold cross-validation\n",
"from sklearn.model_selection import KFold\n",
"\n",
"n_points = 20\n",
"X_cv = np.arange(n_points).reshape(-1, 1)\n",
"colors = plt.cm.tab10(np.linspace(0, 1, 5))\n",
"\n",
"kf = KFold(n_splits=5, shuffle=True, random_state=3)\n",
"\n",
"fig, axes = plt.subplots(5, 1, figsize=(12, 8))\n",
"\n",
"for i, (train_idx, test_idx) in enumerate(kf.split(X_cv)):\n",
" ax = axes[i]\n",
" \n",
" # Plot all points\n",
" for j in range(n_points):\n",
" if j in test_idx:\n",
" ax.scatter(j, 0, s=200, c='red', marker='s', label='Test' if j == test_idx[0] else '')\n",
" else:\n",
" ax.scatter(j, 0, s=200, c='blue', marker='o', label='Train' if j == train_idx[0] else '')\n",
" \n",
" ax.set_xlim(-1, n_points)\n",
" ax.set_ylim(-0.5, 0.5)\n",
" ax.set_yticks([])\n",
" ax.set_ylabel(f'Fold {i+1}', rotation=0, labelpad=30)\n",
" \n",
" if i == 0:\n",
" ax.legend(loc='upper right', ncol=2)\n",
" if i < 4:\n",
" ax.set_xticks([])\n",
"\n",
"axes[-1].set_xlabel('Sample Index')\n",
"axes[2].set_title('5-Fold Cross-Validation', pad=20)\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig('../images/cross_validation_illustration.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "17211749",
"metadata": {},
"source": [
"\n",
"> **Why crossvalidate?** It reduces the variance of the performance estimate and makes better use of limited data. It is also essential for hyperparameter tuning (as we did with Ridge and Lasso).\n",
"\n",
"We already used `cross_val_score` above. Here's an explicit example with a linear model on the housing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99778878",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score, KFold\n",
"\n",
"# 5-fold CV on linear regression\n",
"lin_reg_cv = LinearRegression()\n",
"scores = cross_val_score(lin_reg_cv, X_train, y_train, cv=5, scoring='r2')\n",
"print(f\"5-fold CV R² scores: {scores}\")\n",
"print(f\"Mean R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})\")\n",
"\n",
"# We can also use a custom cross-validator\n",
"kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)\n",
"scores_shuffled = cross_val_score(lin_reg_cv, X_train, y_train, cv=kf, scoring='r2')\n",
"print(f\"Shuffled CV R² scores: {scores_shuffled}\")\n",
"print(f\"Mean R² (shuffled): {scores_shuffled.mean():.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "f0f5b2dc",
"metadata": {},
"source": [
"## Feature Scaling\n",
"\n",
"Many machine learning algorithms are sensitive to the scale of features. For example:\n",
"- Gradient descent converges faster when features are on similar scales.\n",
"- Regularisation (Ridge, Lasso) penalises coefficients equally; if features have different scales, the penalty is not meaningful.\n",
"- Distancebased methods (knearest neighbours, SVM with RBF kernel) assume all features are comparable.\n",
"\n",
"> **Linear algebra view**: Scaling corresponds to multiplying each column of $X$ by a positive scalar. This changes the condition number and the geometry of the optimisation landscape.\n",
"\n",
"Common scaling techniques:\n",
"- **Standardisation** (Zscore): $x' = \\frac{x - \\mu}{\\sigma}$ (mean 0, variance 1).\n",
"- **Minmax scaling**: $x' = \\frac{x - \\min}{\\max - \\min}$ (range [0,1]).\n",
"\n",
"We should always fit the scaler on the training set and then transform both train and test sets to avoid data leakage."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b89e4b1",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Create scaler\n",
"scaler = StandardScaler()\n",
"\n",
"# Fit on training data only\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# Compare condition number before and after scaling\n",
"X_train_aug = np.hstack([np.ones((X_train.shape[0], 1)), X_train])\n",
"X_train_scaled_aug = np.hstack([np.ones((X_train_scaled.shape[0], 1)), X_train_scaled])\n",
"\n",
"print(f\"Condition number (original): {cond(X_train_aug):.2e}\")\n",
"print(f\"Condition number (scaled): {cond(X_train_scaled_aug):.2e}\")\n",
"\n",
"# Fit linear regression on scaled data\n",
"lin_reg_scaled = LinearRegression()\n",
"lin_reg_scaled.fit(X_train_scaled, y_train)\n",
"y_test_pred_scaled = lin_reg_scaled.predict(X_test_scaled)\n",
"test_mse_scaled = mean_squared_error(y_test, y_test_pred_scaled)\n",
"print(f\"Linear regression (scaled) Test MSE: {test_mse_scaled:.4f}\")\n",
"print(f\"Linear regression (original) Test MSE: {test_mse:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "65b450d1",
"metadata": {},
"source": [
"Scaling did not change the linear regression performance because OLS is scaleinvariant (the coefficients adjust accordingly). However, it improves numerical stability and is crucial for regularised models and gradient descent.\n",
"\n",
"## Model Interpretation\n",
"\n",
"Interpretability is important in many applications. Different models offer different levels of insight.\n",
"\n",
"### Linear Models (Ridge, Lasso)\n",
"- Coefficients directly indicate the effect of each feature (assuming features are scaled).\n",
"- Sign and magnitude tell us direction and importance.\n",
"\n",
"### Decision Trees\n",
"- We can visualise the tree structure.\n",
"- Feature importance based on how much each feature reduces impurity (e.g., variance for regression, Gini for classification).\n",
"\n",
"### Random Forests\n",
"- Aggregate feature importance across all trees.\n",
"- Can also use SHAP or LIME for local explanations.\n",
"\n",
"Let's examine coefficients from a scaled linear model and feature importance from a random forest."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a7c2009",
"metadata": {},
"outputs": [],
"source": [
"# Train Ridge on scaled data (with default alpha)\n",
"ridge_scaled = Ridge(alpha=1.0)\n",
"ridge_scaled.fit(X_train_scaled, y_train)\n",
"\n",
"# Display coefficients\n",
"coef_df = pd.DataFrame({\n",
" 'Feature': feature_names,\n",
" 'Coefficient': ridge_scaled.coef_\n",
"})\n",
"print(\"Ridge coefficients (scaled features):\")\n",
"print(coef_df.sort_values('Coefficient', key=abs, ascending=False))\n",
"\n",
"# Random forest feature importance\n",
"rf.fit(X_train, y_train) # already fitted earlier, but ensure\n",
"importances = rf.feature_importances_\n",
"importance_df = pd.DataFrame({\n",
" 'Feature': feature_names,\n",
" 'Importance': importances\n",
"}).sort_values('Importance', ascending=False)\n",
"\n",
"print(\"\\nRandom Forest Feature Importances:\")\n",
"print(importance_df)\n",
"\n",
"# Plot\n",
"plt.figure(figsize=(8,4))\n",
"plt.barh(importance_df['Feature'], importance_df['Importance'])\n",
"plt.xlabel('Importance')\n",
"plt.title('Random Forest Feature Importance')\n",
"plt.gca().invert_yaxis()\n",
"plt.savefig('../images/RF_feature_importance.png')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a470114c",
"metadata": {},
"source": [
"## Hyperparameter Tuning with Grid Search\n",
"\n",
"Most models have hyperparameters that are not learned from data (e.g., `alpha` in Ridge, `max_depth` in trees, `n_estimators` in random forests). Tuning them properly is essential for good performance. Choosing hyperparameters is like selecting the optimal basis or regularisation parameter it changes the solution space.\n",
"\n",
"**Grid search** exhaustively tries a predefined set of hyperparameter combinations using crossvalidation. `sklearn.model_selection.GridSearchCV` does this efficiently.\n",
"\n",
"Let's tune a random forest regressor on the housing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8a04531",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# Define parameter grid\n",
"param_grid = {\n",
" 'n_estimators': [50, 100],\n",
" 'max_depth': [5, 10, None],\n",
" 'min_samples_split': [2, 5]\n",
"}\n",
"\n",
"# Create random forest\n",
"rf_tune = RandomForestRegressor(random_state=42, n_jobs=-1)\n",
"\n",
"# Grid search with 3-fold CV (use a subset of training data for speed)\n",
"X_train_subset = X_train[:5000]\n",
"y_train_subset = y_train[:5000]\n",
"\n",
"grid_search = GridSearchCV(rf_tune, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)\n",
"grid_search.fit(X_train_subset, y_train_subset)\n",
"\n",
"print(\"Best parameters:\", grid_search.best_params_)\n",
"print(\"Best CV MSE:\", -grid_search.best_score_)\n",
"\n",
"# Evaluate on test set\n",
"best_rf = grid_search.best_estimator_\n",
"y_test_pred_best_rf = best_rf.predict(X_test)\n",
"test_mse_best_rf = mean_squared_error(y_test, y_test_pred_best_rf)\n",
"print(f\"Tuned Random Forest Test MSE: {test_mse_best_rf:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "f8528fe3",
"metadata": {},
"source": [
"## Summary and Additional Considerations\n",
"\n",
"We have covered a progression of modelling techniques and essential practices:\n",
"\n",
"| Method | Linearity | Regularisation | Feature Selection | Scalability |\n",
"|--------|-----------|----------------|-------------------|-------------|\n",
"| Linear regression | Yes | No | No | Good (closedform) |\n",
"| Polynomial regression | In features | No | No | Poor (exploding dimension) |\n",
"| Ridge | Yes | $L^2$ | No (shrinks only) | Good |\n",
"| Lasso | Yes | $L^1$ | Yes | Good (via coordinate descent) |\n",
"| Logistic regression | Decision boundary linear | Optional | With L1/L2 | Good |\n",
"| Gradient descent | Yes (or any differentiable) | Optional | Optional | Excellent (very large data) |\n",
"| Decision trees | No | No (but depth limits) | Implicitly | Moderate |\n",
"| Random forests | No | No (ensemble reduces variance) | Implicitly | Moderate (parallelisable) |\n",
"\n",
"> **Biasvariance tradeoff**: Simple models (linear) have high bias but low variance. Complex models (deep trees) have low bias but high variance. Regularisation and ensembles (random forests) try to balance this.\n",
"\n",
"**What else could be added?**\n",
"- **Support vector machines** (SVM) geometric margin classifiers.\n",
"- **Neural networks** highly flexible nonlinear models.\n",
"- **Time series models** (ARIMA, etc.).\n",
"- **Model selection criteria** (AIC, BIC)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}