Modelling 101: Train/Test Splits & Beyond Linear Regression¶

Introduction¶

So far we have seen how linear regression (ordinary least squares) solves $\tilde{X}\tilde{\beta} = y$ by minimizing $\|y - \tilde{X}\tilde{\beta}\|_2^2$. This is a powerful tool, but real data often breaks the assumptions that make linear regression the best choice. We address several of the points made in notebook 03.

Why linear regression might not cut it:

  • Nonlinear relationships – The true dependency may be curved, periodic, or otherwise not linear.
  • High dimensionality – When the number of features $p$ is close to or larger than the number of observations $n$, the matrix $\tilde{X}^T\tilde{X}$ becomes singular or nearly singular.
  • Multicollinearity – Features are correlated, leading to large condition numbers and unstable coefficients.
  • Overfitting – A complex model fits noise instead of signal, especially when $p$ is large.
  • Outliers – The $L^2$ norm magnifies large errors, pulling the fit away from the bulk of the data.

In this notebook we will:

  • Work with a real, moderately sized dataset.
  • Learn how to properly split data into training, validation, and test sets.
  • Apply linear and polynomial regression, then diagnose their limitations.
  • Introduce regularisation methods (Ridge and Lasso) from a linear algebra perspective.
  • Explore gradient descent as a numerical optimisation alternative to the normal equations.
  • Look at decision trees and random forests – nonlinear models that can capture complex interactions without feature engineering.
  • Cover logistic regression for classification.
  • Discuss feature scaling, cross‑validation, model interpretation, and hyperparameter tuning.

The goal is to equip the linear algebraist with practical modelling tools while maintaining a geometric / algebraic intuition.


A Real Dataset: California Housing¶

A natural next step from our toy housing example is the California housing dataset from sklearn.datasets. It contains 20,640 observations of 8 features (median income, house age, average rooms, etc.) and the target is the median house value for blocks in California. This dataset is large enough to illustrate interesting effects but small enough to run quickly.

Linear algebra view: Each observation is a row vector $x_i \in \mathbb{R}^8$. The features form the columns of the design matrix $X \in \mathbb{R}^{20640 \times 8}$. We will add an intercept column $\mathbb{1}$ to obtain $\tilde{X} \in \mathbb{R}^{20640 \times 9}$.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the data
housing = fetch_california_housing()
X = housing.data          # shape (20640, 8)
y = housing.target       # shape (20640,)
feature_names = housing.feature_names

# Convert to DataFrame for convenience
df = pd.DataFrame(X, columns=feature_names)
df['MedHouseVal'] = y

print(f"Data shape: {df.shape}")
df.head()
Data shape: (20640, 9)
Out[1]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
In [2]:
# Basic statistics
df.describe()
Out[2]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744 3.070655 35.631861 -119.569704 2.068558
std 1.899822 12.585558 2.474173 0.473911 1132.462122 10.386050 2.135952 2.003532 1.153956
min 0.499900 1.000000 0.846154 0.333333 3.000000 0.692308 32.540000 -124.350000 0.149990
25% 2.563400 18.000000 4.440716 1.006079 787.000000 2.429741 33.930000 -121.800000 1.196000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000 2.818116 34.260000 -118.490000 1.797000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000 3.282261 37.710000 -118.010000 2.647250
max 15.000100 52.000000 141.909091 34.066667 35682.000000 1243.333333 41.950000 -114.310000 5.000010

Let's see the relationships between these features and the price.

Train / Test Split (and Validation)¶

When it comes to real world modelling, we must split our data into training and tests sets.

Why split? If we evaluate a model on the same data we used to train it, we get an overly optimistic estimate of performance. The model may have memorised the training set (overfitting). Splitting mimics a real‑world scenario: we test on unseen data.

A common workflow:

  1. Training set (e.g., 60‑80%): used to fit the model parameters.
  2. Validation set (e.g., 10‑20%): used to tune hyperparameters (e.g., degree of polynomial, regularisation strength).
  3. Test set (e.g., 10‑20%): used only once at the end to report final performance.
In [3]:
# Illustrate the three-way split
fig, ax = plt.subplots(figsize=(12, 3))

# Create rectangles for each split
ax.barh(0, 60, left=0, height=0.5, color='blue', alpha=0.7, label='Training (60%)')
ax.barh(0, 20, left=60, height=0.5, color='orange', alpha=0.7, label='Validation (20%)')
ax.barh(0, 20, left=80, height=0.5, color='red', alpha=0.7, label='Test (20%)')

# Add labels
ax.text(30, 0, 'Train Model\nParameters', ha='center', va='center', fontsize=10, fontweight='bold')
ax.text(70, 0, 'Tune\nHyperparams', ha='center', va='center', fontsize=10, fontweight='bold')
ax.text(90, 0, 'Final\nEvaluation', ha='center', va='center', fontsize=10, fontweight='bold')

ax.set_xlim(0, 100)
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Percentage of Data')
ax.set_yticks([])
ax.set_title('Train/Validation/Test Split')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)

plt.tight_layout()
plt.savefig('../images/train_validation_test_split.png')
plt.show()
No description has been provided for this image

Let us first fix a random state.

In [4]:
RANDOM_STATE=3

Let's visualize this.

In [5]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Generate synthetic data to illustrate the concept
np.random.seed(3)
n = 50
X = np.random.uniform(-5,5,n) # synthetic, wider range

# True relationship
a_true = 2.0
c_true = 5.0
noise = np.random.normal(0,3,n)

y = a_true * X**2 + c_true + noise

# Perform train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size=0.3, random_state=3
)

# Sort for plotting
X_curve = np.linspace(X.min(), X. max())
y_true = a_true * X_curve**2 + c_true



# Plot
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(X_train, y_train, color='blue', s=50, label='Training data', zorder=3)
ax.scatter(X_test, y_test, color='red', s=50, label='Test data', zorder=3)
ax.plot(X_curve, y_true, linewidth=2, label='True relationship', alpha=0.7)

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Train/Test Split')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../images/train_test_split_illustration.png')
plt.show()

print(f"Total samples: {n}")
print(f"Training samples: {len(X_train)} ({len(X_train)/n*100:.0f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/n*100:.0f}%)")
No description has been provided for this image
Total samples: 50
Training samples: 35 (70%)
Test samples: 15 (30%)

Back to the housing data. We will use sklearn.model_selection.train_test_split to create two splits (train+validation vs. test), then further split the train+validation part if needed. For simplicity we will first do a single train/test split and use cross‑validation later.

In [6]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df[feature_names].values
y = df['MedHouseVal'].values

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size:     {X_test.shape[0]}")
Training set size: 16512
Test set size:     4128

With that, let's see the relationship between the features and the price.

In [7]:
# Visualize relationships between features and target
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, (name, ax) in enumerate(zip(feature_names, axes)):
    ax.scatter(X_train[:, i], y_train, alpha=0.1, s=1)
    ax.set_xlabel(name)
    ax.set_ylabel('MedHouseVal')
    ax.set_title(f'{name} vs Price')

plt.tight_layout()
plt.savefig('../images/california_housing_scatter.png')
plt.show()
No description has been provided for this image

Linear Regression in Practice¶

We can use sklearn.linear_model.LinearRegression, which internally solves the normal equations using either a direct solver or an SVD‑based approach (the lstsq method we saw earlier).

Linear algebra reminder: The least‑squares solution minimises $\|y - X\beta\|_2^2$. The closed form is $\beta = (X^T X)^{-1} X^T y$ when $X$ has full column rank.

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Fit linear regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predict on train and test
y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_test)

# Evaluate
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"Train MSE: {train_mse:.4f},  Train R²: {train_r2:.4f}")
print(f"Test  MSE: {test_mse:.4f},  Test  R²: {test_r2:.4f}")
Train MSE: 0.5229,  Train R²: 0.6079
Test  MSE: 0.5381,  Test  R²: 0.5931

The test $R^2$ is respectable (~0.6), but perhaps we can do better with a more flexible model. However, simply adding polynomial features might lead to overfitting. Let's examine that.

Polynomial Regression and the Danger of Overfitting¶

Polynomial regression creates new features by taking powers of the original features. For example, with one feature $x$, a degree‑2 model uses $[1, x, x^2]$. For multiple features, we can include interaction terms.

Linear algebra view: The Vandermonde matrix (for one feature) or its multivariate generalisation becomes the new design matrix. As degree increases, the condition number often explodes, leading to numerical instability and wild coefficients – a sign of overfitting.

Let's illustrate underfitting and overfitting on synthetic data before moving to the housing dataset.

Illustration: Underfitting vs Overfitting¶

We generate data from a quadratic function with noise, then fit polynomials of different degrees.

In [9]:
# Generate quadratic data (similar to notebook 02) – using distinct names
np.random.seed(3)
n_synth = 50
x_synth = np.random.uniform(-5, 5, n_synth)
y_true_synth = 2.0 * x_synth**2 + 5.0
noise_synth = np.random.normal(0, 3, n_synth)
y_synth = y_true_synth + noise_synth

# Fit polynomials of degree 1 (underfit), 2 (good), 11 (overfit)
degrees = [1, 2, 11]
x_plot = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, d in enumerate(degrees):
    coeff = np.polyfit(x_synth, y_synth, d)
    p = np.poly1d(coeff)
    axes[idx].scatter(x_synth, y_synth, alpha=0.7, label='Data')
    axes[idx].plot(x_plot, p(x_plot), 'r-', linewidth=2, label=f'Degree {d}')
    axes[idx].set_title(f'Degree {d} fit')
    axes[idx].set_xlabel('x')
    axes[idx].set_ylabel('y')
    axes[idx].legend()
    axes[idx].grid(True)
plt.tight_layout()
plt.savefig('../images/underfitting_vs_overfitting.png')
plt.show()
No description has been provided for this image
  • Degree 1 (underfitting): The linear model cannot capture the curvature, resulting in high bias.
  • Degree 2 (good): The quadratic model matches the true underlying structure.
  • Degree 11 (overfitting): The polynomial oscillates wildly to fit the noise, leading to poor generalisation.

Now back to the housing dataset. Let's create polynomial features and see the effect on condition number and test error.

In [10]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features of degree 2 (includes interactions)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

print(f"Original training features: {X_train.shape[1]}")
print(f"Polynomial training features: {X_train_poly.shape[1]}")

# Condition number of the augmented polynomial design matrix (with intercept added later)
from numpy.linalg import cond

X_train_poly_with_intercept = np.hstack([np.ones((X_train_poly.shape[0], 1)), X_train_poly])
print(f"Condition number of polynomial design matrix: {cond(X_train_poly_with_intercept):.2e}")
Original training features: 8
Polynomial training features: 44
Condition number of polynomial design matrix: 1.55e+11
Condition number of polynomial design matrix: 1.55e+11
In [11]:
# Fit linear regression on polynomial features
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

y_train_pred_poly = poly_reg.predict(X_train_poly)
y_test_pred_poly = poly_reg.predict(X_test_poly)

train_mse_poly = mean_squared_error(y_train, y_train_pred_poly)
test_mse_poly = mean_squared_error(y_test, y_test_pred_poly)

print(f"Polynomial (deg=2) Train MSE: {train_mse_poly:.4f}")
print(f"Polynomial (deg=2) Test  MSE: {test_mse_poly:.4f}")
Polynomial (deg=2) Train MSE: 0.4217
Polynomial (deg=2) Test  MSE: 0.4669

The test error is worse than the linear model – this is a clear sign of overfitting. The model is too flexible and fits noise in the training data. We need regularisation.

Ridge Regression ($L^2$ Regularisation)¶

Ridge regression adds a penalty on the squared $L^2$ norm of the coefficient vector:

$$ \min_{\beta} \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2 $$

where $\lambda \ge 0$ is the regularisation strength.

Linear algebra interpretation: The normal equations become $(X^T X + \lambda I)\beta = X^T y$. Adding $\lambda I$ to $X^T X$ increases all eigenvalues by $\lambda$, thereby improving the condition number and making the problem well‑posed even when $X^T X$ is singular. This is a form of Tikhonov regularisation. This directly shifts the eigenvalues (and singular values) of $X^TX$.

Ridge regression shrinks coefficients towards zero but rarely makes them exactly zero. It is especially useful when features are correlated (multicollinearity).

In [12]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# We'll use the polynomial features because ridge can help with overfitting
# Choose lambda via cross-validation on the training set
alphas = np.logspace(-3, 3, 20)
cv_scores = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    # 5-fold cross-validation, negative MSE (scoring expects higher = better)
    scores = cross_val_score(ridge, X_train_poly, y_train, cv=5, scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())

best_alpha = alphas[np.argmin(cv_scores)]
print(f"Best alpha from CV: {best_alpha:.4f}")

# Plot CV error vs alpha
plt.figure(figsize=(8,4))
plt.semilogx(alphas, cv_scores)
plt.xlabel('alpha (λ)')
plt.ylabel('Cross-validated MSE')
plt.title('Ridge Regularisation on Polynomial Features')
plt.grid(True)
plt.savefig('../images/ridge_regularization_polynomial_features_unscaled.png')
plt.show()
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.61091e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.8355e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.12863e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.2106e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.54461e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.8756e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.92268e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.38803e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.47667e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.81721e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.42347e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.1031e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.92486e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.02729e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.38133e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.55789e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.47656e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.03616e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.16704e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.54886e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=9.90852e-21): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.24995e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.03379e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.05273e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.09661e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.47609e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.8532e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.51111e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.54201e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.59756e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.48243e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.18458e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.5036e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.55854e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.63843e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.58073e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.5141e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.57846e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.68043e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.81413e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.99842e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.97942e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.95471e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=9.14672e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=9.41068e-20): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.84197e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.10261e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.82904e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.8654e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.92511e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.2525e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.40749e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.23705e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.31565e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.36802e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.62957e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.85107e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.62987e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.72841e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.85718e-19): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.79172e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.9137e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.79522e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.81032e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.83658e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.74634e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.23675e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.75365e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.78706e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.837e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.84044e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.6169e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.83226e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.97521e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.10899e-18): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.62385e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.40767e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.62848e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.64981e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.67422e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.37154e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.37402e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.42091e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.46597e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.00456e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.00113e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.10119e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.18479e-17): result may not be accurate.
  return f(*arrays, *other_args, **kwargs)
Best alpha from CV: 1000.0000
No description has been provided for this image

You'll notice we are getting a bunch of errors about ill-conditioned matrices. This happens because the polynomial features are on wildly different scales. Let's standardize our features first.

In [13]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Add scaler
scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

# We'll use the polynomial features because ridge can help with overfitting
# Choose lambda via cross-validation on the training set
alphas = np.logspace(-3, 3, 20)
cv_scores = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    scores = cross_val_score(ridge, X_train_poly_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())

best_alpha = alphas[np.argmin(cv_scores)]
print(f"Best alpha from CV: {best_alpha:.4f}")

# Plot CV error vs alpha
plt.figure(figsize=(8,4))
plt.semilogx(alphas, cv_scores)
plt.xlabel('alpha (λ)')
plt.ylabel('Cross-validated MSE')
plt.title('Ridge Regularisation on Polynomial Features')
plt.grid(True)
plt.savefig('../images/ridge_regularization_polynomial_features_scaled.png')
plt.show()
Best alpha from CV: 233.5721
No description has been provided for this image
In [14]:
# Fit ridge with best alpha on SCALED polynomial features
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train_poly_scaled, y_train)

y_test_pred_ridge = ridge_best.predict(X_test_poly_scaled)
test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)
print(f"Ridge (poly deg=2) Test MSE: {test_mse_ridge:.4f}")
print(f"Ridge improved over plain polynomial (MSE {test_mse_poly:.4f} → {test_mse_ridge:.4f})")
Ridge (poly deg=2) Test MSE: 0.4791
Ridge improved over plain polynomial (MSE 0.4669 → 0.4791)

Ridge from the SVD Perspective¶

The Ridge solution has a beautiful interpretation in terms of singular values. Recall from Notebook 2 that if $X = U\Sigma V^T$ is the SVD of the (centered) design matrix, then the OLS solution is

$$ \hat{\beta}_{OLS} = V\Sigma^{-1}U^T y = \sum_{i=1}^{p} \frac{1}{\sigma_i} (u_i^T y) v_i. $$

When $\sigma_i$ is small, the coefficient $\frac{1}{\sigma_i}$ explodes — this is the condition number problem.

For Ridge regression, one can show that

$$ \hat{\beta}_{Ridge} = \sum_{i=1}^{p} \frac{\sigma_i}{\sigma_i^2 + \lambda} (u_i^T y) v_i. $$

Notice what happens:

  • When $\sigma_i \gg \sqrt{\lambda}$, the coefficient is approximately $\frac{1}{\sigma_i}$ (same as OLS).
  • When $\sigma_i \ll \sqrt{\lambda}$, the coefficient is approximately $\frac{\sigma_i}{\lambda}$ — shrunk towards zero.
  • The effective condition number becomes $\frac{\sigma_1^2 + \lambda}{\sigma_p^2 + \lambda}$, which is much better than $\frac{\sigma_1^2}{\sigma_p^2}$.

This is why Ridge helps with multicollinearity: it dampens precisely those directions that were poorly determined.

In [15]:
# Visualize how Ridge shrinks coefficients relative to singular values
from sklearn.preprocessing import StandardScaler

# Use scaled data for clean SVD interpretation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Compute SVD of centered design matrix
U, s, Vt = np.linalg.svd(X_train_scaled, full_matrices=False)

# For different lambda values, compute the "shrinkage factor" for each singular direction
lambdas = [0, 0.1, 1, 10, 100]

plt.figure(figsize=(10, 5))

for lam in lambdas:
    if lam == 0:
        # OLS: no shrinkage
        shrinkage = np.ones_like(s)
        label = 'OLS (λ=0)'
    else:
        # Ridge shrinkage factor: sigma / (sigma^2 + lambda)
        shrinkage = s / (s**2 + lam)
        # Normalize so we can compare shapes
        shrinkage = shrinkage / shrinkage[0]  # normalize to first component
        label = f'Ridge (λ={lam})'
    
    plt.plot(range(1, len(s)+1), shrinkage, 'o-', label=label, markersize=8)

plt.xlabel('Singular value index (decreasing)')
plt.ylabel('Shrinkage factor (normalized)')
plt.title('Ridge Shrinkage: How λ Dampens Small Singular Directions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, len(s)+1))
plt.tight_layout()
plt.savefig('../images/ridge_svd_shrinkage.png')
plt.show()

# Show condition number improvement
print("Singular values:", s.round(2))
print(f"\nCondition number (OLS): {s[0]/s[-1]:.2f}")
for lam in [0.1, 1, 10]:
    effective_cond = (s[0]**2 + lam) / (s[-1]**2 + lam)
    print(f"Effective condition number (λ={lam}): {effective_cond:.2f}")
No description has been provided for this image
Singular values: [182.41 176.4  144.85 130.91 128.68 104.03  37.56  28.16]

Condition number (OLS): 6.48
Effective condition number (λ=0.1): 41.96
Effective condition number (λ=1): 41.91
Effective condition number (λ=10): 41.45

Lasso Regression ($L^1$ Regularisation)¶

Lasso replaces the $L^2$ penalty with an $L^1$ penalty:

$$ \min_{\beta} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1 $$

Geometric intuition: The $L^1$ ball is a diamond (in $\mathbb{R}^2$). The intersection of the quadratic loss contours with this diamond often occurs at a corner, forcing some coefficients to be exactly zero. Thus Lasso performs feature selection.

Lasso is useful when we suspect that only a few features are truly relevant, especially in high‑dimensional settings. However, it does not have a closed‑form solution; it is typically solved via coordinate descent or other optimisation algorithms.

In [16]:
from sklearn.linear_model import Lasso

# Lasso also requires tuning of alpha
lasso = Lasso(alpha=0.01, max_iter=10000)  # start with a small alpha
lasso.fit(X_train_poly, y_train)

# Count non-zero coefficients
n_nonzero = np.sum(np.abs(lasso.coef_) > 1e-10)
print(f"Number of non-zero coefficients: {n_nonzero} out of {len(lasso.coef_)}")

y_test_pred_lasso = lasso.predict(X_test_poly)
test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)
print(f"Lasso (poly deg=2) Test MSE: {test_mse_lasso:.4f}")

# Cross-validation for Lasso alpha
from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(alphas=np.logspace(-3, 1, 30), cv=5, max_iter=10000, random_state=RANDOM_STATE)
lasso_cv.fit(X_train_poly, y_train)
print(f"Best alpha from LassoCV: {lasso_cv.alpha_:.4f}")
print(f"Number of non-zero coefficients (CV best): {np.sum(np.abs(lasso_cv.coef_) > 1e-10)}")

y_test_pred_lasso_cv = lasso_cv.predict(X_test_poly)
test_mse_lasso_cv = mean_squared_error(y_test, y_test_pred_lasso_cv)
print(f"LassoCV Test MSE: {test_mse_lasso_cv:.4f}")
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:716: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.725e+03, tolerance: 2.202e+00
  model = cd_fast.enet_coordinate_descent(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.501e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.286e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
Number of non-zero coefficients: 33 out of 44
Lasso (poly deg=2) Test MSE: 0.4538
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.247e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.192e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.136e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.106e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.047e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.229e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.927e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.995e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.035e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.017e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.002e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.989e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.977e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.966e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.957e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.949e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.941e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.933e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.926e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.920e+03, tolerance: 1.774e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.902e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.045e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.033e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.987e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.939e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.892e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.049e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.055e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.031e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.009e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.990e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.975e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.964e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.953e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.944e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.936e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.929e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.921e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.913e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.906e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.900e+03, tolerance: 1.759e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.612e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.935e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.917e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.900e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.885e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.871e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.857e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.845e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.834e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.826e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.816e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.806e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.798e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.791e+03, tolerance: 1.753e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.694e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.248e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.219e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.166e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.117e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.075e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.039e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.133e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.112e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.090e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.071e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.056e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.044e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.033e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.022e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.012e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.002e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.991e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.981e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.972e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.965e+03, tolerance: 1.771e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.153e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.201e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.168e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.116e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.067e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.026e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.987e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.057e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.040e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.016e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.995e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.979e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.965e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.952e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.940e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.930e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.922e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.913e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.904e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.896e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.889e+03, tolerance: 1.752e+00
  model = cd_fast.enet_coordinate_descent_gram(
Best alpha from LassoCV: 0.0067
Number of non-zero coefficients (CV best): 34
LassoCV Test MSE: 0.4587
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:716: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.709e+03, tolerance: 2.202e+00
  model = cd_fast.enet_coordinate_descent(

Again, there are unscaled polynomial features, so we get convergence warnings. LASSO is sensivitve to scalling becuase the penalty treats all coefficients equally. We also get a suggestion to increase the number of iterations.

Let's fix this.

In [17]:
from sklearn.linear_model import Lasso, LassoCV

# Lasso with more iterations
lasso = Lasso(alpha=0.01, max_iter=100000, tol=1e-4)
lasso.fit(X_train_poly_scaled, y_train)

# Count non-zero coefficients
n_nonzero = np.sum(np.abs(lasso.coef_) > 1e-10)
print(f"Number of non-zero coefficients: {n_nonzero} out of {len(lasso.coef_)}")

y_test_pred_lasso = lasso.predict(X_test_poly_scaled)
test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)
print(f"Lasso Test MSE: {test_mse_lasso:.4f}")

# Cross-validation for Lasso alpha
lasso_cv = LassoCV(
    alphas=np.logspace(-3, 1, 30), 
    cv=5, 
    max_iter=100000, 
    tol=1e-4,
    random_state=RANDOM_STATE
)
lasso_cv.fit(X_train_poly_scaled, y_train)
print(f"Best alpha from LassoCV: {lasso_cv.alpha_:.4f}")
print(f"Number of non-zero coefficients (CV best): {np.sum(np.abs(lasso_cv.coef_) > 1e-10)}")

y_test_pred_lasso_cv = lasso_cv.predict(X_test_poly_scaled)
test_mse_lasso_cv = mean_squared_error(y_test, y_test_pred_lasso_cv)
print(f"LassoCV Test MSE: {test_mse_lasso_cv:.4f}")
Number of non-zero coefficients: 15 out of 44
Lasso Test MSE: 0.5347
Best alpha from LassoCV: 0.0067
Number of non-zero coefficients (CV best): 16
LassoCV Test MSE: 0.5305

Why Lasso Produces Sparse Solutions: The L¹ Geometry¶

Recall from Notebook 3 that the $L^1$ unit ball is a diamond (a rotated square in $\mathbb{R}^1$). This geometric fact is precisely why Lasso tends to produce coefficients that are exactly zero.

Consider the constrained form of the problem:

$$ \min_{\beta} \|y - X\beta\|_2^2 \quad \text{subject to} \quad \|\beta\|_1 \leq t. $$

The constraint region is the $L^1$ ball — a diamond with corners on the axes. The contours of the loss function $\|y - X\beta\|_2^2$ are ellipses centered at the OLS solution.

Key insight: When an elliptical contour expands and first touches the diamond, it often hits a corner. Corners lie on the axes, meaning some coefficients are exactly zero.

This is in contrast to Ridge, where the constraint region is a ball (circle in $\mathbb{R}^1$), and the first contact is typically at a smooth point — coefficients are shrunk but rarely zero.

In [18]:
# Visualize L1 vs L2 constraint regions and why Lasso gives sparsity
import numpy as np
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# L1 ball (diamond)
theta = np.linspace(0, 2*np.pi, 100)
r = 1

# L1 ball vertices
l1_x = [r, 0, -r, 0, r]
l1_y = [0, r, 0, -r, 0]

# L2 ball (circle)
l2_x = r * np.cos(theta)
l2_y = r * np.sin(theta)

# Simulated loss contours (ellipses centered away from origin)
# The OLS solution is at some point (beta1_ols, beta2_ols)
beta_ols = np.array([0.7, 0.3])

for idx, (ax, ball_type) in enumerate(zip(axes, ['Lasso (L¹)', 'Ridge (L²)'])):
    # Draw constraint region
    if idx == 0:  # Lasso - L1 ball
        ax.fill(l1_x, l1_y, alpha=0.3, color='blue', label='L¹ constraint region')
        ax.plot(l1_x, l1_y, 'b-', linewidth=2)
    else:  # Ridge - L2 ball
        ax.fill(l2_x, l2_y, alpha=0.3, color='green', label='L² constraint region')
        ax.plot(l2_x, l2_y, 'g-', linewidth=2)
    
    # Draw loss contours (ellipses)
    # Simplified: concentric ellipses around OLS solution
    for scale in [0.3, 0.5, 0.7, 1.0]:
        ellipse_x = beta_ols[0] + scale * 0.4 * np.cos(theta)
        ellipse_y = beta_ols[1] + scale * 0.2 * np.sin(theta)
        ax.plot(ellipse_x, ellipse_y, 'r--', alpha=0.5, linewidth=1)
    
    # Mark OLS solution
    ax.scatter(*beta_ols, color='red', s=100, zorder=5, label='OLS solution')
    
    # Mark the "first contact" point (approximate)
    if idx == 0:  # Lasso hits corner
        contact = np.array([1.0, 0.0])  # on the axis!
        ax.scatter(*contact, color='purple', s=150, marker='*', zorder=6, label='Lasso solution (sparse!)')
    else:  # Ridge hits smooth part
        contact = np.array([0.85, 0.35])  # not on axis
        ax.scatter(*contact, color='purple', s=150, marker='*', zorder=6, label='Ridge solution')
    
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.set_title(f'{ball_type} Constraint')
    ax.legend(loc='upper right', fontsize=9)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.axhline(0, color='k', linewidth=0.5)
    ax.axvline(0, color='k', linewidth=0.5)

plt.tight_layout()
plt.savefig('../images/lasso_vs_ridge_geometry.png')
plt.show()
No description has been provided for this image

Key insight: Lasso's $L^1$ constraint has corners on the axes. When the loss contour touches a corner, that coefficient becomes exactly zero.

Principal Component Regression (PCR)¶

Principal Component Regression combines the dimensionality reduction from Notebook 4 with linear regression. The idea is simple:

  1. Compute the principal components of $X$ (via SVD on centered data).
  2. Keep only the top $k$ components (those with largest singular values).
  3. Regress $y$ on these $k$ components.

Linear algebra perspective: We project $X$ onto its best rank-$k$ approximation (in Frobenius norm) and then solve a least-squares problem in the reduced space. This is different from Ridge:

  • Ridge shrinks all directions but keeps them.
  • PCR discards the smallest singular directions entirely.

PCR is particularly useful when:

  • Features are highly correlated (multicollinearity).
  • You want interpretable, low-dimensional representations.
  • The signal lives in the top principal components while noise dominates the rest.

The tradeoff: if the target $y$ is correlated with a small singular direction, PCR will discard useful information.

In [19]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# Compare PCR with varying number of components
n_components_range = range(1, X_train_scaled.shape[1] + 1)
pcr_scores = []

for n_comp in n_components_range:
    pcr = make_pipeline(
        PCA(n_components=n_comp),
        LinearRegression()
    )
    # Negative MSE (sklearn convention: higher is better)
    scores = cross_val_score(pcr, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
    pcr_scores.append(-scores.mean())

# Also compute variance explained
pca_full = PCA()
pca_full.fit(X_train_scaled)
var_explained = np.cumsum(pca_full.explained_variance_ratio_)

# Plot
fig, ax1 = plt.subplots(figsize=(10, 5))

ax1.plot(n_components_range, pcr_scores, 'b-o', label='CV MSE')
ax1.set_xlabel('Number of Principal Components')
ax1.set_ylabel('Cross-Validated MSE', color='b')
ax1.tick_params(axis='y', labelcolor='b')

ax2 = ax1.twinx()
ax2.plot(n_components_range, var_explained, 'r--s', label='Variance Explained')
ax2.set_ylabel('Cumulative Variance Explained', color='r')
ax2.tick_params(axis='y', labelcolor='r')
ax2.set_ylim(0, 1.05)

plt.title('Principal Component Regression: Choosing k')
fig.legend(loc='center right', bbox_to_anchor=(0.85, 0.5))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../images/pcr_components_selection.png')
plt.show()

# Best number of components
best_n_comp = n_components_range[np.argmin(pcr_scores)]
print(f"Best number of components: {best_n_comp}")
print(f"Variance explained: {var_explained[best_n_comp-1]:.2%}")

# Compare with OLS and Ridge
print(f"\nModel Comparison (Test MSE):")
print(f"  OLS (all features):  {test_mse:.4f}")
print(f"  PCR (k={best_n_comp}):       {pcr_scores[best_n_comp-1]:.4f}")
print(f"  Ridge (λ={best_alpha:.2f}):   {test_mse_ridge:.4f}")
No description has been provided for this image
Best number of components: 8
Variance explained: 100.00%

Model Comparison (Test MSE):
  OLS (all features):  0.5381
  PCR (k=8):       0.5272
  Ridge (λ=233.57):   0.4791

Gradient Descent: When the Normal Equations Are Not Enough¶

For very large datasets, computing $(X^T X)^{-1}$ or even forming $X^T X$ becomes prohibitive. Gradient descent is an iterative optimisation method that uses only first‑order derivatives.

The Linear Algebra of Convergence¶

The loss function and its gradient:

$$ L(\beta) = \frac{1}{2n}\|y - X\beta\|_2^2, \qquad \nabla L(\beta) = -\frac{1}{n} X^T (y - X\beta). $$

Starting from $\beta^{(0)}$, we update:

$$ \beta^{(t+1)} = \beta^{(t)} - \eta \nabla L(\beta^{(t)}). $$

Convergence depends on the eigenvalues of $X^T X$. Let $\lambda_{\max}$ and $\lambda_{\min}$ be the largest and smallest eigenvalues. Then:

  • The learning rate must satisfy $\eta < \frac{2}{\lambda_{\max}}$ for convergence.
  • The convergence rate is governed by the condition number $\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$.
  • When $\kappa$ is large, gradients point in "wrong" directions — the loss surface is a narrow valley.

This is why feature scaling matters: it reduces $\kappa$, making the loss surface more spherical and convergence faster.

In [20]:
# Demonstrate how condition number affects gradient descent convergence
from sklearn.preprocessing import StandardScaler

def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):
    """Batch gradient descent for linear regression."""
    n, p = X.shape
    beta = np.zeros(p)
    losses = []
    for i in range(n_iter):
        residual = y - X @ beta
        grad = - (1/n) * X.T @ residual
        beta -= learning_rate * grad
        loss = (1/(2*n)) * np.linalg.norm(residual)**2
        losses.append(loss)
    return beta, losses

# Use a subset for illustration
X_subset = X_train[:1000]
y_subset = y_train[:1000]

# Add intercept
X_subset_aug = np.hstack([np.ones((X_subset.shape[0], 1)), X_subset])

# Compute eigenvalues of X^T X
eigenvalues = np.linalg.eigvalsh(X_subset_aug.T @ X_subset_aug)
lambda_max, lambda_min = eigenvalues.max(), eigenvalues[eigenvalues > 1e-10].min()
cond_num = lambda_max / lambda_min

print(f"Eigenvalue range: [{lambda_min:.2e}, {lambda_max:.2e}]")
print(f"Condition number: {cond_num:.2e}")
print(f"Max stable learning rate: {2/lambda_max:.2e}")

# Try gradient descent with different learning rates on UNSCALED data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# UNSCALED
learning_rates = [1e-10, 1e-9, 1e-8]
for lr in learning_rates:
    _, losses = gradient_descent_linear(X_subset_aug, y_subset, learning_rate=lr, n_iter=200)
    axes[0].plot(losses, label=f'η = {lr:.0e}')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title(f'Unscaled Data (κ = {cond_num:.1e})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# SCALED
scaler = StandardScaler()
X_subset_scaled = scaler.fit_transform(X_subset)
X_subset_scaled_aug = np.hstack([np.ones((X_subset_scaled.shape[0], 1)), X_subset_scaled])

eigenvalues_scaled = np.linalg.eigvalsh(X_subset_scaled_aug.T @ X_subset_scaled_aug)
lambda_max_s, lambda_min_s = eigenvalues_scaled.max(), eigenvalues_scaled[eigenvalues_scaled > 1e-10].min()
cond_num_scaled = lambda_max_s / lambda_min_s

learning_rates_scaled = [0.001, 0.01, 0.1]
for lr in learning_rates_scaled:
    _, losses = gradient_descent_linear(X_subset_scaled_aug, y_subset, learning_rate=lr, n_iter=200)
    axes[1].plot(losses, label=f'η = {lr}')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss (MSE)')
axes[1].set_title(f'Scaled Data (κ = {cond_num_scaled:.1f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_yscale('log')

plt.tight_layout()
plt.savefig('../images/gd_condition_number_effect.png')
plt.show()

print(f"\nScaling reduced condition number from {cond_num:.1e} to {cond_num_scaled:.1f}")
print("This allows much larger learning rates and faster convergence.")
Eigenvalue range: [5.59e-02, 3.07e+09]
Condition number: 5.49e+10
Max stable learning rate: 6.52e-10
No description has been provided for this image
Scaling reduced condition number from 5.5e+10 to 42.2
This allows much larger learning rates and faster convergence.

Let's apply gradient descent to our housing data.

In [21]:
# Implement batch gradient descent for linear regression on a small subset for illustration
def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):
    n, p = X.shape
    beta = np.zeros(p)
    losses = []
    for i in range(n_iter):
        grad = - (1/n) * X.T @ (y - X @ beta)
        beta -= learning_rate * grad
        loss = (1/(2*n)) * np.linalg.norm(y - X @ beta)**2
        losses.append(loss)
        if verbose and i % 200 == 0:
            print(f"Iter {i}: loss = {loss:.6f}")
    return beta, losses

# Use a small subset for speed
X_small = X_train[:1000]
y_small = y_train[:1000]

# Add intercept column
X_small_aug = np.hstack([np.ones((X_small.shape[0], 1)), X_small])

beta_gd, losses = gradient_descent_linear(X_small_aug, y_small, learning_rate=0.01, n_iter=500)

plt.figure(figsize=(8,4))
plt.plot(losses)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Gradient Descent Convergence')
plt.grid(True)
plt.savefig('../images/gradient_descent_convergence_unscaled')
plt.show()

# Compare with closed-form solution on the same subset
beta_closed = np.linalg.lstsq(X_small_aug, y_small, rcond=None)[0]
print(f"Difference between GD and closed-form: {np.linalg.norm(beta_gd - beta_closed):.2e}")
/usr/lib64/python3.14/site-packages/numpy/linalg/_linalg.py:2792: RuntimeWarning: overflow encountered in dot
  sqnorm = x.dot(x)
/tmp/ipykernel_66479/3132444904.py:7: RuntimeWarning: overflow encountered in matmul
  grad = - (1/n) * X.T @ (y - X @ beta)
/tmp/ipykernel_66479/3132444904.py:8: RuntimeWarning: invalid value encountered in subtract
  beta -= learning_rate * grad
No description has been provided for this image
Difference between GD and closed-form: nan

Again, we have scaling issues causing some errors. Large values will dominate gradients giving rise to instability.

In [22]:
from sklearn.preprocessing import StandardScaler

# 1. Prepare Data
# Use a small subset for speed
X_small = X_train[:1000].copy() # Use .copy() to avoid SettingWithCopyWarning
y_small = y_train[:1000].copy()

# 2. SCALE THE FEATURES (Critical for Gradient Descent!)
scaler = StandardScaler()
X_small_scaled = scaler.fit_transform(X_small)

# Add intercept column AFTER scaling
# (We don't scale the intercept column, it stays as 1s)
X_small_aug = np.hstack([np.ones((X_small_scaled.shape[0], 1)), X_small_scaled])

# 3. Run Gradient Descent
def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):
    n, p = X.shape
    beta = np.zeros(p)
    losses = []
    for i in range(n_iter):
        # Predict
        prediction = X @ beta
        # Residual
        residual = y - prediction
        # Gradient
        grad = - (1/n) * X.T @ residual
        # Update
        beta -= learning_rate * grad
        
        # Calculate Loss (MSE)
        loss = (1/(2*n)) * np.linalg.norm(residual)**2
        losses.append(loss)
        
        if verbose and i % 200 == 0:
            print(f"Iter {i}: loss = {loss:.6f}")
    return beta, losses

# With scaled data, learning_rate=0.01 or even 0.1 is usually safe
beta_gd, losses = gradient_descent_linear(X_small_aug, y_small, learning_rate=0.1, n_iter=500, verbose=True)

# Plot convergence
plt.figure(figsize=(8,4))
plt.plot(losses)
plt.xlabel('Iteration')
plt.ylabel('Loss (MSE)')
plt.title('Gradient Descent Convergence (Scaled Data)')
plt.grid(True)
plt.savefig('../images/gradient_descent_convergence_scaled')
plt.show()
Iter 0: loss = 2.758919
Iter 200: loss = 0.234124
Iter 400: loss = 0.230774
No description has been provided for this image

In practice, we use stochastic or mini‑batch gradient descent for large data. sklearn's SGDRegressor implements these with various loss functions and penalties.

In [23]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# SGDRegressor is sensitive to feature scaling, so we use a pipeline
# penalty=None means no regularization (standard Linear Regression)
sgd_reg = make_pipeline(
    StandardScaler(),
    SGDRegressor(penalty=None, learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42)
)

sgd_reg.fit(X_train, y_train)

# Note: SGDRegressor optimizes a different loss function formulation by default,
# so coefficients might differ slightly from closed-form, but the prediction quality is similar.
print(f"Coefficients: {sgd_reg.named_steps['sgdregressor'].coef_}")
Coefficients: [ 3.97676073e+09 -1.14418633e+10 -1.78357850e+10  1.01065426e+11
 -1.80378121e+10 -3.02815983e+09 -5.43520408e+10 -4.51215845e+10]

Decision Trees and Random Forests¶

Linear models assume a linear relationship. Decision trees are non‑parametric models that partition the feature space into rectangular regions and assign a constant prediction (or a simple model) in each region. The prediction function is piecewise constant. The basis functions are indicator functions of the leaves. While not linear in the original features, the model is linear in the (high‑dimensional) leaf‑indicator basis.

Random forests combine many decision trees, each trained on a bootstrapped sample and a random subset of features. They reduce variance (overfitting) and often outperform single trees.

When to use trees / forests:

  • Nonlinear relationships with interactions.
  • When interpretability is desired (a single tree can be visualised).
  • When you have mixed categorical and continuous features.
  • As a strong baseline before trying deep learning.
In [24]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Single decision tree (max depth 10)
tree = DecisionTreeRegressor(max_depth=10, random_state=RANDOM_STATE)
tree.fit(X_train, y_train)
y_test_pred_tree = tree.predict(X_test)
test_mse_tree = mean_squared_error(y_test, y_test_pred_tree)

# Random forest (100 trees)
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1)
rf.fit(X_train, y_train)
y_test_pred_rf = rf.predict(X_test)
test_mse_rf = mean_squared_error(y_test, y_test_pred_rf)

print(f"Decision Tree Test MSE: {test_mse_tree:.4f}")
print(f"Random Forest Test MSE: {test_mse_rf:.4f}")

# Compare with best linear model
print(f"Ridge (poly) Test MSE: {test_mse_ridge:.4f}")
print(f"LassoCV Test MSE:      {test_mse_lasso_cv:.4f}")
Decision Tree Test MSE: 0.3961
Random Forest Test MSE: 0.2752
Ridge (poly) Test MSE: 0.4791
LassoCV Test MSE:      0.5305

Random forests often outperform linear models on complex real‑world data without requiring feature engineering or scaling.

Logistic Regression for Classification¶

So far we have focused on regression (continuous targets). For binary classification (e.g., spam vs. not spam), logistic regression is a natural extension. It models the probability that an observation belongs to a class using the logistic (sigmoid) function:

$$ P(y=1 \mid x) = \frac{1}{1 + e^{-x^T\beta}}. $$

The decision boundary is linear in the features: $x^T\beta = 0$. The model is fitted by maximum likelihood estimation, which is equivalent to minimising the log‑loss (cross‑entropy). There is no closed‑form solution; we typically use gradient descent or Newton's method.

We will illustrate logistic regression on a subset of the California housing data by creating a binary target (e.g., whether the median house value is above the median).

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create binary target: 1 if house value > median, else 0
# Use the ORIGINAL dataframe to avoid confusion with scaled/transformed versions
y_binary = (df['MedHouseVal'] > df['MedHouseVal'].median()).astype(int).values
X_original = df[feature_names].values  # original features, not overwritten

# Split
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
    X_original, y_binary, test_size=0.2, random_state=RANDOM_STATE
)

# Scale features (important for logistic regression with regularization)
scaler_bin = StandardScaler()
X_train_bin_scaled = scaler_bin.fit_transform(X_train_bin)
X_test_bin_scaled = scaler_bin.transform(X_test_bin)

# Train logistic regression
log_reg = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
log_reg.fit(X_train_bin_scaled, y_train_bin)

# Predict
y_pred_bin = log_reg.predict(X_test_bin_scaled)
accuracy = accuracy_score(y_test_bin, y_pred_bin)

print(f"Logistic Regression Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_bin, y_pred_bin))

# Coefficients (on scaled features)
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': log_reg.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)
print("\nLogistic Regression Coefficients (scaled features):")
print(coef_df)
Logistic Regression Accuracy: 0.8324

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.84      0.83      2083
           1       0.83      0.83      0.83      2045

    accuracy                           0.83      4128
   macro avg       0.83      0.83      0.83      4128
weighted avg       0.83      0.83      0.83      4128


Logistic Regression Coefficients (scaled features):
      Feature  Coefficient
6    Latitude    -3.532385
7   Longitude    -3.328767
5    AveOccup    -3.094635
0      MedInc     2.512414
3   AveBedrms     0.899888
2    AveRooms    -0.786384
1    HouseAge     0.276838
4  Population     0.062450
              precision    recall  f1-score   support

           0       0.83      0.84      0.83      2083
           1       0.83      0.83      0.83      2045

    accuracy                           0.83      4128
   macro avg       0.83      0.83      0.83      4128
weighted avg       0.83      0.83      0.83      4128


Logistic Regression Coefficients (scaled features):
      Feature  Coefficient
6    Latitude    -3.532385
7   Longitude    -3.328767
5    AveOccup    -3.094635
0      MedInc     2.512414
3   AveBedrms     0.899888
2    AveRooms    -0.786384
1    HouseAge     0.276838
4  Population     0.062450

Cross‑Validation: A Deeper Look¶

Cross‑validation (CV) is a technique for assessing how well a model generalises to unseen data. Instead of a single train/validation split, we partition the training data into $k$ folds (typically 5 or 10). For each fold $i$, we train on the other $k-1$ folds and validate on fold $i$. The performance is averaged over the $k$ folds.

In [26]:
# Illustrate 5-fold cross-validation
from sklearn.model_selection import KFold

n_points = 20
X_cv = np.arange(n_points).reshape(-1, 1)
colors = plt.cm.tab10(np.linspace(0, 1, 5))

kf = KFold(n_splits=5, shuffle=True, random_state=3)

fig, axes = plt.subplots(5, 1, figsize=(12, 8))

for i, (train_idx, test_idx) in enumerate(kf.split(X_cv)):
    ax = axes[i]
    
    # Plot all points
    for j in range(n_points):
        if j in test_idx:
            ax.scatter(j, 0, s=200, c='red', marker='s', label='Test' if j == test_idx[0] else '')
        else:
            ax.scatter(j, 0, s=200, c='blue', marker='o', label='Train' if j == train_idx[0] else '')
    
    ax.set_xlim(-1, n_points)
    ax.set_ylim(-0.5, 0.5)
    ax.set_yticks([])
    ax.set_ylabel(f'Fold {i+1}', rotation=0, labelpad=30)
    
    if i == 0:
        ax.legend(loc='upper right', ncol=2)
    if i < 4:
        ax.set_xticks([])

axes[-1].set_xlabel('Sample Index')
axes[2].set_title('5-Fold Cross-Validation', pad=20)

plt.tight_layout()
plt.savefig('../images/cross_validation_illustration.png')
plt.show()
No description has been provided for this image

Why cross‑validate? It reduces the variance of the performance estimate and makes better use of limited data. It is also essential for hyperparameter tuning (as we did with Ridge and Lasso).

We already used cross_val_score above. Here's an explicit example with a linear model on the housing data.

In [27]:
from sklearn.model_selection import cross_val_score, KFold

# 5-fold CV on linear regression
lin_reg_cv = LinearRegression()
scores = cross_val_score(lin_reg_cv, X_train, y_train, cv=5, scoring='r2')
print(f"5-fold CV R² scores: {scores}")
print(f"Mean R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

# We can also use a custom cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scores_shuffled = cross_val_score(lin_reg_cv, X_train, y_train, cv=kf, scoring='r2')
print(f"Shuffled CV R² scores: {scores_shuffled}")
print(f"Mean R² (shuffled): {scores_shuffled.mean():.4f}")
5-fold CV R² scores: [0.60709214 0.59544452 0.58112984 0.63060861 0.61005689]
Mean R²: 0.6049 (+/- 0.0328)
Shuffled CV R² scores: [0.60563739 0.59602593 0.5917264  0.61941109 0.62268184]
Mean R² (shuffled): 0.6071

Feature Scaling¶

Many machine learning algorithms are sensitive to the scale of features. For example:

  • Gradient descent converges faster when features are on similar scales.
  • Regularisation (Ridge, Lasso) penalises coefficients equally; if features have different scales, the penalty is not meaningful.
  • Distance‑based methods (k‑nearest neighbours, SVM with RBF kernel) assume all features are comparable.

Linear algebra view: Scaling corresponds to multiplying each column of $X$ by a positive scalar. This changes the condition number and the geometry of the optimisation landscape.

Common scaling techniques:

  • Standardisation (Z‑score): $x' = \frac{x - \mu}{\sigma}$ (mean 0, variance 1).
  • Min‑max scaling: $x' = \frac{x - \min}{\max - \min}$ (range [0,1]).

We should always fit the scaler on the training set and then transform both train and test sets to avoid data leakage.

In [28]:
from sklearn.preprocessing import StandardScaler

# Create scaler
scaler = StandardScaler()

# Fit on training data only
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare condition number before and after scaling
X_train_aug = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
X_train_scaled_aug = np.hstack([np.ones((X_train_scaled.shape[0], 1)), X_train_scaled])

print(f"Condition number (original): {cond(X_train_aug):.2e}")
print(f"Condition number (scaled):   {cond(X_train_scaled_aug):.2e}")

# Fit linear regression on scaled data
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_scaled, y_train)
y_test_pred_scaled = lin_reg_scaled.predict(X_test_scaled)
test_mse_scaled = mean_squared_error(y_test, y_test_pred_scaled)
print(f"Linear regression (scaled) Test MSE: {test_mse_scaled:.4f}")
print(f"Linear regression (original) Test MSE: {test_mse:.4f}")
Condition number (original): 2.40e+05
Condition number (scaled):   6.48e+00
Linear regression (scaled) Test MSE: 0.5381
Linear regression (original) Test MSE: 0.5381

Scaling did not change the linear regression performance because OLS is scale‑invariant (the coefficients adjust accordingly). However, it improves numerical stability and is crucial for regularised models and gradient descent.

Model Interpretation¶

Interpretability is important in many applications. Different models offer different levels of insight.

Linear Models (Ridge, Lasso)¶

  • Coefficients directly indicate the effect of each feature (assuming features are scaled).
  • Sign and magnitude tell us direction and importance.

Decision Trees¶

  • We can visualise the tree structure.
  • Feature importance based on how much each feature reduces impurity (e.g., variance for regression, Gini for classification).

Random Forests¶

  • Aggregate feature importance across all trees.
  • Can also use SHAP or LIME for local explanations.

Let's examine coefficients from a scaled linear model and feature importance from a random forest.

In [29]:
# Train Ridge on scaled data (with default alpha)
ridge_scaled = Ridge(alpha=1.0)
ridge_scaled.fit(X_train_scaled, y_train)

# Display coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': ridge_scaled.coef_
})
print("Ridge coefficients (scaled features):")
print(coef_df.sort_values('Coefficient', key=abs, ascending=False))

# Random forest feature importance
rf.fit(X_train, y_train)  # already fitted earlier, but ensure
importances = rf.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("\nRandom Forest Feature Importances:")
print(importance_df)

# Plot
plt.figure(figsize=(8,4))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.savefig('../images/RF_feature_importance.png')
plt.show()
Ridge coefficients (scaled features):
      Feature  Coefficient
6    Latitude    -0.896656
7   Longitude    -0.870257
0      MedInc     0.848402
3   AveBedrms     0.332536
2    AveRooms    -0.287161
1    HouseAge     0.125807
5    AveOccup    -0.040522
4  Population    -0.002190

Random Forest Feature Importances:
      Feature  Importance
0      MedInc    0.589486
5    AveOccup    0.137379
6    Latitude    0.078123
7   Longitude    0.077486
1    HouseAge    0.047525
2    AveRooms    0.034377
4  Population    0.018634
3   AveBedrms    0.016990
No description has been provided for this image

Hyperparameter Tuning with Grid Search¶

Most models have hyperparameters that are not learned from data (e.g., alpha in Ridge, max_depth in trees, n_estimators in random forests). Tuning them properly is essential for good performance. Choosing hyperparameters is like selecting the optimal basis or regularisation parameter – it changes the solution space.

Grid search exhaustively tries a predefined set of hyperparameter combinations using cross‑validation. sklearn.model_selection.GridSearchCV does this efficiently.

Let's tune a random forest regressor on the housing data.

In [30]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5]
}

# Create random forest
rf_tune = RandomForestRegressor(random_state=42, n_jobs=-1)

# Grid search with 3-fold CV (use a subset of training data for speed)
X_train_subset = X_train[:5000]
y_train_subset = y_train[:5000]

grid_search = GridSearchCV(rf_tune, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)
grid_search.fit(X_train_subset, y_train_subset)

print("Best parameters:", grid_search.best_params_)
print("Best CV MSE:", -grid_search.best_score_)

# Evaluate on test set
best_rf = grid_search.best_estimator_
y_test_pred_best_rf = best_rf.predict(X_test)
test_mse_best_rf = mean_squared_error(y_test, y_test_pred_best_rf)
print(f"Tuned Random Forest Test MSE: {test_mse_best_rf:.4f}")
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Best CV MSE: 0.3210370034255883
Tuned Random Forest Test MSE: 0.2928

Summary and Additional Considerations¶

We have covered a progression of modelling techniques and essential practices:

Method Linearity Regularisation Feature Selection Scalability
Linear regression Yes No No Good (closed‑form)
Polynomial regression In features No No Poor (exploding dimension)
Ridge Yes $L^2$ No (shrinks only) Good
Lasso Yes $L^1$ Yes Good (via coordinate descent)
Logistic regression Decision boundary linear Optional With L1/L2 Good
Gradient descent Yes (or any differentiable) Optional Optional Excellent (very large data)
Decision trees No No (but depth limits) Implicitly Moderate
Random forests No No (ensemble reduces variance) Implicitly Moderate (parallelisable)

Bias–variance tradeoff: Simple models (linear) have high bias but low variance. Complex models (deep trees) have low bias but high variance. Regularisation and ensembles (random forests) try to balance this.

What else could be added?

  • Support vector machines (SVM) – geometric margin classifiers.
  • Neural networks – highly flexible nonlinear models.
  • Time series models (ARIMA, etc.).
  • Model selection criteria (AIC, BIC).