1236 lines
42 KiB
Plaintext
1236 lines
42 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "833dad45",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Kilter Board: Predictive Modelling\n",
|
||
"\n",
|
||
"With the feature matrix built in notebook 04, we now turn to the central modelling question: how accurately can we predict climb difficulty from engineered features?\n",
|
||
"\n",
|
||
"## Modelling Approach\n",
|
||
"\n",
|
||
"We fit and compare several regression models on the hold-out test set:\n",
|
||
"\n",
|
||
"1. **Linear models** \n",
|
||
" Linear Regression, Ridge, and Lasso serve as interpretable baselines. Coefficients reveal which features the model relies on most.\n",
|
||
"\n",
|
||
"2. **Tree-based models** \n",
|
||
" Random Forest is the primary workhorse. It handles nonlinear relationships naturally and provides feature importance scores. A tuned variant with deeper trees and more estimators serves as the final classical model.\n",
|
||
"\n",
|
||
"3. **Gradient Boosting** \n",
|
||
" We compare Gradient Boosting against Random Forest to assess whether boosting yields improved predictive performance over bagging.\n",
|
||
"\n",
|
||
"## Evaluation Framework\n",
|
||
"\n",
|
||
"We evaluate models on two levels:\n",
|
||
"\n",
|
||
"- **Fine-grained difficulty scores** \n",
|
||
" The raw `display_difficulty` values. Accuracy within ±1 or ±2 points.\n",
|
||
"\n",
|
||
"- **Grouped V-grades** \n",
|
||
" Fine-grained scores are mapped to V-grade buckets. This is the more practical metric: being off by half a grade is usually acceptable, while being off by two full grades is not.\n",
|
||
"\n",
|
||
"## Output\n",
|
||
"\n",
|
||
"The final products are trained models saved as joblib files, test set predictions for ensemble comparison in notebook 06, and a summary of model performance across all metrics.\n",
|
||
"\n",
|
||
"## Notebook Structure\n",
|
||
"\n",
|
||
"1. [Setup and Imports](#setup-and-imports)\n",
|
||
"2. [Load Feature Data](#load-feature-data)\n",
|
||
"3. [Training/Test Split](#training/test-split)\n",
|
||
"4. [Regression](#regression)\n",
|
||
"5. [Random Forest](#random-forest)\n",
|
||
"6. [Comparing Models](#comparing-models)\n",
|
||
"7. [Saving Models](#saving-models)\n",
|
||
"8. [Conclusion](#conclusion)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "33fdcba8",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Setup and Imports"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e8364a1c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"==================================\n",
|
||
"Setup and Imports\n",
|
||
"==================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# Imports\n",
|
||
"import pandas as pd\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import seaborn as sns\n",
|
||
"import numpy as np\n",
|
||
"import matplotlib.patches as mpatches\n",
|
||
"\n",
|
||
"from sklearn.ensemble import RandomForestRegressor\n",
|
||
"from sklearn.model_selection import cross_val_score\n",
|
||
"\n",
|
||
"from scipy.spatial import ConvexHull\n",
|
||
"from scipy.spatial.distance import pdist, squareform\n",
|
||
"\n",
|
||
"import sqlite3\n",
|
||
"\n",
|
||
"import re\n",
|
||
"import os\n",
|
||
"from collections import defaultdict\n",
|
||
"\n",
|
||
"import ast\n",
|
||
"\n",
|
||
"from sklearn.model_selection import train_test_split, cross_val_score, KFold\n",
|
||
"from sklearn.preprocessing import StandardScaler\n",
|
||
"from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet\n",
|
||
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
|
||
"from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
|
||
"\n",
|
||
"import warnings\n",
|
||
"warnings.filterwarnings('ignore')\n",
|
||
"\n",
|
||
"from PIL import Image\n",
|
||
"\n",
|
||
"# Set some display options\n",
|
||
"pd.set_option('display.max_columns', None)\n",
|
||
"pd.set_option('display.max_rows', 100)\n",
|
||
"\n",
|
||
"# Set style\n",
|
||
"palette=['steelblue', 'coral', 'seagreen'] #(for multi-bar graphs)\n",
|
||
"\n",
|
||
"# Set board image for some visual analysis\n",
|
||
"board_img = Image.open('../images/kilter-original-16x12_compose.png')\n",
|
||
"\n",
|
||
"# Connect to the database\n",
|
||
"DB_PATH=\"../data/kilter.db\"\n",
|
||
"conn = sqlite3.connect(DB_PATH)\n",
|
||
"\n",
|
||
"# Set random state\n",
|
||
"RANDOM_STATE=3"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2830cfab",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"==================================\n",
|
||
"Query our data from the DB\n",
|
||
"==================================\n",
|
||
"\n",
|
||
"This time we restrict to where `layout_id=11` for the Kilter Original.\n",
|
||
"We will also restrict ourselves to an angle of at most 55, since according to our grade vs angle distribution in notebook 01, things start to look a bit weird past 50.\n",
|
||
"(Probably a bias towards climbers who can actually climb that steep). We will encode this directly into our query.\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# Query climbs data\n",
|
||
"climbs_query = \"\"\"\n",
|
||
"SELECT\n",
|
||
" c.uuid,\n",
|
||
" c.name AS climb_name,\n",
|
||
" c.setter_username,\n",
|
||
" c.layout_id AS layout_id,\n",
|
||
" c.description,\n",
|
||
" c.is_nomatch,\n",
|
||
" c.is_listed,\n",
|
||
" l.name AS layout_name,\n",
|
||
" p.name AS board_name,\n",
|
||
" c.frames,\n",
|
||
" cs.angle,\n",
|
||
" cs.display_difficulty,\n",
|
||
" dg.boulder_name AS boulder_grade,\n",
|
||
" cs.ascensionist_count,\n",
|
||
" cs.quality_average,\n",
|
||
" cs.fa_at\n",
|
||
"FROM climbs c\n",
|
||
"JOIN layouts l ON c.layout_id = l.id\n",
|
||
"JOIN products p ON l.product_id = p.id\n",
|
||
"JOIN climb_stats cs ON c.uuid = cs.climb_uuid\n",
|
||
"JOIN difficulty_grades dg ON ROUND(cs.display_difficulty) = dg.difficulty\n",
|
||
"WHERE cs.display_difficulty IS NOT NULL AND c.is_listed=1 AND c.layout_id=1 AND cs.angle <= 55 AND cs.fa_at > '2016-01-01'\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# Query information about placements (and their mirrors)\n",
|
||
"placements_query = \"\"\"\n",
|
||
"SELECT\n",
|
||
" p.id AS placement_id,\n",
|
||
" h.x,\n",
|
||
" h.y,\n",
|
||
" p.default_placement_role_id AS default_role_id,\n",
|
||
" p.set_id AS set_id,\n",
|
||
" s.name AS set_name\n",
|
||
"FROM placements p\n",
|
||
"JOIN holes h ON p.hole_id = h.id\n",
|
||
"JOIN sets s ON p.set_id = s.id\n",
|
||
"WHERE p.layout_id = 1 AND y <=156\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# Load it into a DataFrame\n",
|
||
"df_climbs = pd.read_sql_query(climbs_query, conn)\n",
|
||
"df_placements = pd.read_sql_query(placements_query, conn)\n",
|
||
"\n",
|
||
"df_hold_difficulty = pd.read_csv('../data/03_hold_difficulty/hold_difficulty_scores.csv', index_col='placement_id')\n",
|
||
"df_features = pd.read_csv('../data/04_climb_features/climb_features.csv', index_col='climb_uuid')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "020aadb9",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Separate features and target\n",
|
||
"X = df_features.drop(columns=['display_difficulty'])\n",
|
||
"y = df_features['display_difficulty']\n",
|
||
"\n",
|
||
"print(f\"\\nFeatures shape: {X.shape}\")\n",
|
||
"print(f\"Target range: {y.min():.1f} to {y.max():.1f}\")\n",
|
||
"print(f\"Target mean: {y.mean():.2f}\")\n",
|
||
"print(f\"Target std: {y.std():.2f}\")\n",
|
||
"\n",
|
||
"# Check for any remaining missing values\n",
|
||
"missing = X.isna().sum().sum()\n",
|
||
"print(f\"\\nMissing values in features: {missing}\")\n",
|
||
"\n",
|
||
"if missing > 0:\n",
|
||
" print(\"Filling remaining missing values with column means...\")\n",
|
||
" X = X.fillna(X.mean())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bd8b3d3b",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Training/Test Split"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "81b32e9e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"========================\n",
|
||
"Train/Test split\n",
|
||
"========================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# 80/20 split\n",
|
||
"X_train, X_test, y_train, y_test = train_test_split(\n",
|
||
" X, y, test_size=0.2, random_state=RANDOM_STATE\n",
|
||
")\n",
|
||
"\n",
|
||
"print(f\"Training set: {len(X_train)} samples\")\n",
|
||
"print(f\"Test set: {len(X_test)} samples\")\n",
|
||
"\n",
|
||
"# Also create a scaled version for linear models\n",
|
||
"scaler = StandardScaler()\n",
|
||
"X_train_scaled = scaler.fit_transform(X_train)\n",
|
||
"X_test_scaled = scaler.transform(X_test)\n",
|
||
"\n",
|
||
"print(f\"\\nFeatures scaled for linear models\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "cf091bec",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"===================================\n",
|
||
"Define evaluation functions\n",
|
||
"===================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"grade_to_v = {\n",
|
||
" 10: 0, 11: 0, 12: 0,\n",
|
||
" 13: 1, 14: 1,\n",
|
||
" 15: 2,\n",
|
||
" 16: 3, 17: 3,\n",
|
||
" 18: 4, 19: 4,\n",
|
||
" 20: 5, 21: 5,\n",
|
||
" 22: 6,\n",
|
||
" 23: 7,\n",
|
||
" 24: 8, 25: 8,\n",
|
||
" 26: 9,\n",
|
||
" 27: 10,\n",
|
||
" 28: 11,\n",
|
||
" 29: 12,\n",
|
||
" 30: 13,\n",
|
||
" 31: 14,\n",
|
||
" 32: 15,\n",
|
||
" 33: 16,\n",
|
||
"}\n",
|
||
"\n",
|
||
"def to_grouped_v(x):\n",
|
||
" rounded = int(round(x))\n",
|
||
" rounded = max(min(rounded, max(grade_to_v)), min(grade_to_v))\n",
|
||
" return grade_to_v[rounded]\n",
|
||
"\n",
|
||
"def grouped_v_metrics(y_true, y_pred):\n",
|
||
" true_v = np.array([to_grouped_v(x) for x in y_true])\n",
|
||
" pred_v = np.array([to_grouped_v(x) for x in y_pred])\n",
|
||
"\n",
|
||
" exact_grouped_v = np.mean(true_v == pred_v) * 100\n",
|
||
" within_1_vgrade = np.mean(np.abs(true_v - pred_v) <= 1) * 100\n",
|
||
" within_2_vgrades = np.mean(np.abs(true_v - pred_v) <= 2) * 100\n",
|
||
"\n",
|
||
" return {\n",
|
||
" 'exact_grouped_v': exact_grouped_v,\n",
|
||
" 'within_1_vgrade': within_1_vgrade,\n",
|
||
" 'within_2_vgrades': within_2_vgrades\n",
|
||
" }\n",
|
||
"\n",
|
||
"def evaluate_model(y_true, y_pred, model_name=\"Model\"):\n",
|
||
" \"\"\"\n",
|
||
" Compute comprehensive evaluation metrics.\n",
|
||
" \"\"\"\n",
|
||
" mae = mean_absolute_error(y_true, y_pred)\n",
|
||
" rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n",
|
||
" r2 = r2_score(y_true, y_pred)\n",
|
||
"\n",
|
||
" # Fine-grained difficulty accuracy\n",
|
||
" within_1 = np.mean(np.abs(y_true - y_pred) <= 1) * 100\n",
|
||
" within_2 = np.mean(np.abs(y_true - y_pred) <= 2) * 100\n",
|
||
"\n",
|
||
" # Grouped V-grade accuracy\n",
|
||
" v_metrics = grouped_v_metrics(y_true, y_pred)\n",
|
||
"\n",
|
||
" # Print results\n",
|
||
" print(f\"### {model_name} Evaluation\\n\")\n",
|
||
" print(f\"MAE: {mae:.3f}\")\n",
|
||
" print(f\"RMSE: {rmse:.3f}\")\n",
|
||
" print(f\"R²: {r2:.3f}\")\n",
|
||
" print(f\"\\nAccuracy within ±1 grade: {within_1:.1f}%\")\n",
|
||
" print(f\"Accuracy within ±2 grades: {within_2:.1f}%\")\n",
|
||
" print(f\"\\nExact grouped V-grade accuracy: {v_metrics['exact_grouped_v']:.1f}%\")\n",
|
||
" print(f\"Accuracy within ±1 V-grade: {v_metrics['within_1_vgrade']:.1f}%\")\n",
|
||
" print(f\"Accuracy within ±2 V-grades: {v_metrics['within_2_vgrades']:.1f}%\")\n",
|
||
"\n",
|
||
" return {\n",
|
||
" 'model': model_name,\n",
|
||
" 'mae': mae,\n",
|
||
" 'rmse': rmse,\n",
|
||
" 'r2': r2,\n",
|
||
" 'within_1': within_1,\n",
|
||
" 'within_2': within_2,\n",
|
||
" 'exact_grouped_v': v_metrics['exact_grouped_v'],\n",
|
||
" 'within_1_vgrade': v_metrics['within_1_vgrade'],\n",
|
||
" 'within_2_vgrades': v_metrics['within_2_vgrades']\n",
|
||
" }\n",
|
||
"\n",
|
||
"\n",
|
||
"def plot_predictions(y_true, y_pred, model_name=\"Model\"):\n",
|
||
" \"\"\"\n",
|
||
" Plot predicted vs actual values.\n",
|
||
" \"\"\"\n",
|
||
" fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
|
||
"\n",
|
||
" # Scatter plot\n",
|
||
" ax = axes[0]\n",
|
||
" ax.scatter(y_true, y_pred, alpha=0.3, s=20)\n",
|
||
" ax.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)\n",
|
||
" ax.set_xlabel('Actual Grade', fontsize=12)\n",
|
||
" ax.set_ylabel('Predicted Grade', fontsize=12)\n",
|
||
" ax.set_title(f'{model_name}: Predicted vs Actual', fontsize=14)\n",
|
||
"\n",
|
||
" # Residuals\n",
|
||
" ax = axes[1]\n",
|
||
" residuals = y_pred - y_true\n",
|
||
" ax.scatter(y_pred, residuals, alpha=0.3, s=20)\n",
|
||
" ax.axhline(y=0, color='r', linestyle='--', lw=2)\n",
|
||
" ax.set_xlabel('Predicted Grade', fontsize=12)\n",
|
||
" ax.set_ylabel('Residual', fontsize=12)\n",
|
||
" ax.set_title(f'{model_name}: Residual Plot', fontsize=14)\n",
|
||
"\n",
|
||
" plt.tight_layout()\n",
|
||
" plt.savefig(f'../images/05_predictive_modelling/{model_name.lower().replace(\" \", \"_\")}_predictions.png', dpi=150, bbox_inches='tight')\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
"\n",
|
||
"def plot_error_distribution(y_true, y_pred, model_name=\"Model\"):\n",
|
||
" \"\"\"\n",
|
||
" Plot error distribution.\n",
|
||
" \"\"\"\n",
|
||
" errors = y_pred - y_true\n",
|
||
"\n",
|
||
" fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
|
||
"\n",
|
||
" # Histogram\n",
|
||
" ax = axes[0]\n",
|
||
" ax.hist(errors, bins=30, edgecolor='black', alpha=0.7)\n",
|
||
" ax.axvline(x=0, color='r', linestyle='--', lw=2)\n",
|
||
" ax.axvline(x=1, color='g', linestyle=':', lw=1)\n",
|
||
" ax.axvline(x=-1, color='g', linestyle=':', lw=1)\n",
|
||
" ax.set_xlabel('Prediction Error', fontsize=12)\n",
|
||
" ax.set_ylabel('Count', fontsize=12)\n",
|
||
" ax.set_title(f'{model_name}: Error Distribution', fontsize=14)\n",
|
||
"\n",
|
||
" # Box plot by actual grade\n",
|
||
" ax = axes[1]\n",
|
||
" df_plot = pd.DataFrame({'actual': y_true, 'error': errors})\n",
|
||
" df_plot.boxplot(column='error', by='actual', ax=ax)\n",
|
||
" ax.set_xlabel('Actual Difficulty', fontsize=12)\n",
|
||
" ax.set_ylabel('Prediction Error', fontsize=12)\n",
|
||
" ax.set_title(f'{model_name}: Error by Grade', fontsize=14)\n",
|
||
" plt.suptitle('') # Remove automatic title\n",
|
||
"\n",
|
||
" plt.tight_layout()\n",
|
||
" plt.savefig(f'../images/05_predictive_modelling/{model_name.lower().replace(\" \", \"_\")}_errors.png', dpi=150, bbox_inches='tight')\n",
|
||
" plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4935cac0",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Regression"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "38cdacab",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Linear Regression"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "806cd7ec",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Linear Regression (baseline)\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"LINEAR REGRESSION\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"# Fit linear regression\n",
|
||
"lr = LinearRegression()\n",
|
||
"lr.fit(X_train_scaled, y_train)\n",
|
||
"\n",
|
||
"# Predict\n",
|
||
"y_pred_lr_train = lr.predict(X_train_scaled)\n",
|
||
"y_pred_lr_test = lr.predict(X_test_scaled)\n",
|
||
"\n",
|
||
"# Evaluate\n",
|
||
"results_lr_train = evaluate_model(y_train, y_pred_lr_train, \"Linear Regression (Train)\")\n",
|
||
"print()\n",
|
||
"results_lr_test = evaluate_model(y_test, y_pred_lr_test, \"Linear Regression (Test)\")\n",
|
||
"\n",
|
||
"# Store results\n",
|
||
"all_results = []\n",
|
||
"all_results.append({**results_lr_test, 'set': 'test'})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "28460ebf",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Linear regression - visualization\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"plot_predictions(y_test, y_pred_lr_test, \"Linear Regression\")\n",
|
||
"plot_error_distribution(y_test, y_pred_lr_test, \"Linear Regression\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "949a8b7d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Linear regression - coefficient analysis\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# Get coefficients\n",
|
||
"coef_df = pd.DataFrame({\n",
|
||
" 'feature': X.columns,\n",
|
||
" 'coefficient': lr.coef_\n",
|
||
"}).sort_values('coefficient', key=abs, ascending=False)\n",
|
||
"\n",
|
||
"print(\"### Top 20 Most Important Coefficients (Linear Regression)\\n\")\n",
|
||
"display(coef_df.head(20))\n",
|
||
"\n",
|
||
"# Plot top coefficients\n",
|
||
"fig, ax = plt.subplots(figsize=(10, 8))\n",
|
||
"\n",
|
||
"top_coef = coef_df.head(20)\n",
|
||
"colors = ['#2ecc71' if c > 0 else '#e74c3c' for c in top_coef['coefficient']]\n",
|
||
"\n",
|
||
"ax.barh(range(len(top_coef)), top_coef['coefficient'], color=colors)\n",
|
||
"ax.set_yticks(range(len(top_coef)))\n",
|
||
"ax.set_yticklabels(top_coef['feature'])\n",
|
||
"ax.set_xlabel('Coefficient', fontsize=12)\n",
|
||
"ax.set_title('Linear Regression: Top 20 Coefficients', fontsize=14)\n",
|
||
"ax.axvline(x=0, color='black', linestyle='-', lw=0.5)\n",
|
||
"ax.invert_yaxis()\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('../images/05_predictive_modelling/linear_regression_coefficients.png', dpi=150, bbox_inches='tight')\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4e15d4fb",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Ridge Regression"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ba333faf",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Ridge Regression\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"RIDGE REGRESSION\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"from sklearn.linear_model import RidgeCV\n",
|
||
"\n",
|
||
"# Cross-validate for best alpha\n",
|
||
"alphas = [0.01, 0.1, 1, 10, 100, 1000]\n",
|
||
"ridge = RidgeCV(alphas=alphas, cv=5)\n",
|
||
"ridge.fit(X_train_scaled, y_train)\n",
|
||
"\n",
|
||
"print(f\"Best alpha: {ridge.alpha_}\")\n",
|
||
"\n",
|
||
"# Predict\n",
|
||
"y_pred_ridge = ridge.predict(X_test_scaled)\n",
|
||
"\n",
|
||
"# Evaluate\n",
|
||
"results_ridge = evaluate_model(y_test, y_pred_ridge, \"Ridge Regression\")\n",
|
||
"all_results.append({**results_ridge, 'set': 'test'})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6863eb88",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Lasso Regression"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0f07cba2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Lasso Regression\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"LASSO REGRESSION\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"from sklearn.linear_model import LassoCV\n",
|
||
"\n",
|
||
"# Cross-validate for best alpha\n",
|
||
"lasso = LassoCV(alphas=None, cv=5, max_iter=10000)\n",
|
||
"lasso.fit(X_train_scaled, y_train)\n",
|
||
"\n",
|
||
"print(f\"Best alpha: {lasso.alpha_:.4f}\")\n",
|
||
"\n",
|
||
"# Count non-zero coefficients\n",
|
||
"non_zero = np.sum(lasso.coef_ != 0)\n",
|
||
"print(f\"Non-zero coefficients: {non_zero} / {len(lasso.coef_)}\")\n",
|
||
"\n",
|
||
"# Predict\n",
|
||
"y_pred_lasso = lasso.predict(X_test_scaled)\n",
|
||
"\n",
|
||
"# Evaluate\n",
|
||
"results_lasso = evaluate_model(y_test, y_pred_lasso, \"Lasso Regression\")\n",
|
||
"all_results.append({**results_lasso, 'set': 'test'})\n",
|
||
"\n",
|
||
"# Show features selected by Lasso\n",
|
||
"lasso_features = pd.DataFrame({\n",
|
||
" 'feature': X.columns,\n",
|
||
" 'coefficient': lasso.coef_\n",
|
||
"})\n",
|
||
"lasso_features = lasso_features[lasso_features['coefficient'] != 0].sort_values('coefficient', key=abs, ascending=False)\n",
|
||
"\n",
|
||
"print(f\"\\n### Features Selected by Lasso ({len(lasso_features)})\\n\")\n",
|
||
"display(lasso_features)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b7a6152a",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Random Forest"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9b0b08b1",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Random Forest - Base Model\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"RANDOM FOREST\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"# Base random forest\n",
|
||
"rf = RandomForestRegressor(\n",
|
||
" n_estimators=100,\n",
|
||
" max_depth=None,\n",
|
||
" min_samples_split=2,\n",
|
||
" min_samples_leaf=1,\n",
|
||
" random_state=RANDOM_STATE,\n",
|
||
" n_jobs=-1\n",
|
||
")\n",
|
||
"\n",
|
||
"rf.fit(X_train, y_train)\n",
|
||
"\n",
|
||
"# Predict\n",
|
||
"y_pred_rf_train = rf.predict(X_train)\n",
|
||
"y_pred_rf_test = rf.predict(X_test)\n",
|
||
"\n",
|
||
"# Evaluate\n",
|
||
"results_rf_train = evaluate_model(y_train, y_pred_rf_train, \"Random Forest (Train)\")\n",
|
||
"print()\n",
|
||
"results_rf_test = evaluate_model(y_test, y_pred_rf_test, \"Random Forest (Test)\")\n",
|
||
"all_results.append({**results_rf_test, 'set': 'test'})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2f473fef",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Random Forest - Visualization\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"plot_predictions(y_test, y_pred_rf_test, \"Random Forest\")\n",
|
||
"plot_error_distribution(y_test, y_pred_rf_test, \"Random Forest\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a810d7fb",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"RF - Feature Importance\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# Get feature importance\n",
|
||
"importance_df = pd.DataFrame({\n",
|
||
" 'feature': X.columns,\n",
|
||
" 'importance': rf.feature_importances_\n",
|
||
"}).sort_values('importance', ascending=False)\n",
|
||
"\n",
|
||
"print(\"### Top 20 Most Important Features (Random Forest)\\n\")\n",
|
||
"display(importance_df.head(20))\n",
|
||
"\n",
|
||
"# Plot\n",
|
||
"fig, ax = plt.subplots(figsize=(10, 8))\n",
|
||
"\n",
|
||
"top_features = importance_df.head(20)\n",
|
||
"ax.barh(range(len(top_features)), top_features['importance'], color='#3498db')\n",
|
||
"ax.set_yticks(range(len(top_features)))\n",
|
||
"ax.set_yticklabels(top_features['feature'])\n",
|
||
"ax.set_xlabel('Feature Importance', fontsize=12)\n",
|
||
"ax.set_title('Random Forest: Top 20 Features', fontsize=14)\n",
|
||
"ax.invert_yaxis()\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('../images/05_predictive_modelling/random_forest_feature_importance.png', dpi=150, bbox_inches='tight')\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "35b8ca0e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"RF - Skip tuning, use good defaults\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"Using pre-tuned Random Forest parameters...\\n\")\n",
|
||
"\n",
|
||
"rf_best = RandomForestRegressor(\n",
|
||
" n_estimators=200,\n",
|
||
" max_depth=20,\n",
|
||
" min_samples_split=2,\n",
|
||
" min_samples_leaf=1,\n",
|
||
" random_state=RANDOM_STATE,\n",
|
||
" n_jobs=-1\n",
|
||
")\n",
|
||
"\n",
|
||
"rf_best.fit(X_train, y_train)\n",
|
||
"y_pred_rf_best = rf_best.predict(X_test)\n",
|
||
"\n",
|
||
"results_rf_tuned = evaluate_model(y_test, y_pred_rf_best, \"Random Forest (Tuned)\")\n",
|
||
"all_results.append({**results_rf_tuned, 'set': 'test'})\n",
|
||
"\n",
|
||
"# Save tuned Random Forest test predictions for downstream comparison\n",
|
||
"os.makedirs('../data/06_deep_learning', exist_ok=True)\n",
|
||
"os.makedirs('../models', exist_ok=True)\n",
|
||
"\n",
|
||
"np.save('../data/06_deep_learning/rf_test_predictions.npy', y_pred_rf_best)\n",
|
||
"np.save('../data/06_deep_learning/rf_test_actuals.npy', y_test.values)\n",
|
||
"\n",
|
||
"rf_eval_df = pd.DataFrame({\n",
|
||
" 'y_true': y_test.values,\n",
|
||
" 'y_pred': y_pred_rf_best\n",
|
||
"})\n",
|
||
"rf_eval_df.to_csv('../models/random_forest_test_eval.csv', index=False)\n",
|
||
"\n",
|
||
"print(\"\\nSaved Random Forest test predictions for Notebook 06 comparison.\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ec1160f8",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"RF Tuned - Visualization\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"plot_predictions(y_test, y_pred_rf_best, \"Random Forest (Tuned)\")\n",
|
||
"plot_error_distribution(y_test, y_pred_rf_best, \"Random Forest (Tuned)\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0ad4cfbf",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Comparing models"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ddaa9bcd",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Cross-Validation comparison\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"CROSS-VALIDATION COMPARISON\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"models = {\n",
|
||
" 'Linear Regression': LinearRegression(),\n",
|
||
" 'Ridge Regression': Ridge(alpha=ridge.alpha_),\n",
|
||
" 'Lasso Regression': Lasso(alpha=lasso.alpha_),\n",
|
||
" 'Random Forest': RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1),\n",
|
||
" 'RF (Tuned)': rf_best\n",
|
||
"}\n",
|
||
"\n",
|
||
"cv_results = []\n",
|
||
"kfold = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)\n",
|
||
"\n",
|
||
"for name, model in models.items():\n",
|
||
" print(f\"\\nCross-validating {name}...\")\n",
|
||
" \n",
|
||
" if 'Linear' in name or 'Ridge' in name or 'Lasso' in name:\n",
|
||
" cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring='neg_mean_absolute_error')\n",
|
||
" else:\n",
|
||
" cv_scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_absolute_error')\n",
|
||
" \n",
|
||
" cv_results.append({\n",
|
||
" 'model': name,\n",
|
||
" 'cv_mae_mean': -cv_scores.mean(),\n",
|
||
" 'cv_mae_std': cv_scores.std()\n",
|
||
" })\n",
|
||
"\n",
|
||
"cv_df = pd.DataFrame(cv_results).sort_values('cv_mae_mean')\n",
|
||
"\n",
|
||
"print(\"\\n### Cross-Validation Results (5-Fold)\\n\")\n",
|
||
"display(cv_df)\n",
|
||
"\n",
|
||
"# Plot\n",
|
||
"fig, ax = plt.subplots(figsize=(10, 6))\n",
|
||
"\n",
|
||
"ax.barh(range(len(cv_df)), cv_df['cv_mae_mean'], xerr=cv_df['cv_mae_std'], \n",
|
||
" color=['#e74c3c', '#e67e22', '#f1c40f', '#2ecc71', '#3498db'], alpha=0.8)\n",
|
||
"ax.set_yticks(range(len(cv_df)))\n",
|
||
"ax.set_yticklabels(cv_df['model'])\n",
|
||
"ax.set_xlabel('Mean Absolute Error (MAE)', fontsize=12)\n",
|
||
"ax.set_title('Cross-Validation MAE Comparison', fontsize=14)\n",
|
||
"ax.invert_yaxis()\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('../images/05_predictive_modelling/cv_comparison.png', dpi=150, bbox_inches='tight')\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e9d15b22",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Model Comparison Summary\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"MODEL COMPARISON SUMMARY\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"results_df = pd.DataFrame(all_results)\n",
|
||
"results_df = results_df.sort_values('mae')\n",
|
||
"\n",
|
||
"print(\"\\n### Test Set Performance\\n\")\n",
|
||
"display(results_df[['model', 'mae', 'rmse', 'r2', 'within_1', 'within_2']])\n",
|
||
"\n",
|
||
"# Visual comparison\n",
|
||
"fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
|
||
"\n",
|
||
"# MAE comparison\n",
|
||
"ax = axes[0]\n",
|
||
"ax.barh(results_df['model'], results_df['mae'], color='#3498db', alpha=0.8)\n",
|
||
"ax.set_xlabel('MAE (lower is better)', fontsize=12)\n",
|
||
"ax.set_title('Mean Absolute Error', fontsize=14)\n",
|
||
"ax.invert_yaxis()\n",
|
||
"\n",
|
||
"# R² comparison\n",
|
||
"ax = axes[1]\n",
|
||
"ax.barh(results_df['model'], results_df['r2'], color='#2ecc71', alpha=0.8)\n",
|
||
"ax.set_xlabel('R² (higher is better)', fontsize=12)\n",
|
||
"ax.set_title('R² Score', fontsize=14)\n",
|
||
"ax.invert_yaxis()\n",
|
||
"\n",
|
||
"# Accuracy within ±1\n",
|
||
"ax = axes[2]\n",
|
||
"ax.barh(results_df['model'], results_df['within_1'], color='#e74c3c', alpha=0.8)\n",
|
||
"ax.set_xlabel('Accuracy (%)', fontsize=12)\n",
|
||
"ax.set_title('Accuracy within ±1 Grade', fontsize=14)\n",
|
||
"ax.invert_yaxis()\n",
|
||
"ax.set_xlim(0, 100)\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('../images/05_predictive_modelling/model_comparison.png', dpi=150, bbox_inches='tight')\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1e2c723d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Prediction Examples\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"### Sample Predictions\\n\")\n",
|
||
"\n",
|
||
"# Show some example predictions\n",
|
||
"sample_indices = np.random.choice(len(X_test), 10, replace=False)\n",
|
||
"\n",
|
||
"examples = pd.DataFrame({\n",
|
||
" 'Actual': y_test.iloc[sample_indices].values,\n",
|
||
" 'Linear Reg': y_pred_lr_test[sample_indices],\n",
|
||
" 'Ridge': y_pred_ridge[sample_indices],\n",
|
||
" 'Random Forest': y_pred_rf_test[sample_indices],\n",
|
||
" 'RF (Tuned)': y_pred_rf_best[sample_indices]\n",
|
||
"}).round(2)\n",
|
||
"\n",
|
||
"examples['Actual V'] = [to_grouped_v(x) for x in examples['Actual']]\n",
|
||
"examples['RF (Tuned) V'] = [to_grouped_v(x) for x in examples['RF (Tuned)']]\n",
|
||
"examples['Linear Error'] = (examples['Actual'] - examples['Linear Reg']).abs().round(2)\n",
|
||
"examples['RF Error'] = (examples['Actual'] - examples['RF (Tuned)']).abs().round(2)\n",
|
||
"examples['RF V-Miss'] = (examples['Actual V'] - examples['RF (Tuned) V']).abs()\n",
|
||
"\n",
|
||
"display(examples)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0bcb80eb",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Error analysis by grade\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"### Error Analysis by Grade\\n\")\n",
|
||
"\n",
|
||
"# Group by actual grade\n",
|
||
"df_analysis = pd.DataFrame({\n",
|
||
" 'actual': y_test,\n",
|
||
" 'predicted_rf': y_pred_rf_best,\n",
|
||
" 'error_rf': y_test - y_pred_rf_best\n",
|
||
"})\n",
|
||
"\n",
|
||
"grade_analysis = df_analysis.groupby('actual').agg(\n",
|
||
" count=('actual', 'count'),\n",
|
||
" mae=('error_rf', lambda x: np.abs(x).mean()),\n",
|
||
" bias=('error_rf', 'mean'),\n",
|
||
" within_1=('error_rf', lambda x: (np.abs(x) <= 1).mean() * 100)\n",
|
||
").round(3)\n",
|
||
"\n",
|
||
"display(grade_analysis)\n",
|
||
"\n",
|
||
"# Plot\n",
|
||
"fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
|
||
"\n",
|
||
"# Count by grade\n",
|
||
"ax = axes[0]\n",
|
||
"ax.bar(grade_analysis.index, grade_analysis['count'], color='#3498db', alpha=0.8)\n",
|
||
"ax.set_xlabel('Grade')\n",
|
||
"ax.set_ylabel('Count')\n",
|
||
"ax.set_title('Test Set Distribution by Grade')\n",
|
||
"\n",
|
||
"# MAE by grade\n",
|
||
"ax = axes[1]\n",
|
||
"ax.bar(grade_analysis.index, grade_analysis['mae'], color='#e74c3c', alpha=0.8)\n",
|
||
"ax.set_xlabel('Grade')\n",
|
||
"ax.set_ylabel('MAE')\n",
|
||
"ax.set_title('MAE by Grade')\n",
|
||
"\n",
|
||
"# Bias by grade\n",
|
||
"ax = axes[2]\n",
|
||
"colors = ['#2ecc71' if b >= 0 else '#e74c3c' for b in grade_analysis['bias']]\n",
|
||
"ax.bar(grade_analysis.index, grade_analysis['bias'], color=colors, alpha=0.8)\n",
|
||
"ax.set_xlabel('Grade')\n",
|
||
"ax.set_ylabel('Bias (Actual - Predicted)')\n",
|
||
"ax.set_title('Prediction Bias by Grade')\n",
|
||
"ax.axhline(y=0, color='black', linestyle='--', lw=1)\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('../images/05_predictive_modelling/error_by_grade.png', dpi=150, bbox_inches='tight')\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8664c606",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Prediction Confidence Intervals\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"### Prediction Confidence Analysis\\n\")\n",
|
||
"\n",
|
||
"# For Random Forest, we can use the individual tree predictions\n",
|
||
"# to estimate prediction uncertainty\n",
|
||
"\n",
|
||
"# Get predictions from individual trees\n",
|
||
"predictions = np.array([tree.predict(X_test) for tree in rf_best.estimators_])\n",
|
||
"\n",
|
||
"# Calculate statistics\n",
|
||
"pred_mean = predictions.mean(axis=0)\n",
|
||
"pred_std = predictions.std(axis=0)\n",
|
||
"\n",
|
||
"# Correlation between prediction std and absolute error\n",
|
||
"abs_errors = np.abs(y_test - pred_mean)\n",
|
||
"correlation = np.corrcoef(pred_std, abs_errors)[0, 1]\n",
|
||
"\n",
|
||
"print(f\"Correlation between prediction std and absolute error: {correlation:.3f}\")\n",
|
||
"\n",
|
||
"# Plot\n",
|
||
"fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
|
||
"\n",
|
||
"# Prediction std vs absolute error\n",
|
||
"ax = axes[0]\n",
|
||
"ax.scatter(pred_std, abs_errors, alpha=0.3, s=20)\n",
|
||
"ax.set_xlabel('Prediction Std Dev (Uncertainty)', fontsize=12)\n",
|
||
"ax.set_ylabel('Absolute Error', fontsize=12)\n",
|
||
"ax.set_title('Prediction Uncertainty vs Error', fontsize=14)\n",
|
||
"\n",
|
||
"# Error by uncertainty quartile\n",
|
||
"ax = axes[1]\n",
|
||
"quartiles = pd.qcut(pred_std, 4, labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)'])\n",
|
||
"uncertainty_analysis = pd.DataFrame({\n",
|
||
" 'quartile': quartiles,\n",
|
||
" 'abs_error': abs_errors\n",
|
||
"}).groupby('quartile')['abs_error'].mean()\n",
|
||
"\n",
|
||
"ax.bar(range(4), uncertainty_analysis.values, color=['#2ecc71', '#f1c40f', '#e67e22', '#e74c3c'])\n",
|
||
"ax.set_xticks(range(4))\n",
|
||
"ax.set_xticklabels(uncertainty_analysis.index)\n",
|
||
"ax.set_ylabel('Mean Absolute Error', fontsize=12)\n",
|
||
"ax.set_title('MAE by Prediction Uncertainty Quartile', fontsize=14)\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('../images/05_predictive_modelling/prediction_uncertainty.png', dpi=150, bbox_inches='tight')\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "38a6dd05",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Saving Models"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "fd7c4b38",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Save Models\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"import joblib\n",
|
||
"\n",
|
||
"os.makedirs('../models', exist_ok=True)\n",
|
||
"\n",
|
||
"# Save models\n",
|
||
"joblib.dump(lr, '../models/linear_regression.pkl')\n",
|
||
"joblib.dump(ridge, '../models/ridge_regression.pkl')\n",
|
||
"joblib.dump(lasso, '../models/lasso_regression.pkl')\n",
|
||
"joblib.dump(rf_best, '../models/random_forest_tuned.pkl')\n",
|
||
"joblib.dump(scaler, '../models/feature_scaler.pkl')\n",
|
||
"\n",
|
||
"# Save feature names\n",
|
||
"with open('../models/feature_names.txt', 'w') as f:\n",
|
||
" for feat in X.columns:\n",
|
||
" f.write(f\"{feat}\\n\")\n",
|
||
"\n",
|
||
"print(\"Models saved to ../models/\")\n",
|
||
"print(\" - linear_regression.pkl\")\n",
|
||
"print(\" - ridge_regression.pkl\")\n",
|
||
"print(\" - lasso_regression.pkl\")\n",
|
||
"print(\" - random_forest_tuned.pkl\")\n",
|
||
"print(\" - feature_scaler.pkl\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8ce25f6e",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Conclusion"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "47f5e38f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Final Summary\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"FINAL SUMMARY\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"summary = f\"\"\"\n",
|
||
"### Model Performance Summary\n",
|
||
"\n",
|
||
"| Model | MAE | RMSE | R² | Within ±1 | Within ±2 | Exact V | Within ±1 V |\n",
|
||
"|-------|-----|------|----|-----------|-----------|---------|-------------|\n",
|
||
"| Linear Regression | {results_lr_test['mae']:.3f} | {results_lr_test['rmse']:.3f} | {results_lr_test['r2']:.3f} | {results_lr_test['within_1']:.1f}% | {results_lr_test['within_2']:.1f}% | {results_lr_test['exact_grouped_v']:.1f}% | {results_lr_test['within_1_vgrade']:.1f}% |\n",
|
||
"| Ridge Regression | {results_ridge['mae']:.3f} | {results_ridge['rmse']:.3f} | {results_ridge['r2']:.3f} | {results_ridge['within_1']:.1f}% | {results_ridge['within_2']:.1f}% | {results_ridge['exact_grouped_v']:.1f}% | {results_ridge['within_1_vgrade']:.1f}% |\n",
|
||
"| Lasso Regression | {results_lasso['mae']:.3f} | {results_lasso['rmse']:.3f} | {results_lasso['r2']:.3f} | {results_lasso['within_1']:.1f}% | {results_lasso['within_2']:.1f}% | {results_lasso['exact_grouped_v']:.1f}% | {results_lasso['within_1_vgrade']:.1f}% |\n",
|
||
"| Random Forest (Tuned) | {results_rf_tuned['mae']:.3f} | {results_rf_tuned['rmse']:.3f} | {results_rf_tuned['r2']:.3f} | {results_rf_tuned['within_1']:.1f}% | {results_rf_tuned['within_2']:.1f}% | {results_rf_tuned['exact_grouped_v']:.1f}% | {results_rf_tuned['within_1_vgrade']:.1f}% |\n",
|
||
"\n",
|
||
"### Key Findings\n",
|
||
"\n",
|
||
"1. **Tree-based models remain strongest on this structured feature set.**\n",
|
||
" - Random Forest (Tuned) achieves the best overall balance of MAE, RMSE, and grouped V-grade performance.\n",
|
||
" - Linear models remain useful baselines but leave clear nonlinear signal unexplained.\n",
|
||
"\n",
|
||
"2. **Fine-grained difficulty prediction is meaningfully harder than grouped grade prediction.**\n",
|
||
" - On the held-out test set, the best model is within ±1 fine-grained difficulty score {results_rf_tuned['within_1']:.1f}% of the time.\n",
|
||
" - The same model is within ±1 grouped V-grade {results_rf_tuned['within_1_vgrade']:.1f}% of the time.\n",
|
||
"\n",
|
||
"3. **This gap is expected and informative.**\n",
|
||
" - Small numeric errors often stay inside the same or adjacent V-grade buckets.\n",
|
||
" - The model captures broad difficulty bands more reliably than exact score distinctions.\n",
|
||
"\n",
|
||
"4. **The project’s main predictive takeaway is practical rather than perfect.**\n",
|
||
" - The models are not exact grade replicators.\n",
|
||
" - They are reasonably strong at placing climbs into the correct neighborhood of difficulty.\n",
|
||
"\n",
|
||
"### Portfolio Interpretation\n",
|
||
"\n",
|
||
"From a modelling perspective, this project shows:\n",
|
||
"- feature engineering grounded in domain structure,\n",
|
||
"- comparison of linear and nonlinear models,\n",
|
||
"- honest evaluation on a held-out test set,\n",
|
||
"- and the ability to translate raw regression performance into climbing-relevant grouped V-grade metrics.\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(summary)\n",
|
||
"\n",
|
||
"# Save summary\n",
|
||
"os.makedirs('../data/05_predictive_modelling', exist_ok=True)\n",
|
||
"with open('../data/05_predictive_modelling/model_summary.txt', 'w') as f:\n",
|
||
" f.write(summary)\n",
|
||
"\n",
|
||
"print(\"\\nSummary saved to ../data/05_predictive_modelling/model_summary.txt\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ee8621d7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"====================================\n",
|
||
"Bonus: Gradient Boosting Comparison\n",
|
||
"====================================\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"print(\"=\" * 60)\n",
|
||
"print(\"GRADIENT BOOSTING\")\n",
|
||
"print(\"=\" * 60)\n",
|
||
"\n",
|
||
"from sklearn.ensemble import GradientBoostingRegressor\n",
|
||
"\n",
|
||
"# Train gradient boosting\n",
|
||
"gb = GradientBoostingRegressor(\n",
|
||
" n_estimators=200,\n",
|
||
" max_depth=5,\n",
|
||
" learning_rate=0.1,\n",
|
||
" min_samples_split=5,\n",
|
||
" random_state=RANDOM_STATE\n",
|
||
")\n",
|
||
"\n",
|
||
"gb.fit(X_train, y_train)\n",
|
||
"\n",
|
||
"# Predict\n",
|
||
"y_pred_gb = gb.predict(X_test)\n",
|
||
"\n",
|
||
"# Evaluate\n",
|
||
"results_gb = evaluate_model(y_test, y_pred_gb, \"Gradient Boosting\")\n",
|
||
"all_results.append({**results_gb, 'set': 'test'})\n",
|
||
"\n",
|
||
"# Compare with Random Forest\n",
|
||
"plot_predictions(y_test, y_pred_gb, \"Gradient Boosting\")"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.14.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|