{ "cells": [ { "cell_type": "markdown", "id": "833dad45", "metadata": {}, "source": [ "# Tension Board 2 Mirror: Predictive Modelling\n", "\n", "With the feature matrix built in notebook 04, we now turn to the central modelling question: how accurately can we predict climb difficulty from engineered features?\n", "\n", "## Modelling Approach\n", "\n", "We fit and compare several regression models on the hold-out test set:\n", "\n", "1. **Linear models** \n", " Linear Regression, Ridge, and Lasso serve as interpretable baselines. Coefficients reveal which features the model relies on most.\n", "\n", "2. **Tree-based models** \n", " Random Forest is the primary workhorse. It handles nonlinear relationships naturally and provides feature importance scores. A tuned variant with deeper trees and more estimators serves as the final classical model.\n", "\n", "3. **Gradient Boosting** \n", " We compare Gradient Boosting against Random Forest to assess whether boosting yields improved predictive performance over bagging.\n", "\n", "## Evaluation Framework\n", "\n", "We evaluate models on two levels:\n", "\n", "- **Fine-grained difficulty scores** \n", " The raw `display_difficulty` values. Accuracy within ±1 or ±2 points.\n", "\n", "- **Grouped V-grades** \n", " Fine-grained scores are mapped to V-grade buckets. This is the more practical metric: being off by half a grade is usually acceptable, while being off by two full grades is not.\n", "\n", "## Output\n", "\n", "The final products are trained models saved as joblib files, test set predictions for ensemble comparison in notebook 06, and a summary of model performance across all metrics.\n", "\n", "## Notebook Structure\n", "\n", "1. [Setup and Imports](#setup-and-imports)\n", "2. [Load Feature Data](#load-feature-data)\n", "3. [Training/Test Split](#training/test-split)\n", "4. [Regression](#regression)\n", "5. [Random Forest](#random-forest)\n", "6. [Comparing Models](#comparing-models)\n", "7. [Saving Models](#saving-models)\n", "8. [Conclusion](#conclusion)" ] }, { "cell_type": "markdown", "id": "33fdcba8", "metadata": {}, "source": [ "# Setup and Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "e8364a1c", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "==================================\n", "Setup and Imports\n", "==================================\n", "\"\"\"\n", "\n", "# Imports\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import numpy as np\n", "import matplotlib.patches as mpatches\n", "\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import cross_val_score\n", "\n", "from scipy.spatial import ConvexHull\n", "from scipy.spatial.distance import pdist, squareform\n", "\n", "import sqlite3\n", "\n", "import re\n", "import os\n", "from collections import defaultdict\n", "\n", "import ast\n", "\n", "from sklearn.model_selection import train_test_split, cross_val_score, KFold\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet\n", "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "from PIL import Image\n", "\n", "# Set some display options\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', 100)\n", "\n", "# Set style\n", "palette=['steelblue', 'coral', 'seagreen'] #(for multi-bar graphs)\n", "\n", "# Set board image for some visual analysis\n", "board_img = Image.open('../images/tb2_board_12x12_composite.png')\n", "\n", "# Connect to the database\n", "DB_PATH=\"../data/tb2.db\"\n", "conn = sqlite3.connect(DB_PATH)\n", "\n", "# Set random state\n", "RANDOM_STATE=3" ] }, { "cell_type": "code", "execution_count": null, "id": "2830cfab", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "==================================\n", "Query our data from the DB\n", "==================================\n", "\n", "This time we restrict to where `layout_id=10` for the TB2 Mirror.\n", "We will also restrict ourselves to an angle of at most 50, since according to our grade vs angle distribution in notebook 01, things start to look a bit weird past 50.\n", "(Probably a bias towards climbers who can actually climb that steep). We will encode this directly into our query.\n", "\"\"\"\n", "\n", "# Query climbs data\n", "climbs_query = \"\"\"\n", "SELECT\n", " c.uuid,\n", " c.name AS climb_name,\n", " c.setter_username,\n", " c.layout_id AS layout_id,\n", " c.description,\n", " c.is_nomatch,\n", " c.is_listed,\n", " l.name AS layout_name,\n", " p.name AS board_name,\n", " c.frames,\n", " cs.angle,\n", " cs.display_difficulty,\n", " dg.boulder_name AS boulder_grade,\n", " cs.ascensionist_count,\n", " cs.quality_average,\n", " cs.fa_at\n", " \n", "FROM climbs c\n", "JOIN layouts l ON c.layout_id = l.id\n", "JOIN products p ON l.product_id = p.id\n", "JOIN climb_stats cs ON c.uuid = cs.climb_uuid\n", "JOIN difficulty_grades dg ON ROUND(cs.display_difficulty) = dg.difficulty\n", "WHERE cs.display_difficulty IS NOT NULL AND c.is_listed=1 AND c.layout_id=10 AND cs.angle <= 50\n", "\"\"\"\n", "\n", "# Query information about placements (and their mirrors)\n", "placements_query = \"\"\"\n", "SELECT\n", " p.id AS placement_id,\n", " h.x,\n", " h.y,\n", " p.default_placement_role_id AS default_role_id,\n", " p.set_id AS set_id,\n", " s.name AS set_name,\n", " p_mirror.id AS mirror_placement_id\n", "FROM placements p\n", "JOIN holes h ON p.hole_id = h.id\n", "JOIN sets s ON p.set_id = s.id\n", "LEFT JOIN holes h_mirror ON h.mirrored_hole_id = h_mirror.id\n", "LEFT JOIN placements p_mirror ON p_mirror.hole_id = h_mirror.id AND p_mirror.layout_id = p.layout_id\n", "WHERE p.layout_id = 10\n", "\"\"\"\n", "\n", "# Load it into a DataFrame\n", "df_climbs = pd.read_sql_query(climbs_query, conn)\n", "df_placements = pd.read_sql_query(placements_query, conn)\n", "\n", "df_hold_difficulty = pd.read_csv('../data/03_hold_difficulty/hold_difficulty_scores.csv', index_col='placement_id')\n", "df_features = pd.read_csv('../data/04_climb_features/climb_features.csv', index_col='climb_uuid')" ] }, { "cell_type": "code", "execution_count": null, "id": "020aadb9", "metadata": {}, "outputs": [], "source": [ "# Separate features and target\n", "X = df_features.drop(columns=['display_difficulty'])\n", "y = df_features['display_difficulty']\n", "\n", "print(f\"\\nFeatures shape: {X.shape}\")\n", "print(f\"Target range: {y.min():.1f} to {y.max():.1f}\")\n", "print(f\"Target mean: {y.mean():.2f}\")\n", "print(f\"Target std: {y.std():.2f}\")\n", "\n", "# Check for any remaining missing values\n", "missing = X.isna().sum().sum()\n", "print(f\"\\nMissing values in features: {missing}\")\n", "\n", "if missing > 0:\n", " print(\"Filling remaining missing values with column means...\")\n", " X = X.fillna(X.mean())" ] }, { "cell_type": "markdown", "id": "bd8b3d3b", "metadata": {}, "source": [ "# Training/Test Split" ] }, { "cell_type": "code", "execution_count": null, "id": "81b32e9e", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "========================\n", "Train/Test split\n", "========================\n", "\"\"\"\n", "\n", "# 80/20 split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=RANDOM_STATE\n", ")\n", "\n", "print(f\"Training set: {len(X_train)} samples\")\n", "print(f\"Test set: {len(X_test)} samples\")\n", "\n", "# Also create a scaled version for linear models\n", "scaler = StandardScaler()\n", "X_train_scaled = scaler.fit_transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)\n", "\n", "print(f\"\\nFeatures scaled for linear models\")" ] }, { "cell_type": "code", "execution_count": null, "id": "cf091bec", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "===================================\n", "Define evaluation functions\n", "===================================\n", "\"\"\"\n", "\n", "grade_to_v = {\n", " 10: 0, 11: 0, 12: 0,\n", " 13: 1, 14: 1,\n", " 15: 2,\n", " 16: 3, 17: 3,\n", " 18: 4, 19: 4,\n", " 20: 5, 21: 5,\n", " 22: 6,\n", " 23: 7,\n", " 24: 8, 25: 8,\n", " 26: 9,\n", " 27: 10,\n", " 28: 11,\n", " 29: 12,\n", " 30: 13,\n", " 31: 14,\n", " 32: 15,\n", " 33: 16,\n", "}\n", "\n", "def to_grouped_v(x):\n", " rounded = int(round(x))\n", " rounded = max(min(rounded, max(grade_to_v)), min(grade_to_v))\n", " return grade_to_v[rounded]\n", "\n", "def grouped_v_metrics(y_true, y_pred):\n", " true_v = np.array([to_grouped_v(x) for x in y_true])\n", " pred_v = np.array([to_grouped_v(x) for x in y_pred])\n", "\n", " exact_grouped_v = np.mean(true_v == pred_v) * 100\n", " within_1_vgrade = np.mean(np.abs(true_v - pred_v) <= 1) * 100\n", " within_2_vgrades = np.mean(np.abs(true_v - pred_v) <= 2) * 100\n", "\n", " return {\n", " 'exact_grouped_v': exact_grouped_v,\n", " 'within_1_vgrade': within_1_vgrade,\n", " 'within_2_vgrades': within_2_vgrades\n", " }\n", "\n", "def evaluate_model(y_true, y_pred, model_name=\"Model\"):\n", " \"\"\"\n", " Compute comprehensive evaluation metrics.\n", " \"\"\"\n", " mae = mean_absolute_error(y_true, y_pred)\n", " rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n", " r2 = r2_score(y_true, y_pred)\n", "\n", " # Fine-grained difficulty accuracy\n", " within_1 = np.mean(np.abs(y_true - y_pred) <= 1) * 100\n", " within_2 = np.mean(np.abs(y_true - y_pred) <= 2) * 100\n", "\n", " # Grouped V-grade accuracy\n", " v_metrics = grouped_v_metrics(y_true, y_pred)\n", "\n", " # Print results\n", " print(f\"### {model_name} Evaluation\\n\")\n", " print(f\"MAE: {mae:.3f}\")\n", " print(f\"RMSE: {rmse:.3f}\")\n", " print(f\"R²: {r2:.3f}\")\n", " print(f\"\\nAccuracy within ±1 grade: {within_1:.1f}%\")\n", " print(f\"Accuracy within ±2 grades: {within_2:.1f}%\")\n", " print(f\"\\nExact grouped V-grade accuracy: {v_metrics['exact_grouped_v']:.1f}%\")\n", " print(f\"Accuracy within ±1 V-grade: {v_metrics['within_1_vgrade']:.1f}%\")\n", " print(f\"Accuracy within ±2 V-grades: {v_metrics['within_2_vgrades']:.1f}%\")\n", "\n", " return {\n", " 'model': model_name,\n", " 'mae': mae,\n", " 'rmse': rmse,\n", " 'r2': r2,\n", " 'within_1': within_1,\n", " 'within_2': within_2,\n", " 'exact_grouped_v': v_metrics['exact_grouped_v'],\n", " 'within_1_vgrade': v_metrics['within_1_vgrade'],\n", " 'within_2_vgrades': v_metrics['within_2_vgrades']\n", " }\n", "\n", "\n", "def plot_predictions(y_true, y_pred, model_name=\"Model\"):\n", " \"\"\"\n", " Plot predicted vs actual values.\n", " \"\"\"\n", " fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", "\n", " # Scatter plot\n", " ax = axes[0]\n", " ax.scatter(y_true, y_pred, alpha=0.3, s=20)\n", " ax.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)\n", " ax.set_xlabel('Actual Grade', fontsize=12)\n", " ax.set_ylabel('Predicted Grade', fontsize=12)\n", " ax.set_title(f'{model_name}: Predicted vs Actual', fontsize=14)\n", "\n", " # Residuals\n", " ax = axes[1]\n", " residuals = y_pred - y_true\n", " ax.scatter(y_pred, residuals, alpha=0.3, s=20)\n", " ax.axhline(y=0, color='r', linestyle='--', lw=2)\n", " ax.set_xlabel('Predicted Grade', fontsize=12)\n", " ax.set_ylabel('Residual', fontsize=12)\n", " ax.set_title(f'{model_name}: Residual Plot', fontsize=14)\n", "\n", " plt.tight_layout()\n", " plt.show()\n", "\n", "\n", "def plot_error_distribution(y_true, y_pred, model_name=\"Model\"):\n", " \"\"\"\n", " Plot error distribution.\n", " \"\"\"\n", " errors = y_pred - y_true\n", "\n", " fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", "\n", " # Histogram\n", " ax = axes[0]\n", " ax.hist(errors, bins=30, edgecolor='black', alpha=0.7)\n", " ax.axvline(x=0, color='r', linestyle='--', lw=2)\n", " ax.axvline(x=1, color='g', linestyle=':', lw=1)\n", " ax.axvline(x=-1, color='g', linestyle=':', lw=1)\n", " ax.set_xlabel('Prediction Error', fontsize=12)\n", " ax.set_ylabel('Count', fontsize=12)\n", " ax.set_title(f'{model_name}: Error Distribution', fontsize=14)\n", "\n", " # Box plot by actual grade\n", " ax = axes[1]\n", " df_plot = pd.DataFrame({'actual': y_true, 'error': errors})\n", " df_plot.boxplot(column='error', by='actual', ax=ax)\n", " ax.set_xlabel('Actual Difficulty', fontsize=12)\n", " ax.set_ylabel('Prediction Error', fontsize=12)\n", " ax.set_title(f'{model_name}: Error by Grade', fontsize=14)\n", " plt.suptitle('') # Remove automatic title\n", "\n", " plt.tight_layout()\n", " plt.show()\n" ] }, { "cell_type": "markdown", "id": "4935cac0", "metadata": {}, "source": [ "# Regression" ] }, { "cell_type": "markdown", "id": "38cdacab", "metadata": {}, "source": [ "## Linear Regression" ] }, { "cell_type": "code", "execution_count": null, "id": "806cd7ec", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Linear Regression (baseline)\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"LINEAR REGRESSION\")\n", "print(\"=\" * 60)\n", "\n", "# Fit linear regression\n", "lr = LinearRegression()\n", "lr.fit(X_train_scaled, y_train)\n", "\n", "# Predict\n", "y_pred_lr_train = lr.predict(X_train_scaled)\n", "y_pred_lr_test = lr.predict(X_test_scaled)\n", "\n", "# Evaluate\n", "results_lr_train = evaluate_model(y_train, y_pred_lr_train, \"Linear Regression (Train)\")\n", "print()\n", "results_lr_test = evaluate_model(y_test, y_pred_lr_test, \"Linear Regression (Test)\")\n", "\n", "# Store results\n", "all_results = []\n", "all_results.append({**results_lr_test, 'set': 'test'})" ] }, { "cell_type": "code", "execution_count": null, "id": "28460ebf", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Linear regression - visualization\n", "====================================\n", "\"\"\"\n", "\n", "plot_predictions(y_test, y_pred_lr_test, \"Linear Regression\")\n", "plot_error_distribution(y_test, y_pred_lr_test, \"Linear Regression\")" ] }, { "cell_type": "code", "execution_count": null, "id": "949a8b7d", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Linear regression - coefficient analysis\n", "====================================\n", "\"\"\"\n", "\n", "# Get coefficients\n", "coef_df = pd.DataFrame({\n", " 'feature': X.columns,\n", " 'coefficient': lr.coef_\n", "}).sort_values('coefficient', key=abs, ascending=False)\n", "\n", "print(\"### Top 20 Most Important Coefficients (Linear Regression)\\n\")\n", "display(coef_df.head(20))\n", "\n", "# Plot top coefficients\n", "fig, ax = plt.subplots(figsize=(10, 8))\n", "\n", "top_coef = coef_df.head(20)\n", "colors = ['#2ecc71' if c > 0 else '#e74c3c' for c in top_coef['coefficient']]\n", "\n", "ax.barh(range(len(top_coef)), top_coef['coefficient'], color=colors)\n", "ax.set_yticks(range(len(top_coef)))\n", "ax.set_yticklabels(top_coef['feature'])\n", "ax.set_xlabel('Coefficient', fontsize=12)\n", "ax.set_title('Linear Regression: Top 20 Coefficients', fontsize=14)\n", "ax.axvline(x=0, color='black', linestyle='-', lw=0.5)\n", "ax.invert_yaxis()\n", "\n", "plt.tight_layout()\n", "plt.savefig('../images/05_predictive_modelling/linear_regression_coefficients.png', dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "4e15d4fb", "metadata": {}, "source": [ "## Ridge Regression" ] }, { "cell_type": "code", "execution_count": null, "id": "ba333faf", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Ridge Regression\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"RIDGE REGRESSION\")\n", "print(\"=\" * 60)\n", "\n", "from sklearn.linear_model import RidgeCV\n", "\n", "# Cross-validate for best alpha\n", "alphas = [0.01, 0.1, 1, 10, 100, 1000]\n", "ridge = RidgeCV(alphas=alphas, cv=5)\n", "ridge.fit(X_train_scaled, y_train)\n", "\n", "print(f\"Best alpha: {ridge.alpha_}\")\n", "\n", "# Predict\n", "y_pred_ridge = ridge.predict(X_test_scaled)\n", "\n", "# Evaluate\n", "results_ridge = evaluate_model(y_test, y_pred_ridge, \"Ridge Regression\")\n", "all_results.append({**results_ridge, 'set': 'test'})" ] }, { "cell_type": "markdown", "id": "6863eb88", "metadata": {}, "source": [ "## Lasso Regression" ] }, { "cell_type": "code", "execution_count": null, "id": "0f07cba2", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Lasso Regression\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"LASSO REGRESSION\")\n", "print(\"=\" * 60)\n", "\n", "from sklearn.linear_model import LassoCV\n", "\n", "# Cross-validate for best alpha\n", "lasso = LassoCV(alphas=None, cv=5, max_iter=10000)\n", "lasso.fit(X_train_scaled, y_train)\n", "\n", "print(f\"Best alpha: {lasso.alpha_:.4f}\")\n", "\n", "# Count non-zero coefficients\n", "non_zero = np.sum(lasso.coef_ != 0)\n", "print(f\"Non-zero coefficients: {non_zero} / {len(lasso.coef_)}\")\n", "\n", "# Predict\n", "y_pred_lasso = lasso.predict(X_test_scaled)\n", "\n", "# Evaluate\n", "results_lasso = evaluate_model(y_test, y_pred_lasso, \"Lasso Regression\")\n", "all_results.append({**results_lasso, 'set': 'test'})\n", "\n", "# Show features selected by Lasso\n", "lasso_features = pd.DataFrame({\n", " 'feature': X.columns,\n", " 'coefficient': lasso.coef_\n", "})\n", "lasso_features = lasso_features[lasso_features['coefficient'] != 0].sort_values('coefficient', key=abs, ascending=False)\n", "\n", "print(f\"\\n### Features Selected by Lasso ({len(lasso_features)})\\n\")\n", "display(lasso_features)" ] }, { "cell_type": "markdown", "id": "b7a6152a", "metadata": {}, "source": [ "# Random Forest" ] }, { "cell_type": "code", "execution_count": null, "id": "9b0b08b1", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Random Forest - Base Model\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"RANDOM FOREST\")\n", "print(\"=\" * 60)\n", "\n", "# Base random forest\n", "rf = RandomForestRegressor(\n", " n_estimators=100,\n", " max_depth=None,\n", " min_samples_split=2,\n", " min_samples_leaf=1,\n", " random_state=RANDOM_STATE,\n", " n_jobs=-1\n", ")\n", "\n", "rf.fit(X_train, y_train)\n", "\n", "# Predict\n", "y_pred_rf_train = rf.predict(X_train)\n", "y_pred_rf_test = rf.predict(X_test)\n", "\n", "# Evaluate\n", "results_rf_train = evaluate_model(y_train, y_pred_rf_train, \"Random Forest (Train)\")\n", "print()\n", "results_rf_test = evaluate_model(y_test, y_pred_rf_test, \"Random Forest (Test)\")\n", "all_results.append({**results_rf_test, 'set': 'test'})" ] }, { "cell_type": "code", "execution_count": null, "id": "2f473fef", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Random Forest - Visualization\n", "====================================\n", "\"\"\"\n", "\n", "plot_predictions(y_test, y_pred_rf_test, \"Random Forest\")\n", "plot_error_distribution(y_test, y_pred_rf_test, \"Random Forest\")" ] }, { "cell_type": "code", "execution_count": null, "id": "a810d7fb", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "RF - Feature Importance\n", "====================================\n", "\"\"\"\n", "\n", "# Get feature importance\n", "importance_df = pd.DataFrame({\n", " 'feature': X.columns,\n", " 'importance': rf.feature_importances_\n", "}).sort_values('importance', ascending=False)\n", "\n", "print(\"### Top 20 Most Important Features (Random Forest)\\n\")\n", "display(importance_df.head(20))\n", "\n", "# Plot\n", "fig, ax = plt.subplots(figsize=(10, 8))\n", "\n", "top_features = importance_df.head(20)\n", "ax.barh(range(len(top_features)), top_features['importance'], color='#3498db')\n", "ax.set_yticks(range(len(top_features)))\n", "ax.set_yticklabels(top_features['feature'])\n", "ax.set_xlabel('Feature Importance', fontsize=12)\n", "ax.set_title('Random Forest: Top 20 Features', fontsize=14)\n", "ax.invert_yaxis()\n", "\n", "plt.tight_layout()\n", "plt.savefig('../images/05_predictive_modelling/random_forest_importance.png', dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "35b8ca0e", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "RF - Skip tuning, use good defaults\n", "====================================\n", "\"\"\"\n", "\n", "print(\"Using pre-tuned Random Forest parameters...\\n\")\n", "\n", "rf_best = RandomForestRegressor(\n", " n_estimators=200,\n", " max_depth=20,\n", " min_samples_split=2,\n", " min_samples_leaf=1,\n", " random_state=RANDOM_STATE,\n", " n_jobs=-1\n", ")\n", "\n", "rf_best.fit(X_train, y_train)\n", "y_pred_rf_best = rf_best.predict(X_test)\n", "\n", "results_rf_tuned = evaluate_model(y_test, y_pred_rf_best, \"Random Forest (Tuned)\")\n", "all_results.append({**results_rf_tuned, 'set': 'test'})\n", "\n", "# Save tuned Random Forest test predictions for downstream comparison\n", "os.makedirs('../data/06_deep_learning', exist_ok=True)\n", "os.makedirs('../models', exist_ok=True)\n", "\n", "np.save('../data/06_deep_learning/rf_test_predictions.npy', y_pred_rf_best)\n", "np.save('../data/06_deep_learning/rf_test_actuals.npy', y_test.values)\n", "\n", "rf_eval_df = pd.DataFrame({\n", " 'y_true': y_test.values,\n", " 'y_pred': y_pred_rf_best\n", "})\n", "rf_eval_df.to_csv('../models/random_forest_test_eval.csv', index=False)\n", "\n", "print(\"\\nSaved Random Forest test predictions for Notebook 06 comparison.\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ec1160f8", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "RF Tuned - Visualization\n", "====================================\n", "\"\"\"\n", "\n", "plot_predictions(y_test, y_pred_rf_best, \"Random Forest (Tuned)\")\n", "plot_error_distribution(y_test, y_pred_rf_best, \"Random Forest (Tuned)\")" ] }, { "cell_type": "markdown", "id": "0ad4cfbf", "metadata": {}, "source": [ "# Comparing models" ] }, { "cell_type": "code", "execution_count": null, "id": "ddaa9bcd", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Cross-Validation comparison\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"CROSS-VALIDATION COMPARISON\")\n", "print(\"=\" * 60)\n", "\n", "models = {\n", " 'Linear Regression': LinearRegression(),\n", " 'Ridge Regression': Ridge(alpha=ridge.alpha_),\n", " 'Lasso Regression': Lasso(alpha=lasso.alpha_),\n", " 'Random Forest': RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1),\n", " 'RF (Tuned)': rf_best\n", "}\n", "\n", "cv_results = []\n", "kfold = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)\n", "\n", "for name, model in models.items():\n", " print(f\"\\nCross-validating {name}...\")\n", " \n", " if 'Linear' in name or 'Ridge' in name or 'Lasso' in name:\n", " cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring='neg_mean_absolute_error')\n", " else:\n", " cv_scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_absolute_error')\n", " \n", " cv_results.append({\n", " 'model': name,\n", " 'cv_mae_mean': -cv_scores.mean(),\n", " 'cv_mae_std': cv_scores.std()\n", " })\n", "\n", "cv_df = pd.DataFrame(cv_results).sort_values('cv_mae_mean')\n", "\n", "print(\"\\n### Cross-Validation Results (5-Fold)\\n\")\n", "display(cv_df)\n", "\n", "# Plot\n", "fig, ax = plt.subplots(figsize=(10, 6))\n", "\n", "ax.barh(range(len(cv_df)), cv_df['cv_mae_mean'], xerr=cv_df['cv_mae_std'], \n", " color=['#e74c3c', '#e67e22', '#f1c40f', '#2ecc71', '#3498db'], alpha=0.8)\n", "ax.set_yticks(range(len(cv_df)))\n", "ax.set_yticklabels(cv_df['model'])\n", "ax.set_xlabel('Mean Absolute Error (MAE)', fontsize=12)\n", "ax.set_title('Cross-Validation MAE Comparison', fontsize=14)\n", "ax.invert_yaxis()\n", "\n", "plt.tight_layout()\n", "plt.savefig('../images/05_predictive_modelling/cv_comparison.png', dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "e9d15b22", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Model Comparison Summary\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"MODEL COMPARISON SUMMARY\")\n", "print(\"=\" * 60)\n", "\n", "results_df = pd.DataFrame(all_results)\n", "results_df = results_df.sort_values('mae')\n", "\n", "print(\"\\n### Test Set Performance\\n\")\n", "display(results_df[['model', 'mae', 'rmse', 'r2', 'within_1', 'within_2']])\n", "\n", "# Visual comparison\n", "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n", "\n", "# MAE comparison\n", "ax = axes[0]\n", "ax.barh(results_df['model'], results_df['mae'], color='#3498db', alpha=0.8)\n", "ax.set_xlabel('MAE (lower is better)', fontsize=12)\n", "ax.set_title('Mean Absolute Error', fontsize=14)\n", "ax.invert_yaxis()\n", "\n", "# R² comparison\n", "ax = axes[1]\n", "ax.barh(results_df['model'], results_df['r2'], color='#2ecc71', alpha=0.8)\n", "ax.set_xlabel('R² (higher is better)', fontsize=12)\n", "ax.set_title('R² Score', fontsize=14)\n", "ax.invert_yaxis()\n", "\n", "# Accuracy within ±1\n", "ax = axes[2]\n", "ax.barh(results_df['model'], results_df['within_1'], color='#e74c3c', alpha=0.8)\n", "ax.set_xlabel('Accuracy (%)', fontsize=12)\n", "ax.set_title('Accuracy within ±1 Grade', fontsize=14)\n", "ax.invert_yaxis()\n", "ax.set_xlim(0, 100)\n", "\n", "plt.tight_layout()\n", "plt.savefig('../images/05_predictive_modelling/model_comparison.png', dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "1e2c723d", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Prediction Examples\n", "====================================\n", "\"\"\"\n", "\n", "print(\"### Sample Predictions\\n\")\n", "\n", "# Show some example predictions\n", "sample_indices = np.random.choice(len(X_test), 10, replace=False)\n", "\n", "examples = pd.DataFrame({\n", " 'Actual': y_test.iloc[sample_indices].values,\n", " 'Linear Reg': y_pred_lr_test[sample_indices],\n", " 'Ridge': y_pred_ridge[sample_indices],\n", " 'Random Forest': y_pred_rf_test[sample_indices],\n", " 'RF (Tuned)': y_pred_rf_best[sample_indices]\n", "}).round(2)\n", "\n", "examples['Actual V'] = [to_grouped_v(x) for x in examples['Actual']]\n", "examples['RF (Tuned) V'] = [to_grouped_v(x) for x in examples['RF (Tuned)']]\n", "examples['Linear Error'] = (examples['Actual'] - examples['Linear Reg']).abs().round(2)\n", "examples['RF Error'] = (examples['Actual'] - examples['RF (Tuned)']).abs().round(2)\n", "examples['RF V-Miss'] = (examples['Actual V'] - examples['RF (Tuned) V']).abs()\n", "\n", "display(examples)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0bcb80eb", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Error analysis by grade\n", "====================================\n", "\"\"\"\n", "\n", "print(\"### Error Analysis by Grade\\n\")\n", "\n", "# Group by actual grade\n", "df_analysis = pd.DataFrame({\n", " 'actual': y_test,\n", " 'predicted_rf': y_pred_rf_best,\n", " 'error_rf': y_test - y_pred_rf_best\n", "})\n", "\n", "grade_analysis = df_analysis.groupby('actual').agg(\n", " count=('actual', 'count'),\n", " mae=('error_rf', lambda x: np.abs(x).mean()),\n", " bias=('error_rf', 'mean'),\n", " within_1=('error_rf', lambda x: (np.abs(x) <= 1).mean() * 100)\n", ").round(3)\n", "\n", "display(grade_analysis)\n", "\n", "# Plot\n", "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n", "\n", "# Count by grade\n", "ax = axes[0]\n", "ax.bar(grade_analysis.index, grade_analysis['count'], color='#3498db', alpha=0.8)\n", "ax.set_xlabel('Grade')\n", "ax.set_ylabel('Count')\n", "ax.set_title('Test Set Distribution by Grade')\n", "\n", "# MAE by grade\n", "ax = axes[1]\n", "ax.bar(grade_analysis.index, grade_analysis['mae'], color='#e74c3c', alpha=0.8)\n", "ax.set_xlabel('Grade')\n", "ax.set_ylabel('MAE')\n", "ax.set_title('MAE by Grade')\n", "\n", "# Bias by grade\n", "ax = axes[2]\n", "colors = ['#2ecc71' if b >= 0 else '#e74c3c' for b in grade_analysis['bias']]\n", "ax.bar(grade_analysis.index, grade_analysis['bias'], color=colors, alpha=0.8)\n", "ax.set_xlabel('Grade')\n", "ax.set_ylabel('Bias (Actual - Predicted)')\n", "ax.set_title('Prediction Bias by Grade')\n", "ax.axhline(y=0, color='black', linestyle='--', lw=1)\n", "\n", "plt.tight_layout()\n", "plt.savefig('../images/05_predictive_modelling/error_by_grade.png', dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "8664c606", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Prediction Confidence Intervals\n", "====================================\n", "\"\"\"\n", "\n", "print(\"### Prediction Confidence Analysis\\n\")\n", "\n", "# For Random Forest, we can use the individual tree predictions\n", "# to estimate prediction uncertainty\n", "\n", "# Get predictions from individual trees\n", "predictions = np.array([tree.predict(X_test) for tree in rf_best.estimators_])\n", "\n", "# Calculate statistics\n", "pred_mean = predictions.mean(axis=0)\n", "pred_std = predictions.std(axis=0)\n", "\n", "# Correlation between prediction std and absolute error\n", "abs_errors = np.abs(y_test - pred_mean)\n", "correlation = np.corrcoef(pred_std, abs_errors)[0, 1]\n", "\n", "print(f\"Correlation between prediction std and absolute error: {correlation:.3f}\")\n", "\n", "# Plot\n", "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", "\n", "# Prediction std vs absolute error\n", "ax = axes[0]\n", "ax.scatter(pred_std, abs_errors, alpha=0.3, s=20)\n", "ax.set_xlabel('Prediction Std Dev (Uncertainty)', fontsize=12)\n", "ax.set_ylabel('Absolute Error', fontsize=12)\n", "ax.set_title('Prediction Uncertainty vs Error', fontsize=14)\n", "\n", "# Error by uncertainty quartile\n", "ax = axes[1]\n", "quartiles = pd.qcut(pred_std, 4, labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)'])\n", "uncertainty_analysis = pd.DataFrame({\n", " 'quartile': quartiles,\n", " 'abs_error': abs_errors\n", "}).groupby('quartile')['abs_error'].mean()\n", "\n", "ax.bar(range(4), uncertainty_analysis.values, color=['#2ecc71', '#f1c40f', '#e67e22', '#e74c3c'])\n", "ax.set_xticks(range(4))\n", "ax.set_xticklabels(uncertainty_analysis.index)\n", "ax.set_ylabel('Mean Absolute Error', fontsize=12)\n", "ax.set_title('MAE by Prediction Uncertainty Quartile', fontsize=14)\n", "\n", "plt.tight_layout()\n", "plt.savefig('../images/05_predictive_modelling/prediction_uncertainty.png', dpi=150, bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "38a6dd05", "metadata": {}, "source": [ "# Saving Models" ] }, { "cell_type": "code", "execution_count": null, "id": "fd7c4b38", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Save Models\n", "====================================\n", "\"\"\"\n", "\n", "import joblib\n", "\n", "os.makedirs('../models', exist_ok=True)\n", "\n", "# Save models\n", "joblib.dump(lr, '../models/linear_regression.pkl')\n", "joblib.dump(ridge, '../models/ridge_regression.pkl')\n", "joblib.dump(lasso, '../models/lasso_regression.pkl')\n", "joblib.dump(rf_best, '../models/random_forest_tuned.pkl')\n", "joblib.dump(scaler, '../models/feature_scaler.pkl')\n", "\n", "# Save feature names\n", "with open('../models/feature_names.txt', 'w') as f:\n", " for feat in X.columns:\n", " f.write(f\"{feat}\\n\")\n", "\n", "print(\"Models saved to ../models/\")\n", "print(\" - linear_regression.pkl\")\n", "print(\" - ridge_regression.pkl\")\n", "print(\" - lasso_regression.pkl\")\n", "print(\" - random_forest_tuned.pkl\")\n", "print(\" - feature_scaler.pkl\")" ] }, { "cell_type": "markdown", "id": "8ce25f6e", "metadata": {}, "source": [ "# Conclusion" ] }, { "cell_type": "code", "execution_count": null, "id": "47f5e38f", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Final Summary\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"FINAL SUMMARY\")\n", "print(\"=\" * 60)\n", "\n", "summary = f\"\"\"\n", "### Model Performance Summary\n", "\n", "| Model | MAE | RMSE | R² | Within ±1 | Within ±2 | Exact V | Within ±1 V |\n", "|-------|-----|------|----|-----------|-----------|---------|-------------|\n", "| Linear Regression | {results_lr_test['mae']:.3f} | {results_lr_test['rmse']:.3f} | {results_lr_test['r2']:.3f} | {results_lr_test['within_1']:.1f}% | {results_lr_test['within_2']:.1f}% | {results_lr_test['exact_grouped_v']:.1f}% | {results_lr_test['within_1_vgrade']:.1f}% |\n", "| Ridge Regression | {results_ridge['mae']:.3f} | {results_ridge['rmse']:.3f} | {results_ridge['r2']:.3f} | {results_ridge['within_1']:.1f}% | {results_ridge['within_2']:.1f}% | {results_ridge['exact_grouped_v']:.1f}% | {results_ridge['within_1_vgrade']:.1f}% |\n", "| Lasso Regression | {results_lasso['mae']:.3f} | {results_lasso['rmse']:.3f} | {results_lasso['r2']:.3f} | {results_lasso['within_1']:.1f}% | {results_lasso['within_2']:.1f}% | {results_lasso['exact_grouped_v']:.1f}% | {results_lasso['within_1_vgrade']:.1f}% |\n", "| Random Forest (Tuned) | {results_rf_tuned['mae']:.3f} | {results_rf_tuned['rmse']:.3f} | {results_rf_tuned['r2']:.3f} | {results_rf_tuned['within_1']:.1f}% | {results_rf_tuned['within_2']:.1f}% | {results_rf_tuned['exact_grouped_v']:.1f}% | {results_rf_tuned['within_1_vgrade']:.1f}% |\n", "\n", "### Key Findings\n", "\n", "1. **Tree-based models remain strongest on this structured feature set.**\n", " - Random Forest (Tuned) achieves the best overall balance of MAE, RMSE, and grouped V-grade performance.\n", " - Linear models remain useful baselines but leave clear nonlinear signal unexplained.\n", "\n", "2. **Fine-grained difficulty prediction is meaningfully harder than grouped grade prediction.**\n", " - On the held-out test set, the best model is within ±1 fine-grained difficulty score {results_rf_tuned['within_1']:.1f}% of the time.\n", " - The same model is within ±1 grouped V-grade {results_rf_tuned['within_1_vgrade']:.1f}% of the time.\n", "\n", "3. **This gap is expected and informative.**\n", " - Small numeric errors often stay inside the same or adjacent V-grade buckets.\n", " - The model captures broad difficulty bands more reliably than exact score distinctions.\n", "\n", "4. **The project’s main predictive takeaway is practical rather than perfect.**\n", " - The models are not exact grade replicators.\n", " - They are reasonably strong at placing climbs into the correct neighborhood of difficulty.\n", "\n", "### Portfolio Interpretation\n", "\n", "From a modelling perspective, this project shows:\n", "- feature engineering grounded in domain structure,\n", "- comparison of linear and nonlinear models,\n", "- honest evaluation on a held-out test set,\n", "- and the ability to translate raw regression performance into climbing-relevant grouped V-grade metrics.\n", "\"\"\"\n", "\n", "print(summary)\n", "\n", "# Save summary\n", "os.makedirs('../data/05_predictive_modelling', exist_ok=True)\n", "with open('../data/05_predictive_modelling/model_summary.txt', 'w') as f:\n", " f.write(summary)\n", "\n", "print(\"\\nSummary saved to ../data/05_predictive_modelling/model_summary.txt\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ee8621d7", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "====================================\n", "Bonus: Gradient Boosting Comparison\n", "====================================\n", "\"\"\"\n", "\n", "print(\"=\" * 60)\n", "print(\"GRADIENT BOOSTING\")\n", "print(\"=\" * 60)\n", "\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "# Train gradient boosting\n", "gb = GradientBoostingRegressor(\n", " n_estimators=200,\n", " max_depth=5,\n", " learning_rate=0.1,\n", " min_samples_split=5,\n", " random_state=RANDOM_STATE\n", ")\n", "\n", "gb.fit(X_train, y_train)\n", "\n", "# Predict\n", "y_pred_gb = gb.predict(X_test)\n", "\n", "# Evaluate\n", "results_gb = evaluate_model(y_test, y_pred_gb, \"Gradient Boosting\")\n", "all_results.append({**results_gb, 'set': 'test'})\n", "\n", "# Compare with Random Forest\n", "plot_predictions(y_test, y_pred_gb, \"Gradient Boosting\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 5 }