Tension-Board-2-Analysis/notebooks/04_feature_engineering.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ed7f1a86",
   "metadata": {},
   "source": [
    "# Tension Board 2 Mirror: Feature Engineering\n",
    "\n",
    "The goal of this notebook is to convert raw climb descriptions into a clean modelling table. Each row of the final table corresponds to a single climb-angle observation, and each column is a numeric feature that may help predict grade.\n",
    "\n",
    "## Modelling idea\n",
    "\n",
    "A climb's grade should depend on more than just angle. It should also depend on the geometry and sequencing of the holds used. To capture that, this notebook builds features from three sources:\n",
    "\n",
    "1. **Wall configuration**  \n",
    "   Examples: angle, board geometry, mirrored placements.\n",
    "\n",
    "2. **Route structure**  \n",
    "   Examples: number of holds, spatial spread, height gained, move lengths, left/right balance, and other frame-derived quantities.\n",
    "\n",
    "When this was initially done, we added:\n",
    "\n",
    "3. **Hold difficulty priors** \n",
    "\n",
    "However, that makes it quite circular -- we'd be using the difficulty data to create difficulty scores to then predict difficulty data. The difficulty is already baked in there, so it is not a very good independent model. Heuristically, I don't think this is a big deal if we **just** want to predict V-grades, but we'll leave it out of our analysis in order to see what sorts of features actually help determine the difficulty of a climb. We'll add it back in notebook 07 and create a *leaky model*. \n",
    "\n",
    "## Output\n",
    "\n",
    "The final product is a saved feature matrix that is reused in the predictive modelling and deep learning notebooks.\n",
    "\n",
    "## Notebook Structure\n",
    "\n",
    "1. [Setup and Imports](#setup-and-imports)\n",
    "2. [Feature Extraction](#feature-extraction)\n",
    "3. [Visualizing Key Features](#visualizing-key-features)\n",
    "4. [Conclusion](#conclusion)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef5d85ef",
   "metadata": {},
   "source": [
    "# Setup and Imports\n",
    "\n",
    "This section loads the database, auxiliary tables, and the hold-difficulty table produced in notebook 03.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "513d5c42",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Setup and Imports\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Imports\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "import matplotlib.patches as mpatches\n",
    "\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "from scipy.spatial import ConvexHull\n",
    "from scipy.spatial.distance import pdist, squareform\n",
    "\n",
    "import sqlite3\n",
    "\n",
    "import re\n",
    "import os\n",
    "from collections import defaultdict\n",
    "\n",
    "import ast\n",
    "\n",
    "from PIL import Image\n",
    "\n",
    "# Set some display options\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "\n",
    "# Set style\n",
    "palette=['steelblue', 'coral', 'seagreen']  #(for multi-bar graphs)\n",
    "\n",
    "# Set board image for some visual analysis\n",
    "board_img = Image.open('../images/tb2_board_12x12_composite.png')\n",
    "\n",
    "# Connect to the database\n",
    "DB_PATH=\"../data/tb2.db\"\n",
    "conn = sqlite3.connect(DB_PATH)\n",
    "\n",
    "# Create output directories\n",
    "os.makedirs('../data/04_climb_features', exist_ok=True)\n",
    "os.makedirs('../images/04_climb_features', exist_ok=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04f9ccb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Query our data from the DB\n",
    "==================================\n",
    "\n",
    "This time we restrict to where `layout_id=10` for the TB2 Mirror.\n",
    "We will also restrict ourselves to an angle of at most 50, since according to our grade vs angle distribution in notebook 01, things start to look a bit weird past 50.\n",
    "(Probably a bias towards climbers who can actually climb that steep). We will encode this directly into our query.\n",
    "\"\"\"\n",
    "\n",
    "# Query climbs data\n",
    "climbs_query = \"\"\"\n",
    "SELECT\n",
    "    c.uuid,\n",
    "    c.name AS climb_name,\n",
    "    c.setter_username,\n",
    "    c.layout_id AS layout_id,\n",
    "    c.description,\n",
    "    c.is_nomatch,\n",
    "    c.is_listed,\n",
    "    l.name AS layout_name,\n",
    "    p.name AS board_name,\n",
    "    c.frames,\n",
    "    cs.angle,\n",
    "    cs.display_difficulty,\n",
    "    dg.boulder_name AS boulder_grade,\n",
    "    cs.ascensionist_count,\n",
    "    cs.quality_average,\n",
    "    cs.fa_at\n",
    "FROM climbs c\n",
    "JOIN layouts l ON c.layout_id = l.id\n",
    "JOIN products p ON l.product_id = p.id\n",
    "JOIN climb_stats cs ON c.uuid = cs.climb_uuid\n",
    "JOIN difficulty_grades dg ON ROUND(cs.display_difficulty) = dg.difficulty\n",
    "WHERE cs.display_difficulty IS NOT NULL AND c.is_listed=1 AND c.layout_id=10 AND cs.angle <= 50\n",
    "\"\"\"\n",
    "\n",
    "# Query information about placements (and their mirrors)\n",
    "placements_query = \"\"\"\n",
    "SELECT\n",
    "    p.id AS placement_id,\n",
    "    h.x,\n",
    "    h.y,\n",
    "    p.default_placement_role_id AS default_role_id,\n",
    "    p.set_id AS set_id,\n",
    "    s.name AS set_name,\n",
    "    p_mirror.id AS mirror_placement_id\n",
    "FROM placements p\n",
    "JOIN holes h ON p.hole_id = h.id\n",
    "JOIN sets s ON p.set_id = s.id\n",
    "LEFT JOIN holes h_mirror ON h.mirrored_hole_id = h_mirror.id\n",
    "LEFT JOIN placements p_mirror ON p_mirror.hole_id = h_mirror.id AND p_mirror.layout_id = p.layout_id\n",
    "WHERE p.layout_id = 10\n",
    "\"\"\"\n",
    "\n",
    "# Load it into a DataFrame\n",
    "df_climbs = pd.read_sql_query(climbs_query, conn)\n",
    "df_placements = pd.read_sql_query(placements_query, conn)\n",
    "\n",
    "# Load the hold-level difficulty table created in notebook 03\n",
    "df_hold_difficulty = pd.read_csv('../data/03_hold_difficulty/hold_difficulty_scores.csv', index_col='placement_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e5d93f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Difficulty-related columns loaded from Notebook 03:\")\n",
    "print([c for c in df_hold_difficulty.columns if 'difficulty' in c.lower()])\n",
    "\n",
    "assert 'overall_difficulty' in df_hold_difficulty.columns, \"Missing overall_difficulty\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0c28b51",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_hold_difficulty"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f54f7f6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "placement_coords = {\n",
    "    row['placement_id']: (row['x'], row['y'])\n",
    "    for _, row in df_placements.iterrows()\n",
    "}\n",
    "\n",
    "board_width = 144\n",
    "board_height = 144\n",
    "\n",
    "x_min, x_max = -68, 68\n",
    "y_min, y_max = 0, 144\n",
    "\n",
    "# Role definitions (TB2)\n",
    "ROLE_DEFINITIONS = {\n",
    "    'start': 5,\n",
    "    'middle': 6,\n",
    "    'finish': 7,\n",
    "    'foot': 8\n",
    "}\n",
    "\n",
    "HAND_ROLE_IDS = [5, 6, 7]\n",
    "FOOT_ROLE_IDS = [8]\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38e865a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Parse Frame function\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "def parse_frames(frames_str):\n",
    "    \"\"\"\n",
    "    Parse frames string into list of (placement_id, role_id) tuples.\n",
    "    \n",
    "    Parameters:\n",
    "    -----------\n",
    "    frames_str : str\n",
    "        Frame string like \"p1r5p2r6p3r8\"\n",
    "    \n",
    "    Returns:\n",
    "    --------\n",
    "    list of tuples: [(placement_id, role_id), ...]\n",
    "    \"\"\"\n",
    "    if not isinstance(frames_str, str):\n",
    "        return []\n",
    "    \n",
    "    matches = re.findall(r'p(\\d+)r(\\d+)', frames_str)\n",
    "    return [(int(p), int(r)) for p, r in matches]\n",
    "\n",
    "\n",
    "def get_role_type(role_id):\n",
    "    \"\"\"Map role_id to role type string.\"\"\"\n",
    "    for role_type, rid in ROLE_DEFINITIONS.items():\n",
    "        if role_id == rid:\n",
    "            return role_type\n",
    "    return 'unknown'\n",
    "\n",
    "\n",
    "# Test\n",
    "test_frames = \"p1r5p2r6p3r8p4r5\"\n",
    "parsed = parse_frames(test_frames)\n",
    "print(f\"Test parse: {parsed}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2f2138a",
   "metadata": {},
   "source": [
    "# Feature Extraction\n",
    "\n",
    "This is the core notebook section. The aim is to translate the raw `frames` string into a route-level numerical representation suitable for regression or classification models.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eeba545e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Feature Exraction Function\n",
    "==================================\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61ec584a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_features(row, placement_coords):\n",
    "    \"\"\"\n",
    "    Extract a trimmed set of clean geometric/spatial features.\n",
    "    No hold-difficulty-derived features are used.\n",
    "    \"\"\"\n",
    "    features = {}\n",
    "\n",
    "    holds = parse_frames(row['frames'])\n",
    "    angle = row['angle']\n",
    "\n",
    "    if not holds:\n",
    "        return None\n",
    "\n",
    "    hold_data = []\n",
    "    for placement_id, role_id in holds:\n",
    "        coords = placement_coords.get(placement_id, (None, None))\n",
    "        if coords[0] is None:\n",
    "            continue\n",
    "\n",
    "        role_type = get_role_type(role_id)\n",
    "        is_hand = role_id in HAND_ROLE_IDS\n",
    "        is_foot = role_id in FOOT_ROLE_IDS\n",
    "\n",
    "        hold_data.append({\n",
    "            'placement_id': placement_id,\n",
    "            'x': coords[0],\n",
    "            'y': coords[1],\n",
    "            'role_type': role_type,\n",
    "            'is_hand': is_hand,\n",
    "            'is_foot': is_foot,\n",
    "        })\n",
    "\n",
    "    if not hold_data:\n",
    "        return None\n",
    "\n",
    "    df_holds = pd.DataFrame(hold_data)\n",
    "\n",
    "    hand_holds = df_holds[df_holds['is_hand']]\n",
    "    foot_holds = df_holds[df_holds['is_foot']]\n",
    "    start_holds = df_holds[df_holds['role_type'] == 'start']\n",
    "    finish_holds = df_holds[df_holds['role_type'] == 'finish']\n",
    "    middle_holds = df_holds[df_holds['role_type'] == 'middle']\n",
    "\n",
    "    xs = df_holds['x'].to_numpy()\n",
    "    ys = df_holds['y'].to_numpy()\n",
    "\n",
    "    description = row.get('description', '')\n",
    "    if pd.isna(description):\n",
    "        description = ''\n",
    "\n",
    "    center_x = (x_min + x_max) / 2\n",
    "\n",
    "    # Basic\n",
    "    features['angle'] = angle\n",
    "    features['angle_squared'] = angle ** 2\n",
    "\n",
    "    features['total_holds'] = len(df_holds)\n",
    "    features['hand_holds'] = len(hand_holds)\n",
    "    features['foot_holds'] = len(foot_holds)\n",
    "    features['start_holds'] = len(start_holds)\n",
    "    features['finish_holds'] = len(finish_holds)\n",
    "    features['middle_holds'] = len(middle_holds)\n",
    "\n",
    "    features['is_nomatch'] = int(\n",
    "        (row['is_nomatch'] == 1) or\n",
    "        bool(re.search(r'\\bno\\s*match(ing)?\\b', description, flags=re.IGNORECASE))\n",
    "    )\n",
    "\n",
    "    # Spatial\n",
    "    features['mean_y'] = np.mean(ys)\n",
    "    features['std_x'] = np.std(xs) if len(xs) > 1 else 0.0\n",
    "    features['std_y'] = np.std(ys) if len(ys) > 1 else 0.0\n",
    "    features['range_x'] = np.max(xs) - np.min(xs)\n",
    "    features['range_y'] = np.max(ys) - np.min(ys)\n",
    "    features['min_y'] = np.min(ys)\n",
    "    features['max_y'] = np.max(ys)\n",
    "    features['height_gained'] = features['max_y'] - features['min_y']\n",
    "\n",
    "    # Start / finish heights\n",
    "    start_height = start_holds['y'].mean() if len(start_holds) > 0 else np.nan\n",
    "    finish_height = finish_holds['y'].mean() if len(finish_holds) > 0 else np.nan\n",
    "\n",
    "    features['height_gained_start_finish'] = (\n",
    "        finish_height - start_height\n",
    "        if pd.notna(start_height) and pd.notna(finish_height)\n",
    "        else np.nan\n",
    "    )\n",
    "\n",
    "    # Density / symmetry\n",
    "    bbox_area = features['range_x'] * features['range_y']\n",
    "    features['bbox_area'] = bbox_area\n",
    "    features['hold_density'] = features['total_holds'] / bbox_area if bbox_area > 0 else 0.0\n",
    "    features['holds_per_vertical_foot'] = features['total_holds'] / max(features['range_y'], 1)\n",
    "\n",
    "    left_holds = (df_holds['x'] < center_x).sum()\n",
    "    features['left_ratio'] = left_holds / features['total_holds'] if features['total_holds'] > 0 else 0.5\n",
    "    features['symmetry_score'] = 1 - abs(features['left_ratio'] - 0.5) * 2\n",
    "\n",
    "    y_median = np.median(ys)\n",
    "    upper_holds = (df_holds['y'] > y_median).sum()\n",
    "    features['upper_ratio'] = upper_holds / features['total_holds']\n",
    "\n",
    "    # Hand reach\n",
    "    if len(hand_holds) >= 2:\n",
    "        hand_points = hand_holds[['x', 'y']].to_numpy()\n",
    "        hand_distances = pdist(hand_points)\n",
    "\n",
    "        hand_xs = hand_holds['x'].to_numpy()\n",
    "        hand_ys = hand_holds['y'].to_numpy()\n",
    "\n",
    "        features['mean_hand_reach'] = float(np.mean(hand_distances))\n",
    "        features['max_hand_reach'] = float(np.max(hand_distances))\n",
    "        features['std_hand_reach'] = float(np.std(hand_distances))\n",
    "        features['hand_spread_x'] = float(hand_xs.max() - hand_xs.min())\n",
    "        features['hand_spread_y'] = float(hand_ys.max() - hand_ys.min())\n",
    "    else:\n",
    "        features['mean_hand_reach'] = 0.0\n",
    "        features['max_hand_reach'] = 0.0\n",
    "        features['std_hand_reach'] = 0.0\n",
    "        features['hand_spread_x'] = 0.0\n",
    "        features['hand_spread_y'] = 0.0\n",
    "\n",
    "    # Hand-foot distances\n",
    "    if len(hand_holds) > 0 and len(foot_holds) > 0:\n",
    "        hand_points = hand_holds[['x', 'y']].to_numpy()\n",
    "        foot_points = foot_holds[['x', 'y']].to_numpy()\n",
    "\n",
    "        dists = []\n",
    "        for hx, hy in hand_points:\n",
    "            for fx, fy in foot_points:\n",
    "                dists.append(np.sqrt((hx - fx)**2 + (hy - fy)**2))\n",
    "        dists = np.asarray(dists)\n",
    "\n",
    "        features['min_hand_to_foot'] = float(np.min(dists))\n",
    "        features['mean_hand_to_foot'] = float(np.mean(dists))\n",
    "        features['std_hand_to_foot'] = float(np.std(dists))\n",
    "    else:\n",
    "        features['min_hand_to_foot'] = 0.0\n",
    "        features['mean_hand_to_foot'] = 0.0\n",
    "        features['std_hand_to_foot'] = 0.0\n",
    "\n",
    "    # Global geometry\n",
    "    points = np.column_stack([xs, ys])\n",
    "\n",
    "    if len(df_holds) >= 3:\n",
    "        try:\n",
    "            hull = ConvexHull(points)\n",
    "            features['convex_hull_area'] = float(hull.volume)\n",
    "            features['hull_area_to_bbox_ratio'] = features['convex_hull_area'] / max(bbox_area, 1)\n",
    "        except Exception:\n",
    "            features['convex_hull_area'] = np.nan\n",
    "            features['hull_area_to_bbox_ratio'] = np.nan\n",
    "    else:\n",
    "        features['convex_hull_area'] = 0.0\n",
    "        features['hull_area_to_bbox_ratio'] = 0.0\n",
    "\n",
    "    if len(df_holds) >= 2:\n",
    "        pairwise = pdist(points)\n",
    "        features['mean_pairwise_distance'] = float(np.mean(pairwise))\n",
    "        features['std_pairwise_distance'] = float(np.std(pairwise))\n",
    "    else:\n",
    "        features['mean_pairwise_distance'] = 0.0\n",
    "        features['std_pairwise_distance'] = 0.0\n",
    "\n",
    "    if len(df_holds) >= 2:\n",
    "        sorted_idx = np.argsort(ys)\n",
    "        sorted_points = points[sorted_idx]\n",
    "\n",
    "        path_length = 0.0\n",
    "        for i in range(len(sorted_points) - 1):\n",
    "            dx = sorted_points[i + 1, 0] - sorted_points[i, 0]\n",
    "            dy = sorted_points[i + 1, 1] - sorted_points[i, 1]\n",
    "            path_length += np.sqrt(dx**2 + dy**2)\n",
    "\n",
    "        features['path_length_vertical'] = path_length\n",
    "        features['path_efficiency'] = features['height_gained'] / max(path_length, 1)\n",
    "    else:\n",
    "        features['path_length_vertical'] = 0.0\n",
    "        features['path_efficiency'] = 0.0\n",
    "\n",
    "    # Normalized / relative\n",
    "    features['mean_y_normalized'] = (features['mean_y'] - y_min) / board_height\n",
    "    features['start_height_normalized'] = (\n",
    "        (start_height - y_min) / board_height if pd.notna(start_height) else np.nan\n",
    "    )\n",
    "    features['finish_height_normalized'] = (\n",
    "        (finish_height - y_min) / board_height if pd.notna(finish_height) else np.nan\n",
    "    )\n",
    "    features['mean_y_relative_to_start'] = (\n",
    "        features['mean_y'] - start_height if pd.notna(start_height) else np.nan\n",
    "    )\n",
    "    features['spread_x_normalized'] = features['range_x'] / board_width\n",
    "    features['spread_y_normalized'] = features['range_y'] / board_height\n",
    "\n",
    "    y_q75 = np.percentile(ys, 75)\n",
    "    y_q25 = np.percentile(ys, 25)\n",
    "    features['y_q75'] = y_q75\n",
    "    features['y_iqr'] = y_q75 - y_q25\n",
    "\n",
    "    # Optional engineered clean feature\n",
    "    features['complexity_score'] = (\n",
    "        features['mean_hand_reach']\n",
    "        * np.log1p(features['total_holds'])\n",
    "        * (1 + features['hold_density'])\n",
    "    )\n",
    "\n",
    "    return features"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e800c18b",
   "metadata": {},
   "source": [
    "## Sanity Check on One Example\n",
    "\n",
    "Before extracting features for the entire dataset, we inspect one representative climb to confirm that the parsing logic and the computed geometric summaries behave as expected. Let's do the climb \"Ooo La La\" from notebook two.\n",
    "\n",
    "![Ooo La La](../images/02_hold_stats/Ooo_La_La.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "573182a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "extract_features(df_climbs.iloc[10000], placement_coords)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "551b47ed",
   "metadata": {},
   "source": [
    "The printed example above is an important checkpoint. If the parsed placements, role counts, or geometric summaries look unreasonable here, then the full feature matrix will inherit those mistakes.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6df7451e",
   "metadata": {},
   "source": [
    "## Extract Features or all climbs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ee9856b",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Extract features for all climbs\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "from tqdm import tqdm # Progess bar. This will take a while.\n",
    "\n",
    "print(f\"Extracting features for {len(df_climbs)} climbs...\")\n",
    "\n",
    "feature_list = []\n",
    "\n",
    "for idx, row in tqdm(df_climbs.iterrows(), total=len(df_climbs)):\n",
    "    features = extract_features(row, placement_coords)\n",
    "    if features:\n",
    "        features['climb_uuid'] = row['uuid']\n",
    "        features['display_difficulty'] = row['display_difficulty']\n",
    "        feature_list.append(features)\n",
    "\n",
    "df_features = pd.DataFrame(feature_list)\n",
    "df_features = df_features.set_index('climb_uuid')\n",
    "\n",
    "print(f\"\\nExtracted features for {len(df_features)} climbs\")\n",
    "print(f\"Feature columns: {len(df_features.columns)}\")\n",
    "\n",
    "print(\"\\n### Feature Table Sample\\n\")\n",
    "display(df_features.head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcbb5de5",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Feature Summary Statistics\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "print(\"### Feature Summary\\n\")\n",
    "\n",
    "summary = df_features.describe().T\n",
    "summary['missing'] = df_features.isna().sum()\n",
    "summary['missing_pct'] = (df_features.isna().sum() / len(df_features) * 100).round(2)\n",
    "\n",
    "display(summary[['count', 'mean', 'std', 'min', 'max', 'missing', 'missing_pct']])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb2eb615",
   "metadata": {},
   "source": [
    "## Correlation with Difficulty"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "668a506e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Correlation with Difficulty\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "correlations = df_features.corr()['display_difficulty'].drop('display_difficulty').sort_values(key=abs, ascending=False)\n",
    "\n",
    "print(\"### Top 30 Features Correlated with Difficulty\\n\")\n",
    "display(correlations.head(30).to_frame('correlation'))\n",
    "\n",
    "print(\"\\n### Bottom 10 Features (Least Correlated)\\n\")\n",
    "display(correlations.tail(10).to_frame('correlation'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95ef9547",
   "metadata": {},
   "source": [
    "# Visualizing Key Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25a55e53",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Visualize Key Features\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "fig, axes = plt.subplots(4, 4, figsize=(16, 16))\n",
    "\n",
    "key_features = [\n",
    "    # Core driver\n",
    "    'angle',\n",
    "\n",
    "    # Basic structure\n",
    "    'total_holds',\n",
    "    'height_gained',\n",
    "\n",
    "    # Density / compactness\n",
    "    'hold_density',\n",
    "    'holds_per_vertical_foot',\n",
    "\n",
    "    # Hand geometry (very important)\n",
    "    'mean_hand_reach',\n",
    "    'max_hand_reach',\n",
    "    'std_hand_reach',\n",
    "\n",
    "    # Hand-foot interaction\n",
    "    'mean_hand_to_foot',\n",
    "    'std_hand_to_foot',\n",
    "\n",
    "    # Spatial layout\n",
    "    'symmetry_score',\n",
    "    'upper_ratio',\n",
    "\n",
    "    # Global geometry\n",
    "    'convex_hull_area',\n",
    "    'hull_area_to_bbox_ratio',\n",
    "\n",
    "    # Path / flow\n",
    "    'path_length_vertical',\n",
    "    'path_efficiency'\n",
    "]\n",
    "\n",
    "for ax, feature in zip(axes.flat, key_features):\n",
    "    if feature in df_features.columns:\n",
    "        ax.scatter(df_features[feature], df_features['display_difficulty'], alpha=0.3, s=10)\n",
    "        ax.set_xlabel(feature)\n",
    "        ax.set_ylabel('Difficulty')\n",
    "        ax.set_title(f'{feature} vs Difficulty')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/04_climb_features/feature_correlations.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d27cfcf7",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Add Interaction Features\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Angle interactions\n",
    "df_features['angle_x_holds'] = df_features['angle'] * df_features['total_holds']\n",
    "df_features['angle_squared'] = df_features['angle'] ** 2\n",
    "\n",
    "\n",
    "# Complexity features\n",
    "df_features['complexity_score'] = (\n",
    "    df_features['total_holds'] * \n",
    "    df_features['mean_hand_reach'].fillna(0) * \n",
    "    df_features['hold_density']\n",
    ")\n",
    "\n",
    "print(f\"Added interaction features. Total columns: {len(df_features.columns)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f87892fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Handle Missing Values\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "missing = df_features.isna().sum()\n",
    "missing_cols = missing[missing > 0]\n",
    "\n",
    "print(\"### Columns with Missing Values\\n\")\n",
    "display(missing_cols.to_frame('missing'))\n",
    "\n",
    "\n",
    "# Fill other NaNs with column means\n",
    "for col in df_features.columns:\n",
    "    if df_features[col].isna().any():\n",
    "        if df_features[col].dtype in ['float64', 'int64']:\n",
    "            df_features[col] = df_features[col].fillna(df_features[col].mean())\n",
    "\n",
    "# Check remaining missing\n",
    "remaining = df_features.isna().sum().sum()\n",
    "print(f\"\\nRemaining missing values: {remaining}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed904eb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "===================================\n",
    "Feature Importance Review\n",
    "===================================\n",
    "\"\"\"\n",
    "\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "X = df_features.drop(columns=['display_difficulty'])\n",
    "y = df_features['display_difficulty']\n",
    "\n",
    "rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=3, n_jobs=-1)\n",
    "rf.fit(X, y)\n",
    "\n",
    "importance = pd.DataFrame({\n",
    "    'feature': X.columns,\n",
    "    'importance': rf.feature_importances_\n",
    "}).sort_values('importance', ascending=False)\n",
    "\n",
    "print(\"### Top 30 Most Important Features (Random Forest)\\n\")\n",
    "display(importance.head(30))\n",
    "\n",
    "# Cross-validation score\n",
    "scores = cross_val_score(rf, X, y, cv=5, scoring='neg_mean_absolute_error')\n",
    "print(f\"\\nCross-validated MAE: {-scores.mean():.2f} (+/- {scores.std():.2f})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "547f7eb1",
   "metadata": {},
   "source": [
    "# Conclusion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f5f95c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "============================\n",
    "Save Feature Matrix\n",
    "============================\n",
    "\"\"\"\n",
    "raw_cols = [c for c in df_features.columns if c.endswith('_raw')]\n",
    "if raw_cols:\n",
    "    print(\"Dropping raw columns from final climb feature matrix:\")\n",
    "    print(raw_cols)\n",
    "    df_features = df_features.drop(columns=raw_cols)\n",
    "\n",
    "# `climb_features.csv` is the canonical name used by later notebooks.\n",
    "df_features.to_csv('../data/04_climb_features/climb_features.csv')\n",
    "\n",
    "print(\"Saved feature matrix to:\")\n",
    "print(\"  - ../data/04_climb_features/climb_features.csv\")\n",
    "\n",
    "with open('../data/04_climb_features/feature_list.txt', 'w') as f:\n",
    "    for col in df_features.columns:\n",
    "        f.write(f\"{col}\\n\")\n",
    "\n",
    "print(\"\\nFeature list saved to ../data/04_climb_features/feature_list.txt\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07d3e1dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Final Feature Summary\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "print(\"### Feature Engineering Complete\\n\")\n",
    "print(f\"Total climbs: {len(df_features)}\")\n",
    "print(f\"Total features: {df_features.shape[1] - 1}\")  # Exclude target\n",
    "print(f\"Target: display_difficulty\")\n",
    "print(f\"Feature matrix shape: {df_features.shape}\")\n",
    "\n",
    "print(\"\"\"\\nInterpretation:\n",
    "- Each row is a climb-angle observation.\n",
    "- The target is `display_difficulty`.\n",
    "- The predictors combine geometry and structure\n",
    "- The next notebook tests how much predictive signal these engineered features actually contain.\n",
    "\"\"\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}