fixed leakage

This commit is contained in:
Pawel Sarkowicz
2026-03-28 16:03:04 -04:00
parent 880272aaf5
commit 3ab9b77bb7
36 changed files with 2296 additions and 681 deletions

101
README.md
View File

@@ -1,6 +1,6 @@
# Kilter Board: Predicting Climbing Route Difficulty from Board Data
I recently got into *board climbing*, and I enjoy climbing on the TB2 and the Kilter Board. I've been climbing on the 12ftx12ft (mirrored) that is available at my local gym, and I've never felt that the phrase "*it hurts so good*" would be so apt. As such, I've done an in depth analysis of TB2 data <a href="https://gitlab.com/psark/Tension-Board-2-Analysis">here</a>, and have decided to mimic that analysis with available Kilter Board data.
I recently got into *board climbing*, and I enjoy climbing on the TB2 and the Kilter Board. I've been climbing on 12x12ft boards that are available at my local gym, and I've never felt that the phrase "*it hurts so good*" would be so apt. As such, I've done an in depth analysis of TB2 data <a href="https://gitlab.com/psark/Tension-Board-2-Analysis">here</a>, and have decided to mimic that analysis with available Kilter Board data.
![Hold Usage Heatmap](images/02_hold_stats/all_holds_all_grades_heatmap.png)
@@ -11,12 +11,12 @@ I recently got into *board climbing*, and I enjoy climbing on the TB2 and the Ki
## Overview
This project analyzes ~300,000 climbs from the Kilter Board in order to do the following.
This project analyzes ~300,000 climbs on the Kilter Board in order to do the following.
> 1. **Understand** hold usage patterns and difficulty distributions
> 2. **Quantify** empircal hold difficulty scores
> 3. **Predict** climb grades from hold positions and board angle
Climbing grades are inherently subjective. Different climbers use different beta, setters have different grading standards, and difficulty depends on factors not always captured in data. What makes it harder in the case of the board climbing is that the grade is displayed almost democratically -- it is determined by user input.
Climbing grades are inherently subjective. Different climbers use different beta, setters have different grading standards, and difficulty depends on factors not always captured in data. Moreover, on the boards, the displayed grade for any specific climb is based on user input.
Using a Kilter Board dataset, this project combines:
@@ -153,8 +153,7 @@ Beyond structural analysis, we can also study how board-climbers behave over tim
![Hold Difficulty](images/03_hold_difficulty/difficulty_hand_40deg.png)
* Hold difficulty is estimated from climb data
* We averaged (pre-role/per-angle) difficulty for each hold (with Bayesian smoothing)
* Took advantage of the mirrored layout to increase the amount of data per hold
* We averaged (per-role/per-angle) difficulty for each hold (with Bayesian smoothing)
### Key technique: Bayesian smoothing
@@ -167,7 +166,7 @@ This significantly improves downstream feature quality.
---
## 6. Many more!
## 6. Many more
There are many other statistics, see notebooks [`01`](notebooks/01_data_overview_and_climbing_statistics.ipynb) (climbing statistics), [`02`](notebooks/02_hold_analysis_and_board_heatmaps.ipynb) (climbing hold statistics), and [`03`](notebooks/03_hold_difficulty.ipynb) (hold difficulty). Included are:
@@ -188,35 +187,58 @@ This section focuses on **building predictive models and evaluating performance*
---
## 7. Feature Engineering
Features are constructed at the climb level using only **structural and geometric information** derived from the climb definition (`angle` and `frames`).
Features are constructed at the climb level using:
We explicitly avoid using hold-difficulty-derived features in the predictive models to prevent target leakage.
* geometry (height, spread, convex hull)
* structure (number of moves, clustering)
* hold difficulty (smoothed estimates)
* interaction features
Feature categories include:
* **Geometry** — spatial footprint of the climb (height, spread, convex hull)
* **Movement** — reach distances and spatial relationships between holds
* **Density** — how tightly or sparsely holds are arranged
* **Symmetry** — left/right balance and distribution
* **Path structure** — approximations of movement flow and efficiency
* **Normalized position** — relative positioning on the board
* **Interaction features** — simple nonlinear combinations (e.g., angle × hold count)
This results in a **leakage-free feature set** that better reflects the physical structure of climbing.
| Category | Description | Examples |
| ------------- | --------------------------------- | ------------------------------------------- |
| Geometry | Shape and size of climb | bbox_area, range_x, range_y |
| Movement | Reach and movement complexity | max_hand_reach, path_efficiency |
| Difficulty | Hold-based difficulty metrics | mean_hold_difficulty, max_hold_difficulty |
| Progression | How difficulty changes over climb | difficulty_gradient, difficulty_progression |
| Symmetry | Left/right balance | symmetry_score, hand_symmetry |
| Clustering | Local density of holds | mean_neighbors_12in |
| Normalization | Relative board positioning | mean_y_normalized |
| Distribution | Vertical distribution of holds | y_q25, y_q75 |
| Category | Description | Examples |
| ------------- | ---------------------------------------- | ----------------------------------------- |
| Geometry | Shape and size of climb | bbox_area, range_x, range_y |
| Movement | Reach and movement structure | mean_hand_reach, path_efficiency |
| Density | Hold spacing and compactness | hold_density, holds_per_vertical_foot |
| Symmetry | Left/right balance | symmetry_score, left_ratio |
| Path | Approximate movement trajectory | path_length_vertical |
| Position | Relative board positioning | mean_y_normalized, start_height_normalized|
| Distribution | Vertical distribution of holds | y_q75, y_iqr |
| Interaction | Nonlinear feature combinations | angle_squared, angle_x_holds |
### Important design decision
The dataset is restricted to:
> **climbs with angle ≤ 50°**
> **climbs with angle ≤ 55°**
to reduce variability and improve consistency. (see [Angle vs Difficulty](#3-angle-vs-difficulty), where average climb grade seems to stabilize or get lower over 50°)
###
### Important: Leakage and Feature Design
Earlier iterations of this project included features derived from hold difficulty scores (computed from climb grades). While these features slightly improved predictive performance, they introduce a form of **target leakage** if computed globally.
In this version of the project:
* Hold difficulty scores are still computed in Notebook 03 for **exploratory analysis**
* Predictive models (Notebooks 0406) use only **leakage-free features**
* No feature is derived from the target variable (`display_difficulty`)
This allows the model to learn from the **structure of climbs themselves**, rather than from aggregated statistics of the labels.
Note: Hold-difficulty-based features can still be valid in a production setting if computed strictly from historical (training) data, similar to target encoding techniques.
---
## 8. Feature Relationships
@@ -226,10 +248,10 @@ Here are some relationships between features and difficulty
![Correlation Heatmap](images/04_climb_features/feature_correlations.png)
* higher angles allow for harder difficulties
* hold difficulty features seem to correlate the most to difficulty
* engineered features capture non-trivial structure
* distance between holds seems to correlate with difficulty
* geometric and structural features capture non-trivial climbing patterns
We have a full feature list in [`data/04_climb_features/feature_list.txt`](data/04_climb_features/feature_list.txt). Explanations are available in [`data/04_climb_features/feature_list_explanations.txt`](data/04_climb_features/feature_explanations.txt).
We have a full feature list in [`data/04_climb_features/feature_list.txt`](data/04_climb_features/feature_list.txt). Explanations are available in [`data/04_climb_features/feature_explanations.txt`](data/04_climb_features/feature_explanations.txt).
---
@@ -253,22 +275,28 @@ Models tested:
Key drivers:
* hold difficulty
* wall angle
* structural features
* reach-based features (e.g., mean/max hand reach)
* spatial density and distribution
* geometric structure of the climb
This confirms that **difficulty is strongly tied to spatial arrangement and movement constraints**, rather than just individual hold properties.
---
## 10. Model Performance
![RF redicted vs Actual](images/05_predictive_modelling/random_forest_predictions.png)
![RF Predicted vs Actual](images/05_predictive_modelling/random_forest_predictions.png)
![NN Predicted vs Actual](images/06_deep_learning/neural_network_predictions.png)
### Results (in terms of difficulty score)
### Results (in terms of V-grade)
Both the RF and NN models performed similarly.
* **~83% within ±1 V-grade (~45% within ±1 difficulty score)**
* **~96% within ±2 V-grade (~80% within ±2 difficulty scores)**
* **~70% within ±1 V-grade (~36% within ±1 difficulty score)**
* **~90% within ±2 V-grade (~65% within ±2 difficulty scores)**
In earlier experiements, we were able to achieve ~83% within one V-grade and ~96% within 2. However, that setup used hold-difficulties from notebook 03 derived from climbing grades, creating leakage. This result is more realistic and more independent: the model relies purely on spatial and structural information, without access to hold-based information or beta.
This demonstrates that a substantial portion of climbing difficulty can be attributed to geometry and movement constraints.
### Interpretation
@@ -285,15 +313,15 @@ Both the RF and NN models performed similarly.
| Metric | Performance |
| ------------------ | ----------- |
| Within ±1 V-grade | ~83% |
| Within ±2 V-grades | ~96% |
| Within ±1 V-grade | ~70% |
| Within ±2 V-grades | ~90% |
The model can still predict subgrades (e.g., V3 contains 6a and 6a+), but it is not as accurate.
| Metric | Performance |
| ------------------ | ----------- |
| Within ±1 difficulty-grade | ~45% |
| Within ±2 difficulty-grades | ~80% |
| Within ±1 difficulty-grade | ~36% |
| Within ±2 difficulty-grades | ~65% |
---
@@ -308,6 +336,7 @@ The model can still predict subgrades (e.g., V3 contains 6a and 6a+), but it is
# Future Work
* Unified grade prediction across boards
* Combined board analysis
* Test other models
* Better spatial features

View File

@@ -1,4 +1,5 @@
angle
angle_squared
total_holds
hand_holds
foot_holds
@@ -6,7 +7,6 @@ start_holds
finish_holds
middle_holds
is_nomatch
mean_x
mean_y
std_x
std_y
@@ -14,107 +14,36 @@ range_x
range_y
min_y
max_y
start_height
start_height_min
start_height_max
finish_height
finish_height_min
finish_height_max
height_gained
height_gained_start_finish
bbox_area
bbox_aspect_ratio
bbox_normalized_area
hold_density
holds_per_vertical_foot
left_holds
right_holds
left_ratio
symmetry_score
hand_left_ratio
hand_symmetry
upper_holds
lower_holds
upper_ratio
max_hand_reach
min_hand_reach
mean_hand_reach
max_hand_reach
std_hand_reach
hand_spread_x
hand_spread_y
max_foot_spread
mean_foot_spread
foot_spread_x
foot_spread_y
max_hand_to_foot
min_hand_to_foot
mean_hand_to_foot
std_hand_to_foot
mean_hold_difficulty
max_hold_difficulty
min_hold_difficulty
std_hold_difficulty
median_hold_difficulty
difficulty_range
mean_hand_difficulty
max_hand_difficulty
std_hand_difficulty
mean_foot_difficulty
max_foot_difficulty
std_foot_difficulty
start_difficulty
finish_difficulty
hand_foot_ratio
movement_density
hold_com_x
hold_com_y
weighted_difficulty
convex_hull_area
convex_hull_perimeter
hull_area_to_bbox_ratio
min_nn_distance
mean_nn_distance
max_nn_distance
std_nn_distance
mean_neighbors_12in
max_neighbors_12in
clustering_ratio
mean_pairwise_distance
std_pairwise_distance
path_length_vertical
path_efficiency
difficulty_gradient
lower_region_difficulty
middle_region_difficulty
upper_region_difficulty
difficulty_progression
max_difficulty_jump
mean_difficulty_jump
difficulty_weighted_reach
max_weighted_reach
mean_x_normalized
mean_y_normalized
std_x_normalized
std_y_normalized
start_height_normalized
finish_height_normalized
start_offset_from_typical
finish_offset_from_typical
mean_y_relative_to_start
max_y_relative_to_start
spread_x_normalized
spread_y_normalized
bbox_coverage_x
bbox_coverage_y
y_q25
y_q50
y_q75
y_iqr
holds_bottom_quartile
holds_top_quartile
complexity_score
display_difficulty
angle_x_holds
angle_x_difficulty
angle_squared
difficulty_x_height
difficulty_x_density
complexity_score
hull_area_x_difficulty

View File

@@ -0,0 +1,35 @@
### Model Performance Summary
| Model | MAE | RMSE | R² | Within ±1 | Within ±2 | Exact V | Within ±1 V |
|-------|-----|------|----|-----------|-----------|---------|-------------|
| Linear Regression | 2.088 | 2.670 | 0.560 | 30.1% | 55.9% | 25.9% | 64.8% |
| Ridge Regression | 2.088 | 2.670 | 0.560 | 30.0% | 55.9% | 25.9% | 64.8% |
| Lasso Regression | 2.089 | 2.672 | 0.559 | 29.9% | 55.9% | 25.9% | 64.8% |
| Random Forest (Tuned) | 1.846 | 2.375 | 0.652 | 34.8% | 62.4% | 29.6% | 69.7% |
### Key Findings
1. **Tree-based models remain strongest on this structured feature set.**
- Random Forest (Tuned) achieves the best overall balance of MAE, RMSE, and grouped V-grade performance.
- Linear models remain useful baselines but leave clear nonlinear signal unexplained.
2. **Fine-grained difficulty prediction is meaningfully harder than grouped grade prediction.**
- On the held-out test set, the best model is within ±1 fine-grained difficulty score 34.8% of the time.
- The same model is within ±1 grouped V-grade 69.7% of the time.
3. **This gap is expected and informative.**
- Small numeric errors often stay inside the same or adjacent V-grade buckets.
- The model captures broad difficulty bands more reliably than exact score distinctions.
4. **The projects main predictive takeaway is practical rather than perfect.**
- The models are not exact grade replicators.
- They are reasonably strong at placing climbs into the correct neighborhood of difficulty.
### Portfolio Interpretation
From a modelling perspective, this project shows:
- feature engineering grounded in domain structure,
- comparison of linear and nonlinear models,
- honest evaluation on a held-out test set,
- and the ability to translate raw regression performance into climbing-relevant grouped V-grade metrics.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.9 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 791 KiB

After

Width:  |  Height:  |  Size: 798 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.4 MiB

After

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.8 MiB

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.8 MiB

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.8 MiB

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.8 MiB

After

Width:  |  Height:  |  Size: 1.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.3 MiB

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 243 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 387 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 133 KiB

48
models/feature_names.txt Normal file
View File

@@ -0,0 +1,48 @@
angle
angle_squared
total_holds
hand_holds
foot_holds
start_holds
finish_holds
middle_holds
is_nomatch
mean_y
std_x
std_y
range_x
range_y
min_y
max_y
height_gained
height_gained_start_finish
bbox_area
hold_density
holds_per_vertical_foot
left_ratio
symmetry_score
upper_ratio
mean_hand_reach
max_hand_reach
std_hand_reach
hand_spread_x
hand_spread_y
min_hand_to_foot
mean_hand_to_foot
std_hand_to_foot
convex_hull_area
hull_area_to_bbox_ratio
mean_pairwise_distance
std_pairwise_distance
path_length_vertical
path_efficiency
mean_y_normalized
start_height_normalized
finish_height_normalized
mean_y_relative_to_start
spread_x_normalized
spread_y_normalized
y_q75
y_iqr
complexity_score
angle_x_holds

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -19,8 +19,11 @@
"2. **Route structure** \n",
" Examples: number of holds, spatial spread, height gained, move lengths, left/right balance, and other frame-derived quantities.\n",
"\n",
"3. **Hold difficulty priors** \n",
" Examples: average, maximum, and distributional summaries of the empirical hold scores built in notebook 03.\n",
"When this was initially done, we added:\n",
"\n",
"3. **Hold difficulty priors** \n",
"\n",
"However, that makes it quite circular -- we'd be using the difficulty data to create difficulty scores to then predict difficulty data. The difficulty is already baked in there, so it is not a very good independent model. Heuristically, I don't think this is a big deal if we **just** want to predict V-grades, but we'll leave it out of our analysis in order to see what sorts of features actually help determine the difficulty of a climb.\n",
"\n",
"## Output\n",
"\n",
@@ -111,8 +114,8 @@
"Query our data from the DB\n",
"==================================\n",
"\n",
"We restrict to `layout_id=1` for the Kilter Board Original\n",
"\n",
"We restrict to `layout_id=1` for the Kilter Board Original.\n",
"Again, we set the date to be past 2016 for simplicity (dates start in 2018, with the exception of one in 2006).\n",
"\"\"\"\n",
"\n",
"# Query climbs data\n",
@@ -261,7 +264,7 @@
"\n",
"\n",
"# Test\n",
"test_frames = \"p1r5p2r6p3r8p4r5\"\n",
"test_frames = \"p1r12p2r13p3r14p4r15\"\n",
"parsed = parse_frames(test_frames)\n",
"print(f\"Test parse: {parsed}\")"
]
@@ -283,564 +286,212 @@
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"==================================\n",
"Feature Exraction Function\n",
"==================================\n",
"\"\"\"\n",
"\n",
"def extract_features(row, placement_coords, df_hold_difficulty):\n",
"def extract_features(row, placement_coords):\n",
" \"\"\"\n",
" Extract all features from a single climb row.\n",
" Extract a trimmed set of clean geometric/spatial features.\n",
" No hold-difficulty-derived features are used.\n",
" \"\"\"\n",
" features = {}\n",
" \n",
" # Parse frames\n",
"\n",
" holds = parse_frames(row['frames'])\n",
" angle = row['angle']\n",
" \n",
"\n",
" if not holds:\n",
" return None\n",
" \n",
" # =====================\n",
" # BASIC HOLD EXTRACTION\n",
" # =====================\n",
" \n",
"\n",
" hold_data = []\n",
" for placement_id, role_id in holds:\n",
" coords = placement_coords.get(placement_id, (None, None))\n",
" if coords[0] is None:\n",
" continue\n",
" \n",
"\n",
" role_type = get_role_type(role_id)\n",
" is_hand = role_id in HAND_ROLE_IDS\n",
" is_foot = role_id in FOOT_ROLE_IDS\n",
" \n",
" # Get difficulty scores for this hold at this angle\n",
" diff_key = f\"{role_type}_diff_{int(angle)}deg\"\n",
" hand_diff_key = f\"hand_diff_{int(angle)}deg\"\n",
" foot_diff_key = f\"foot_diff_{int(angle)}deg\"\n",
" \n",
" difficulty = None\n",
" if placement_id in df_hold_difficulty.index:\n",
" # Try role-specific first, then aggregate\n",
" if diff_key in df_hold_difficulty.columns:\n",
" difficulty = df_hold_difficulty.loc[placement_id, diff_key]\n",
" if pd.isna(difficulty):\n",
" if is_hand and hand_diff_key in df_hold_difficulty.columns:\n",
" difficulty = df_hold_difficulty.loc[placement_id, hand_diff_key]\n",
" elif is_foot and foot_diff_key in df_hold_difficulty.columns:\n",
" difficulty = df_hold_difficulty.loc[placement_id, foot_diff_key]\n",
" \n",
" # Fallback to overall\n",
" if pd.isna(difficulty) and 'overall_difficulty' in df_hold_difficulty.columns:\n",
" difficulty = df_hold_difficulty.loc[placement_id, 'overall_difficulty']\n",
" \n",
"\n",
" hold_data.append({\n",
" 'placement_id': placement_id,\n",
" 'x': coords[0],\n",
" 'y': coords[1],\n",
" 'role_id': role_id,\n",
" 'role_type': role_type,\n",
" 'is_hand': is_hand,\n",
" 'is_foot': is_foot,\n",
" 'difficulty': difficulty\n",
" })\n",
" \n",
"\n",
" if not hold_data:\n",
" return None\n",
" \n",
"\n",
" df_holds = pd.DataFrame(hold_data)\n",
" \n",
" # Separate by role\n",
"\n",
" hand_holds = df_holds[df_holds['is_hand']]\n",
" foot_holds = df_holds[df_holds['is_foot']]\n",
" start_holds = df_holds[df_holds['role_type'] == 'start']\n",
" finish_holds = df_holds[df_holds['role_type'] == 'finish']\n",
" middle_holds = df_holds[df_holds['role_type'] == 'middle']\n",
" \n",
" # =====================\n",
" # 1. ANGLE\n",
" # =====================\n",
"\n",
" xs = df_holds['x'].to_numpy()\n",
" ys = df_holds['y'].to_numpy()\n",
"\n",
" description = row.get('description', '')\n",
" if pd.isna(description):\n",
" description = ''\n",
"\n",
" center_x = (x_min + x_max) / 2\n",
"\n",
" # Basic\n",
" features['angle'] = angle\n",
" \n",
" # =====================\n",
" # 2. BASIC COUNTS\n",
" # =====================\n",
" features['angle_squared'] = angle ** 2\n",
"\n",
" features['total_holds'] = len(df_holds)\n",
" features['hand_holds'] = len(hand_holds)\n",
" features['foot_holds'] = len(foot_holds)\n",
" features['start_holds'] = len(start_holds)\n",
" features['finish_holds'] = len(finish_holds)\n",
" features['middle_holds'] = len(middle_holds)\n",
" \n",
" # =====================\n",
" # 3. MATCHING FEATURE\n",
" # =====================\n",
" # A climb is \"matching\" if you are allowed to match your hands at any hold.\n",
" # There are slight difference in difficulties of matchines vs no matching climbs as per our analysis in 01.\n",
" features['is_nomatch'] = int((row['is_nomatch'] == 1) or bool(re.search(r'\\bno\\s*match(ing)?\\b', row['description'], flags=re.IGNORECASE)))\n",
" \n",
" # =====================\n",
" # 4. SPATIAL/POSITION\n",
" # =====================\n",
" xs = df_holds['x'].values\n",
" ys = df_holds['y'].values\n",
" \n",
" features['mean_x'] = np.mean(xs)\n",
"\n",
" features['is_nomatch'] = int(\n",
" (row['is_nomatch'] == 1) or\n",
" bool(re.search(r'\\bno\\s*match(ing)?\\b', description, flags=re.IGNORECASE))\n",
" )\n",
"\n",
" # Spatial\n",
" features['mean_y'] = np.mean(ys)\n",
" features['std_x'] = np.std(xs) if len(xs) > 1 else 0\n",
" features['std_y'] = np.std(ys) if len(ys) > 1 else 0\n",
" features['std_x'] = np.std(xs) if len(xs) > 1 else 0.0\n",
" features['std_y'] = np.std(ys) if len(ys) > 1 else 0.0\n",
" features['range_x'] = np.max(xs) - np.min(xs)\n",
" features['range_y'] = np.max(ys) - np.min(ys)\n",
" features['min_y'] = np.min(ys)\n",
" features['max_y'] = np.max(ys)\n",
" \n",
" # =====================\n",
" # 5. HEIGHT FEATURES\n",
" # =====================\n",
" if len(start_holds) > 0:\n",
" features['start_height'] = start_holds['y'].mean()\n",
" features['start_height_min'] = start_holds['y'].min()\n",
" features['start_height_max'] = start_holds['y'].max()\n",
" else:\n",
" features['start_height'] = np.nan\n",
" features['start_height_min'] = np.nan\n",
" features['start_height_max'] = np.nan\n",
" \n",
" if len(finish_holds) > 0:\n",
" features['finish_height'] = finish_holds['y'].mean()\n",
" features['finish_height_min'] = finish_holds['y'].min()\n",
" features['finish_height_max'] = finish_holds['y'].max()\n",
" else:\n",
" features['finish_height'] = np.nan\n",
" features['finish_height_min'] = np.nan\n",
" features['finish_height_max'] = np.nan\n",
" \n",
" features['height_gained'] = features['max_y'] - features['min_y']\n",
" \n",
" if pd.notna(features.get('finish_height')) and pd.notna(features.get('start_height')):\n",
" features['height_gained_start_finish'] = features['finish_height'] - features['start_height']\n",
" else:\n",
" features['height_gained_start_finish'] = np.nan\n",
" \n",
" # =====================\n",
" # 6. BBOX FEATURES\n",
" # =====================\n",
" bbox_width = features['range_x']\n",
" bbox_height = features['range_y']\n",
" features['bbox_area'] = bbox_width * bbox_height\n",
" features['bbox_aspect_ratio'] = bbox_width / bbox_height if bbox_height > 0 else 0\n",
" features['bbox_normalized_area'] = features['bbox_area'] / (board_width * board_height)\n",
" \n",
" # =====================\n",
" # 7. HOLD DENSITY\n",
" # =====================\n",
" if features['bbox_area'] > 0:\n",
" features['hold_density'] = features['total_holds'] / features['bbox_area']\n",
" else:\n",
" features['hold_density'] = 0\n",
" \n",
"\n",
" # Start / finish heights\n",
" start_height = start_holds['y'].mean() if len(start_holds) > 0 else np.nan\n",
" finish_height = finish_holds['y'].mean() if len(finish_holds) > 0 else np.nan\n",
"\n",
" features['height_gained_start_finish'] = (\n",
" finish_height - start_height\n",
" if pd.notna(start_height) and pd.notna(finish_height)\n",
" else np.nan\n",
" )\n",
"\n",
" # Density / symmetry\n",
" bbox_area = features['range_x'] * features['range_y']\n",
" features['bbox_area'] = bbox_area\n",
" features['hold_density'] = features['total_holds'] / bbox_area if bbox_area > 0 else 0.0\n",
" features['holds_per_vertical_foot'] = features['total_holds'] / max(features['range_y'], 1)\n",
" \n",
" # =====================\n",
" # 8. SYMMETRY/BALANCE\n",
" # =====================\n",
" center_x = (x_min + x_max) / 2\n",
" features['left_holds'] = (df_holds['x'] < center_x).sum()\n",
" features['right_holds'] = (df_holds['x'] >= center_x).sum()\n",
" features['left_ratio'] = features['left_holds'] / features['total_holds'] if features['total_holds'] > 0 else 0.5\n",
" \n",
" # Symmetry score (how balanced left/right)\n",
"\n",
" left_holds = (df_holds['x'] < center_x).sum()\n",
" features['left_ratio'] = left_holds / features['total_holds'] if features['total_holds'] > 0 else 0.5\n",
" features['symmetry_score'] = 1 - abs(features['left_ratio'] - 0.5) * 2\n",
" \n",
" # Hand symmetry\n",
" if len(hand_holds) > 0:\n",
" hand_left = (hand_holds['x'] < center_x).sum()\n",
" hand_right = (hand_holds['x'] >= center_x).sum()\n",
" features['hand_left_ratio'] = hand_left / len(hand_holds)\n",
" features['hand_symmetry'] = 1 - abs(features['hand_left_ratio'] - 0.5) * 2\n",
" else:\n",
" features['hand_left_ratio'] = np.nan\n",
" features['hand_symmetry'] = np.nan\n",
" \n",
" # =====================\n",
" # 9. VERTICAL DISTRIBUTION\n",
" # =====================\n",
"\n",
" y_median = np.median(ys)\n",
" features['upper_holds'] = (df_holds['y'] > y_median).sum()\n",
" features['lower_holds'] = (df_holds['y'] <= y_median).sum()\n",
" features['upper_ratio'] = features['upper_holds'] / features['total_holds']\n",
" \n",
" # =====================\n",
" # 10. HAND REACH / SPREAD\n",
" # =====================\n",
" upper_holds = (df_holds['y'] > y_median).sum()\n",
" features['upper_ratio'] = upper_holds / features['total_holds']\n",
"\n",
" # Hand reach\n",
" if len(hand_holds) >= 2:\n",
" hand_xs = hand_holds['x'].values\n",
" hand_ys = hand_holds['y'].values\n",
" \n",
" hand_distances = []\n",
" for i in range(len(hand_holds)):\n",
" for j in range(i + 1, len(hand_holds)):\n",
" dx = hand_xs[i] - hand_xs[j]\n",
" dy = hand_ys[i] - hand_ys[j]\n",
" hand_distances.append(np.sqrt(dx**2 + dy**2))\n",
" \n",
" features['max_hand_reach'] = max(hand_distances)\n",
" features['min_hand_reach'] = min(hand_distances)\n",
" features['mean_hand_reach'] = np.mean(hand_distances)\n",
" features['std_hand_reach'] = np.std(hand_distances)\n",
" features['hand_spread_x'] = hand_xs.max() - hand_xs.min()\n",
" features['hand_spread_y'] = hand_ys.max() - hand_ys.min()\n",
" hand_points = hand_holds[['x', 'y']].to_numpy()\n",
" hand_distances = pdist(hand_points)\n",
"\n",
" hand_xs = hand_holds['x'].to_numpy()\n",
" hand_ys = hand_holds['y'].to_numpy()\n",
"\n",
" features['mean_hand_reach'] = float(np.mean(hand_distances))\n",
" features['max_hand_reach'] = float(np.max(hand_distances))\n",
" features['std_hand_reach'] = float(np.std(hand_distances))\n",
" features['hand_spread_x'] = float(hand_xs.max() - hand_xs.min())\n",
" features['hand_spread_y'] = float(hand_ys.max() - hand_ys.min())\n",
" else:\n",
" features['max_hand_reach'] = 0\n",
" features['min_hand_reach'] = 0\n",
" features['mean_hand_reach'] = 0\n",
" features['std_hand_reach'] = 0\n",
" features['hand_spread_x'] = 0\n",
" features['hand_spread_y'] = 0\n",
" \n",
" # =====================\n",
" # 11. FOOT SPREAD\n",
" # =====================\n",
" if len(foot_holds) >= 2:\n",
" foot_xs = foot_holds['x'].values\n",
" foot_ys = foot_holds['y'].values\n",
" \n",
" foot_distances = []\n",
" for i in range(len(foot_holds)):\n",
" for j in range(i + 1, len(foot_holds)):\n",
" dx = foot_xs[i] - foot_xs[j]\n",
" dy = foot_ys[i] - foot_ys[j]\n",
" foot_distances.append(np.sqrt(dx**2 + dy**2))\n",
" \n",
" features['max_foot_spread'] = max(foot_distances)\n",
" features['mean_foot_spread'] = np.mean(foot_distances)\n",
" features['foot_spread_x'] = foot_xs.max() - foot_xs.min()\n",
" features['foot_spread_y'] = foot_ys.max() - foot_ys.min()\n",
" else:\n",
" features['max_foot_spread'] = 0\n",
" features['mean_foot_spread'] = 0\n",
" features['foot_spread_x'] = 0\n",
" features['foot_spread_y'] = 0\n",
" \n",
" # =====================\n",
" # 12. HAND-TO-FOOT DISTANCES\n",
" # =====================\n",
" features['mean_hand_reach'] = 0.0\n",
" features['max_hand_reach'] = 0.0\n",
" features['std_hand_reach'] = 0.0\n",
" features['hand_spread_x'] = 0.0\n",
" features['hand_spread_y'] = 0.0\n",
"\n",
" # Hand-foot distances\n",
" if len(hand_holds) > 0 and len(foot_holds) > 0:\n",
" h2f_distances = []\n",
" for _, h in hand_holds.iterrows():\n",
" for _, f in foot_holds.iterrows():\n",
" dx = h['x'] - f['x']\n",
" dy = h['y'] - f['y']\n",
" h2f_distances.append(np.sqrt(dx**2 + dy**2))\n",
" \n",
" features['max_hand_to_foot'] = max(h2f_distances)\n",
" features['min_hand_to_foot'] = min(h2f_distances)\n",
" features['mean_hand_to_foot'] = np.mean(h2f_distances)\n",
" features['std_hand_to_foot'] = np.std(h2f_distances)\n",
" hand_points = hand_holds[['x', 'y']].to_numpy()\n",
" foot_points = foot_holds[['x', 'y']].to_numpy()\n",
"\n",
" dists = []\n",
" for hx, hy in hand_points:\n",
" for fx, fy in foot_points:\n",
" dists.append(np.sqrt((hx - fx)**2 + (hy - fy)**2))\n",
" dists = np.asarray(dists)\n",
"\n",
" features['min_hand_to_foot'] = float(np.min(dists))\n",
" features['mean_hand_to_foot'] = float(np.mean(dists))\n",
" features['std_hand_to_foot'] = float(np.std(dists))\n",
" else:\n",
" features['max_hand_to_foot'] = 0\n",
" features['min_hand_to_foot'] = 0\n",
" features['mean_hand_to_foot'] = 0\n",
" features['std_hand_to_foot'] = 0\n",
" \n",
" # =====================\n",
" # 13. HOLD DIFFICULTY FEATURES\n",
" # =====================\n",
" difficulties = df_holds['difficulty'].dropna().values\n",
" \n",
" if len(difficulties) > 0:\n",
" features['mean_hold_difficulty'] = np.mean(difficulties)\n",
" features['max_hold_difficulty'] = np.max(difficulties)\n",
" features['min_hold_difficulty'] = np.min(difficulties)\n",
" features['std_hold_difficulty'] = np.std(difficulties)\n",
" features['median_hold_difficulty'] = np.median(difficulties)\n",
" features['difficulty_range'] = features['max_hold_difficulty'] - features['min_hold_difficulty']\n",
" else:\n",
" features['mean_hold_difficulty'] = np.nan\n",
" features['max_hold_difficulty'] = np.nan\n",
" features['min_hold_difficulty'] = np.nan\n",
" features['std_hold_difficulty'] = np.nan\n",
" features['median_hold_difficulty'] = np.nan\n",
" features['difficulty_range'] = np.nan\n",
" \n",
" # Hand difficulty\n",
" hand_diffs = hand_holds['difficulty'].dropna().values if len(hand_holds) > 0 else np.array([])\n",
" if len(hand_diffs) > 0:\n",
" features['mean_hand_difficulty'] = np.mean(hand_diffs)\n",
" features['max_hand_difficulty'] = np.max(hand_diffs)\n",
" features['std_hand_difficulty'] = np.std(hand_diffs)\n",
" else:\n",
" features['mean_hand_difficulty'] = np.nan\n",
" features['max_hand_difficulty'] = np.nan\n",
" features['std_hand_difficulty'] = np.nan\n",
" \n",
" # Foot difficulty\n",
" foot_diffs = foot_holds['difficulty'].dropna().values if len(foot_holds) > 0 else np.array([])\n",
" if len(foot_diffs) > 0:\n",
" features['mean_foot_difficulty'] = np.mean(foot_diffs)\n",
" features['max_foot_difficulty'] = np.max(foot_diffs)\n",
" features['std_foot_difficulty'] = np.std(foot_diffs)\n",
" else:\n",
" features['mean_foot_difficulty'] = np.nan\n",
" features['max_foot_difficulty'] = np.nan\n",
" features['std_foot_difficulty'] = np.nan\n",
" \n",
" # Start/Finish difficulty\n",
" start_diffs = start_holds['difficulty'].dropna().values if len(start_holds) > 0 else np.array([])\n",
" finish_diffs = finish_holds['difficulty'].dropna().values if len(finish_holds) > 0 else np.array([])\n",
" \n",
" features['start_difficulty'] = np.mean(start_diffs) if len(start_diffs) > 0 else np.nan\n",
" features['finish_difficulty'] = np.mean(finish_diffs) if len(finish_diffs) > 0 else np.nan\n",
" \n",
" # =====================\n",
" # 14. ADDITIONAL BASIC FEATURES\n",
" # =====================\n",
" \n",
" # Hand to foot ratio\n",
" features['hand_foot_ratio'] = features['hand_holds'] / max(features['foot_holds'], 1)\n",
" \n",
" # Movement complexity\n",
" features['movement_density'] = features['total_holds'] / max(features['height_gained'], 1)\n",
" \n",
" # Center of mass of holds\n",
" features['hold_com_x'] = np.average(xs, weights=None)\n",
" features['hold_com_y'] = np.average(ys, weights=None)\n",
" \n",
" # Weighted difficulty (by y position)\n",
" if len(difficulties) > 0 and len(ys) >= len(difficulties):\n",
" weights = (ys[:len(difficulties)] - ys.min()) / max(ys.max() - ys.min(), 1) + 0.5\n",
" features['weighted_difficulty'] = np.average(difficulties, weights=weights)\n",
" else:\n",
" features['weighted_difficulty'] = features['mean_hold_difficulty']\n",
" \n",
" # =====================================================\n",
" # 15. GEOMETRIC FEATURES\n",
" # =====================================================\n",
" \n",
" # Convex hull area (2D polygon enclosing all holds)\n",
" features['min_hand_to_foot'] = 0.0\n",
" features['mean_hand_to_foot'] = 0.0\n",
" features['std_hand_to_foot'] = 0.0\n",
"\n",
" # Global geometry\n",
" points = np.column_stack([xs, ys])\n",
"\n",
" if len(df_holds) >= 3:\n",
" try:\n",
" points = np.column_stack([xs, ys])\n",
" hull = ConvexHull(points)\n",
" features['convex_hull_area'] = hull.volume # In 2D, volume = area\n",
" features['convex_hull_perimeter'] = hull.area # In 2D, area = perimeter\n",
" features['hull_area_to_bbox_ratio'] = features['convex_hull_area'] / max(features['bbox_area'], 1)\n",
" except:\n",
" features['convex_hull_area'] = float(hull.volume)\n",
" features['hull_area_to_bbox_ratio'] = features['convex_hull_area'] / max(bbox_area, 1)\n",
" except Exception:\n",
" features['convex_hull_area'] = np.nan\n",
" features['convex_hull_perimeter'] = np.nan\n",
" features['hull_area_to_bbox_ratio'] = np.nan\n",
" else:\n",
" features['convex_hull_area'] = 0\n",
" features['convex_hull_perimeter'] = 0\n",
" features['hull_area_to_bbox_ratio'] = 0\n",
" \n",
" # Nearest neighbor distances\n",
" features['convex_hull_area'] = 0.0\n",
" features['hull_area_to_bbox_ratio'] = 0.0\n",
"\n",
" if len(df_holds) >= 2:\n",
" points = np.column_stack([xs, ys])\n",
" distances = pdist(points)\n",
" \n",
" features['min_nn_distance'] = np.min(distances)\n",
" features['mean_nn_distance'] = np.mean(distances)\n",
" features['max_nn_distance'] = np.max(distances)\n",
" features['std_nn_distance'] = np.std(distances)\n",
" pairwise = pdist(points)\n",
" features['mean_pairwise_distance'] = float(np.mean(pairwise))\n",
" features['std_pairwise_distance'] = float(np.std(pairwise))\n",
" else:\n",
" features['min_nn_distance'] = 0\n",
" features['mean_nn_distance'] = 0\n",
" features['max_nn_distance'] = 0\n",
" features['std_nn_distance'] = 0\n",
" \n",
" # Clustering coefficient (holds grouped vs spread)\n",
" if len(df_holds) >= 3:\n",
" points = np.column_stack([xs, ys])\n",
" dist_matrix = squareform(pdist(points))\n",
" \n",
" # Count neighbors within threshold (e.g., 12 inches)\n",
" threshold = 12.0\n",
" neighbors_count = (dist_matrix < threshold).sum(axis=1) - 1 # Exclude self\n",
" features['mean_neighbors_12in'] = np.mean(neighbors_count)\n",
" features['max_neighbors_12in'] = np.max(neighbors_count)\n",
" \n",
" # Clustering: ratio of actual neighbors to max possible\n",
" avg_neighbors = np.mean(neighbors_count)\n",
" max_possible = len(df_holds) - 1\n",
" features['clustering_ratio'] = avg_neighbors / max_possible if max_possible > 0 else 0\n",
" else:\n",
" features['mean_neighbors_12in'] = 0\n",
" features['max_neighbors_12in'] = 0\n",
" features['clustering_ratio'] = 0\n",
" \n",
" # Path length (greedy nearest-neighbor tour)\n",
" features['mean_pairwise_distance'] = 0.0\n",
" features['std_pairwise_distance'] = 0.0\n",
"\n",
" if len(df_holds) >= 2:\n",
" # Sort by y (bottom to top) for approximate path\n",
" sorted_indices = np.argsort(ys)\n",
" sorted_points = np.column_stack([xs[sorted_indices], ys[sorted_indices]])\n",
" \n",
" path_length = 0\n",
" sorted_idx = np.argsort(ys)\n",
" sorted_points = points[sorted_idx]\n",
"\n",
" path_length = 0.0\n",
" for i in range(len(sorted_points) - 1):\n",
" dx = sorted_points[i+1, 0] - sorted_points[i, 0]\n",
" dy = sorted_points[i+1, 1] - sorted_points[i, 1]\n",
" dx = sorted_points[i + 1, 0] - sorted_points[i, 0]\n",
" dy = sorted_points[i + 1, 1] - sorted_points[i, 1]\n",
" path_length += np.sqrt(dx**2 + dy**2)\n",
" \n",
"\n",
" features['path_length_vertical'] = path_length\n",
" features['path_efficiency'] = features['height_gained'] / max(path_length, 1)\n",
" else:\n",
" features['path_length_vertical'] = 0\n",
" features['path_efficiency'] = 0\n",
" \n",
" # =====================================================\n",
" # 16. DIFFICULTY-WEIGHTED FEATURES\n",
" # =====================================================\n",
" \n",
" # Difficulty gradient (finish vs start)\n",
" if pd.notna(features.get('finish_difficulty')) and pd.notna(features.get('start_difficulty')):\n",
" features['difficulty_gradient'] = features['finish_difficulty'] - features['start_difficulty']\n",
" else:\n",
" features['difficulty_gradient'] = np.nan\n",
" \n",
" # Difficulty variance by vertical region (split into thirds)\n",
" if len(difficulties) > 0:\n",
" y_min_val, y_max_val = ys.min(), ys.max()\n",
" y_range = y_max_val - y_min_val\n",
" \n",
" if y_range > 0:\n",
" # Split into lower, middle, upper thirds\n",
" lower_mask = ys <= (y_min_val + y_range / 3)\n",
" middle_mask = (ys > y_min_val + y_range / 3) & (ys <= y_min_val + 2 * y_range / 3)\n",
" upper_mask = ys > (y_min_val + 2 * y_range / 3)\n",
" \n",
" # Get difficulties for each region\n",
" df_with_diff = df_holds.copy()\n",
" df_with_diff['lower'] = lower_mask\n",
" df_with_diff['middle'] = middle_mask\n",
" df_with_diff['upper'] = upper_mask\n",
" \n",
" lower_diffs = df_with_diff[df_with_diff['lower'] & df_with_diff['difficulty'].notna()]['difficulty']\n",
" middle_diffs = df_with_diff[df_with_diff['middle'] & df_with_diff['difficulty'].notna()]['difficulty']\n",
" upper_diffs = df_with_diff[df_with_diff['upper'] & df_with_diff['difficulty'].notna()]['difficulty']\n",
" \n",
" features['lower_region_difficulty'] = lower_diffs.mean() if len(lower_diffs) > 0 else np.nan\n",
" features['middle_region_difficulty'] = middle_diffs.mean() if len(middle_diffs) > 0 else np.nan\n",
" features['upper_region_difficulty'] = upper_diffs.mean() if len(upper_diffs) > 0 else np.nan\n",
" \n",
" # Difficulty progression (upper - lower)\n",
" if pd.notna(features['lower_region_difficulty']) and pd.notna(features['upper_region_difficulty']):\n",
" features['difficulty_progression'] = features['upper_region_difficulty'] - features['lower_region_difficulty']\n",
" else:\n",
" features['difficulty_progression'] = np.nan\n",
" else:\n",
" features['lower_region_difficulty'] = features['mean_hold_difficulty']\n",
" features['middle_region_difficulty'] = features['mean_hold_difficulty']\n",
" features['upper_region_difficulty'] = features['mean_hold_difficulty']\n",
" features['difficulty_progression'] = 0\n",
" else:\n",
" features['lower_region_difficulty'] = np.nan\n",
" features['middle_region_difficulty'] = np.nan\n",
" features['upper_region_difficulty'] = np.nan\n",
" features['difficulty_progression'] = np.nan\n",
" \n",
" # Hardest move estimate (max difficulty jump between consecutive holds)\n",
" if len(hand_holds) >= 2 and len(hand_diffs) >= 2:\n",
" # Sort hand holds by y position\n",
" hand_sorted = hand_holds.sort_values('y')\n",
" hand_diff_sorted = hand_sorted['difficulty'].dropna().values\n",
" \n",
" if len(hand_diff_sorted) >= 2:\n",
" difficulty_jumps = np.abs(np.diff(hand_diff_sorted))\n",
" features['max_difficulty_jump'] = np.max(difficulty_jumps) if len(difficulty_jumps) > 0 else 0\n",
" features['mean_difficulty_jump'] = np.mean(difficulty_jumps) if len(difficulty_jumps) > 0 else 0\n",
" else:\n",
" features['max_difficulty_jump'] = 0\n",
" features['mean_difficulty_jump'] = 0\n",
" else:\n",
" features['max_difficulty_jump'] = 0\n",
" features['mean_difficulty_jump'] = 0\n",
" \n",
" # Difficulty-weighted reach (combine difficulty with distances)\n",
" if len(hand_holds) >= 2 and len(hand_diffs) >= 2:\n",
" hand_sorted = hand_holds.sort_values('y')\n",
" xs_sorted = hand_sorted['x'].values\n",
" ys_sorted = hand_sorted['y'].values\n",
" diffs_sorted = hand_sorted['difficulty'].fillna(hand_diffs.mean()).values\n",
" \n",
" weighted_reach = []\n",
" for i in range(len(hand_sorted) - 1):\n",
" dx = xs_sorted[i+1] - xs_sorted[i]\n",
" dy = ys_sorted[i+1] - ys_sorted[i]\n",
" dist = np.sqrt(dx**2 + dy**2)\n",
" avg_diff = (diffs_sorted[i] + diffs_sorted[i+1]) / 2\n",
" weighted_reach.append(dist * avg_diff)\n",
" \n",
" features['difficulty_weighted_reach'] = np.mean(weighted_reach) if weighted_reach else 0\n",
" features['max_weighted_reach'] = np.max(weighted_reach) if weighted_reach else 0\n",
" else:\n",
" features['difficulty_weighted_reach'] = 0\n",
" features['max_weighted_reach'] = 0\n",
" \n",
" # =====================================================\n",
" # 17. POSITION-NORMALIZED FEATURES\n",
" # =====================================================\n",
" \n",
" # Normalized positions (0-1 scale)\n",
" features['mean_x_normalized'] = (features['mean_x'] - x_min) / board_width\n",
" features['path_length_vertical'] = 0.0\n",
" features['path_efficiency'] = 0.0\n",
"\n",
" # Normalized / relative\n",
" features['mean_y_normalized'] = (features['mean_y'] - y_min) / board_height\n",
" features['std_x_normalized'] = features['std_x'] / board_width\n",
" features['std_y_normalized'] = features['std_y'] / board_height\n",
" \n",
" # Start/finish normalized\n",
" if pd.notna(features.get('start_height')):\n",
" features['start_height_normalized'] = (features['start_height'] - y_min) / board_height\n",
" else:\n",
" features['start_height_normalized'] = np.nan\n",
" \n",
" if pd.notna(features.get('finish_height')):\n",
" features['finish_height_normalized'] = (features['finish_height'] - y_min) / board_height\n",
" else:\n",
" features['finish_height_normalized'] = np.nan\n",
" \n",
" # Distance from typical positions (center bottom for start, center top for finish)\n",
" typical_start_y = y_min + board_height * 0.15\n",
" typical_finish_y = y_min + board_height * 0.85\n",
" \n",
" if pd.notna(features.get('start_height')):\n",
" features['start_offset_from_typical'] = abs(features['start_height'] - typical_start_y)\n",
" else:\n",
" features['start_offset_from_typical'] = np.nan\n",
" \n",
" if pd.notna(features.get('finish_height')):\n",
" features['finish_offset_from_typical'] = abs(features['finish_height'] - typical_finish_y)\n",
" else:\n",
" features['finish_offset_from_typical'] = np.nan\n",
" \n",
" # Hold positions relative to start\n",
" if len(start_holds) > 0:\n",
" start_y = start_holds['y'].mean()\n",
" features['mean_y_relative_to_start'] = features['mean_y'] - start_y\n",
" features['max_y_relative_to_start'] = features['max_y'] - start_y\n",
" else:\n",
" features['mean_y_relative_to_start'] = np.nan\n",
" features['max_y_relative_to_start'] = np.nan\n",
" \n",
" # Spread normalized by board\n",
" features['start_height_normalized'] = (\n",
" (start_height - y_min) / board_height if pd.notna(start_height) else np.nan\n",
" )\n",
" features['finish_height_normalized'] = (\n",
" (finish_height - y_min) / board_height if pd.notna(finish_height) else np.nan\n",
" )\n",
" features['mean_y_relative_to_start'] = (\n",
" features['mean_y'] - start_height if pd.notna(start_height) else np.nan\n",
" )\n",
" features['spread_x_normalized'] = features['range_x'] / board_width\n",
" features['spread_y_normalized'] = features['range_y'] / board_height\n",
" \n",
" # Bbox coverage (percentage of board covered)\n",
" features['bbox_coverage_x'] = features['range_x'] / board_width\n",
" features['bbox_coverage_y'] = features['range_y'] / board_height\n",
" \n",
" # Position quartile features\n",
" y_quartiles = np.percentile(ys, [25, 50, 75])\n",
" features['y_q25'] = y_quartiles[0]\n",
" features['y_q50'] = y_quartiles[1]\n",
" features['y_q75'] = y_quartiles[2]\n",
" features['y_iqr'] = y_quartiles[2] - y_quartiles[0]\n",
" \n",
" # Holds in each vertical quartile\n",
" features['holds_bottom_quartile'] = (ys < y_quartiles[0]).sum()\n",
" features['holds_top_quartile'] = (ys >= y_quartiles[2]).sum()\n",
" \n",
"\n",
" y_q75 = np.percentile(ys, 75)\n",
" y_q25 = np.percentile(ys, 25)\n",
" features['y_q75'] = y_q75\n",
" features['y_iqr'] = y_q75 - y_q25\n",
"\n",
" # Optional engineered clean feature\n",
" features['complexity_score'] = (\n",
" features['mean_hand_reach']\n",
" * np.log1p(features['total_holds'])\n",
" * (1 + features['hold_density'])\n",
" )\n",
"\n",
" return features"
]
},
@@ -851,7 +502,7 @@
"source": [
"## Sanity Check on One Example\n",
"\n",
"Before extracting features for the entire dataset, we inspect one representative climb to confirm that the parsing logic and the computed geometric summaries behave as expected. Let's do the climb \"Ooo La La\" from notebook two.\n",
"Before extracting features for the entire dataset, we inspect one representative climb to confirm that the parsing logic and the computed geometric summaries behave as expected. Let's do the climb \"Anna Got Me Clickin'\" from notebook two.\n",
"\n",
"![Anna Got Me Clickin](../images/02_hold_stats/Anna_Got_Me_Clickin.png)\n"
]
@@ -863,7 +514,7 @@
"metadata": {},
"outputs": [],
"source": [
"extract_features(df_climbs.iloc[10000], placement_coords, df_placements)"
"extract_features(df_climbs.iloc[10000], placement_coords)"
]
},
{
@@ -902,7 +553,7 @@
"feature_list = []\n",
"\n",
"for idx, row in tqdm(df_climbs.iterrows(), total=len(df_climbs)):\n",
" features = extract_features(row, placement_coords, df_hold_difficulty)\n",
" features = extract_features(row, placement_coords)\n",
" if features:\n",
" features['climb_uuid'] = row['uuid']\n",
" features['display_difficulty'] = row['display_difficulty']\n",
@@ -997,22 +648,37 @@
"fig, axes = plt.subplots(4, 4, figsize=(16, 16))\n",
"\n",
"key_features = [\n",
" # Core driver\n",
" 'angle',\n",
"\n",
" # Basic structure\n",
" 'total_holds',\n",
" 'height_gained',\n",
" 'mean_hold_difficulty',\n",
" 'max_hold_difficulty',\n",
" 'mean_hand_reach',\n",
"\n",
" # Density / compactness\n",
" 'hold_density',\n",
" 'holds_per_vertical_foot',\n",
"\n",
" # Hand geometry (very important)\n",
" 'mean_hand_reach',\n",
" 'max_hand_reach',\n",
" 'std_hand_reach',\n",
"\n",
" # Hand-foot interaction\n",
" 'mean_hand_to_foot',\n",
" 'std_hand_to_foot',\n",
"\n",
" # Spatial layout\n",
" 'symmetry_score',\n",
" 'is_nomatch',\n",
" 'upper_ratio',\n",
"\n",
" # Global geometry\n",
" 'convex_hull_area',\n",
" 'difficulty_progression',\n",
" 'mean_y_normalized',\n",
" 'clustering_ratio',\n",
" 'path_efficiency',\n",
" 'max_difficulty_jump',\n",
" 'difficulty_weighted_reach'\n",
" 'hull_area_to_bbox_ratio',\n",
"\n",
" # Path / flow\n",
" 'path_length_vertical',\n",
" 'path_efficiency'\n",
"]\n",
"\n",
"for ax, feature in zip(axes.flat, key_features):\n",
@@ -1042,12 +708,8 @@
"\n",
"# Angle interactions\n",
"df_features['angle_x_holds'] = df_features['angle'] * df_features['total_holds']\n",
"df_features['angle_x_difficulty'] = df_features['angle'] * df_features['mean_hold_difficulty'].fillna(0)\n",
"df_features['angle_squared'] = df_features['angle'] ** 2\n",
"\n",
"# Difficulty interactions\n",
"df_features['difficulty_x_height'] = df_features['mean_hold_difficulty'].fillna(0) * df_features['height_gained']\n",
"df_features['difficulty_x_density'] = df_features['mean_hold_difficulty'].fillna(0) * df_features['hold_density']\n",
"\n",
"# Complexity features\n",
"df_features['complexity_score'] = (\n",
@@ -1056,9 +718,6 @@
" df_features['hold_density']\n",
")\n",
"\n",
"# Geometric × difficulty\n",
"df_features['hull_area_x_difficulty'] = df_features['convex_hull_area'].fillna(0) * df_features['mean_hold_difficulty'].fillna(0)\n",
"\n",
"print(f\"Added interaction features. Total columns: {len(df_features.columns)}\")"
]
},
@@ -1081,23 +740,6 @@
"print(\"### Columns with Missing Values\\n\")\n",
"display(missing_cols.to_frame('missing'))\n",
"\n",
"# Fill difficulty NaNs with column mean\n",
"difficulty_cols = [c for c in df_features.columns if 'difficulty' in c.lower()]\n",
"for col in difficulty_cols:\n",
" if df_features[col].isna().any():\n",
" df_features[col] = df_features[col].fillna(df_features[col].mean())\n",
"\n",
"# Fill start/finish height with min_y/max_y if missing\n",
"df_features['start_height'] = df_features['start_height'].fillna(df_features['min_y'])\n",
"df_features['finish_height'] = df_features['finish_height'].fillna(df_features['max_y'])\n",
"\n",
"# Fill normalized features\n",
"df_features['start_height_normalized'] = df_features['start_height_normalized'].fillna(\n",
" (df_features['start_height'] - y_min) / board_height\n",
")\n",
"df_features['finish_height_normalized'] = df_features['finish_height_normalized'].fillna(\n",
" (df_features['finish_height'] - y_min) / board_height\n",
")\n",
"\n",
"# Fill other NaNs with column means\n",
"for col in df_features.columns:\n",
@@ -1206,8 +848,7 @@
"print(\"\"\"\\nInterpretation:\n",
"- Each row is a climb-angle observation.\n",
"- The target is `display_difficulty`.\n",
"- The predictors combine geometry, hold statistics, and aggregate difficulty information.\n",
"- Hold-difficulty-based features use Bayesian-smoothed hold scores from Notebook 03.\n",
"- The predictors combine geometry and structure\n",
"- The next notebook tests how much predictive signal these engineered features actually contain.\n",
"\"\"\")\n"
]

File diff suppressed because it is too large Load Diff

700
scripts/predict.py Normal file
View File

@@ -0,0 +1,700 @@
import re
from pathlib import Path
import joblib
import numpy as np
import pandas as pd
from scipy.spatial import ConvexHull
from scipy.spatial.distance import pdist, squareform
try:
import torch
import torch.nn as nn
TORCH_AVAILABLE = True
except ImportError:
TORCH_AVAILABLE = False
# ============================================================
# Paths
# ============================================================
ROOT = Path(__file__).resolve().parents[1]
SCALER_PATH = ROOT / "models" / "feature_scaler.pkl"
FEATURE_NAMES_PATH = ROOT / "models" / "feature_names.txt"
PLACEMENTS_PATH = ROOT / "data" / "placements.csv" # adjust if needed
# ============================================================
# Model registry
# ============================================================
MODEL_REGISTRY = {
"linear": {
"path": ROOT / "models" / "linear_regression.pkl",
"kind": "sklearn",
"needs_scaling": True,
},
"ridge": {
"path": ROOT / "models" / "ridge_regression.pkl",
"kind": "sklearn",
"needs_scaling": True,
},
"lasso": {
"path": ROOT / "models" / "lasso_regression.pkl",
"kind": "sklearn",
"needs_scaling": True,
},
"random_forest": {
"path": ROOT / "models" / "random_forest_tuned.pkl",
"kind": "sklearn",
"needs_scaling": False,
},
"nn_best": {
"path": ROOT / "models" / "neural_network_best.pth",
"kind": "torch_checkpoint",
"needs_scaling": True,
},
}
DEFAULT_MODEL = "random_forest"
# ============================================================
# Board constants
# Adjust if your board coordinate system differs
# ============================================================
x_min, x_max = -24, 168
y_min, y_max = 0, 156
board_width = x_max - x_min
board_height = y_max - y_min
# ============================================================
# Role mappings
# ============================================================
HAND_ROLE_IDS = {12, 13, 14}
FOOT_ROLE_IDS = {15}
def get_role_type(role_id: int) -> str:
mapping = {
12: "start",
13: "middle",
14: "finish",
15: "foot",
}
return mapping.get(role_id, "middle")
# ============================================================
# Grade map
# ============================================================
grade_map = {
10: '4a/V0',
11: '4b/V0',
12: '4c/V0',
13: '5a/V1',
14: '5b/V1',
15: '5c/V2',
16: '6a/V3',
17: '6a+/V3',
18: '6b/V4',
19: '6b+/V4',
20: '6c/V5',
21: '6c+/V5',
22: '7a/V6',
23: '7a+/V7',
24: '7b/V8',
25: '7b+/V8',
26: '7c/V9',
27: '7c+/V10',
28: '8a/V11',
29: '8a+/V12',
30: '8b/V13',
31: '8b+/V14',
32: '8c/V15',
33: '8c+/V16'
}
MIN_GRADE = min(grade_map)
MAX_GRADE = max(grade_map)
# ============================================================
# Neural network architecture from Notebook 06
# ============================================================
if TORCH_AVAILABLE:
class ClimbGradePredictor(nn.Module):
def __init__(self, input_dim, hidden_layers=None, dropout_rate=0.2):
super().__init__()
if hidden_layers is None:
hidden_layers = [256, 128, 64]
layers = []
prev_dim = input_dim
for hidden_dim in hidden_layers:
layers.append(nn.Linear(prev_dim, hidden_dim))
layers.append(nn.BatchNorm1d(hidden_dim))
layers.append(nn.ReLU())
layers.append(nn.Dropout(dropout_rate))
prev_dim = hidden_dim
layers.append(nn.Linear(prev_dim, 1))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
# ============================================================
# Load shared artifacts
# ============================================================
scaler = joblib.load(SCALER_PATH)
with open(FEATURE_NAMES_PATH, "r") as f:
FEATURE_NAMES = [line.strip() for line in f if line.strip()]
df_placements = pd.read_csv(PLACEMENTS_PATH)
placement_coords = {
int(row["placement_id"]): (row["x"], row["y"])
for _, row in df_placements.iterrows()
}
# ============================================================
# Model loading
# ============================================================
_MODEL_CACHE = {}
def normalize_model_name(model_name: str) -> str:
if model_name == "nn":
return "nn_best"
return model_name
def load_model(model_name=DEFAULT_MODEL):
model_name = normalize_model_name(model_name)
if model_name not in MODEL_REGISTRY:
raise ValueError(
f"Unknown model '{model_name}'. Choose from: {list(MODEL_REGISTRY.keys()) + ['nn']}"
)
if model_name in _MODEL_CACHE:
return _MODEL_CACHE[model_name]
info = MODEL_REGISTRY[model_name]
path = info["path"]
if info["kind"] == "sklearn":
model = joblib.load(path)
elif info["kind"] == "torch_checkpoint":
if not TORCH_AVAILABLE:
raise ImportError("PyTorch is not installed, so the neural network model cannot be used.")
checkpoint = torch.load(path, map_location="cpu")
if hasattr(checkpoint, "eval"):
model = checkpoint
model.eval()
elif isinstance(checkpoint, dict):
input_dim = checkpoint.get("input_dim", len(FEATURE_NAMES))
hidden_layers = checkpoint.get("hidden_layers", [256, 128, 64])
dropout_rate = checkpoint.get("dropout_rate", 0.2)
model = ClimbGradePredictor(
input_dim=input_dim,
hidden_layers=hidden_layers,
dropout_rate=dropout_rate,
)
if "model_state_dict" in checkpoint:
model.load_state_dict(checkpoint["model_state_dict"])
else:
model.load_state_dict(checkpoint)
model.eval()
else:
raise RuntimeError(
f"Unsupported checkpoint type for {model_name}: {type(checkpoint)}"
)
else:
raise ValueError(f"Unsupported model kind: {info['kind']}")
_MODEL_CACHE[model_name] = model
return model
# ============================================================
# Helpers
# ============================================================
def parse_frames(frames: str):
"""
Parse strings like:
p304r8p378r6p552r6
into:
[(304, 8), (378, 6), (552, 6)]
"""
if not isinstance(frames, str) or not frames.strip():
return []
matches = re.findall(r"p(\d+)r(\d+)", frames)
return [(int(p), int(r)) for p, r in matches]
# ============================================================
# Feature extraction
# ============================================================
def extract_features_from_raw(angle, frames, is_nomatch=0, description=""):
"""
Extract the clean, leakage-free feature set used by the updated models.
"""
holds = parse_frames(frames)
if not holds:
raise ValueError("Could not parse any holds from frames.")
hold_data = []
for placement_id, role_id in holds:
coords = placement_coords.get(placement_id, (None, None))
if coords[0] is None:
continue
role_type = get_role_type(role_id)
is_hand_role = role_id in HAND_ROLE_IDS
is_foot_role = role_id in FOOT_ROLE_IDS
hold_data.append({
"placement_id": placement_id,
"x": coords[0],
"y": coords[1],
"role_type": role_type,
"is_hand": is_hand_role,
"is_foot": is_foot_role,
})
if not hold_data:
raise ValueError("No valid holds found after parsing frames.")
df_holds = pd.DataFrame(hold_data)
hand_holds = df_holds[df_holds["is_hand"]]
foot_holds = df_holds[df_holds["is_foot"]]
start_holds = df_holds[df_holds["role_type"] == "start"]
finish_holds = df_holds[df_holds["role_type"] == "finish"]
middle_holds = df_holds[df_holds["role_type"] == "middle"]
xs = df_holds["x"].to_numpy()
ys = df_holds["y"].to_numpy()
desc = str(description) if description is not None else ""
if pd.isna(desc):
desc = ""
center_x = (x_min + x_max) / 2
features = {}
# Core / counts
features["angle"] = float(angle)
features["angle_squared"] = float(angle) ** 2
features["total_holds"] = int(len(df_holds))
features["hand_holds"] = int(len(hand_holds))
features["foot_holds"] = int(len(foot_holds))
features["start_holds"] = int(len(start_holds))
features["finish_holds"] = int(len(finish_holds))
features["middle_holds"] = int(len(middle_holds))
features["is_nomatch"] = int(
(is_nomatch == 1) or
bool(re.search(r"\bno\s*match(ing)?\b", desc, flags=re.IGNORECASE))
)
# Spatial
features["mean_y"] = float(np.mean(ys))
features["std_x"] = float(np.std(xs)) if len(xs) > 1 else 0.0
features["std_y"] = float(np.std(ys)) if len(ys) > 1 else 0.0
features["range_x"] = float(np.max(xs) - np.min(xs))
features["range_y"] = float(np.max(ys) - np.min(ys))
features["min_y"] = float(np.min(ys))
features["max_y"] = float(np.max(ys))
features["height_gained"] = features["max_y"] - features["min_y"]
start_height = float(start_holds["y"].mean()) if len(start_holds) > 0 else np.nan
finish_height = float(finish_holds["y"].mean()) if len(finish_holds) > 0 else np.nan
features["height_gained_start_finish"] = (
finish_height - start_height
if pd.notna(start_height) and pd.notna(finish_height)
else np.nan
)
# Density / symmetry
bbox_area = features["range_x"] * features["range_y"]
features["bbox_area"] = float(bbox_area)
features["hold_density"] = float(features["total_holds"] / bbox_area) if bbox_area > 0 else 0.0
features["holds_per_vertical_foot"] = float(features["total_holds"] / max(features["range_y"], 1))
left_holds = int((df_holds["x"] < center_x).sum())
features["left_ratio"] = left_holds / features["total_holds"] if features["total_holds"] > 0 else 0.5
features["symmetry_score"] = 1 - abs(features["left_ratio"] - 0.5) * 2
y_median = np.median(ys)
upper_holds = int((df_holds["y"] > y_median).sum())
features["upper_ratio"] = upper_holds / features["total_holds"]
# Hand reach
if len(hand_holds) >= 2:
hand_points = hand_holds[["x", "y"]].to_numpy()
hand_distances = pdist(hand_points)
hand_xs = hand_holds["x"].to_numpy()
hand_ys = hand_holds["y"].to_numpy()
features["mean_hand_reach"] = float(np.mean(hand_distances))
features["max_hand_reach"] = float(np.max(hand_distances))
features["std_hand_reach"] = float(np.std(hand_distances))
features["hand_spread_x"] = float(hand_xs.max() - hand_xs.min())
features["hand_spread_y"] = float(hand_ys.max() - hand_ys.min())
else:
features["mean_hand_reach"] = 0.0
features["max_hand_reach"] = 0.0
features["std_hand_reach"] = 0.0
features["hand_spread_x"] = 0.0
features["hand_spread_y"] = 0.0
# Hand-foot distances
if len(hand_holds) > 0 and len(foot_holds) > 0:
hand_points = hand_holds[["x", "y"]].to_numpy()
foot_points = foot_holds[["x", "y"]].to_numpy()
dists = []
for hx, hy in hand_points:
for fx, fy in foot_points:
dists.append(np.sqrt((hx - fx) ** 2 + (hy - fy) ** 2))
dists = np.asarray(dists, dtype=float)
features["min_hand_to_foot"] = float(np.min(dists))
features["mean_hand_to_foot"] = float(np.mean(dists))
features["std_hand_to_foot"] = float(np.std(dists))
else:
features["min_hand_to_foot"] = 0.0
features["mean_hand_to_foot"] = 0.0
features["std_hand_to_foot"] = 0.0
# Global geometry
points = np.column_stack([xs, ys])
if len(df_holds) >= 3:
try:
hull = ConvexHull(points)
features["convex_hull_area"] = float(hull.volume)
features["hull_area_to_bbox_ratio"] = float(features["convex_hull_area"] / max(bbox_area, 1))
except Exception:
features["convex_hull_area"] = np.nan
features["hull_area_to_bbox_ratio"] = np.nan
else:
features["convex_hull_area"] = 0.0
features["hull_area_to_bbox_ratio"] = 0.0
if len(df_holds) >= 2:
pairwise = pdist(points)
features["mean_pairwise_distance"] = float(np.mean(pairwise))
features["std_pairwise_distance"] = float(np.std(pairwise))
else:
features["mean_pairwise_distance"] = 0.0
features["std_pairwise_distance"] = 0.0
if len(df_holds) >= 2:
sorted_idx = np.argsort(ys)
sorted_points = points[sorted_idx]
path_length = 0.0
for i in range(len(sorted_points) - 1):
dx = sorted_points[i + 1, 0] - sorted_points[i, 0]
dy = sorted_points[i + 1, 1] - sorted_points[i, 1]
path_length += np.sqrt(dx ** 2 + dy ** 2)
features["path_length_vertical"] = float(path_length)
features["path_efficiency"] = float(features["height_gained"] / max(path_length, 1))
else:
features["path_length_vertical"] = 0.0
features["path_efficiency"] = 0.0
# Normalized / relative
features["mean_y_normalized"] = float((features["mean_y"] - y_min) / board_height)
features["start_height_normalized"] = float((start_height - y_min) / board_height) if pd.notna(start_height) else np.nan
features["finish_height_normalized"] = float((finish_height - y_min) / board_height) if pd.notna(finish_height) else np.nan
features["mean_y_relative_to_start"] = float(features["mean_y"] - start_height) if pd.notna(start_height) else np.nan
features["spread_x_normalized"] = float(features["range_x"] / board_width)
features["spread_y_normalized"] = float(features["range_y"] / board_height)
y_q75 = np.percentile(ys, 75)
y_q25 = np.percentile(ys, 25)
features["y_q75"] = float(y_q75)
features["y_iqr"] = float(y_q75 - y_q25)
# Engineered clean features
features["complexity_score"] = float(
features["mean_hand_reach"]
* np.log1p(features["total_holds"])
* (1 + features["hold_density"])
)
features["angle_x_holds"] = float(features["angle"] * features["total_holds"])
return features
# ============================================================
# Model input preparation
# ============================================================
def prepare_feature_vector(features: dict) -> pd.DataFrame:
row = {}
for col in FEATURE_NAMES:
value = features.get(col, 0.0)
row[col] = 0.0 if pd.isna(value) else value
return pd.DataFrame([row], columns=FEATURE_NAMES)
# ============================================================
# Prediction helpers
# ============================================================
def format_prediction(pred: float):
rounded = int(round(pred))
rounded = max(min(rounded, MAX_GRADE), MIN_GRADE)
return {
"predicted_numeric": float(pred),
"predicted_display_difficulty": rounded,
"predicted_boulder_grade": grade_map[rounded],
}
def predict_with_model(model, X: pd.DataFrame, model_name: str):
model_name = normalize_model_name(model_name)
info = MODEL_REGISTRY[model_name]
if info["kind"] == "sklearn":
X_input = scaler.transform(X) if info["needs_scaling"] else X
pred = model.predict(X_input)[0]
return float(pred)
if info["kind"] == "torch_checkpoint":
if not TORCH_AVAILABLE:
raise ImportError("PyTorch is not installed.")
X_input = scaler.transform(X) if info["needs_scaling"] else X
X_tensor = torch.tensor(np.asarray(X_input), dtype=torch.float32)
with torch.no_grad():
out = model(X_tensor)
if isinstance(out, tuple):
out = out[0]
pred = np.asarray(out).reshape(-1)[0]
return float(pred)
raise ValueError(f"Unsupported model kind: {info['kind']}")
# ============================================================
# Public API
# ============================================================
def predict(
angle,
frames,
is_nomatch=0,
description="",
model_name=DEFAULT_MODEL,
return_numeric=False,
debug=False,
):
model_name = normalize_model_name(model_name)
model = load_model(model_name)
features = extract_features_from_raw(
angle=angle,
frames=frames,
is_nomatch=is_nomatch,
description=description,
)
X = prepare_feature_vector(features)
if debug:
print("\nNonzero / non-null feature values:")
for col, val in X.iloc[0].items():
if pd.notna(val) and val != 0:
print(f"{col}: {val}")
pred = predict_with_model(model, X, model_name=model_name)
if return_numeric:
return float(pred)
result = format_prediction(pred)
result["model"] = model_name
return result
def predict_csv(
input_csv,
output_csv=None,
model_name=DEFAULT_MODEL,
angle_col="angle",
frames_col="frames",
is_nomatch_col="is_nomatch",
description_col="description",
):
"""
Batch prediction over a CSV file.
Required columns:
- angle
- frames
Optional columns:
- is_nomatch
- description
"""
model_name = normalize_model_name(model_name)
df = pd.read_csv(input_csv)
if angle_col not in df.columns:
raise ValueError(f"Missing required column: '{angle_col}'")
if frames_col not in df.columns:
raise ValueError(f"Missing required column: '{frames_col}'")
results = []
for _, row in df.iterrows():
angle = row[angle_col]
frames = row[frames_col]
is_nomatch = row[is_nomatch_col] if is_nomatch_col in df.columns and pd.notna(row[is_nomatch_col]) else 0
description = row[description_col] if description_col in df.columns and pd.notna(row[description_col]) else ""
pred = predict(
angle=angle,
frames=frames,
is_nomatch=is_nomatch,
description=description,
model_name=model_name,
return_numeric=False,
debug=False,
)
results.append(pred)
pred_df = pd.DataFrame(results)
out = pd.concat([df.reset_index(drop=True), pred_df.reset_index(drop=True)], axis=1)
if output_csv is not None:
out.to_csv(output_csv, index=False)
return out
def evaluate_predictions(df, true_col="display_difficulty", pred_col="predicted_numeric"):
"""
Simple evaluation summary for labeled batch predictions.
"""
if true_col not in df.columns:
raise ValueError(f"Missing true target column: '{true_col}'")
if pred_col not in df.columns:
raise ValueError(f"Missing prediction column: '{pred_col}'")
y_true = df[true_col].astype(float)
y_pred = df[pred_col].astype(float)
mae = np.mean(np.abs(y_true - y_pred))
rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))
within_1 = np.mean(np.abs(y_true - y_pred) <= 1)
within_2 = np.mean(np.abs(y_true - y_pred) <= 2)
return {
"mae": float(mae),
"rmse": float(rmse),
"within_1": float(within_1),
"within_2": float(within_2),
}
# ============================================================
# CLI
# ============================================================
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
# Single prediction mode
parser.add_argument("--angle", type=int)
parser.add_argument("--frames", type=str)
parser.add_argument("--is_nomatch", type=int, default=0)
parser.add_argument("--description", type=str, default="")
# Batch mode
parser.add_argument("--input_csv", type=str)
parser.add_argument("--output_csv", type=str)
parser.add_argument(
"--model",
type=str,
default=DEFAULT_MODEL,
choices=list(MODEL_REGISTRY.keys()) + ["nn"],
help="Which trained model to use",
)
parser.add_argument("--numeric", action="store_true")
parser.add_argument("--debug", action="store_true")
parser.add_argument("--evaluate", action="store_true")
args = parser.parse_args()
if args.input_csv:
df_out = predict_csv(
input_csv=args.input_csv,
output_csv=args.output_csv,
model_name=args.model,
)
print(df_out.head())
if args.evaluate:
try:
metrics = evaluate_predictions(df_out)
print("\nEvaluation:")
for k, v in metrics.items():
print(f"{k}: {v:.4f}")
except Exception as e:
print(f"\nCould not evaluate predictions: {e}")
else:
if args.angle is None or args.frames is None:
raise ValueError("For single prediction, you must provide --angle and --frames")
pred = predict(
angle=args.angle,
frames=args.frames,
is_nomatch=args.is_nomatch,
description=args.description,
model_name=args.model,
return_numeric=args.numeric,
debug=args.debug,
)
print(pred)