fixed leakage

2026-03-28 12:19:09 -04:00
parent 321fe78105
commit 1530c02961
24 changed files with 8224 additions and 1086 deletions
--- a/README.md
+++ b/README.md
@@ -13,8 +13,8 @@ I recently got into *board climbing*, and have been enjoying using the <a href="

 This project analyzes ~130,000 climbs from the Tension Boards in order to do the following.
 > 1. **Understand** hold usage patterns and difficulty distributions
-> 2. **Quantify** empircal hold difficulty scores
-> 3. **Predict** climb grades from hold positions and board angle
+> 2. **Quantify** empircal hold difficulty scores (for analysis)
+> 3. **Predict** climb grades from spatial and structural features of climbs

 Climbing grades are inherently subjective. Different climbers use different beta, setters have different grading standards, and difficulty depends on factors not always captured in data. What makes it harder in the case of the board climbing is that the grade is displayed almost democratically -- it is determined by user input. 

@@ -92,7 +92,7 @@ Go to your working directory and run notebooks in order:
 Note:

 * Notebooks 01-03 are uploaded with all of their cells run, so that one can see the data analysis. Notebooks 04-06 are uploaded without having been run.
-* Notebook 03 generates hold difficulty tables
+* Notebook 03 generates global hold difficulty tables
 * Notebook 04 generates feature matrix
 * Notebook 05 trains models
 * Notebook 06 trains neural network
@@ -167,9 +167,9 @@ This significantly improves downstream feature quality.
 
 ---

-## 6. Many more!
+## 6. Many more

-There are many other statistics, see notebooks [`01`](notebooks/01_data_overview_and_climbing_statistics.ipynb) (climbing statistics), [`02`](notebooks/02_hold_analysis_and_board_heatmaps.ipynb) (climbing hold statistics), and [`03`](notebooks/03_hold_difficulty.ipynb). Included are:
+There are many other statistics, see notebooks [`01`](notebooks/01_data_overview_and_climbing_statistics.ipynb) (climbing statistics), [`02`](notebooks/02_hold_analysis_and_board_heatmaps.ipynb) (climbing hold statistics), and [`03`](notebooks/03_hold_difficulty.ipynb) (climbing hold difficulty). Included are:

 * **Time-Date analysis** based on `fa_at`. We include month, day of week, and time analysis based on first ascent log data. Winter months are the most popular, and Tuesday and Wednesday are the most popular days of the week.
 * **Distribution of climbs per angle**, with 40 degrees being the most common.
@@ -178,6 +178,7 @@ There are many other statistics, see notebooks [`01`](notebooks/01_data_overview
 * **Prolific statistics**: most popular routes & setters
 * **Plastic vs wood** hold analysis
 * **Per-Angle, Per-Grade** hold frequency & difficulty analyses
+* more!

 ---

@@ -189,24 +190,33 @@ This section focuses on **building predictive models and evaluating performance*

 ## 7. Feature Engineering

-Features are constructed at the climb level using:
+Features are constructed at the climb level using only **structural and geometric information** derived from the climb definition (`angle` and `frames`).

-* geometry (height, spread, convex hull)
-* structure (number of moves, clustering)
-* hold difficulty (smoothed estimates)
-* interaction features
+We explicitly avoid using hold-difficulty-derived features in the predictive models to prevent target leakage.
+
+Feature categories include:
+
+* **Geometry** — spatial footprint of the climb (height, spread, convex hull)
+* **Movement** — reach distances and spatial relationships between holds
+* **Density** — how tightly or sparsely holds are arranged
+* **Symmetry** — left/right balance and distribution
+* **Path structure** — approximations of movement flow and efficiency
+* **Normalized position** — relative positioning on the board
+* **Interaction features** — simple nonlinear combinations (e.g., angle × hold count)
+
+This results in a **leakage-free feature set** that better reflects the physical structure of climbing.


-| Category      | Description                       | Examples                                    |
-| ------------- | --------------------------------- | ------------------------------------------- |
-| Geometry      | Shape and size of climb           | bbox_area, range_x, range_y                 |
-| Movement      | Reach and movement complexity     | max_hand_reach, path_efficiency             |
-| Difficulty    | Hold-based difficulty metrics     | mean_hold_difficulty, max_hold_difficulty   |
-| Progression   | How difficulty changes over climb | difficulty_gradient, difficulty_progression |
-| Symmetry      | Left/right balance                | symmetry_score, hand_symmetry               |
-| Clustering    | Local density of holds            | mean_neighbors_12in                         |
-| Normalization | Relative board positioning        | mean_y_normalized                           |
-| Distribution  | Vertical distribution of holds    | y_q25, y_q75                                |
+| Category      | Description                              | Examples                                  |
+| ------------- | ---------------------------------------- | ----------------------------------------- |
+| Geometry      | Shape and size of climb                  | bbox_area, range_x, range_y               |
+| Movement      | Reach and movement structure             | mean_hand_reach, path_efficiency          |
+| Density       | Hold spacing and compactness             | hold_density, holds_per_vertical_foot     |
+| Symmetry      | Left/right balance                       | symmetry_score, left_ratio                |
+| Path          | Approximate movement trajectory          | path_length_vertical                      |
+| Position      | Relative board positioning               | mean_y_normalized, start_height_normalized|
+| Distribution  | Vertical distribution of holds           | y_q75, y_iqr                              |
+| Interaction   | Nonlinear feature combinations           | angle_squared, angle_x_holds              |

 ### Important design decision

@@ -216,6 +226,22 @@ The dataset is restricted to:

 to reduce variability and improve consistency. (see [Angle vs Difficulty](#3-angle-vs-difficulty), where average climb grade seems to stabilize or get lower over 50°)

+###
+
+### Important: Leakage and Feature Design
+
+Earlier iterations of this project included features derived from hold difficulty scores (computed from climb grades). While these features slightly improved predictive performance, they introduce a form of **target leakage** if computed globally.
+
+In this version of the project:
+
+* Hold difficulty scores are still computed in Notebook 03 for **exploratory analysis**
+* Predictive models (Notebooks 04–06) use only **leakage-free features**
+* No feature is derived from the target variable (`display_difficulty`)
+
+This allows the model to learn from the **structure of climbs themselves**, rather than from aggregated statistics of the labels.
+
+Note: Hold-difficulty-based features can still be valid in a production setting if computed strictly from historical (training) data, similar to target encoding techniques.
+
 ---

 ## 8. Feature Relationships
@@ -225,10 +251,10 @@ Here are some relationships between features and difficulty
 ![Correlation Heatmap](images/04_climb_features/feature_correlations.png)

 * higher angles allow for harder difficulties
-* hold difficulty features seem to correlate the most to difficulty
-* engineered features capture non-trivial structure
+* distance between holds seems to correlate with difficulty
+* geoemetric and structural features capture non-trivial climbing patterns

-We have a full feature list in [`data/04_climb_features/feature_list.txt`](data/04_climb_features/feature_list.txt). Explanations are available in [`data/04_climb_features/feature_list_explanations.txt`](data/04_climb_features/feature_explanations.txt).
+We have a full feature list in [`data/04_climb_features/feature_list.txt`](data/04_climb_features/feature_list.txt). Explanations are available in [`data/04_climb_features/feature_explanations.txt`](data/04_climb_features/feature_explanations.txt).

 ---

@@ -252,9 +278,12 @@ Models tested:

 Key drivers:

-* hold difficulty
 * wall angle
-* structural features
+* reach-based features (e.g., mean/max hand reach)
+* spatial density and distribution
+* geometric structure of the climb
+
+This confirms that **difficulty is strongly tied to spatial arrangement and movement constraints**, rather than just individual hold properties.

 ---

@@ -263,11 +292,14 @@ Key drivers:
 ![RF redicted vs Actual](images/05_predictive_modelling/random_forest_predictions.png)
 ![NN Predicted vs Actual](images/06_deep_learning/neural_network_predictions.png)

-### Results (in terms of difficulty score)
+### Results (in terms of V-grade)
 Both the RF and NN models performed similarly.
-* **~83% within ±1 V-grade (~45% within ±1 difficulty score)**
-* **~96% within ±2 V-grade (~80% within ±2 difficulty scores)**
+* **~70% within ±1 V-grade (~36% within ±1 difficulty score)**
+* **~90% within ±2 V-grade (~65% within ±2 difficulty scores)**

+In earlier experiements, we were able to achieve ~83% within one V-grade and ~96% within 2. However, that setup used hold-difficulties from notebook 03 derived from climbing grades, creating leakage. This result is more realistic and more independent: the model relies purely on spatial and structural information, without access to hold-based information or beta.
+
+This demonstrates that a substantial portion of climbing difficulty can be attributed to geometry and movement constraints. 

 ### Interpretation

@@ -284,15 +316,15 @@ Both the RF and NN models performed similarly.

 | Metric             | Performance |
 | ------------------ | ----------- |
-| Within ±1 V-grade  | ~83%        |
-| Within ±2 V-grades | ~96%        |
+| Within ±1 V-grade  | ~70%        |
+| Within ±2 V-grades | ~90%        |

 The model can still predict subgrades (e.g., V3 contains 6a and 6a+), but it is not as accurate.

 | Metric             | Performance |
 | ------------------ | ----------- |
-| Within ±1 difficulty-grade  | ~45%        |
-| Within ±2 difficulty-grades | ~80%        |
+| Within ±1 difficulty-grade  | ~36%        |
+| Within ±2 difficulty-grades | ~65%        |

 ---

@@ -312,7 +344,8 @@ The model can still predict subgrades (e.g., V3 contains 6a and 6a+), but it is

 # Future Work

-* Kilter Board analysis
+* <a href="https://gitlab.com/psark/Kilter-Board-Analysis">~~Kilter Board analysis~~</a>
+* A unified approach to grade prediction across boards
 * Test other models
 * Better spatial features
 * GUI to create climb and instantly tell you a predicted difficulty
--- a/data/04_climb_features/feature_explanations.txt
+++ b/data/04_climb_features/feature_explanations.txt
@@ -1,324 +1,201 @@
-Tension Board 2 – Feature Engineering Explanation
+# Feature Explanations

-This document explains the engineered features used for climb difficulty prediction.
+## Core Climb Attributes

--------------------------------------------------
-1. BASIC STRUCTURE FEATURES
--------------------------------------------------
+**angle**
+Board angle in degrees. Higher angles correspond to steeper (more overhanging) climbs, which generally increase difficulty.

-angle
-Wall angle in degrees.
+**angle_squared**
+Square of the board angle. Captures nonlinear effects of steepness (difficulty often increases faster at higher angles).

-total_holds
-Total number of holds in the climb.
+**display_difficulty**
+Target variable: the climb’s difficulty rating provided by the dataset.

-hand_holds / foot_holds
-Number of hand vs foot holds.
+**angle_x_holds**
+Interaction between angle and number of holds. Captures how steepness and hold count jointly affect difficulty (e.g., many holds on steep terrain vs few holds on slab).

-start_holds / finish_holds / middle_holds
-Counts of hold roles.
+---

+## Hold Counts / Composition

--------------------------------------------------
-2. MATCHING FEATURE
--------------------------------------------------
+**total_holds**
+Total number of holds used in the climb.

-is_nomatch
-Binary feature indicating whether matching is disallowed.
-Derived from:
- explicit flag OR
- description text (e.g. “no match”, “no matching”)
+**hand_holds**
+Number of holds intended for hands.

+**foot_holds**
+Number of holds intended for feet.

--------------------------------------------------
-3. SPATIAL FEATURES
--------------------------------------------------
+**start_holds**
+Number of starting holds.

-mean_x, mean_y
-Center of mass of all holds.
+**finish_holds**
+Number of finishing holds.

-std_x, std_y
-Spread of holds.
+**middle_holds**
+Number of intermediate (non-start, non-finish) holds.

-range_x, range_y
-Width and height of climb.
+**is_nomatch**
+Binary indicator for whether matching hands on holds is disallowed. No-match climbs tend to be more difficult due to restricted movement options.

-min_y, max_y
-Lowest and highest holds.
+---

-height_gained
-Total vertical gain.
+## Spatial Position (Raw Coordinates)

-height_gained_start_finish
-Vertical gain from start to finish.
+**mean_y**
+Average vertical position of all holds. Higher values indicate climbs concentrated toward the top of the board.

+**std_x**
+Standard deviation of horizontal hold positions. Measures left-right spread.

--------------------------------------------------
-4. START / FINISH FEATURES
--------------------------------------------------
+**std_y**
+Standard deviation of vertical hold positions. Measures vertical dispersion.

-start_height, finish_height
-Average height of start/finish holds.
+**range_x**
+Horizontal range (max − min x). Indicates how wide the climb is.

-start_height_min/max, finish_height_min/max
-Range of start/finish positions.
+**range_y**
+Vertical range (max − min y). Indicates how tall the climb is.

+**min_y**
+Lowest hold position.

--------------------------------------------------
-5. BOUNDING BOX FEATURES
--------------------------------------------------
+**max_y**
+Highest hold position.

-bbox_area
-Area covered by climb.
+**height_gained**
+Total vertical distance covered by the climb.

-bbox_aspect_ratio
-Horizontal vs vertical shape.
+**height_gained_start_finish**
+Vertical distance between average start and finish holds.

-bbox_normalized_area
-Relative coverage of board.
+---

-hold_density
-Holds per unit area.
+## Density / Coverage

-holds_per_vertical_foot
-Vertical density.
+**bbox_area**
+Area of the bounding box containing all holds.

+**hold_density**
+Number of holds per unit area. Higher density often means more options and potentially easier climbing.

--------------------------------------------------
-6. SYMMETRY FEATURES
--------------------------------------------------
+**holds_per_vertical_foot**
+Number of holds per unit vertical distance. Captures how “ladder-like” a climb is.

-left_holds, right_holds
-Distribution across board center.
+---

-left_ratio
-Fraction of holds on left.
+## Symmetry / Balance

-symmetry_score
-Symmetry measure (1 = perfectly balanced).
+**left_ratio**
+Proportion of holds on the left side of the board.

-hand_left_ratio, hand_symmetry
-Same but for hand holds.
+**symmetry_score**
+How balanced the climb is left-to-right (1 = perfectly balanced, 0 = fully one-sided).

+**upper_ratio**
+Fraction of holds located above the median vertical position. Indicates whether the climb is top-heavy.

--------------------------------------------------
-7. VERTICAL DISTRIBUTION
--------------------------------------------------
+---

-upper_holds, lower_holds
-Split around median height.
+## Hand Geometry (Reach / Movement)

-upper_ratio
-Proportion of upper holds.
+**mean_hand_reach**
+Average distance between pairs of hand holds. Proxy for typical move size.

+**max_hand_reach**
+Maximum distance between hand holds. Captures hardest reach or span.

--------------------------------------------------
-8. REACH / DISTANCE FEATURES
--------------------------------------------------
+**std_hand_reach**
+Variation in hand distances. Measures consistency vs variability of moves.

-max_hand_reach, mean_hand_reach, std_hand_reach
-Distances between hand holds.
+**hand_spread_x**
+Horizontal spread of hand holds.

-hand_spread_x, hand_spread_y
-Spatial extent of hand holds.
+**hand_spread_y**
+Vertical spread of hand holds.

-max_foot_spread, mean_foot_spread
-Foot hold spacing.
+---

-max_hand_to_foot, mean_hand_to_foot
-Hand-foot distances.
+## Hand–Foot Interaction

+**min_hand_to_foot**
+Minimum distance between any hand and foot hold. Indicates tight body positioning.

--------------------------------------------------
-9. HOLD DIFFICULTY FEATURES
--------------------------------------------------
+**mean_hand_to_foot**
+Average distance between hands and feet. Proxy for body extension requirements.

-mean_hold_difficulty
-Average difficulty of holds.
+**std_hand_to_foot**
+Variation in hand-foot distances. Measures consistency of body positioning.

-max_hold_difficulty / min_hold_difficulty
-Extremes.
+---

-std_hold_difficulty
-Variation.
+## Global Geometry

-median_hold_difficulty
-Central tendency.
+**convex_hull_area**
+Area of the convex hull enclosing all holds. Measures overall spatial footprint.

-difficulty_range
-Spread.
+**hull_area_to_bbox_ratio**
+Ratio of convex hull area to bounding box area. Indicates how “filled” or “sparse” the hold distribution is.

-mean_hand_difficulty / mean_foot_difficulty
-Role-specific difficulty.
+**mean_pairwise_distance**
+Average distance between all pairs of holds. Global spacing measure.

-start_difficulty / finish_difficulty
-Entry and exit difficulty.
+**std_pairwise_distance**
+Variation in distances between holds. Captures clustering vs spread.

+---

--------------------------------------------------
-10. COMBINED / INTERACTION FEATURES
--------------------------------------------------
+## Path / Flow

-hand_foot_ratio
-Balance of hands vs feet.
+**path_length_vertical**
+Approximate total path length when moving from bottom to top (based on sorted vertical positions).

-movement_density
-Holds per vertical distance.
+**path_efficiency**
+Ratio of vertical gain to path length. Higher values indicate more direct movement; lower values indicate more traversing or inefficiency.

-weighted_difficulty
-Height-weighted difficulty.
+---

-difficulty_gradient
-Difference between start and finish difficulty.
+## Normalized / Relative Position

+**mean_y_normalized**
+Average vertical position scaled to board height (0–1).

--------------------------------------------------
-11. SHAPE / GEOMETRY FEATURES
--------------------------------------------------
+**start_height_normalized**
+Start hold height relative to board height.

-convex_hull_area
-Area of convex hull around holds.
+**finish_height_normalized**
+Finish hold height relative to board height.

-convex_hull_perimeter
-Perimeter.
+**mean_y_relative_to_start**
+Average hold height relative to starting position.

-hull_area_to_bbox_ratio
-Compactness.
+**spread_x_normalized**
+Horizontal spread normalized by board width.

+**spread_y_normalized**
+Vertical spread normalized by board height.

--------------------------------------------------
-12. NEAREST-NEIGHBOR FEATURES
--------------------------------------------------
+---

-min_nn_distance / mean_nn_distance
-Spacing between holds.
+## Distribution Features

-max_nn_distance
-Maximum separation.
+**y_q75**
+75th percentile of hold heights. Indicates where upper holds are concentrated.

-std_nn_distance
-Spread.
+**y_iqr**
+Interquartile range (75th − 25th percentile) of hold heights. Measures vertical spread excluding extremes.

+---

--------------------------------------------------
-13. CLUSTERING FEATURES
--------------------------------------------------
+## Engineered Feature

-mean_neighbors_12in
-Average nearby holds within 12 inches.
+**complexity_score**
+Composite feature combining:

-max_neighbors_12in
-Max clustering.
+* hand reach (movement difficulty),
+* number of holds (sequence length),
+* hold density (spacing).

-clustering_ratio
-Normalized clustering.
+Designed to capture overall climb complexity in a single metric.

-
--------------------------------------------------
-14. PATH FEATURES
--------------------------------------------------
-
-path_length_vertical
-Estimated movement path length.
-
-path_efficiency
-Vertical gain vs path length.
-
-
--------------------------------------------------
-15. REGIONAL DIFFICULTY FEATURES
--------------------------------------------------
-
-lower_region_difficulty
-Bottom third difficulty.
-
-middle_region_difficulty
-Middle third difficulty.
-
-upper_region_difficulty
-Top third difficulty.
-
-difficulty_progression
-Change in difficulty from bottom to top.
-
-
--------------------------------------------------
-16. DIFFICULTY TRANSITIONS
--------------------------------------------------
-
-max_difficulty_jump
-Largest jump between moves.
-
-mean_difficulty_jump
-Average jump.
-
-difficulty_weighted_reach
-Distance weighted by difficulty.
-
-
--------------------------------------------------
-17. NORMALIZED FEATURES
--------------------------------------------------
-
-mean_x_normalized, mean_y_normalized
-Relative board position.
-
-std_x_normalized, std_y_normalized
-Normalized spread.
-
-start_height_normalized, finish_height_normalized
-Relative heights.
-
-spread_x_normalized, spread_y_normalized
-Coverage.
-
-
--------------------------------------------------
-18. RELATIVE POSITION FEATURES
--------------------------------------------------
-
-start_offset_from_typical
-Deviation from typical start height.
-
-finish_offset_from_typical
-Deviation from typical finish height.
-
-mean_y_relative_to_start
-Average height relative to start.
-
-max_y_relative_to_start
-Highest point relative to start.
-
-
--------------------------------------------------
-19. DISTRIBUTION FEATURES
--------------------------------------------------
-
-y_q25, y_q50, y_q75
-Height quartiles.
-
-y_iqr
-Spread.
-
-holds_bottom_quartile
-Lower density.
-
-holds_top_quartile
-Upper density.
-
-
--------------------------------------------------
-SUMMARY
--------------------------------------------------
-
-These features capture:
-
- Geometry (shape, spread)
- Movement (reach, density, path)
- Difficulty (hold-based + progression)
- Symmetry and balance
- Spatial distribution
-
-Together they allow the model to approximate both:
- physical movement complexity
- and hold difficulty structure of a climb.
--- a/data/04_climb_features/feature_list.txt
+++ b/data/04_climb_features/feature_list.txt
@@ -1,4 +1,5 @@
 angle
+angle_squared
 total_holds
 hand_holds
 foot_holds
@@ -6,7 +7,6 @@ start_holds
 finish_holds
 middle_holds
 is_nomatch
-mean_x
 mean_y
 std_x
 std_y
@@ -14,107 +14,36 @@ range_x
 range_y
 min_y
 max_y
-start_height
-start_height_min
-start_height_max
-finish_height
-finish_height_min
-finish_height_max
 height_gained
 height_gained_start_finish
 bbox_area
-bbox_aspect_ratio
-bbox_normalized_area
 hold_density
 holds_per_vertical_foot
-left_holds
-right_holds
 left_ratio
 symmetry_score
-hand_left_ratio
-hand_symmetry
-upper_holds
-lower_holds
 upper_ratio
-max_hand_reach
-min_hand_reach
 mean_hand_reach
+max_hand_reach
 std_hand_reach
 hand_spread_x
 hand_spread_y
-max_foot_spread
-mean_foot_spread
-foot_spread_x
-foot_spread_y
-max_hand_to_foot
 min_hand_to_foot
 mean_hand_to_foot
 std_hand_to_foot
-mean_hold_difficulty
-max_hold_difficulty
-min_hold_difficulty
-std_hold_difficulty
-median_hold_difficulty
-difficulty_range
-mean_hand_difficulty
-max_hand_difficulty
-std_hand_difficulty
-mean_foot_difficulty
-max_foot_difficulty
-std_foot_difficulty
-start_difficulty
-finish_difficulty
-hand_foot_ratio
-movement_density
-hold_com_x
-hold_com_y
-weighted_difficulty
 convex_hull_area
-convex_hull_perimeter
 hull_area_to_bbox_ratio
-min_nn_distance
-mean_nn_distance
-max_nn_distance
-std_nn_distance
-mean_neighbors_12in
-max_neighbors_12in
-clustering_ratio
+mean_pairwise_distance
+std_pairwise_distance
 path_length_vertical
 path_efficiency
-difficulty_gradient
-lower_region_difficulty
-middle_region_difficulty
-upper_region_difficulty
-difficulty_progression
-max_difficulty_jump
-mean_difficulty_jump
-difficulty_weighted_reach
-max_weighted_reach
-mean_x_normalized
 mean_y_normalized
-std_x_normalized
-std_y_normalized
 start_height_normalized
 finish_height_normalized
-start_offset_from_typical
-finish_offset_from_typical
 mean_y_relative_to_start
-max_y_relative_to_start
 spread_x_normalized
 spread_y_normalized
-bbox_coverage_x
-bbox_coverage_y
-y_q25
-y_q50
 y_q75
 y_iqr
-holds_bottom_quartile
-holds_top_quartile
+complexity_score
 display_difficulty
 angle_x_holds
-angle_x_difficulty
-angle_squared
-difficulty_x_height
-difficulty_x_density
-complexity_score
-hull_area_x_difficulty
--- a/data/05_predictive_modelling/model_summary.txt
+++ b/data/05_predictive_modelling/model_summary.txt
@@ -3,10 +3,10 @@

 | Model | MAE | RMSE | R² | Within ±1 | Within ±2 | Exact V | Within ±1 V |
 |-------|-----|------|----|-----------|-----------|---------|-------------|
-| Linear Regression | 1.467 | 1.882 | 0.782 | 42.6% | 73.3% | 34.9% | 79.4% |
-| Ridge Regression | 1.467 | 1.882 | 0.782 | 42.6% | 73.3% | 34.9% | 79.4% |
-| Lasso Regression | 1.475 | 1.891 | 0.780 | 42.2% | 73.0% | 34.6% | 79.3% |
-| Random Forest (Tuned) | 1.325 | 1.718 | 0.818 | 47.0% | 77.7% | 38.6% | 83.0% |
+| Linear Regression | 2.191 | 2.742 | 0.537 | 28.1% | 53.1% | 23.9% | 61.3% |
+| Ridge Regression | 2.191 | 2.742 | 0.537 | 28.1% | 53.1% | 23.9% | 61.3% |
+| Lasso Regression | 2.192 | 2.741 | 0.538 | 27.9% | 53.1% | 23.8% | 61.3% |
+| Random Forest (Tuned) | 1.788 | 2.293 | 0.676 | 36.1% | 64.3% | 30.2% | 70.8% |

 ### Key Findings

@@ -15,8 +15,8 @@
   - Linear models remain useful baselines but leave clear nonlinear signal unexplained.

 2. **Fine-grained difficulty prediction is meaningfully harder than grouped grade prediction.**
-   - On the held-out test set, the best model is within ±1 fine-grained difficulty score 47.0% of the time.
-   - The same model is within ±1 grouped V-grade 83.0% of the time.
+   - On the held-out test set, the best model is within ±1 fine-grained difficulty score 36.1% of the time.
+   - The same model is within ±1 grouped V-grade 70.8% of the time.

 3. **This gap is expected and informative.**
   - Small numeric errors often stay inside the same or adjacent V-grade buckets.
--- a/data/06_deep_learning/neural_network_summary.txt
+++ b/data/06_deep_learning/neural_network_summary.txt
@@ -2,25 +2,25 @@
 ### Neural Network Model Summary

 **Architecture:**
- Input: 119 features
+- Input: 48 features
 - Hidden layers: [256, 128, 64]
 - Dropout rate: 0.2
- Total parameters: 72,833
+- Total parameters: 54,657

 **Training:**
 - Optimizer: Adam (lr=0.001)
 - Early stopping: 25 epochs patience
- Best epoch: 121
+- Best epoch: 153

 **Test Set Performance:**
- MAE: 1.270
- RMSE: 1.643
- R²: 0.834
- Accuracy within ±1 grade: 49.0%
- Accuracy within ±2 grades: 80.2%
- Exact grouped V-grade accuracy: 39.2%
- Accuracy within ±1 V-grade: 84.3%
- Accuracy within ±2 V-grades: 96.8%
+- MAE: 1.893
+- RMSE: 2.398
+- R²: 0.646
+- Accuracy within ±1 grade: 33.8%
+- Accuracy within ±2 grades: 60.5%
+- Exact grouped V-grade accuracy: 27.8%
+- Accuracy within ±1 V-grade: 67.9%
+- Accuracy within ±2 V-grades: 88.4%

 **Key Findings:**
 1. The neural network is competitive, but not clearly stronger than the best tree-based baseline.
--- a/images/04_climb_features/feature_correlations.png
+++ b/images/04_climb_features/feature_correlations.png
--- a/images/05_predictive_modelling/cv_comparison.png
+++ b/images/05_predictive_modelling/cv_comparison.png
--- a/images/05_predictive_modelling/error_by_grade.png
+++ b/images/05_predictive_modelling/error_by_grade.png
--- a/images/05_predictive_modelling/linear_regression_coefficients.png
+++ b/images/05_predictive_modelling/linear_regression_coefficients.png
--- a/images/05_predictive_modelling/model_comparison.png
+++ b/images/05_predictive_modelling/model_comparison.png
--- a/images/05_predictive_modelling/prediction_uncertainty.png
+++ b/images/05_predictive_modelling/prediction_uncertainty.png
--- a/images/05_predictive_modelling/random_forest_importance.png
+++ b/images/05_predictive_modelling/random_forest_importance.png
--- a/images/06_deep_learning/neural_network_by_grade.png
+++ b/images/06_deep_learning/neural_network_by_grade.png
--- a/images/06_deep_learning/neural_network_errors.png
+++ b/images/06_deep_learning/neural_network_errors.png
--- a/images/06_deep_learning/neural_network_feature_importance.png
+++ b/images/06_deep_learning/neural_network_feature_importance.png
--- a/images/06_deep_learning/neural_network_predictions.png
+++ b/images/06_deep_learning/neural_network_predictions.png
--- a/images/06_deep_learning/neural_network_training.png
+++ b/images/06_deep_learning/neural_network_training.png
--- a/images/06_deep_learning/rf_vs_nn_comparison.png
+++ b/images/06_deep_learning/rf_vs_nn_comparison.png
--- a/models/feature_names.txt
+++ b/models/feature_names.txt
@@ -1,4 +1,5 @@
 angle
+angle_squared
 total_holds
 hand_holds
 foot_holds
@@ -6,7 +7,6 @@ start_holds
 finish_holds
 middle_holds
 is_nomatch
-mean_x
 mean_y
 std_x
 std_y
@@ -14,106 +14,35 @@ range_x
 range_y
 min_y
 max_y
-start_height
-start_height_min
-start_height_max
-finish_height
-finish_height_min
-finish_height_max
 height_gained
 height_gained_start_finish
 bbox_area
-bbox_aspect_ratio
-bbox_normalized_area
 hold_density
 holds_per_vertical_foot
-left_holds
-right_holds
 left_ratio
 symmetry_score
-hand_left_ratio
-hand_symmetry
-upper_holds
-lower_holds
 upper_ratio
-max_hand_reach
-min_hand_reach
 mean_hand_reach
+max_hand_reach
 std_hand_reach
 hand_spread_x
 hand_spread_y
-max_foot_spread
-mean_foot_spread
-foot_spread_x
-foot_spread_y
-max_hand_to_foot
 min_hand_to_foot
 mean_hand_to_foot
 std_hand_to_foot
-mean_hold_difficulty
-max_hold_difficulty
-min_hold_difficulty
-std_hold_difficulty
-median_hold_difficulty
-difficulty_range
-mean_hand_difficulty
-max_hand_difficulty
-std_hand_difficulty
-mean_foot_difficulty
-max_foot_difficulty
-std_foot_difficulty
-start_difficulty
-finish_difficulty
-hand_foot_ratio
-movement_density
-hold_com_x
-hold_com_y
-weighted_difficulty
 convex_hull_area
-convex_hull_perimeter
 hull_area_to_bbox_ratio
-min_nn_distance
-mean_nn_distance
-max_nn_distance
-std_nn_distance
-mean_neighbors_12in
-max_neighbors_12in
-clustering_ratio
+mean_pairwise_distance
+std_pairwise_distance
 path_length_vertical
 path_efficiency
-difficulty_gradient
-lower_region_difficulty
-middle_region_difficulty
-upper_region_difficulty
-difficulty_progression
-max_difficulty_jump
-mean_difficulty_jump
-difficulty_weighted_reach
-max_weighted_reach
-mean_x_normalized
 mean_y_normalized
-std_x_normalized
-std_y_normalized
 start_height_normalized
 finish_height_normalized
-start_offset_from_typical
-finish_offset_from_typical
 mean_y_relative_to_start
-max_y_relative_to_start
 spread_x_normalized
 spread_y_normalized
-bbox_coverage_x
-bbox_coverage_y
-y_q25
-y_q50
 y_q75
 y_iqr
-holds_bottom_quartile
-holds_top_quartile
-angle_x_holds
-angle_x_difficulty
-angle_squared
-difficulty_x_height
-difficulty_x_density
 complexity_score
-hull_area_x_difficulty
+angle_x_holds
--- a/notebooks/02_hold_analysis_and_board_heatmaps.ipynb
+++ b/notebooks/02_hold_analysis_and_board_heatmaps.ipynb
@@ -520,7 +520,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
   "id": "9d3eb97b",
   "metadata": {},
   "outputs": [],
@@ -553,6 +553,7 @@
    "role_type_map = {5: 'hand', 6: 'hand', 7: 'hand', 8: 'foot'}\n",
    "\n",
    "## Boundary conditions\n",
+    "# comes from the product_sizes table. The edge_left/edge_right/edge_bottom/edge_top give this info.\n",
    "x_min, x_max = -68, 68\n",
    "y_min, y_max = 0, 144"
   ]
--- a/notebooks/04_feature_engineering.ipynb
+++ b/notebooks/04_feature_engineering.ipynb
--- a/notebooks/05_predictive_modelling.ipynb
+++ b/notebooks/05_predictive_modelling.ipynb
--- a/notebooks/06_deep_learning.ipynb
+++ b/notebooks/06_deep_learning.ipynb
--- a/sql/01_data_exploration.sql
+++ b/sql/01_data_exploration.sql
@@ -402,6 +402,20 @@ id|product_id|name|x |y  |mirrored_hole_id|mirror_group|
 * With the TB1 and TB2 Mirror, you can mirror climbs. So the mirror_hole_id must be where the associated mirror hole is.
 * Not sure about the mirror_group.
 *
+ * Let's also take a look at our range of holes.
+ */
+
+ SELECT 
+    MIN(x) AS x_min,
+    MAX(x) AS x_max,
+    MIN(y) AS y_min,
+    MAX(y) AS y_max
+FROM holes WHERE product_id=5
+/*
+x_min|x_max|y_min|y_max|
+-----+-----+-----+-----+
+  -64|   64|    4|  140|
+ *
 * Let's look at sets next.
 */