Files
ClimbingBoardGPT/notebooks/04_generated_route_evaluation.ipynb
2026-05-21 07:21:13 -04:00

514 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "84d328e0",
"metadata": {},
"source": [
"# 04 — Generated Route Evaluation\n",
"\n",
"## Why evaluate generated routes?\n",
"\n",
"In language modeling, we evaluate generated text using metrics like BLEU, ROUGE, or perplexity. For climbing routes, we need domain-specific evaluation:\n",
"\n",
"1. **Validity**: Does the route follow the rules of climbing boards?\n",
"2. **Novelty**: Is the route different from existing climbs, or just a copy?\n",
"3. **Geometric plausibility**: Are the holds in reasonable positions?\n",
"4. **Grade consistency**: Does the route's predicted grade match the requested grade?\n",
"\n",
"### Validity checks\n",
"\n",
"A \"basic valid\" route must have:\n",
"- At least 3 holds (you need at least 2 hands + 1 foot to climb)\n",
"- No duplicate placements (you can't use the same hold twice)\n",
"- At least one start hold and one finish hold\n",
"- All holds from the same board (no mixing TB2 and Kilter holds)\n",
"\n",
"A \"strict valid\" route additionally has:\n",
"- At least one middle hold (most real climbs have more than just start + finish)\n",
"- At least 4 holds total\n",
"\n",
"### Novelty metrics\n",
"\n",
"We measure novelty using **Jaccard distance**: 1 minus the Jaccard similarity between the generated route's hold set and the most similar real route's hold set.\n",
"\n",
"- Jaccard similarity = |A intersection B| / |A union B|\n",
"- Novelty distance = 1 - Jaccard similarity\n",
"\n",
"A novelty distance of 1.0 means the generated route shares no holds with any real route. A distance of 0.0 means it's identical to an existing route."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "726b846f",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import sys\n",
"import numpy as np\n",
"import pandas as pd\n",
"import torch\n",
"\n",
"ROOT = Path.cwd().resolve()\n",
"if ROOT.name == \"notebooks\":\n",
" ROOT = ROOT.parent\n",
"sys.path.insert(0, str(ROOT / \"src\"))\n",
"\n",
"from climbingboardgpt.evaluation import (\n",
" build_placement_coords,\n",
" frames_to_holds,\n",
" holds_to_placement_set,\n",
" nearest_real_route_same_board,\n",
" parse_token_list,\n",
" simple_route_features,\n",
" tokens_to_hold_records,\n",
" validity_from_records,\n",
")\n",
"from climbingboardgpt.grades import to_grouped_v\n",
"from climbingboardgpt.models import JointRouteTransformerRegressor"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f8bb61f",
"metadata": {},
"outputs": [],
"source": [
"# Load generated routes and real routes for comparison\n",
"# NOTE: This notebook requires that you've run notebook 03 first to\n",
"# generate and save the routes.\n",
"\n",
"TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
"GENERATED = ROOT / \"data\" / \"processed\" / \"generation\"\n",
"\n",
"# Check if required files exist\n",
"generated_path = GENERATED / \"generated_routes.csv\"\n",
"routes_path = TOKENIZED / \"route_sequences.csv\"\n",
"token_meta_path = TOKENIZED / \"token_metadata.csv\"\n",
"\n",
"if not generated_path.exists():\n",
" raise FileNotFoundError(\n",
" f\"Generated routes not found at: {generated_path}\\n\"\n",
" f\"Please run notebook 03 first to generate and save routes,\\n\"\n",
" f\"or run: python scripts/03_train_route_generator.py\"\n",
" )\n",
"\n",
"if not routes_path.exists() or not token_meta_path.exists():\n",
" raise FileNotFoundError(\n",
" f\"Tokenized data not found at: {TOKENIZED}\\n\"\n",
" f\"Please run notebook 01 first to tokenize routes,\\n\"\n",
" f\"or run: python scripts/01_tokenize_routes.py\"\n",
" )\n",
"\n",
"df_generated = pd.read_csv(generated_path)\n",
"df_real = pd.read_csv(routes_path)\n",
"df_token_meta = pd.read_csv(token_meta_path)\n",
"\n",
"print(f\"Generated routes: {len(df_generated):,}\")\n",
"print(f\"Real routes: {len(df_real):,}\")"
]
},
{
"cell_type": "markdown",
"id": "0091bafb",
"metadata": {},
"source": [
"## Parse generated tokens and check validity\n",
"\n",
"We parse the generated token sequences and check each route for validity."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f5c2b25a",
"metadata": {},
"outputs": [],
"source": [
"# Parse the token strings into structured records\n",
"df_generated[\"tokens_parsed\"] = df_generated[\"tokens\"].apply(parse_token_list)\n",
"\n",
"# Extract hold information from tokens\n",
"df_generated[\"hold_records\"] = df_generated[\"tokens_parsed\"].apply(tokens_to_hold_records)\n",
"\n",
"# Check validity for each generated route\n",
"validity = pd.DataFrame(df_generated[\"hold_records\"].apply(validity_from_records).tolist())\n",
"df_eval = pd.concat([df_generated.reset_index(drop=True), validity], axis=1)\n",
"\n",
"print(\"Validity rates by board:\")\n",
"print(\"=\" * 50)\n",
"validity_summary = df_eval.groupby(\"board_key\").agg(\n",
" total=(\"basic_valid_eval\", \"count\"),\n",
" basic_valid_rate=(\"basic_valid_eval\", \"mean\"),\n",
" strict_valid_rate=(\"strict_valid_eval\", \"mean\"),\n",
" avg_holds=(\"n_holds_eval\", \"mean\"),\n",
").round(3)\n",
"print(validity_summary)"
]
},
{
"cell_type": "markdown",
"id": "0cf2170e",
"metadata": {},
"source": [
"## Novelty against real climbs\n",
"\n",
"For each generated route, we find the most similar real route from the same board (by Jaccard similarity of hold sets). A good generator should produce routes that are novel (low Jaccard similarity to existing routes) while still being valid."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7f34524",
"metadata": {},
"outputs": [],
"source": [
"# Convert hold sets to frozensets for fast comparison\n",
"df_eval[\"hold_set\"] = df_eval[\"hold_records\"].apply(\n",
" lambda records: frozenset(int(record[\"placement_id\"]) for record in records)\n",
")\n",
"\n",
"# Parse real routes' frames strings into hold sets\n",
"df_real[\"real_holds\"] = df_real[\"frames\"].apply(frames_to_holds)\n",
"df_real[\"hold_set\"] = df_real[\"real_holds\"].apply(holds_to_placement_set)\n",
"\n",
"# Find nearest real route for each generated route\n",
"print(\"Computing novelty (finding nearest real route for each generated route)...\")\n",
"print(\"This may take a few minutes...\")\n",
"\n",
"nearest = pd.DataFrame(\n",
" df_eval.apply(\n",
" lambda row: nearest_real_route_same_board(\n",
" generated_set=row[\"hold_set\"],\n",
" generated_board_key=row[\"board_key\"],\n",
" real_df=df_real,\n",
" ),\n",
" axis=1,\n",
" ).tolist()\n",
")\n",
"df_eval = pd.concat([df_eval, nearest], axis=1)\n",
"\n",
"print(\"\\nNovelty statistics by board:\")\n",
"print(\"=\" * 50)\n",
"novelty_summary = df_eval.groupby(\"board_key\").agg(\n",
" mean_jaccard=(\"nearest_real_jaccard\", \"mean\"),\n",
" mean_novelty=(\"novelty_distance\", \"mean\"),\n",
" median_novelty=(\"novelty_distance\", \"median\"),\n",
").round(3)\n",
"print(novelty_summary)"
]
},
{
"cell_type": "markdown",
"id": "b581705d",
"metadata": {},
"source": [
"## Geometric descriptors\n",
"\n",
"We compute simple geometric features for each generated route:\n",
"\n",
"- `geom_n_holds`: Number of holds\n",
"- `geom_height`: Vertical extent of the route\n",
"- `geom_width`: Horizontal extent\n",
"- `geom_mean_hand_reach`: Average distance between hand holds\n",
"\n",
"These features help us understand whether generated routes have reasonable spatial properties."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d74d4cad",
"metadata": {},
"outputs": [],
"source": [
"# Build coordinate lookup from token metadata\n",
"coords = build_placement_coords(df_token_meta)\n",
"\n",
"# Compute geometric features for each generated route\n",
"geom = pd.DataFrame(\n",
" df_eval.apply(\n",
" lambda row: simple_route_features(\n",
" board_key=row[\"board_key\"],\n",
" records=row[\"hold_records\"],\n",
" placement_coords=coords,\n",
" ),\n",
" axis=1,\n",
" ).tolist()\n",
")\n",
"df_eval = pd.concat([df_eval, geom], axis=1)\n",
"\n",
"print(\"Geometric feature statistics by board:\")\n",
"print(\"=\" * 50)\n",
"geom_summary = df_eval.groupby(\"board_key\").agg(\n",
" mean_holds=(\"geom_n_holds\", \"mean\"),\n",
" mean_height=(\"geom_height\", \"mean\"),\n",
" mean_width=(\"geom_width\", \"mean\"),\n",
" mean_hand_reach=(\"geom_mean_hand_reach\", \"mean\"),\n",
").round(3)\n",
"print(geom_summary)"
]
},
{
"cell_type": "markdown",
"id": "4455557a",
"metadata": {},
"source": [
"## Grade consistency (using the trained critic)\n",
"\n",
"If we have a trained grade predictor (from notebook 02), we can use it as a **critic** to check whether generated routes have grades consistent with what was requested.\n",
"\n",
"This is similar to how GANs use a discriminator to evaluate generated samples, except our critic is a regression model rather than a binary classifier."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88747d6e",
"metadata": {},
"outputs": [],
"source": [
"# Try to load the grade critic from notebook 02\n",
"GRADE_MODEL_PATH = ROOT / \"models\" / \"joint_transformer_grade_predictor.pth\"\n",
"\n",
"def load_grade_critic(model_path, device):\n",
" \"\"\"Load the trained grade predictor model.\"\"\"\n",
" if not model_path.exists():\n",
" return None\n",
" try:\n",
" checkpoint = torch.load(model_path, map_location=device, weights_only=False)\n",
" except TypeError:\n",
" checkpoint = torch.load(model_path, map_location=device)\n",
"\n",
" cfg = checkpoint[\"config\"]\n",
" stoi = {str(k): int(v) for k, v in checkpoint[\"stoi\"].items()}\n",
" coord_features = checkpoint[\"coord_features\"]\n",
" if not isinstance(coord_features, torch.Tensor):\n",
" coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
"\n",
" model = JointRouteTransformerRegressor(\n",
" vocab_size=cfg[\"vocab_size\"],\n",
" max_len=cfg[\"max_len\"],\n",
" coord_features=coord_features,\n",
" d_model=cfg.get(\"d_model\", 128),\n",
" nhead=cfg.get(\"nhead\", 4),\n",
" num_layers=cfg.get(\"num_layers\", 4),\n",
" dim_feedforward=cfg.get(\"dim_feedforward\", 256),\n",
" dropout=cfg.get(\"dropout\", 0.10),\n",
" pad_id=cfg.get(\"pad_id\", stoi[\"<PAD>\"]),\n",
" ).to(device)\n",
" model.load_state_dict(checkpoint[\"model_state_dict\"])\n",
" model.eval()\n",
"\n",
" return {\n",
" \"model\": model,\n",
" \"stoi\": stoi,\n",
" \"pad_id\": stoi[\"<PAD>\"],\n",
" \"unk_id\": stoi[\"<UNK>\"],\n",
" \"max_len\": cfg[\"max_len\"],\n",
" }\n",
"\n",
"\n",
"def predict_generated_grade(tokens, critic, device):\n",
" \"\"\"Predict the difficulty of a generated route using the critic.\"\"\"\n",
" model = critic[\"model\"]\n",
" stoi = critic[\"stoi\"]\n",
" pad_id = critic[\"pad_id\"]\n",
" unk_id = critic[\"unk_id\"]\n",
" max_len = critic[\"max_len\"]\n",
"\n",
" # Remove grade tokens and replace BOS with CLS\n",
" tokens = [t for t in tokens if not t.startswith(\"<GRADE_\")]\n",
" if tokens and tokens[0] == \"<BOS>\":\n",
" tokens = [\"<CLS>\"] + tokens[1:]\n",
" else:\n",
" tokens = [\"<CLS>\"] + tokens\n",
"\n",
" ids = [stoi.get(t, unk_id) for t in tokens][:max_len]\n",
" mask = [1] * len(ids)\n",
" if len(ids) < max_len:\n",
" pad_n = max_len - len(ids)\n",
" ids += [pad_id] * pad_n\n",
" mask += [0] * pad_n\n",
"\n",
" with torch.no_grad():\n",
" input_ids = torch.tensor([ids], dtype=torch.long, device=device)\n",
" attention_mask = torch.tensor([mask], dtype=torch.bool, device=device)\n",
" return float(model(input_ids, attention_mask).cpu().item())\n",
"\n",
"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"critic = load_grade_critic(GRADE_MODEL_PATH, device)\n",
"\n",
"if critic is not None:\n",
" print(\"Grade critic loaded successfully!\")\n",
" print(f\"Device: {device}\")\n",
"else:\n",
" print(\"No trained grade critic found. Skipping critic-based scoring.\")\n",
" print(\"Run notebook 02 first to train the grade predictor.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "critic_eval",
"metadata": {},
"outputs": [],
"source": [
"# Apply the critic to evaluate grade consistency\n",
"if critic is not None:\n",
" df_eval[\"critic_pred_display_difficulty\"] = df_eval[\"tokens_parsed\"].apply(\n",
" lambda tokens: predict_generated_grade(tokens, critic, device)\n",
" )\n",
" df_eval[\"critic_pred_grouped_v\"] = df_eval[\"critic_pred_display_difficulty\"].apply(to_grouped_v)\n",
" df_eval[\"critic_v_error\"] = df_eval[\"critic_pred_grouped_v\"] - df_eval[\"requested_grouped_v\"]\n",
"\n",
" print(\"Grade consistency by board:\")\n",
" print(\"=\" * 50)\n",
" critic_summary = df_eval.groupby(\"board_key\").agg(\n",
" exact_v=(\"critic_v_error\", lambda s: float((s == 0).mean() * 100)),\n",
" within_1_v=(\"critic_v_error\", lambda s: float((s.abs() <= 1).mean() * 100)),\n",
" within_2_v=(\"critic_v_error\", lambda s: float((s.abs() <= 2).mean() * 100)),\n",
" mean_error=(\"critic_v_error\", \"mean\"),\n",
" ).round(2)\n",
" print(critic_summary)\n",
"else:\n",
" print(\"Skipping critic evaluation (no model loaded).\")"
]
},
{
"cell_type": "markdown",
"id": "ranking",
"metadata": {},
"source": [
"## Ranking generated routes\n",
"\n",
"We rank candidates by a composite score that rewards:\n",
"- **Basic validity** (required): At least 3 holds, start/finish, no duplicates, one board\n",
"- **Strict validity** (bonus): Also has middle holds and 4+ holds\n",
"- **Novelty** (higher is better): Distance from nearest real route\n",
"- **Grade consistency** (if critic available): Predicted grade close to requested grade"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88747d6e2",
"metadata": {},
"outputs": [],
"source": [
"# Rank candidates by composite score\n",
"ranked = df_eval.copy()\n",
"ranked[\"score\"] = 0.0\n",
"ranked[\"score\"] += ranked[\"basic_valid_eval\"].astype(float) * 2.0\n",
"ranked[\"score\"] += ranked[\"strict_valid_eval\"].astype(float) * 1.0\n",
"ranked[\"score\"] += ranked[\"novelty_distance\"].fillna(0.0)\n",
"\n",
"if \"critic_v_error\" in ranked.columns:\n",
" ranked[\"score\"] += (ranked[\"critic_v_error\"].abs() <= 1).astype(float)\n",
" ranked[\"score\"] -= 0.25 * ranked[\"critic_v_error\"].abs()\n",
"\n",
"print(\"Top 10 generated routes by composite score:\")\n",
"print(\"=\" * 70)\n",
"top_routes = ranked.sort_values(\"score\", ascending=False).head(10)\n",
"display_cols = [\"board_key\", \"score\", \"basic_valid_eval\", \"strict_valid_eval\", \"novelty_distance\"]\n",
"if \"critic_v_error\" in top_routes.columns:\n",
" display_cols.append(\"critic_v_error\")\n",
"print(top_routes[display_cols].to_string(index=False))"
]
},
{
"cell_type": "markdown",
"id": "evaluation_summary",
"metadata": {},
"source": [
"## Save evaluation results\n",
"\n",
"We save the full evaluation DataFrame and the top candidates for further analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "save_results",
"metadata": {},
"outputs": [],
"source": [
"# Save evaluation results\n",
"OUT_DIR = ROOT / \"data\" / \"processed\" / \"evaluation\"\n",
"OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
"\n",
"df_eval.to_csv(OUT_DIR / \"generated_route_evaluation.csv\", index=False)\n",
"top_candidates = ranked.sort_values(\"score\", ascending=False).head(100)\n",
"top_candidates.to_csv(OUT_DIR / \"top_generated_candidates.csv\", index=False)\n",
"\n",
"print(f\"Saved evaluation results to: {OUT_DIR}\")\n",
"print(f\" - generated_route_evaluation.csv ({len(df_eval)} rows)\")\n",
"print(f\" - top_generated_candidates.csv (100 rows)\")\n",
"\n",
"print(\"\\n\" + \"=\" * 50)\n",
"print(\"EVALUATION SUMMARY\")\n",
"print(\"=\" * 50)\n",
"print(f\"\\nTotal generated routes: {len(df_eval):,}\")\n",
"print(f\"\\nBasic validity rate: {df_eval['basic_valid_eval'].mean():.1%}\")\n",
"print(f\"Strict validity rate: {df_eval['strict_valid_eval'].mean():.1%}\")\n",
"print(f\"Mean novelty distance: {df_eval['novelty_distance'].mean():.3f}\")\n",
"\n",
"if 'critic_v_error' in df_eval.columns:\n",
" print(f\"\\nGrade consistency:\")\n",
" print(f\" Exact V-grade: {(df_eval['critic_v_error'] == 0).mean():.1%}\")\n",
" print(f\" Within 1 V-grade: {(df_eval['critic_v_error'].abs() <= 1).mean():.1%}\")\n",
" print(f\" Within 2 V-grades: {(df_eval['critic_v_error'].abs() <= 2).mean():.1%}\")\n",
"else:\n",
" print(\"\\n(Grade consistency not available - no critic model loaded)\")"
]
},
{
"cell_type": "markdown",
"id": "takeaways",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Validity**: The generator produces routes that mostly satisfy structural constraints (start/finish holds, no duplicates, single board).\n",
"\n",
"2. **Novelty**: Generated routes are meaningfully different from existing routes, as measured by Jaccard distance.\n",
"\n",
"3. **Geometric plausibility**: The geometric features (height, width, hand reach) should be in reasonable ranges compared to real routes.\n",
"\n",
"4. **Grade consistency**: If the critic is available, we can check whether routes generated at a requested grade actually feel like that grade.\n",
"\n",
"### Limitations\n",
"\n",
"- Validity checks are structural, not semantic. A route might have valid start/finish holds but still be impossible.\n",
"- Geometric features are simple. More sophisticated analysis could check reachability and move sequences.\n",
"- The critic model was trained on real data, so it may not generalize well to novel route structures."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}