{ "cells": [ { "cell_type": "markdown", "id": "84d328e0", "metadata": {}, "source": [ "# 04 — Generated Route Evaluation\n", "\n", "## Why evaluate generated routes?\n", "\n", "In language modeling, we evaluate generated text using metrics like BLEU, ROUGE, or perplexity. For climbing routes, we need domain-specific evaluation:\n", "\n", "1. **Validity**: Does the route follow the rules of climbing boards?\n", "2. **Novelty**: Is the route different from existing climbs, or just a copy?\n", "3. **Geometric plausibility**: Are the holds in reasonable positions?\n", "4. **Grade consistency**: Does the route's predicted grade match the requested grade?\n", "\n", "### Validity checks\n", "\n", "A \"basic valid\" route must have:\n", "- At least 3 holds (you need at least 2 hands + 1 foot to climb)\n", "- No duplicate placements (you can't use the same hold twice)\n", "- At least one start hold and one finish hold\n", "- All holds from the same board (no mixing TB2 and Kilter holds)\n", "\n", "A \"strict valid\" route additionally has:\n", "- At least one middle hold (most real climbs have more than just start + finish)\n", "- At least 4 holds total\n", "\n", "### Novelty metrics\n", "\n", "We measure novelty using **Jaccard distance**: 1 minus the Jaccard similarity between the generated route's hold set and the most similar real route's hold set.\n", "\n", "- Jaccard similarity = |A intersection B| / |A union B|\n", "- Novelty distance = 1 - Jaccard similarity\n", "\n", "A novelty distance of 1.0 means the generated route shares no holds with any real route. A distance of 0.0 means it's identical to an existing route." ] }, { "cell_type": "code", "execution_count": null, "id": "726b846f", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import sys\n", "import numpy as np\n", "import pandas as pd\n", "import torch\n", "\n", "ROOT = Path.cwd().resolve()\n", "if ROOT.name == \"notebooks\":\n", " ROOT = ROOT.parent\n", "sys.path.insert(0, str(ROOT / \"src\"))\n", "\n", "from climbingboardgpt.evaluation import (\n", " build_placement_coords,\n", " frames_to_holds,\n", " holds_to_placement_set,\n", " nearest_real_route_same_board,\n", " parse_token_list,\n", " simple_route_features,\n", " tokens_to_hold_records,\n", " validity_from_records,\n", ")\n", "from climbingboardgpt.grades import to_grouped_v\n", "from climbingboardgpt.models import JointRouteTransformerRegressor" ] }, { "cell_type": "code", "execution_count": null, "id": "7f8bb61f", "metadata": {}, "outputs": [], "source": [ "# Load generated routes and real routes for comparison\n", "# NOTE: This notebook requires that you've run notebook 03 first to\n", "# generate and save the routes.\n", "\n", "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n", "GENERATED = ROOT / \"data\" / \"processed\" / \"generation\"\n", "\n", "# Check if required files exist\n", "generated_path = GENERATED / \"generated_routes.csv\"\n", "routes_path = TOKENIZED / \"route_sequences.csv\"\n", "token_meta_path = TOKENIZED / \"token_metadata.csv\"\n", "\n", "if not generated_path.exists():\n", " raise FileNotFoundError(\n", " f\"Generated routes not found at: {generated_path}\\n\"\n", " f\"Please run notebook 03 first to generate and save routes,\\n\"\n", " f\"or run: python scripts/03_train_route_generator.py\"\n", " )\n", "\n", "if not routes_path.exists() or not token_meta_path.exists():\n", " raise FileNotFoundError(\n", " f\"Tokenized data not found at: {TOKENIZED}\\n\"\n", " f\"Please run notebook 01 first to tokenize routes,\\n\"\n", " f\"or run: python scripts/01_tokenize_routes.py\"\n", " )\n", "\n", "df_generated = pd.read_csv(generated_path)\n", "df_real = pd.read_csv(routes_path)\n", "df_token_meta = pd.read_csv(token_meta_path)\n", "\n", "print(f\"Generated routes: {len(df_generated):,}\")\n", "print(f\"Real routes: {len(df_real):,}\")" ] }, { "cell_type": "markdown", "id": "0091bafb", "metadata": {}, "source": [ "## Parse generated tokens and check validity\n", "\n", "We parse the generated token sequences and check each route for validity." ] }, { "cell_type": "code", "execution_count": null, "id": "f5c2b25a", "metadata": {}, "outputs": [], "source": [ "# Parse the token strings into structured records\n", "df_generated[\"tokens_parsed\"] = df_generated[\"tokens\"].apply(parse_token_list)\n", "\n", "# Extract hold information from tokens\n", "df_generated[\"hold_records\"] = df_generated[\"tokens_parsed\"].apply(tokens_to_hold_records)\n", "\n", "# Check validity for each generated route\n", "validity = pd.DataFrame(df_generated[\"hold_records\"].apply(validity_from_records).tolist())\n", "df_eval = pd.concat([df_generated.reset_index(drop=True), validity], axis=1)\n", "\n", "print(\"Validity rates by board:\")\n", "print(\"=\" * 50)\n", "validity_summary = df_eval.groupby(\"board_key\").agg(\n", " total=(\"basic_valid_eval\", \"count\"),\n", " basic_valid_rate=(\"basic_valid_eval\", \"mean\"),\n", " strict_valid_rate=(\"strict_valid_eval\", \"mean\"),\n", " avg_holds=(\"n_holds_eval\", \"mean\"),\n", ").round(3)\n", "print(validity_summary)" ] }, { "cell_type": "markdown", "id": "0cf2170e", "metadata": {}, "source": [ "## Novelty against real climbs\n", "\n", "For each generated route, we find the most similar real route from the same board (by Jaccard similarity of hold sets). A good generator should produce routes that are novel (low Jaccard similarity to existing routes) while still being valid." ] }, { "cell_type": "code", "execution_count": null, "id": "e7f34524", "metadata": {}, "outputs": [], "source": [ "# Convert hold sets to frozensets for fast comparison\n", "df_eval[\"hold_set\"] = df_eval[\"hold_records\"].apply(\n", " lambda records: frozenset(int(record[\"placement_id\"]) for record in records)\n", ")\n", "\n", "# Parse real routes' frames strings into hold sets\n", "df_real[\"real_holds\"] = df_real[\"frames\"].apply(frames_to_holds)\n", "df_real[\"hold_set\"] = df_real[\"real_holds\"].apply(holds_to_placement_set)\n", "\n", "# Find nearest real route for each generated route\n", "print(\"Computing novelty (finding nearest real route for each generated route)...\")\n", "print(\"This may take a few minutes...\")\n", "\n", "nearest = pd.DataFrame(\n", " df_eval.apply(\n", " lambda row: nearest_real_route_same_board(\n", " generated_set=row[\"hold_set\"],\n", " generated_board_key=row[\"board_key\"],\n", " real_df=df_real,\n", " ),\n", " axis=1,\n", " ).tolist()\n", ")\n", "df_eval = pd.concat([df_eval, nearest], axis=1)\n", "\n", "print(\"\\nNovelty statistics by board:\")\n", "print(\"=\" * 50)\n", "novelty_summary = df_eval.groupby(\"board_key\").agg(\n", " mean_jaccard=(\"nearest_real_jaccard\", \"mean\"),\n", " mean_novelty=(\"novelty_distance\", \"mean\"),\n", " median_novelty=(\"novelty_distance\", \"median\"),\n", ").round(3)\n", "print(novelty_summary)" ] }, { "cell_type": "markdown", "id": "b581705d", "metadata": {}, "source": [ "## Geometric descriptors\n", "\n", "We compute simple geometric features for each generated route:\n", "\n", "- `geom_n_holds`: Number of holds\n", "- `geom_height`: Vertical extent of the route\n", "- `geom_width`: Horizontal extent\n", "- `geom_mean_hand_reach`: Average distance between hand holds\n", "\n", "These features help us understand whether generated routes have reasonable spatial properties." ] }, { "cell_type": "code", "execution_count": null, "id": "d74d4cad", "metadata": {}, "outputs": [], "source": [ "# Build coordinate lookup from token metadata\n", "coords = build_placement_coords(df_token_meta)\n", "\n", "# Compute geometric features for each generated route\n", "geom = pd.DataFrame(\n", " df_eval.apply(\n", " lambda row: simple_route_features(\n", " board_key=row[\"board_key\"],\n", " records=row[\"hold_records\"],\n", " placement_coords=coords,\n", " ),\n", " axis=1,\n", " ).tolist()\n", ")\n", "df_eval = pd.concat([df_eval, geom], axis=1)\n", "\n", "print(\"Geometric feature statistics by board:\")\n", "print(\"=\" * 50)\n", "geom_summary = df_eval.groupby(\"board_key\").agg(\n", " mean_holds=(\"geom_n_holds\", \"mean\"),\n", " mean_height=(\"geom_height\", \"mean\"),\n", " mean_width=(\"geom_width\", \"mean\"),\n", " mean_hand_reach=(\"geom_mean_hand_reach\", \"mean\"),\n", ").round(3)\n", "print(geom_summary)" ] }, { "cell_type": "markdown", "id": "4455557a", "metadata": {}, "source": [ "## Grade consistency (using the trained critic)\n", "\n", "If we have a trained grade predictor (from notebook 02), we can use it as a **critic** to check whether generated routes have grades consistent with what was requested.\n", "\n", "This is similar to how GANs use a discriminator to evaluate generated samples, except our critic is a regression model rather than a binary classifier." ] }, { "cell_type": "code", "execution_count": null, "id": "88747d6e", "metadata": {}, "outputs": [], "source": [ "# Try to load the grade critic from notebook 02\n", "GRADE_MODEL_PATH = ROOT / \"models\" / \"joint_transformer_grade_predictor.pth\"\n", "\n", "def load_grade_critic(model_path, device):\n", " \"\"\"Load the trained grade predictor model.\"\"\"\n", " if not model_path.exists():\n", " return None\n", " try:\n", " checkpoint = torch.load(model_path, map_location=device, weights_only=False)\n", " except TypeError:\n", " checkpoint = torch.load(model_path, map_location=device)\n", "\n", " cfg = checkpoint[\"config\"]\n", " stoi = {str(k): int(v) for k, v in checkpoint[\"stoi\"].items()}\n", " coord_features = checkpoint[\"coord_features\"]\n", " if not isinstance(coord_features, torch.Tensor):\n", " coord_features = torch.tensor(coord_features, dtype=torch.float32)\n", "\n", " model = JointRouteTransformerRegressor(\n", " vocab_size=cfg[\"vocab_size\"],\n", " max_len=cfg[\"max_len\"],\n", " coord_features=coord_features,\n", " d_model=cfg.get(\"d_model\", 128),\n", " nhead=cfg.get(\"nhead\", 4),\n", " num_layers=cfg.get(\"num_layers\", 4),\n", " dim_feedforward=cfg.get(\"dim_feedforward\", 256),\n", " dropout=cfg.get(\"dropout\", 0.10),\n", " pad_id=cfg.get(\"pad_id\", stoi[\"\"]),\n", " ).to(device)\n", " model.load_state_dict(checkpoint[\"model_state_dict\"])\n", " model.eval()\n", "\n", " return {\n", " \"model\": model,\n", " \"stoi\": stoi,\n", " \"pad_id\": stoi[\"\"],\n", " \"unk_id\": stoi[\"\"],\n", " \"max_len\": cfg[\"max_len\"],\n", " }\n", "\n", "\n", "def predict_generated_grade(tokens, critic, device):\n", " \"\"\"Predict the difficulty of a generated route using the critic.\"\"\"\n", " model = critic[\"model\"]\n", " stoi = critic[\"stoi\"]\n", " pad_id = critic[\"pad_id\"]\n", " unk_id = critic[\"unk_id\"]\n", " max_len = critic[\"max_len\"]\n", "\n", " # Remove grade tokens and replace BOS with CLS\n", " tokens = [t for t in tokens if not t.startswith(\"\":\n", " tokens = [\"\"] + tokens[1:]\n", " else:\n", " tokens = [\"\"] + tokens\n", "\n", " ids = [stoi.get(t, unk_id) for t in tokens][:max_len]\n", " mask = [1] * len(ids)\n", " if len(ids) < max_len:\n", " pad_n = max_len - len(ids)\n", " ids += [pad_id] * pad_n\n", " mask += [0] * pad_n\n", "\n", " with torch.no_grad():\n", " input_ids = torch.tensor([ids], dtype=torch.long, device=device)\n", " attention_mask = torch.tensor([mask], dtype=torch.bool, device=device)\n", " return float(model(input_ids, attention_mask).cpu().item())\n", "\n", "\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "critic = load_grade_critic(GRADE_MODEL_PATH, device)\n", "\n", "if critic is not None:\n", " print(\"Grade critic loaded successfully!\")\n", " print(f\"Device: {device}\")\n", "else:\n", " print(\"No trained grade critic found. Skipping critic-based scoring.\")\n", " print(\"Run notebook 02 first to train the grade predictor.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "critic_eval", "metadata": {}, "outputs": [], "source": [ "# Apply the critic to evaluate grade consistency\n", "if critic is not None:\n", " df_eval[\"critic_pred_display_difficulty\"] = df_eval[\"tokens_parsed\"].apply(\n", " lambda tokens: predict_generated_grade(tokens, critic, device)\n", " )\n", " df_eval[\"critic_pred_grouped_v\"] = df_eval[\"critic_pred_display_difficulty\"].apply(to_grouped_v)\n", " df_eval[\"critic_v_error\"] = df_eval[\"critic_pred_grouped_v\"] - df_eval[\"requested_grouped_v\"]\n", "\n", " print(\"Grade consistency by board:\")\n", " print(\"=\" * 50)\n", " critic_summary = df_eval.groupby(\"board_key\").agg(\n", " exact_v=(\"critic_v_error\", lambda s: float((s == 0).mean() * 100)),\n", " within_1_v=(\"critic_v_error\", lambda s: float((s.abs() <= 1).mean() * 100)),\n", " within_2_v=(\"critic_v_error\", lambda s: float((s.abs() <= 2).mean() * 100)),\n", " mean_error=(\"critic_v_error\", \"mean\"),\n", " ).round(2)\n", " print(critic_summary)\n", "else:\n", " print(\"Skipping critic evaluation (no model loaded).\")" ] }, { "cell_type": "markdown", "id": "ranking", "metadata": {}, "source": [ "## Ranking generated routes\n", "\n", "We rank candidates by a composite score that rewards:\n", "- **Basic validity** (required): At least 3 holds, start/finish, no duplicates, one board\n", "- **Strict validity** (bonus): Also has middle holds and 4+ holds\n", "- **Novelty** (higher is better): Distance from nearest real route\n", "- **Grade consistency** (if critic available): Predicted grade close to requested grade" ] }, { "cell_type": "code", "execution_count": null, "id": "88747d6e2", "metadata": {}, "outputs": [], "source": [ "# Rank candidates by composite score\n", "ranked = df_eval.copy()\n", "ranked[\"score\"] = 0.0\n", "ranked[\"score\"] += ranked[\"basic_valid_eval\"].astype(float) * 2.0\n", "ranked[\"score\"] += ranked[\"strict_valid_eval\"].astype(float) * 1.0\n", "ranked[\"score\"] += ranked[\"novelty_distance\"].fillna(0.0)\n", "\n", "if \"critic_v_error\" in ranked.columns:\n", " ranked[\"score\"] += (ranked[\"critic_v_error\"].abs() <= 1).astype(float)\n", " ranked[\"score\"] -= 0.25 * ranked[\"critic_v_error\"].abs()\n", "\n", "print(\"Top 10 generated routes by composite score:\")\n", "print(\"=\" * 70)\n", "top_routes = ranked.sort_values(\"score\", ascending=False).head(10)\n", "display_cols = [\"board_key\", \"score\", \"basic_valid_eval\", \"strict_valid_eval\", \"novelty_distance\"]\n", "if \"critic_v_error\" in top_routes.columns:\n", " display_cols.append(\"critic_v_error\")\n", "print(top_routes[display_cols].to_string(index=False))" ] }, { "cell_type": "markdown", "id": "evaluation_summary", "metadata": {}, "source": [ "## Save evaluation results\n", "\n", "We save the full evaluation DataFrame and the top candidates for further analysis." ] }, { "cell_type": "code", "execution_count": null, "id": "save_results", "metadata": {}, "outputs": [], "source": [ "# Save evaluation results\n", "OUT_DIR = ROOT / \"data\" / \"processed\" / \"evaluation\"\n", "OUT_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "df_eval.to_csv(OUT_DIR / \"generated_route_evaluation.csv\", index=False)\n", "top_candidates = ranked.sort_values(\"score\", ascending=False).head(100)\n", "top_candidates.to_csv(OUT_DIR / \"top_generated_candidates.csv\", index=False)\n", "\n", "print(f\"Saved evaluation results to: {OUT_DIR}\")\n", "print(f\" - generated_route_evaluation.csv ({len(df_eval)} rows)\")\n", "print(f\" - top_generated_candidates.csv (100 rows)\")\n", "\n", "print(\"\\n\" + \"=\" * 50)\n", "print(\"EVALUATION SUMMARY\")\n", "print(\"=\" * 50)\n", "print(f\"\\nTotal generated routes: {len(df_eval):,}\")\n", "print(f\"\\nBasic validity rate: {df_eval['basic_valid_eval'].mean():.1%}\")\n", "print(f\"Strict validity rate: {df_eval['strict_valid_eval'].mean():.1%}\")\n", "print(f\"Mean novelty distance: {df_eval['novelty_distance'].mean():.3f}\")\n", "\n", "if 'critic_v_error' in df_eval.columns:\n", " print(f\"\\nGrade consistency:\")\n", " print(f\" Exact V-grade: {(df_eval['critic_v_error'] == 0).mean():.1%}\")\n", " print(f\" Within 1 V-grade: {(df_eval['critic_v_error'].abs() <= 1).mean():.1%}\")\n", " print(f\" Within 2 V-grades: {(df_eval['critic_v_error'].abs() <= 2).mean():.1%}\")\n", "else:\n", " print(\"\\n(Grade consistency not available - no critic model loaded)\")" ] }, { "cell_type": "markdown", "id": "takeaways", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **Validity**: The generator produces routes that mostly satisfy structural constraints (start/finish holds, no duplicates, single board).\n", "\n", "2. **Novelty**: Generated routes are meaningfully different from existing routes, as measured by Jaccard distance.\n", "\n", "3. **Geometric plausibility**: The geometric features (height, width, hand reach) should be in reasonable ranges compared to real routes.\n", "\n", "4. **Grade consistency**: If the critic is available, we can check whether routes generated at a requested grade actually feel like that grade.\n", "\n", "### Limitations\n", "\n", "- Validity checks are structural, not semantic. A route might have valid start/finish holds but still be impossible.\n", "- Geometric features are simple. More sophisticated analysis could check reachability and move sequences.\n", "- The critic model was trained on real data, so it may not generalize well to novel route structures." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.4" } }, "nbformat": 4, "nbformat_minor": 5 }