ClimbingBoardGPT/notebooks/04_generated_route_evaluation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "84d328e0",
   "metadata": {},
   "source": [
    "# 04 — Generated Route Evaluation\n",
    "\n",
    "## Why evaluate generated routes?\n",
    "\n",
    "In language modeling, we evaluate generated text using metrics like BLEU, ROUGE, or perplexity. For climbing routes, we need domain-specific evaluation:\n",
    "\n",
    "1. **Validity**: Does the route follow the rules of climbing boards?\n",
    "2. **Novelty**: Is the route different from existing climbs, or just a copy?\n",
    "3. **Geometric plausibility**: Are the holds in reasonable positions?\n",
    "4. **Grade consistency**: Does the route's predicted grade match the requested grade?\n",
    "\n",
    "### Validity checks\n",
    "\n",
    "A \"basic valid\" route must have:\n",
    "- At least 3 holds (you need at least 2 hands + 1 foot to climb)\n",
    "- No duplicate placements (you can't use the same hold twice)\n",
    "- At least one start hold and one finish hold\n",
    "- All holds from the same board (no mixing TB2 and Kilter holds)\n",
    "\n",
    "A \"strict valid\" route additionally has:\n",
    "- At least one middle hold (most real climbs have more than just start + finish)\n",
    "- At least 4 holds total\n",
    "\n",
    "### Novelty metrics\n",
    "\n",
    "We measure novelty using **Jaccard distance**: 1 minus the Jaccard similarity between the generated route's hold set and the most similar real route's hold set.\n",
    "\n",
    "- Jaccard similarity = |A intersection B| / |A union B|\n",
    "- Novelty distance = 1 - Jaccard similarity\n",
    "\n",
    "A novelty distance of 1.0 means the generated route shares no holds with any real route. A distance of 0.0 means it's identical to an existing route."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "726b846f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import sys\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "\n",
    "ROOT = Path.cwd().resolve()\n",
    "if ROOT.name == \"notebooks\":\n",
    "    ROOT = ROOT.parent\n",
    "sys.path.insert(0, str(ROOT / \"src\"))\n",
    "\n",
    "from climbingboardgpt.evaluation import (\n",
    "    build_placement_coords,\n",
    "    frames_to_holds,\n",
    "    holds_to_placement_set,\n",
    "    nearest_real_route_same_board,\n",
    "    parse_token_list,\n",
    "    simple_route_features,\n",
    "    tokens_to_hold_records,\n",
    "    validity_from_records,\n",
    ")\n",
    "from climbingboardgpt.grades import to_grouped_v\n",
    "from climbingboardgpt.models import JointRouteTransformerRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f8bb61f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load generated routes and real routes for comparison\n",
    "# NOTE: This notebook requires that you've run notebook 03 first to\n",
    "# generate and save the routes.\n",
    "\n",
    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
    "GENERATED = ROOT / \"data\" / \"processed\" / \"generation\"\n",
    "\n",
    "# Check if required files exist\n",
    "generated_path = GENERATED / \"generated_routes.csv\"\n",
    "routes_path = TOKENIZED / \"route_sequences.csv\"\n",
    "token_meta_path = TOKENIZED / \"token_metadata.csv\"\n",
    "\n",
    "if not generated_path.exists():\n",
    "    raise FileNotFoundError(\n",
    "        f\"Generated routes not found at: {generated_path}\\n\"\n",
    "        f\"Please run notebook 03 first to generate and save routes,\\n\"\n",
    "        f\"or run: python scripts/03_train_route_generator.py\"\n",
    "    )\n",
    "\n",
    "if not routes_path.exists() or not token_meta_path.exists():\n",
    "    raise FileNotFoundError(\n",
    "        f\"Tokenized data not found at: {TOKENIZED}\\n\"\n",
    "        f\"Please run notebook 01 first to tokenize routes,\\n\"\n",
    "        f\"or run: python scripts/01_tokenize_routes.py\"\n",
    "    )\n",
    "\n",
    "df_generated = pd.read_csv(generated_path)\n",
    "df_real = pd.read_csv(routes_path)\n",
    "df_token_meta = pd.read_csv(token_meta_path)\n",
    "\n",
    "print(f\"Generated routes: {len(df_generated):,}\")\n",
    "print(f\"Real routes: {len(df_real):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0091bafb",
   "metadata": {},
   "source": [
    "## Parse generated tokens and check validity\n",
    "\n",
    "We parse the generated token sequences and check each route for validity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5c2b25a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parse the token strings into structured records\n",
    "df_generated[\"tokens_parsed\"] = df_generated[\"tokens\"].apply(parse_token_list)\n",
    "\n",
    "# Extract hold information from tokens\n",
    "df_generated[\"hold_records\"] = df_generated[\"tokens_parsed\"].apply(tokens_to_hold_records)\n",
    "\n",
    "# Check validity for each generated route\n",
    "validity = pd.DataFrame(df_generated[\"hold_records\"].apply(validity_from_records).tolist())\n",
    "df_eval = pd.concat([df_generated.reset_index(drop=True), validity], axis=1)\n",
    "\n",
    "print(\"Validity rates by board:\")\n",
    "print(\"=\" * 50)\n",
    "validity_summary = df_eval.groupby(\"board_key\").agg(\n",
    "    total=(\"basic_valid_eval\", \"count\"),\n",
    "    basic_valid_rate=(\"basic_valid_eval\", \"mean\"),\n",
    "    strict_valid_rate=(\"strict_valid_eval\", \"mean\"),\n",
    "    avg_holds=(\"n_holds_eval\", \"mean\"),\n",
    ").round(3)\n",
    "print(validity_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cf2170e",
   "metadata": {},
   "source": [
    "## Novelty against real climbs\n",
    "\n",
    "For each generated route, we find the most similar real route from the same board (by Jaccard similarity of hold sets). A good generator should produce routes that are novel (low Jaccard similarity to existing routes) while still being valid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7f34524",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert hold sets to frozensets for fast comparison\n",
    "df_eval[\"hold_set\"] = df_eval[\"hold_records\"].apply(\n",
    "    lambda records: frozenset(int(record[\"placement_id\"]) for record in records)\n",
    ")\n",
    "\n",
    "# Parse real routes' frames strings into hold sets\n",
    "df_real[\"real_holds\"] = df_real[\"frames\"].apply(frames_to_holds)\n",
    "df_real[\"hold_set\"] = df_real[\"real_holds\"].apply(holds_to_placement_set)\n",
    "\n",
    "# Find nearest real route for each generated route\n",
    "print(\"Computing novelty (finding nearest real route for each generated route)...\")\n",
    "print(\"This may take a few minutes...\")\n",
    "\n",
    "nearest = pd.DataFrame(\n",
    "    df_eval.apply(\n",
    "        lambda row: nearest_real_route_same_board(\n",
    "            generated_set=row[\"hold_set\"],\n",
    "            generated_board_key=row[\"board_key\"],\n",
    "            real_df=df_real,\n",
    "        ),\n",
    "        axis=1,\n",
    "    ).tolist()\n",
    ")\n",
    "df_eval = pd.concat([df_eval, nearest], axis=1)\n",
    "\n",
    "print(\"\\nNovelty statistics by board:\")\n",
    "print(\"=\" * 50)\n",
    "novelty_summary = df_eval.groupby(\"board_key\").agg(\n",
    "    mean_jaccard=(\"nearest_real_jaccard\", \"mean\"),\n",
    "    mean_novelty=(\"novelty_distance\", \"mean\"),\n",
    "    median_novelty=(\"novelty_distance\", \"median\"),\n",
    ").round(3)\n",
    "print(novelty_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b581705d",
   "metadata": {},
   "source": [
    "## Geometric descriptors\n",
    "\n",
    "We compute simple geometric features for each generated route:\n",
    "\n",
    "- `geom_n_holds`: Number of holds\n",
    "- `geom_height`: Vertical extent of the route\n",
    "- `geom_width`: Horizontal extent\n",
    "- `geom_mean_hand_reach`: Average distance between hand holds\n",
    "\n",
    "These features help us understand whether generated routes have reasonable spatial properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d74d4cad",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build coordinate lookup from token metadata\n",
    "coords = build_placement_coords(df_token_meta)\n",
    "\n",
    "# Compute geometric features for each generated route\n",
    "geom = pd.DataFrame(\n",
    "    df_eval.apply(\n",
    "        lambda row: simple_route_features(\n",
    "            board_key=row[\"board_key\"],\n",
    "            records=row[\"hold_records\"],\n",
    "            placement_coords=coords,\n",
    "        ),\n",
    "        axis=1,\n",
    "    ).tolist()\n",
    ")\n",
    "df_eval = pd.concat([df_eval, geom], axis=1)\n",
    "\n",
    "print(\"Geometric feature statistics by board:\")\n",
    "print(\"=\" * 50)\n",
    "geom_summary = df_eval.groupby(\"board_key\").agg(\n",
    "    mean_holds=(\"geom_n_holds\", \"mean\"),\n",
    "    mean_height=(\"geom_height\", \"mean\"),\n",
    "    mean_width=(\"geom_width\", \"mean\"),\n",
    "    mean_hand_reach=(\"geom_mean_hand_reach\", \"mean\"),\n",
    ").round(3)\n",
    "print(geom_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4455557a",
   "metadata": {},
   "source": [
    "## Grade consistency (using the trained critic)\n",
    "\n",
    "If we have a trained grade predictor (from notebook 02), we can use it as a **critic** to check whether generated routes have grades consistent with what was requested.\n",
    "\n",
    "This is similar to how GANs use a discriminator to evaluate generated samples, except our critic is a regression model rather than a binary classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88747d6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Try to load the grade critic from notebook 02\n",
    "GRADE_MODEL_PATH = ROOT / \"models\" / \"joint_transformer_grade_predictor.pth\"\n",
    "\n",
    "def load_grade_critic(model_path, device):\n",
    "    \"\"\"Load the trained grade predictor model.\"\"\"\n",
    "    if not model_path.exists():\n",
    "        return None\n",
    "    try:\n",
    "        checkpoint = torch.load(model_path, map_location=device, weights_only=False)\n",
    "    except TypeError:\n",
    "        checkpoint = torch.load(model_path, map_location=device)\n",
    "\n",
    "    cfg = checkpoint[\"config\"]\n",
    "    stoi = {str(k): int(v) for k, v in checkpoint[\"stoi\"].items()}\n",
    "    coord_features = checkpoint[\"coord_features\"]\n",
    "    if not isinstance(coord_features, torch.Tensor):\n",
    "        coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
    "\n",
    "    model = JointRouteTransformerRegressor(\n",
    "        vocab_size=cfg[\"vocab_size\"],\n",
    "        max_len=cfg[\"max_len\"],\n",
    "        coord_features=coord_features,\n",
    "        d_model=cfg.get(\"d_model\", 128),\n",
    "        nhead=cfg.get(\"nhead\", 4),\n",
    "        num_layers=cfg.get(\"num_layers\", 4),\n",
    "        dim_feedforward=cfg.get(\"dim_feedforward\", 256),\n",
    "        dropout=cfg.get(\"dropout\", 0.10),\n",
    "        pad_id=cfg.get(\"pad_id\", stoi[\"<PAD>\"]),\n",
    "    ).to(device)\n",
    "    model.load_state_dict(checkpoint[\"model_state_dict\"])\n",
    "    model.eval()\n",
    "\n",
    "    return {\n",
    "        \"model\": model,\n",
    "        \"stoi\": stoi,\n",
    "        \"pad_id\": stoi[\"<PAD>\"],\n",
    "        \"unk_id\": stoi[\"<UNK>\"],\n",
    "        \"max_len\": cfg[\"max_len\"],\n",
    "    }\n",
    "\n",
    "\n",
    "def predict_generated_grade(tokens, critic, device):\n",
    "    \"\"\"Predict the difficulty of a generated route using the critic.\"\"\"\n",
    "    model = critic[\"model\"]\n",
    "    stoi = critic[\"stoi\"]\n",
    "    pad_id = critic[\"pad_id\"]\n",
    "    unk_id = critic[\"unk_id\"]\n",
    "    max_len = critic[\"max_len\"]\n",
    "\n",
    "    # Remove grade tokens and replace BOS with CLS\n",
    "    tokens = [t for t in tokens if not t.startswith(\"<GRADE_\")]\n",
    "    if tokens and tokens[0] == \"<BOS>\":\n",
    "        tokens = [\"<CLS>\"] + tokens[1:]\n",
    "    else:\n",
    "        tokens = [\"<CLS>\"] + tokens\n",
    "\n",
    "    ids = [stoi.get(t, unk_id) for t in tokens][:max_len]\n",
    "    mask = [1] * len(ids)\n",
    "    if len(ids) < max_len:\n",
    "        pad_n = max_len - len(ids)\n",
    "        ids += [pad_id] * pad_n\n",
    "        mask += [0] * pad_n\n",
    "\n",
    "    with torch.no_grad():\n",
    "        input_ids = torch.tensor([ids], dtype=torch.long, device=device)\n",
    "        attention_mask = torch.tensor([mask], dtype=torch.bool, device=device)\n",
    "        return float(model(input_ids, attention_mask).cpu().item())\n",
    "\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "critic = load_grade_critic(GRADE_MODEL_PATH, device)\n",
    "\n",
    "if critic is not None:\n",
    "    print(\"Grade critic loaded successfully!\")\n",
    "    print(f\"Device: {device}\")\n",
    "else:\n",
    "    print(\"No trained grade critic found. Skipping critic-based scoring.\")\n",
    "    print(\"Run notebook 02 first to train the grade predictor.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "critic_eval",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the critic to evaluate grade consistency\n",
    "if critic is not None:\n",
    "    df_eval[\"critic_pred_display_difficulty\"] = df_eval[\"tokens_parsed\"].apply(\n",
    "        lambda tokens: predict_generated_grade(tokens, critic, device)\n",
    "    )\n",
    "    df_eval[\"critic_pred_grouped_v\"] = df_eval[\"critic_pred_display_difficulty\"].apply(to_grouped_v)\n",
    "    df_eval[\"critic_v_error\"] = df_eval[\"critic_pred_grouped_v\"] - df_eval[\"requested_grouped_v\"]\n",
    "\n",
    "    print(\"Grade consistency by board:\")\n",
    "    print(\"=\" * 50)\n",
    "    critic_summary = df_eval.groupby(\"board_key\").agg(\n",
    "        exact_v=(\"critic_v_error\", lambda s: float((s == 0).mean() * 100)),\n",
    "        within_1_v=(\"critic_v_error\", lambda s: float((s.abs() <= 1).mean() * 100)),\n",
    "        within_2_v=(\"critic_v_error\", lambda s: float((s.abs() <= 2).mean() * 100)),\n",
    "        mean_error=(\"critic_v_error\", \"mean\"),\n",
    "    ).round(2)\n",
    "    print(critic_summary)\n",
    "else:\n",
    "    print(\"Skipping critic evaluation (no model loaded).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ranking",
   "metadata": {},
   "source": [
    "## Ranking generated routes\n",
    "\n",
    "We rank candidates by a composite score that rewards:\n",
    "- **Basic validity** (required): At least 3 holds, start/finish, no duplicates, one board\n",
    "- **Strict validity** (bonus): Also has middle holds and 4+ holds\n",
    "- **Novelty** (higher is better): Distance from nearest real route\n",
    "- **Grade consistency** (if critic available): Predicted grade close to requested grade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88747d6e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rank candidates by composite score\n",
    "ranked = df_eval.copy()\n",
    "ranked[\"score\"] = 0.0\n",
    "ranked[\"score\"] += ranked[\"basic_valid_eval\"].astype(float) * 2.0\n",
    "ranked[\"score\"] += ranked[\"strict_valid_eval\"].astype(float) * 1.0\n",
    "ranked[\"score\"] += ranked[\"novelty_distance\"].fillna(0.0)\n",
    "\n",
    "if \"critic_v_error\" in ranked.columns:\n",
    "    ranked[\"score\"] += (ranked[\"critic_v_error\"].abs() <= 1).astype(float)\n",
    "    ranked[\"score\"] -= 0.25 * ranked[\"critic_v_error\"].abs()\n",
    "\n",
    "print(\"Top 10 generated routes by composite score:\")\n",
    "print(\"=\" * 70)\n",
    "top_routes = ranked.sort_values(\"score\", ascending=False).head(10)\n",
    "display_cols = [\"board_key\", \"score\", \"basic_valid_eval\", \"strict_valid_eval\", \"novelty_distance\"]\n",
    "if \"critic_v_error\" in top_routes.columns:\n",
    "    display_cols.append(\"critic_v_error\")\n",
    "print(top_routes[display_cols].to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "evaluation_summary",
   "metadata": {},
   "source": [
    "## Save evaluation results\n",
    "\n",
    "We save the full evaluation DataFrame and the top candidates for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "save_results",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save evaluation results\n",
    "OUT_DIR = ROOT / \"data\" / \"processed\" / \"evaluation\"\n",
    "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "df_eval.to_csv(OUT_DIR / \"generated_route_evaluation.csv\", index=False)\n",
    "top_candidates = ranked.sort_values(\"score\", ascending=False).head(100)\n",
    "top_candidates.to_csv(OUT_DIR / \"top_generated_candidates.csv\", index=False)\n",
    "\n",
    "print(f\"Saved evaluation results to: {OUT_DIR}\")\n",
    "print(f\"  - generated_route_evaluation.csv ({len(df_eval)} rows)\")\n",
    "print(f\"  - top_generated_candidates.csv (100 rows)\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 50)\n",
    "print(\"EVALUATION SUMMARY\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"\\nTotal generated routes: {len(df_eval):,}\")\n",
    "print(f\"\\nBasic validity rate: {df_eval['basic_valid_eval'].mean():.1%}\")\n",
    "print(f\"Strict validity rate: {df_eval['strict_valid_eval'].mean():.1%}\")\n",
    "print(f\"Mean novelty distance: {df_eval['novelty_distance'].mean():.3f}\")\n",
    "\n",
    "if 'critic_v_error' in df_eval.columns:\n",
    "    print(f\"\\nGrade consistency:\")\n",
    "    print(f\"  Exact V-grade: {(df_eval['critic_v_error'] == 0).mean():.1%}\")\n",
    "    print(f\"  Within 1 V-grade: {(df_eval['critic_v_error'].abs() <= 1).mean():.1%}\")\n",
    "    print(f\"  Within 2 V-grades: {(df_eval['critic_v_error'].abs() <= 2).mean():.1%}\")\n",
    "else:\n",
    "    print(\"\\n(Grade consistency not available - no critic model loaded)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "takeaways",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Validity**: The generator produces routes that mostly satisfy structural constraints (start/finish holds, no duplicates, single board).\n",
    "\n",
    "2. **Novelty**: Generated routes are meaningfully different from existing routes, as measured by Jaccard distance.\n",
    "\n",
    "3. **Geometric plausibility**: The geometric features (height, width, hand reach) should be in reasonable ranges compared to real routes.\n",
    "\n",
    "4. **Grade consistency**: If the critic is available, we can check whether routes generated at a requested grade actually feel like that grade.\n",
    "\n",
    "### Limitations\n",
    "\n",
    "- Validity checks are structural, not semantic. A route might have valid start/finish holds but still be impossible.\n",
    "- Geometric features are simple. More sophisticated analysis could check reachability and move sequences.\n",
    "- The critic model was trained on real data, so it may not generalize well to novel route structures."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}