ClimbingBoardGPT/notebooks/04_generated_route_evaluation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "84d328e0",
   "metadata": {},
   "source": [
    "# 04 — Generated Route Evaluation\n",
    "\n",
    "## Why evaluate generated routes?\n",
    "\n",
    "In language modeling, we evaluate generated text using metrics like BLEU, ROUGE, or perplexity. For climbing routes, we need domain-specific evaluation:\n",
    "\n",
    "1. **Validity**: Does the route follow the rules of climbing boards?\n",
    "2. **Novelty**: Is the route different from existing climbs, or just a copy?\n",
    "3. **Geometric plausibility**: Are the holds in reasonable positions?\n",
    "4. **Grade consistency**: Does the route's predicted grade match the requested grade?\n",
    "\n",
    "### Validity checks\n",
    "\n",
    "A \"basic valid\" route must have:\n",
    "- At least 3 holds\n",
    "- No duplicate placements\n",
    "- At least one start hold and one finish hold\n",
    "- All holds from the same board (no mixing TB2 and Kilter holds)\n",
    "\n",
    "A \"strict valid\" route additionally has:\n",
    "- At least one middle hold (most real climbs have more than just start + finish)\n",
    "- At least 4 holds total\n",
    "\n",
    "### Novelty metrics\n",
    "\n",
    "We measure novelty using **Jaccard distance**: 1 minus the Jaccard similarity between the generated route's hold set and the most similar real route's hold set.\n",
    "\n",
    "- Jaccard similarity = |A intersection B| / |A union B|\n",
    "- Novelty distance = 1 - Jaccard similarity\n",
    "\n",
    "A novelty distance of 1.0 means the generated route shares no holds with any real route. A distance of 0.0 means it's identical to an existing route.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "726b846f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:44:02.200057Z",
     "iopub.status.busy": "2026-06-07T23:44:02.199717Z",
     "iopub.status.idle": "2026-06-07T23:44:04.626359Z",
     "shell.execute_reply": "2026-06-07T23:44:04.625624Z"
    }
   },
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import ast\n",
    "import re\n",
    "from pathlib import Path\n",
    "from typing import Iterable\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "from scipy.spatial.distance import pdist\n",
    "\n",
    "ROOT = Path.cwd().resolve()\n",
    "if ROOT.name == \"notebooks\":\n",
    "    ROOT = ROOT.parent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f8bb61f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:44:04.629832Z",
     "iopub.status.busy": "2026-06-07T23:44:04.629390Z",
     "iopub.status.idle": "2026-06-07T23:44:10.364160Z",
     "shell.execute_reply": "2026-06-07T23:44:10.363335Z"
    }
   },
   "outputs": [],
   "source": [
    "# Load generated routes and real routes for comparison\n",
    "# NOTE: This notebook requires that you've run notebook 03 first to\n",
    "# generate and save the routes.\n",
    "\n",
    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
    "GENERATED = ROOT / \"data\" / \"processed\" / \"generation\"\n",
    "\n",
    "# Check if required files exist\n",
    "generated_path = GENERATED / \"generated_routes.csv\"\n",
    "routes_path = TOKENIZED / \"route_sequences.csv\"\n",
    "token_meta_path = TOKENIZED / \"token_metadata.csv\"\n",
    "\n",
    "if not generated_path.exists():\n",
    "    raise FileNotFoundError(\n",
    "        f\"Generated routes not found at: {generated_path}\\n\"\n",
    "        f\"Please run notebook 03 first to generate and save routes,\\n\"\n",
    "        f\"or run: python scripts/03_train_route_generator.py\"\n",
    "    )\n",
    "\n",
    "if not routes_path.exists() or not token_meta_path.exists():\n",
    "    raise FileNotFoundError(\n",
    "        f\"Tokenized data not found at: {TOKENIZED}\\n\"\n",
    "        f\"Please run notebook 01 first to tokenize routes,\\n\"\n",
    "        f\"or run: python scripts/01_tokenize_routes.py\"\n",
    "    )\n",
    "\n",
    "df_generated = pd.read_csv(generated_path)\n",
    "df_real = pd.read_csv(routes_path)\n",
    "df_token_meta = pd.read_csv(token_meta_path)\n",
    "\n",
    "print(f\"Generated routes: {len(df_generated):,}\")\n",
    "print(f\"Real routes: {len(df_real):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dc0ac67",
   "metadata": {},
   "source": [
    "### Token parsing and validity helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c32f7ced",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:44:10.368243Z",
     "iopub.status.busy": "2026-06-07T23:44:10.367603Z",
     "iopub.status.idle": "2026-06-07T23:44:10.380028Z",
     "shell.execute_reply": "2026-06-07T23:44:10.379312Z"
    }
   },
   "outputs": [],
   "source": [
    "# Parse generated token strings and compute basic route-validity flags.\n",
    "HOLD_TOKEN_PATTERN = re.compile(r\"^<([A-Z0-9_]+)_p(\\d+)_(start|middle|finish|foot|unknown)>$\")\n",
    "\n",
    "def parse_tokens(value) -> list[str]:\n",
    "    \"\"\"Parse tokens from a list, repr-style list string, or whitespace sequence.\"\"\"\n",
    "    if isinstance(value, list):\n",
    "        return [str(v) for v in value]\n",
    "    if not isinstance(value, str):\n",
    "        return []\n",
    "\n",
    "    try:\n",
    "        parsed = ast.literal_eval(value)\n",
    "        if isinstance(parsed, list):\n",
    "            return [str(v) for v in parsed]\n",
    "    except (SyntaxError, ValueError):\n",
    "        pass\n",
    "\n",
    "    return value.split()\n",
    "\n",
    "def tokens_to_hold_records(tokens: Iterable[str]) -> list[dict[str, object]]:\n",
    "    \"\"\"Extract hold records from model tokens using the shared hold-token grammar.\"\"\"\n",
    "    rows: list[dict[str, object]] = []\n",
    "    for token in tokens:\n",
    "        match = HOLD_TOKEN_PATTERN.match(str(token))\n",
    "        if match is None:\n",
    "            continue\n",
    "        board_prefix = match.group(1)\n",
    "        rows.append(\n",
    "            {\n",
    "                \"token\": str(token),\n",
    "                \"board_token_prefix\": board_prefix,\n",
    "                \"board_prefix\": board_prefix,\n",
    "                \"placement_id\": int(match.group(2)),\n",
    "                \"role\": match.group(3),\n",
    "            }\n",
    "        )\n",
    "    return rows\n",
    "\n",
    "def parse_token_list(value) -> list[str]:\n",
    "    \"\"\"Compatibility wrapper around the shared token parser.\"\"\"\n",
    "    return parse_tokens(value)\n",
    "\n",
    "def validity_from_records(records: list[dict[str, object]], requested_board_prefix: str | None = None) -> dict[str, object]:\n",
    "    \"\"\"Compute evaluation-specific route-validity flags from hold records.\"\"\"\n",
    "    placements = [int(record[\"placement_id\"]) for record in records]\n",
    "    roles = [str(record[\"role\"]) for record in records]\n",
    "    prefixes = [str(record[\"board_token_prefix\"]) for record in records]\n",
    "    one_board_only = len(set(prefixes)) <= 1\n",
    "    matches_requested_board = requested_board_prefix is None or all(prefix == requested_board_prefix for prefix in prefixes)\n",
    "\n",
    "    out = {\n",
    "        \"n_holds_eval\": len(records),\n",
    "        \"n_unique_placements_eval\": len(set(placements)),\n",
    "        \"has_duplicate_placements_eval\": len(records) != len(set(placements)),\n",
    "        \"one_board_only_eval\": one_board_only,\n",
    "        \"matches_requested_board_eval\": matches_requested_board,\n",
    "        \"n_start_eval\": roles.count(\"start\"),\n",
    "        \"n_middle_eval\": roles.count(\"middle\"),\n",
    "        \"n_foot_eval\": roles.count(\"foot\"),\n",
    "        \"n_finish_eval\": roles.count(\"finish\"),\n",
    "        \"has_start_eval\": \"start\" in roles,\n",
    "        \"has_middle_eval\": \"middle\" in roles,\n",
    "        \"has_finish_eval\": \"finish\" in roles,\n",
    "    }\n",
    "    out[\"basic_valid_eval\"] = (\n",
    "        one_board_only\n",
    "        and out[\"n_holds_eval\"] >= 3\n",
    "        and out[\"n_holds_eval\"] == out[\"n_unique_placements_eval\"]\n",
    "        and out[\"has_start_eval\"]\n",
    "        and out[\"has_finish_eval\"]\n",
    "    )\n",
    "    out[\"strict_valid_eval\"] = (\n",
    "        out[\"basic_valid_eval\"]\n",
    "        and out[\"has_middle_eval\"]\n",
    "        and out[\"n_holds_eval\"] >= 4\n",
    "    )\n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0091bafb",
   "metadata": {},
   "source": [
    "## Parse generated tokens and check validity\n",
    "\n",
    "We parse the generated token sequences and check each route for validity.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5c2b25a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:44:10.383121Z",
     "iopub.status.busy": "2026-06-07T23:44:10.382759Z",
     "iopub.status.idle": "2026-06-07T23:44:10.430410Z",
     "shell.execute_reply": "2026-06-07T23:44:10.429741Z"
    }
   },
   "outputs": [],
   "source": [
    "# Parse the token strings into structured records\n",
    "df_generated[\"tokens_parsed\"] = df_generated[\"tokens\"].apply(parse_token_list)\n",
    "\n",
    "# Extract hold information from tokens\n",
    "df_generated[\"hold_records\"] = df_generated[\"tokens_parsed\"].apply(tokens_to_hold_records)\n",
    "\n",
    "# Check validity for each generated route\n",
    "validity = pd.DataFrame(df_generated[\"hold_records\"].apply(validity_from_records).tolist())\n",
    "df_eval = pd.concat([df_generated.reset_index(drop=True), validity], axis=1)\n",
    "\n",
    "print(\"Validity rates by board:\")\n",
    "print(\"=\" * 50)\n",
    "validity_summary = df_eval.groupby(\"board_key\").agg(\n",
    "    total=(\"basic_valid_eval\", \"count\"),\n",
    "    basic_valid_rate=(\"basic_valid_eval\", \"mean\"),\n",
    "    strict_valid_rate=(\"strict_valid_eval\", \"mean\"),\n",
    "    avg_holds=(\"n_holds_eval\", \"mean\"),\n",
    ").round(3)\n",
    "print(validity_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ff31b72",
   "metadata": {},
   "source": [
    "### Novelty helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10a40f48",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:44:10.434117Z",
     "iopub.status.busy": "2026-06-07T23:44:10.433720Z",
     "iopub.status.idle": "2026-06-07T23:44:10.443045Z",
     "shell.execute_reply": "2026-06-07T23:44:10.442319Z"
    }
   },
   "outputs": [],
   "source": [
    "# Compare generated hold sets to real routes using Jaccard similarity.\n",
    "def frames_to_holds(frames: str | None) -> list[tuple[int, int]]:\n",
    "    \"\"\"Parse a frames string into ``(placement_id, role_id)`` pairs.\"\"\"\n",
    "    if not isinstance(frames, str):\n",
    "        return []\n",
    "    return [(int(p), int(r)) for p, r in re.findall(r\"p(\\d+)r(\\d+)\", frames)]\n",
    "\n",
    "def holds_to_placement_set(holds: Iterable[tuple[int, int]]) -> frozenset[int]:\n",
    "    \"\"\"Drop role IDs and represent a route by its unique placement IDs.\"\"\"\n",
    "    return frozenset(int(placement_id) for placement_id, _ in holds)\n",
    "\n",
    "def jaccard(a: frozenset[int], b: frozenset[int]) -> float:\n",
    "    \"\"\"Return Jaccard similarity between two placement sets.\"\"\"\n",
    "    if not a and not b:\n",
    "        return 1.0\n",
    "    if not a or not b:\n",
    "        return 0.0\n",
    "    return len(a & b) / len(a | b)\n",
    "\n",
    "def nearest_real_route_same_board(\n",
    "    generated_set: frozenset[int],\n",
    "    generated_board_key: str,\n",
    "    real_df: pd.DataFrame,\n",
    ") -> dict[str, object]:\n",
    "    \"\"\"Find the most similar real route on the same board by Jaccard score.\n",
    "\n",
    "    .. note::\n",
    "\n",
    "       This function performs an O(n) linear scan over all real routes for\n",
    "       the matching board, computing a Jaccard similarity for each one. With\n",
    "       ~256K training examples, evaluating 400 generated routes costs roughly\n",
    "       O(100M) Jaccard comparisons. This is acceptable for evaluation scripts\n",
    "       but would not scale to a real-time or high-throughput setting without\n",
    "       an approximate nearest-neighbour index.\n",
    "    \"\"\"\n",
    "    board_frame = real_df[real_df[\"board_key\"] == generated_board_key]\n",
    "    if board_frame.empty:\n",
    "        return {\n",
    "            \"nearest_real_jaccard\": np.nan,\n",
    "            \"nearest_real_uuid\": None,\n",
    "            \"nearest_real_name\": None,\n",
    "            \"nearest_real_grouped_v\": None,\n",
    "            \"nearest_real_angle\": None,\n",
    "            \"novelty_distance\": np.nan,\n",
    "        }\n",
    "\n",
    "    similarities = board_frame[\"hold_set\"].map(lambda hold_set: jaccard(generated_set, hold_set))\n",
    "    best_idx = similarities.idxmax()\n",
    "    row = board_frame.loc[best_idx]\n",
    "\n",
    "    nearest_real_jaccard = float(similarities.loc[best_idx])\n",
    "    return {\n",
    "        \"nearest_real_jaccard\": nearest_real_jaccard,\n",
    "        \"nearest_real_uuid\": row[\"uuid\"],\n",
    "        \"nearest_real_name\": row[\"climb_name\"],\n",
    "        \"nearest_real_grouped_v\": row[\"grouped_v\"],\n",
    "        \"nearest_real_angle\": row[\"angle\"],\n",
    "        \"novelty_distance\": 1.0 - nearest_real_jaccard,\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cf2170e",
   "metadata": {},
   "source": [
    "## Novelty against real climbs\n",
    "\n",
    "For each generated route, we find the most similar real route from the same board (by Jaccard similarity of hold sets). A good generator should produce routes that are novel (low Jaccard similarity to existing routes) while still being valid.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7f34524",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:44:10.446422Z",
     "iopub.status.busy": "2026-06-07T23:44:10.445998Z",
     "iopub.status.idle": "2026-06-07T23:46:41.914124Z",
     "shell.execute_reply": "2026-06-07T23:46:41.913292Z"
    }
   },
   "outputs": [],
   "source": [
    "# Convert hold sets to frozensets for fast comparison\n",
    "df_eval[\"hold_set\"] = df_eval[\"hold_records\"].apply(\n",
    "    lambda records: frozenset(int(record[\"placement_id\"]) for record in records)\n",
    ")\n",
    "\n",
    "# Parse real routes' frames strings into hold sets\n",
    "df_real[\"real_holds\"] = df_real[\"frames\"].apply(frames_to_holds)\n",
    "df_real[\"hold_set\"] = df_real[\"real_holds\"].apply(holds_to_placement_set)\n",
    "\n",
    "# Find nearest real route for each generated route\n",
    "print(\"Computing novelty (finding nearest real route for each generated route)...\")\n",
    "print(\"This may take a few minutes...\")\n",
    "\n",
    "nearest = pd.DataFrame(\n",
    "    df_eval.apply(\n",
    "        lambda row: nearest_real_route_same_board(\n",
    "            generated_set=row[\"hold_set\"],\n",
    "            generated_board_key=row[\"board_key\"],\n",
    "            real_df=df_real,\n",
    "        ),\n",
    "        axis=1,\n",
    "    ).tolist()\n",
    ")\n",
    "df_eval = pd.concat([df_eval, nearest], axis=1)\n",
    "\n",
    "print(\"\\nNovelty statistics by board:\")\n",
    "print(\"=\" * 50)\n",
    "novelty_summary = df_eval.groupby(\"board_key\").agg(\n",
    "    mean_jaccard=(\"nearest_real_jaccard\", \"mean\"),\n",
    "    mean_novelty=(\"novelty_distance\", \"mean\"),\n",
    "    median_novelty=(\"novelty_distance\", \"median\"),\n",
    ").round(3)\n",
    "print(novelty_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad70ff4c",
   "metadata": {},
   "source": [
    "### Geometry helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85ddaf53",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:41.918125Z",
     "iopub.status.busy": "2026-06-07T23:46:41.917658Z",
     "iopub.status.idle": "2026-06-07T23:46:41.929570Z",
     "shell.execute_reply": "2026-06-07T23:46:41.928790Z"
    }
   },
   "outputs": [],
   "source": [
    "# Compute simple geometric descriptors from placement coordinates.\n",
    "def build_placement_coords(df_token_meta: pd.DataFrame) -> dict[tuple[str, int], dict[str, float]]:\n",
    "    \"\"\"Build a placement-coordinate lookup from token metadata.\"\"\"\n",
    "    hold_meta = df_token_meta[df_token_meta[\"kind\"] == \"hold\"].dropna(subset=[\"placement_id\"]).copy()\n",
    "    coords = {}\n",
    "    for _, row in hold_meta.drop_duplicates([\"board_key\", \"placement_id\"]).iterrows():\n",
    "        key = (str(row[\"board_key\"]), int(row[\"placement_id\"]))\n",
    "        coords[key] = {\n",
    "            \"x\": float(row[\"x\"]),\n",
    "            \"y\": float(row[\"y\"]),\n",
    "        }\n",
    "    return coords\n",
    "\n",
    "def simple_route_features(\n",
    "    board_key: str,\n",
    "    records: list[dict[str, object]],\n",
    "    placement_coords: dict[tuple[str, int], dict[str, float]],\n",
    ") -> dict[str, float]:\n",
    "    \"\"\"Compute simple geometric route features from hold coordinates.\n",
    "\n",
    "    These features are descriptive rather than a full climbing-physics model:\n",
    "    height/width describe route spread, and hand-reach distances summarize the\n",
    "    pairwise spacing among start/middle/finish holds.\n",
    "    \"\"\"\n",
    "    rows = []\n",
    "    for record in records:\n",
    "        key = (str(board_key), int(record[\"placement_id\"]))\n",
    "        coord = placement_coords.get(key)\n",
    "        if coord is None:\n",
    "            continue\n",
    "        x = float(coord[\"x\"])\n",
    "        y = float(coord[\"y\"])\n",
    "        if np.isnan(x) or np.isnan(y):\n",
    "            continue\n",
    "        role = str(record[\"role\"])\n",
    "        rows.append(\n",
    "            {\n",
    "                \"x\": x,\n",
    "                \"y\": y,\n",
    "                \"role\": role,\n",
    "                \"is_hand\": role in {\"start\", \"middle\", \"finish\"},\n",
    "                \"is_foot\": role == \"foot\",\n",
    "            }\n",
    "        )\n",
    "\n",
    "    if not rows:\n",
    "        return {\n",
    "            \"geom_n_holds\": 0.0,\n",
    "            \"geom_height\": np.nan,\n",
    "            \"geom_width\": np.nan,\n",
    "            \"geom_mean_y\": np.nan,\n",
    "            \"geom_mean_x_abs\": np.nan,\n",
    "            \"geom_mean_hand_reach\": np.nan,\n",
    "            \"geom_max_hand_reach\": np.nan,\n",
    "        }\n",
    "\n",
    "    d = pd.DataFrame(rows)\n",
    "    out = {\n",
    "        \"geom_n_holds\": float(len(d)),\n",
    "        \"geom_height\": float(d[\"y\"].max() - d[\"y\"].min()),\n",
    "        \"geom_width\": float(d[\"x\"].max() - d[\"x\"].min()),\n",
    "        \"geom_mean_y\": float(d[\"y\"].mean()),\n",
    "        \"geom_mean_x_abs\": float(d[\"x\"].abs().mean()),\n",
    "    }\n",
    "\n",
    "    hands = d[d[\"is_hand\"]].sort_values([\"y\", \"x\"])\n",
    "    if len(hands) >= 2:\n",
    "        distances = pdist(hands[[\"x\", \"y\"]].values)\n",
    "        out[\"geom_mean_hand_reach\"] = float(distances.mean())\n",
    "        out[\"geom_max_hand_reach\"] = float(distances.max())\n",
    "    else:\n",
    "        out[\"geom_mean_hand_reach\"] = np.nan\n",
    "        out[\"geom_max_hand_reach\"] = np.nan\n",
    "\n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b581705d",
   "metadata": {},
   "source": [
    "## Geometric descriptors\n",
    "\n",
    "We compute simple geometric features for each generated route:\n",
    "\n",
    "- `geom_n_holds`: Number of holds\n",
    "- `geom_height`: Vertical extent of the route\n",
    "- `geom_width`: Horizontal extent\n",
    "- `geom_mean_hand_reach`: Average distance between hand holds\n",
    "\n",
    "These features help us understand whether generated routes have reasonable spatial properties.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d74d4cad",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:41.932520Z",
     "iopub.status.busy": "2026-06-07T23:46:41.932262Z",
     "iopub.status.idle": "2026-06-07T23:46:42.775565Z",
     "shell.execute_reply": "2026-06-07T23:46:42.774476Z"
    }
   },
   "outputs": [],
   "source": [
    "# Build coordinate lookup from token metadata\n",
    "coords = build_placement_coords(df_token_meta)\n",
    "\n",
    "# Compute geometric features for each generated route\n",
    "geom = pd.DataFrame(\n",
    "    df_eval.apply(\n",
    "        lambda row: simple_route_features(\n",
    "            board_key=row[\"board_key\"],\n",
    "            records=row[\"hold_records\"],\n",
    "            placement_coords=coords,\n",
    "        ),\n",
    "        axis=1,\n",
    "    ).tolist()\n",
    ")\n",
    "df_eval = pd.concat([df_eval, geom], axis=1)\n",
    "\n",
    "print(\"Geometric feature statistics by board:\")\n",
    "print(\"=\" * 50)\n",
    "geom_summary = df_eval.groupby(\"board_key\").agg(\n",
    "    mean_holds=(\"geom_n_holds\", \"mean\"),\n",
    "    mean_height=(\"geom_height\", \"mean\"),\n",
    "    mean_width=(\"geom_width\", \"mean\"),\n",
    "    mean_hand_reach=(\"geom_mean_hand_reach\", \"mean\"),\n",
    ").round(3)\n",
    "print(geom_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44036a1e",
   "metadata": {},
   "source": [
    "### Critic model and grade helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cfff1f4",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:42.779195Z",
     "iopub.status.busy": "2026-06-07T23:46:42.778895Z",
     "iopub.status.idle": "2026-06-07T23:46:42.791727Z",
     "shell.execute_reply": "2026-06-07T23:46:42.790706Z"
    }
   },
   "outputs": [],
   "source": [
    "# Map BoardLib display difficulties into grouped V-grade tokens.\n",
    "GRADE_TO_V = {\n",
    "    10: 0, 11: 0, 12: 0,\n",
    "    13: 1, 14: 1,\n",
    "    15: 2,\n",
    "    16: 3, 17: 3,\n",
    "    18: 4, 19: 4,\n",
    "    20: 5, 21: 5,\n",
    "    22: 6,\n",
    "    23: 7,\n",
    "    24: 8, 25: 8,\n",
    "    26: 9,\n",
    "    27: 10,\n",
    "    28: 11,\n",
    "    29: 12,\n",
    "    30: 13,\n",
    "    31: 14,\n",
    "    32: 15,\n",
    "    33: 16,\n",
    "}\n",
    "\n",
    "def to_grouped_v(display_difficulty: float) -> int:\n",
    "    \"\"\"Map a continuous display difficulty to the nearest grouped V grade.\"\"\"\n",
    "    rounded = int(round(float(display_difficulty)))\n",
    "    rounded = max(min(rounded, max(GRADE_TO_V)), min(GRADE_TO_V))\n",
    "    return GRADE_TO_V[rounded]\n",
    "\n",
    "def grade_token(display_difficulty: float) -> str:\n",
    "    \"\"\"Return the grade-conditioning token for a display difficulty value.\"\"\"\n",
    "    return f\"<GRADE_V{to_grouped_v(display_difficulty)}>\"\n",
    "\n",
    "# Transformer encoder used as a continuous grade regressor.\n",
    "class JointRouteTransformerRegressor(nn.Module):\n",
    "    \"\"\"Transformer encoder for joint TB2/Kilter route difficulty prediction.\n",
    "\n",
    "    Inputs are token IDs plus an attention mask. Token, position, and learned\n",
    "    projections of coordinate metadata are added before the encoder. The first\n",
    "    ``<CLS>`` position is then used as a pooled route representation for scalar\n",
    "    difficulty regression.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        vocab_size: int,\n",
    "        max_len: int,\n",
    "        coord_features: torch.Tensor,\n",
    "        d_model: int = 128,\n",
    "        nhead: int = 4,\n",
    "        num_layers: int = 4,\n",
    "        dim_feedforward: int = 256,\n",
    "        dropout: float = 0.10,\n",
    "        pad_id: int = 0,\n",
    "    ):\n",
    "        \"\"\"Create the encoder, coordinate projection, and regression head.\"\"\"\n",
    "        super().__init__()\n",
    "        self.vocab_size = vocab_size\n",
    "        self.max_len = max_len\n",
    "        self.d_model = d_model\n",
    "        self.pad_id = pad_id\n",
    "\n",
    "        self.token_emb = nn.Embedding(vocab_size, d_model, padding_idx=pad_id)\n",
    "        self.pos_emb = nn.Embedding(max_len, d_model)\n",
    "\n",
    "        self.register_buffer(\"coord_features\", coord_features.clone().float())\n",
    "        self.coord_proj = nn.Linear(coord_features.shape[1], d_model)\n",
    "\n",
    "        encoder_layer = nn.TransformerEncoderLayer(\n",
    "            d_model=d_model,\n",
    "            nhead=nhead,\n",
    "            dim_feedforward=dim_feedforward,\n",
    "            dropout=dropout,\n",
    "            activation=\"gelu\",\n",
    "            batch_first=True,\n",
    "            norm_first=True,\n",
    "        )\n",
    "        self.encoder = nn.TransformerEncoder(\n",
    "            encoder_layer,\n",
    "            num_layers=num_layers,\n",
    "            enable_nested_tensor=False,\n",
    "        )\n",
    "        self.norm = nn.LayerNorm(d_model)\n",
    "        self.head = nn.Sequential(\n",
    "            nn.Linear(d_model, d_model),\n",
    "            nn.GELU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(d_model, 1),\n",
    "        )\n",
    "\n",
    "    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:\n",
    "        \"\"\"Return one continuous difficulty prediction per input sequence.\"\"\"\n",
    "        batch_size, seq_len = input_ids.shape\n",
    "        positions = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand(batch_size, seq_len)\n",
    "\n",
    "        # Coordinate features are indexed by token ID, so every occurrence of a\n",
    "        # hold token gets the same physical x/y hint wherever it appears.\n",
    "        x = self.token_emb(input_ids) + self.pos_emb(positions)\n",
    "        x = x + self.coord_proj(self.coord_features[input_ids])\n",
    "\n",
    "        key_padding_mask = ~attention_mask.bool()\n",
    "        h = self.encoder(x, src_key_padding_mask=key_padding_mask)\n",
    "        h = self.norm(h)\n",
    "\n",
    "        cls_state = h[:, 0, :]\n",
    "        return self.head(cls_state).squeeze(-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4455557a",
   "metadata": {},
   "source": [
    "## Grade consistency (using the trained critic)\n",
    "\n",
    "If we have a trained grade predictor (from notebook 02), we can use it as a **critic** to check whether generated routes have grades consistent with what was requested.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88747d6e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:42.795099Z",
     "iopub.status.busy": "2026-06-07T23:46:42.794788Z",
     "iopub.status.idle": "2026-06-07T23:46:43.323348Z",
     "shell.execute_reply": "2026-06-07T23:46:43.321923Z"
    }
   },
   "outputs": [],
   "source": [
    "# Try to load the grade critic from notebook 02\n",
    "GRADE_MODEL_PATH = ROOT / \"models\" / \"joint_transformer_grade_predictor.pth\"\n",
    "\n",
    "def load_grade_critic(model_path, device):\n",
    "    \"\"\"Load the trained grade predictor model.\"\"\"\n",
    "    if not model_path.exists():\n",
    "        return None\n",
    "    try:\n",
    "        checkpoint = torch.load(model_path, map_location=device, weights_only=False)\n",
    "    except TypeError:\n",
    "        checkpoint = torch.load(model_path, map_location=device)\n",
    "\n",
    "    cfg = checkpoint[\"config\"]\n",
    "    stoi = {str(k): int(v) for k, v in checkpoint[\"stoi\"].items()}\n",
    "    coord_features = checkpoint[\"coord_features\"]\n",
    "    if not isinstance(coord_features, torch.Tensor):\n",
    "        coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
    "\n",
    "    model = JointRouteTransformerRegressor(\n",
    "        vocab_size=cfg[\"vocab_size\"],\n",
    "        max_len=cfg[\"max_len\"],\n",
    "        coord_features=coord_features,\n",
    "        d_model=cfg.get(\"d_model\", 128),\n",
    "        nhead=cfg.get(\"nhead\", 4),\n",
    "        num_layers=cfg.get(\"num_layers\", 4),\n",
    "        dim_feedforward=cfg.get(\"dim_feedforward\", 256),\n",
    "        dropout=cfg.get(\"dropout\", 0.10),\n",
    "        pad_id=cfg.get(\"pad_id\", stoi[\"<PAD>\"]),\n",
    "    ).to(device)\n",
    "    model.load_state_dict(checkpoint[\"model_state_dict\"])\n",
    "    model.eval()\n",
    "\n",
    "    return {\n",
    "        \"model\": model,\n",
    "        \"stoi\": stoi,\n",
    "        \"pad_id\": stoi[\"<PAD>\"],\n",
    "        \"unk_id\": stoi[\"<UNK>\"],\n",
    "        \"max_len\": cfg[\"max_len\"],\n",
    "    }\n",
    "\n",
    "\n",
    "def predict_generated_grade(tokens, critic, device):\n",
    "    \"\"\"Predict the difficulty of a generated route using the critic.\"\"\"\n",
    "    model = critic[\"model\"]\n",
    "    stoi = critic[\"stoi\"]\n",
    "    pad_id = critic[\"pad_id\"]\n",
    "    unk_id = critic[\"unk_id\"]\n",
    "    max_len = critic[\"max_len\"]\n",
    "\n",
    "    # Remove grade tokens and replace BOS with CLS\n",
    "    tokens = [t for t in tokens if not t.startswith(\"<GRADE_\")]\n",
    "    if tokens and tokens[0] == \"<BOS>\":\n",
    "        tokens = [\"<CLS>\"] + tokens[1:]\n",
    "    else:\n",
    "        tokens = [\"<CLS>\"] + tokens\n",
    "\n",
    "    ids = [stoi.get(t, unk_id) for t in tokens][:max_len]\n",
    "    mask = [1] * len(ids)\n",
    "    if len(ids) < max_len:\n",
    "        pad_n = max_len - len(ids)\n",
    "        ids += [pad_id] * pad_n\n",
    "        mask += [0] * pad_n\n",
    "\n",
    "    with torch.no_grad():\n",
    "        input_ids = torch.tensor([ids], dtype=torch.long, device=device)\n",
    "        attention_mask = torch.tensor([mask], dtype=torch.bool, device=device)\n",
    "        return float(model(input_ids, attention_mask).cpu().item())\n",
    "\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "critic = load_grade_critic(GRADE_MODEL_PATH, device)\n",
    "\n",
    "if critic is not None:\n",
    "    print(\"Grade critic loaded successfully!\")\n",
    "    print(f\"Device: {device}\")\n",
    "else:\n",
    "    print(\"No trained grade critic found. Skipping critic-based scoring.\")\n",
    "    print(\"Run notebook 02 first to train the grade predictor.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "critic_eval",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:43.327454Z",
     "iopub.status.busy": "2026-06-07T23:46:43.326834Z",
     "iopub.status.idle": "2026-06-07T23:46:44.473105Z",
     "shell.execute_reply": "2026-06-07T23:46:44.472309Z"
    }
   },
   "outputs": [],
   "source": [
    "# Apply the critic to evaluate grade consistency\n",
    "if critic is not None:\n",
    "    df_eval[\"critic_pred_display_difficulty\"] = df_eval[\"tokens_parsed\"].apply(\n",
    "        lambda tokens: predict_generated_grade(tokens, critic, device)\n",
    "    )\n",
    "    df_eval[\"critic_pred_grouped_v\"] = df_eval[\"critic_pred_display_difficulty\"].apply(to_grouped_v)\n",
    "    df_eval[\"critic_v_error\"] = df_eval[\"critic_pred_grouped_v\"] - df_eval[\"requested_grouped_v\"]\n",
    "\n",
    "    print(\"Grade consistency by board:\")\n",
    "    print(\"=\" * 50)\n",
    "    critic_summary = df_eval.groupby(\"board_key\").agg(\n",
    "        exact_v=(\"critic_v_error\", lambda s: float((s == 0).mean() * 100)),\n",
    "        within_1_v=(\"critic_v_error\", lambda s: float((s.abs() <= 1).mean() * 100)),\n",
    "        within_2_v=(\"critic_v_error\", lambda s: float((s.abs() <= 2).mean() * 100)),\n",
    "        mean_error=(\"critic_v_error\", \"mean\"),\n",
    "    ).round(2)\n",
    "    print(critic_summary)\n",
    "else:\n",
    "    print(\"Skipping critic evaluation (no model loaded).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ranking",
   "metadata": {},
   "source": [
    "## Ranking generated routes\n",
    "\n",
    "We rank candidates by a composite score that rewards:\n",
    "- **Basic validity** (required): At least 3 holds, start/finish, no duplicates, one board\n",
    "- **Strict validity** (bonus): Also has middle holds and 4+ holds\n",
    "- **Novelty** (higher is better): Distance from nearest real route\n",
    "- **Grade consistency** (if critic available): Predicted grade close to requested grade\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88747d6e2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:44.476183Z",
     "iopub.status.busy": "2026-06-07T23:46:44.475814Z",
     "iopub.status.idle": "2026-06-07T23:46:44.489525Z",
     "shell.execute_reply": "2026-06-07T23:46:44.488845Z"
    }
   },
   "outputs": [],
   "source": [
    "# Rank candidates by composite score\n",
    "ranked = df_eval.copy()\n",
    "ranked[\"score\"] = 0.0\n",
    "ranked[\"score\"] += ranked[\"basic_valid_eval\"].astype(float) * 2.0\n",
    "ranked[\"score\"] += ranked[\"strict_valid_eval\"].astype(float) * 1.0\n",
    "ranked[\"score\"] += ranked[\"novelty_distance\"].fillna(0.0)\n",
    "\n",
    "if \"critic_v_error\" in ranked.columns:\n",
    "    ranked[\"score\"] += (ranked[\"critic_v_error\"].abs() <= 1).astype(float)\n",
    "    ranked[\"score\"] -= 0.25 * ranked[\"critic_v_error\"].abs()\n",
    "\n",
    "print(\"Top 10 generated routes by composite score:\")\n",
    "print(\"=\" * 70)\n",
    "top_routes = ranked.sort_values(\"score\", ascending=False).head(10)\n",
    "display_cols = [\"board_key\", \"score\", \"basic_valid_eval\", \"strict_valid_eval\", \"novelty_distance\"]\n",
    "if \"critic_v_error\" in top_routes.columns:\n",
    "    display_cols.append(\"critic_v_error\")\n",
    "print(top_routes[display_cols].to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "evaluation_summary",
   "metadata": {},
   "source": [
    "## Save evaluation results\n",
    "\n",
    "We save the full evaluation DataFrame and the top candidates for further analysis.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "save_results",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T23:46:44.492676Z",
     "iopub.status.busy": "2026-06-07T23:46:44.492218Z",
     "iopub.status.idle": "2026-06-07T23:46:44.561651Z",
     "shell.execute_reply": "2026-06-07T23:46:44.560880Z"
    }
   },
   "outputs": [],
   "source": [
    "# Save evaluation results\n",
    "OUT_DIR = ROOT / \"data\" / \"processed\" / \"evaluation\"\n",
    "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "df_eval.to_csv(OUT_DIR / \"generated_route_evaluation.csv\", index=False)\n",
    "top_candidates = ranked.sort_values(\"score\", ascending=False).head(100)\n",
    "top_candidates.to_csv(OUT_DIR / \"top_generated_candidates.csv\", index=False)\n",
    "\n",
    "print(f\"Saved evaluation results to: {OUT_DIR}\")\n",
    "print(f\"  - generated_route_evaluation.csv ({len(df_eval)} rows)\")\n",
    "print(f\"  - top_generated_candidates.csv (100 rows)\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 50)\n",
    "print(\"EVALUATION SUMMARY\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"\\nTotal generated routes: {len(df_eval):,}\")\n",
    "print(f\"\\nBasic validity rate: {df_eval['basic_valid_eval'].mean():.1%}\")\n",
    "print(f\"Strict validity rate: {df_eval['strict_valid_eval'].mean():.1%}\")\n",
    "print(f\"Mean novelty distance: {df_eval['novelty_distance'].mean():.3f}\")\n",
    "\n",
    "if 'critic_v_error' in df_eval.columns:\n",
    "    print(f\"\\nGrade consistency:\")\n",
    "    print(f\"  Exact V-grade: {(df_eval['critic_v_error'] == 0).mean():.1%}\")\n",
    "    print(f\"  Within 1 V-grade: {(df_eval['critic_v_error'].abs() <= 1).mean():.1%}\")\n",
    "    print(f\"  Within 2 V-grades: {(df_eval['critic_v_error'].abs() <= 2).mean():.1%}\")\n",
    "else:\n",
    "    print(\"\\n(Grade consistency not available - no critic model loaded)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "takeaways",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Validity**: The generator produces routes that mostly satisfy structural constraints (start/finish holds, no duplicates, single board).\n",
    "\n",
    "2. **Novelty**: Generated routes are meaningfully different from existing routes, as measured by Jaccard distance.\n",
    "\n",
    "3. **Geometric plausibility**: The geometric features (height, width, hand reach) should be in reasonable ranges compared to real routes.\n",
    "\n",
    "4. **Grade consistency**: If the critic is available, we can check whether routes generated at a requested grade actually feel like that grade.\n",
    "\n",
    "### Limitations\n",
    "\n",
    "- Validity checks are structural, not semantic. A route might have valid start/finish holds but still be impossible.\n",
    "- Geometric features are simple. More sophisticated analysis could check reachability and move sequences.\n",
    "- The critic model was trained on real data, so it may not generalize well to novel route structures.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}