initial commit

2026-05-21 07:21:13 -04:00
commit d510d07ed9
50 changed files with 5359 additions and 0 deletions
@@ -0,0 +1,466 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "94a0ae33",
+   "metadata": {},
+   "source": [
+    "# 01 — Unified Route Tokenization for TB2 and Kilter\n",
+    "\n",
+    "## What is tokenization and why does it matter?\n",
+    "\n",
+    "In natural language processing, **tokenization** is the process of converting raw text into a sequence of discrete symbols (tokens) that a model can process. For example, the sentence \"I love climbing\" might be tokenized as `[\"I\", \" love\", \" climbing\"]` using a subword tokenizer like BPE.\n",
+    "\n",
+    "For climbing board routes, we face an analogous problem: how do we convert a climb — which is fundamentally a *set of holds at specific positions with specific roles* — into a sequence of tokens that a transformer can learn from?\n",
+    "\n",
+    "### Key design decisions in this notebook\n",
+    "\n",
+    "1. **Board namespacing**: Each hold token includes the board prefix (e.g., `TB2_p344_start` vs `KILTER_p1084_start`). This prevents placement ID collisions between boards — placement 344 on TB2 is a completely different physical hold than placement 344 on Kilter (in fact, the latter does not exist).\n",
+    "\n",
+    "2. **Semantic role mapping**: Different boards use different role IDs (TB2 uses 5/6/7/8, Kilter uses 12/13/14/15), but they all map to the same semantic roles: `start`, `middle`, `finish`, `foot`. This shared vocabulary lets the model learn transferable patterns.\n",
+    "\n",
+    "3. **Canonical ordering**: Holds within a route are sorted by (role priority, y-position, x-position). This gives the model a consistent input ordering, similar to how LLMs expect text in left-to-right order.\n",
+    "\n",
+    "4. **Special tokens**: Like BERT and GPT, we use special tokens:\n",
+    "   - `<BOS>` (beginning of sequence) — marks the start, like `[CLS]` in BERT\n",
+    "   - `<EOS>` (end of sequence) — marks the end, like `[SEP]` or the end-of-text token in GPT\n",
+    "   - `<PAD>` — for batching sequences of different lengths\n",
+    "   - `<UNK>` — for unknown tokens (safety net)\n",
+    "   - `<CLS>` — used by the grade predictor to pool sequence information\n",
+    "   - `<MASK>` — reserved for future masked language modeling experiments\n",
+    "\n",
+    "5. **Conditioning tokens**: Routes are prefixed with board, angle, and grade tokens. This is analogous to how modern LLMs use system prompts or task prefixes to condition generation.\n",
+    "\n",
+    "### The analogy to NLP\n",
+    "\n",
+    "| NLP Concept | Climbing Board Analog |\n",
+    "|---|---|\n",
+    "| Word / Subword | Hold token (placement + role) |\n",
+    "| Sentence | Route (sequence of holds) |\n",
+    "| Document language | Board type (TB2 vs Kilter) |\n",
+    "| Sentence length | Number of holds in route |\n",
+    "| POS tag | Semantic role (start/middle/finish/foot) |\n",
+    "| Genre / Domain | Angle + Grade conditioning |\n",
+    "\n",
+    "This notebook tokenizes climbing routes from **both** supported boards:\n",
+    "\n",
+    "- Tension Board 2 Mirror\n",
+    "- Kilter Board Original\n",
+    "\n",
+    "The board-specific details are stored in `configs/tb2.json` and `configs/kilter.json`.\n",
+    "The shared tokenization code lives in `src/climbingboardgpt/`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ee2907f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import sys\n",
+    "import json\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Set up the project root so we can import our custom package\n",
+    "ROOT = Path.cwd().resolve()\n",
+    "if ROOT.name == \"notebooks\":\n",
+    "    ROOT = ROOT.parent\n",
+    "sys.path.insert(0, str(ROOT / \"src\"))\n",
+    "\n",
+    "# Import our custom modules\n",
+    "from climbingboardgpt.config import load_board_configs\n",
+    "from climbingboardgpt.data import load_multi_board_data\n",
+    "from climbingboardgpt.tokenization import (\n",
+    "    build_route_records,\n",
+    "    build_token_metadata,\n",
+    "    build_vocab,\n",
+    "    encode,\n",
+    "    make_placement_lookup,\n",
+    "    vocab_payload,\n",
+    ")\n",
+    "from climbingboardgpt.utils import safe_train_test_split, write_json, json_safe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b3066e2b",
+   "metadata": {},
+   "source": [
+    "## Load board configurations\n",
+    "\n",
+    "Each board has its own configuration file (`configs/tb2.json`, `configs/kilter.json`) that specifies:\n",
+    "\n",
+    "- **`layout_id`**: Which board layout to use (TB2 Mirror = 10, Kilter Original = 1)\n",
+    "- **`role_definitions`**: Maps semantic role names to board-specific role IDs\n",
+    "  - TB2: start=5, middle=6, finish=7, foot=8\n",
+    "  - Kilter: start=12, middle=13, finish=14, foot=15\n",
+    "- **`max_angle`**: We filter out climbs at extreme angles (>50° for TB2, >55° for Kilter) because those grades are biased toward elite climbers\n",
+    "- **`token_prefix`**: The namespace prefix for hold tokens (\"TB2\" vs \"KILTER\")\n",
+    "- **`include_mirror_placement_id`**: Whether to include mirror information (TB2 has symmetric left/right holds)\n",
+    "\n",
+    "This configuration-driven approach means we can add new boards by creating a new JSON config file, without changing any code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4f04dcea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "configs = load_board_configs([\"tb2\", \"kilter\"])\n",
+    "configs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a5c9a9b",
+   "metadata": {},
+   "source": [
+    "## Load raw climbs and placement metadata\n",
+    "\n",
+    "The data loading step reads from SQLite databases downloaded using BoardLib:\n",
+    "\n",
+    "```bash\n",
+    "boardlib database tension data/raw/tb2.db\n",
+    "boardlib database kilter data/raw/kilter.db\n",
+    "```\n",
+    "\n",
+    "### What we're loading\n",
+    "\n",
+    "**Climbs data** (`df_climbs`): Each row is a climb-angle observation. Key columns:\n",
+    "- `uuid`: Unique climb identifier\n",
+    "- `frames`: The raw string encoding holds and roles, e.g., `p344r5p369r6p603r7`\n",
+    "- `angle`: Wall angle in degrees\n",
+    "- `display_difficulty`: Numeric difficulty score (maps to V-grades)\n",
+    "- `boulder_grade`: Human-readable grade like \"6b/V4\"\n",
+    "\n",
+    "**Placements data** (`df_placements`): Each row is a physical hold position on the board. Key columns:\n",
+    "- `placement_id`: The hold's unique ID within its board\n",
+    "- `x`, `y`: Physical coordinates on the board (in inches)\n",
+    "- `default_role_id`: What role this hold typically plays (hand vs foot)\n",
+    "- `set_name`: Material type (\"Wood\" or \"Plastic\")\n",
+    "- `mirror_placement_id`: For TB2, the ID of the symmetric hold on the other side"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53c1951a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_climbs, df_placements = load_multi_board_data(configs, project_root=ROOT)\n",
+    "print(f\"Total climbs loaded: {len(df_climbs):,}\")\n",
+    "print(f\"Total placements loaded: {len(df_placements):,}\")\n",
+    "print()\n",
+    "print(\"Climbs per board:\")\n",
+    "print(df_climbs.groupby(\"board_key\").size())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6198c24e",
+   "metadata": {},
+   "source": [
+    "## Build unified route records\n",
+    "\n",
+    "This is the core tokenization step. For each climb, we:\n",
+    "\n",
+    "1. **Parse the frames string**: Convert `p344r5p369r6p603r7` into a list of `(placement_id, role_id)` tuples\n",
+    "\n",
+    "2. **Map role IDs to semantic roles**: Convert board-specific role IDs (5→start, 6→middle, etc.) to shared semantic names\n",
+    "\n",
+    "3. **Canonicalize hold order**: Sort holds by (role priority, y-position, x-position). This is important because:\n",
+    "   - The same climb can be represented with holds in any order in the database\n",
+    "   - Transformers need consistent input ordering to learn patterns\n",
+    "   - This is analogous to how NLP tokenizers normalize text (lowercasing, etc.)\n",
+    "\n",
+    "4. **Generate token sequences**: Create two versions of each route:\n",
+    "   - `sequence_with_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> ... <EOS>`\n",
+    "   - `sequence_no_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> ... <EOS>` (grade removed)\n",
+    "\n",
+    "The grade-included version is used for the GPT generator (which predicts the next token, including grade). The grade-excluded version is used for the grade predictor (which receives the route without knowing the grade and must predict it)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20bed1da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "configs_by_key = {config.board_key: config for config in configs}\n",
+    "configs_by_prefix = {config.token_prefix: config for config in configs}\n",
+    "placement_lookup = make_placement_lookup(df_placements)\n",
+    "\n",
+    "df_routes = build_route_records(\n",
+    "    df_climbs=df_climbs,\n",
+    "    configs_by_key=configs_by_key,\n",
+    "    placement_lookup=placement_lookup,\n",
+    ")\n",
+    "print(f\"Tokenized routes: {len(df_routes):,}\")\n",
+    "print()\n",
+    "df_routes[[\"board_key\", \"angle\", \"display_difficulty\", \"sequence_with_grade\"]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4d007fc0",
+   "metadata": {},
+   "source": [
+    "## Example tokenized routes\n",
+    "\n",
+    "Let's look at what the tokenized routes actually look like. This is the \"text\" that our transformer models will read."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f5b7391b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for _, row in df_routes.groupby(\"board_key\").head(2).iterrows():\n",
+    "    print(f\"Board: {row['board_key']}\")\n",
+    "    print(f\"  Angle: {row['angle']}°\")\n",
+    "    print(f\"  Grade: {row['boulder_grade']} (V{row['grouped_v']})\")\n",
+    "    print(f\"  Tokens: {row['sequence_with_grade']}\")\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0393d191",
+   "metadata": {},
+   "source": [
+    "## Build the shared vocabulary\n",
+    "\n",
+    "### What is a vocabulary?\n",
+    "\n",
+    "In NLP, the **vocabulary** (or \"vocab\") is the set of all possible tokens the model can produce or understand. For GPT-3, this is ~50,000 BPE tokens. For our climbing model, it includes:\n",
+    "\n",
+    "1. **Special tokens** (6): `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`, `<CLS>`, `<MASK>`\n",
+    "2. **Board tokens** (2): `<BOARD_TB2>`, `<BOARD_KILTER>`\n",
+    "3. **Angle tokens** (~6): `<ANGLE_30>`, `<ANGLE_35>`, `<ANGLE_40>`, etc.\n",
+    "4. **Grade tokens** (~17): `<GRADE_V0>` through `<GRADE_V16>`\n",
+    "5. **Hold tokens** (~1000+): One per placement per board per role\n",
+    "\n",
+    "### Why board-namespaced hold tokens?\n",
+    "\n",
+    "Placement ID 344 on TB2 refers to a completely different physical hold than placement ID 344 on Kilter. By prefixing with the board name (`TB2_p344_start` vs `KILTER_p344_start`), we ensure the model treats these as distinct tokens.\n",
+    "\n",
+    "This is analogous to how multilingual LLMs use language-specific subword tokens — the same byte sequence can mean different things in different languages.\n",
+    "\n",
+    "### String-to-integer mapping\n",
+    "\n",
+    "Transformers operate on integer indices, not strings. The `stoi` (string-to-integer) and `itos` (integer-to-string) dictionaries provide this mapping, similar to how HuggingFace tokenizers work."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1ba5b78d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vocab_tokens, stoi, itos = build_vocab(df_routes)\n",
+    "\n",
+    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
+    "print()\n",
+    "print(\"First 20 tokens (special + board tokens):\")\n",
+    "print(vocab_tokens[:20])\n",
+    "print()\n",
+    "hold_tokens = [t for t in vocab_tokens if t.startswith('<') and '_p' in t]\n",
+    "angle_tokens = [t for t in vocab_tokens if t.startswith('<ANGLE_')]\n",
+    "grade_tokens = [t for t in vocab_tokens if t.startswith('<GRADE_')]\n",
+    "board_tokens = [t for t in vocab_tokens if t.startswith('<BOARD_')]\n",
+    "special_tokens = [t for t in vocab_tokens if t in ['<PAD>', '<UNK>', '<BOS>', '<EOS>', '<CLS>', '<MASK>']]\n",
+    "\n",
+    "print(f\"Special tokens: {len(special_tokens)}\")\n",
+    "print(f\"Board tokens: {len(board_tokens)}\")\n",
+    "print(f\"Angle tokens: {len(angle_tokens)}\")\n",
+    "print(f\"Grade tokens: {len(grade_tokens)}\")\n",
+    "print(f\"Hold tokens: {len(hold_tokens)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93176cff",
+   "metadata": {},
+   "source": [
+    "## Train/validation/test splits\n",
+    "\n",
+    "### Why stratified splitting?\n",
+    "\n",
+    "We split data into train (80%), validation (10%), and test (10%) sets. Crucially, we **stratify by `board_key × grouped_v`** — this ensures that:\n",
+    "\n",
+    "1. Both boards (TB2 and Kilter) are represented in all splits\n",
+    "2. All difficulty levels (V0 through V16) are represented in all splits\n",
+    "\n",
+    "Without stratification, we might end up with all V14 climbs in the test set and none in training, which would make evaluation meaningless.\n",
+    "\n",
+    "This is the same principle as stratified splitting in NLP, where you ensure all languages or domains are represented in each split."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ff18298e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_routes[\"ids_with_grade\"] = df_routes[\"tokens_with_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
+    "df_routes[\"ids_no_grade\"] = df_routes[\"tokens_no_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
+    "\n",
+    "df_routes[\"split_stratum\"] = df_routes[\"board_key\"].astype(str) + \"__V\" + df_routes[\"grouped_v\"].astype(str)\n",
+    "\n",
+    "train_df, temp_df = safe_train_test_split(\n",
+    "    df_routes,\n",
+    "    test_size=0.20,\n",
+    "    random_state=42,\n",
+    "    stratify_col=\"split_stratum\",\n",
+    ")\n",
+    "val_df, test_df = safe_train_test_split(\n",
+    "    temp_df,\n",
+    "    test_size=0.50,\n",
+    "    random_state=42,\n",
+    "    stratify_col=\"split_stratum\",\n",
+    ")\n",
+    "\n",
+    "split_map = {}\n",
+    "split_map.update({uuid: \"train\" for uuid in train_df[\"uuid\"]})\n",
+    "split_map.update({uuid: \"val\" for uuid in val_df[\"uuid\"]})\n",
+    "split_map.update({uuid: \"test\" for uuid in test_df[\"uuid\"]})\n",
+    "df_routes[\"split\"] = df_routes[\"uuid\"].map(split_map)\n",
+    "\n",
+    "print(\"Split counts by board:\")\n",
+    "print(df_routes.groupby([\"board_key\", \"split\"]).size().unstack(fill_value=0))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "414dca92",
+   "metadata": {},
+   "source": [
+    "## Token metadata\n",
+    "\n",
+    "### Why metadata matters\n",
+    "\n",
+    "Each hold token carries rich metadata that the model can use:\n",
+    "\n",
+    "- **Physical coordinates** (`x`, `y`): Where the hold is on the board\n",
+    "- **Normalized coordinates** (`x_norm`, `y_norm`): Scaled to [-1, 1] per board, so the model doesn't need to learn different coordinate scales\n",
+    "- **Semantic role** (`start`, `middle`, `finish`, `foot`): What the hold is used for\n",
+    "- **Board identity** (`board_key`): Which board this hold belongs to\n",
+    "\n",
+    "The grade predictor uses these coordinate features as additional embeddings alongside the token embeddings. This is similar to how some LLMs inject positional embeddings or segment embeddings — we're giving the model extra structured information about each token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48c3692e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_token_meta = build_token_metadata(\n",
+    "    vocab_tokens=vocab_tokens,\n",
+    "    stoi=stoi,\n",
+    "    df_placements=df_placements,\n",
+    "    placement_lookup=placement_lookup,\n",
+    "    configs_by_prefix=configs_by_prefix,\n",
+    ")\n",
+    "\n",
+    "print(\"Token metadata columns:\")\n",
+    "print(df_token_meta.columns.tolist())\n",
+    "print()\n",
+    "print(\"Example hold token metadata:\")\n",
+    "df_token_meta[df_token_meta[\"kind\"] == \"hold\"].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "414dca93",
+   "metadata": {},
+   "source": [
+    "## Save artifacts\n",
+    "\n",
+    "We save several files that will be consumed by later notebooks:\n",
+    "\n",
+    "1. **`route_sequences.csv`**: The main tokenized dataset with train/val/test splits\n",
+    "2. **`routes_tokenized.jsonl`**: Same data in JSON Lines format (one JSON object per route)\n",
+    "3. **`token_vocab.json`**: The vocabulary mapping (stoi and itos)\n",
+    "4. **`token_metadata.csv`**: Metadata for each token (coordinates, roles, etc.)\n",
+    "5. **`placement_metadata.csv`**: Physical placement information\n",
+    "6. **`board_summary.csv`**: Aggregate statistics per board"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50e81878",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "OUT = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
+    "OUT.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "csv_cols = [\n",
+    "    \"uuid\", \"board_key\", \"board_display_name\", \"board_token_prefix\", \"board_token\",\n",
+    "    \"climb_name\", \"setter_username\", \"layout_id\", \"layout_name\", \"board_name\",\n",
+    "    \"frames\", \"angle\", \"display_difficulty\", \"grouped_v\", \"boulder_grade\",\n",
+    "    \"ascensionist_count\", \"quality_average\", \"fa_at\",\n",
+    "    \"n_holds\", \"n_start\", \"n_middle\", \"n_foot\", \"n_finish\",\n",
+    "    \"sequence_with_grade\", \"sequence_no_grade\", \"split\",\n",
+    "]\n",
+    "df_routes[csv_cols].to_csv(OUT / \"route_sequences.csv\", index=False)\n",
+    "\n",
+    "df_placements.to_csv(OUT / \"placement_metadata.csv\", index=False)\n",
+    "\n",
+    "df_token_meta.to_csv(OUT / \"token_metadata.csv\", index=False)\n",
+    "\n",
+    "write_json(OUT / \"token_vocab.json\", vocab_payload(stoi, itos, configs_by_key))\n",
+    "\n",
+    "with (OUT / \"routes_tokenized.jsonl\").open(\"w\", encoding=\"utf-8\") as handle:\n",
+    "    for record in df_routes.to_dict(orient=\"records\"):\n",
+    "        handle.write(json.dumps(json_safe(record)) + \"\\n\")\n",
+    "\n",
+    "board_summary = (\n",
+    "    df_routes.groupby(\"board_key\")\n",
+    "    .agg(\n",
+    "        n_routes=(\"uuid\", \"count\"),\n",
+    "        mean_angle=(\"angle\", \"mean\"),\n",
+    "        mean_display_difficulty=(\"display_difficulty\", \"mean\"),\n",
+    "        mean_holds=(\"n_holds\", \"mean\"),\n",
+    "    )\n",
+    "    .reset_index()\n",
+    ")\n",
+    "board_summary.to_csv(OUT / \"board_summary.csv\", index=False)\n",
+    "\n",
+    "print(\"Saved artifacts to:\", OUT)\n",
+    "print(f\"  - route_sequences.csv ({len(df_routes):,} rows)\")\n",
+    "print(f\"  - routes_tokenized.jsonl\")\n",
+    "print(f\"  - token_vocab.json ({len(stoi):,} tokens)\")\n",
+    "print(f\"  - token_metadata.csv ({len(df_token_meta):,} rows)\")\n",
+    "print(f\"  - placement_metadata.csv ({len(df_placements):,} rows)\")\n",
+    "print(f\"  - board_summary.csv\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,562 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "92b83b1d",
+   "metadata": {},
+   "source": [
+    "# 02 — Joint Transformer Grade Prediction\n",
+    "\n",
+    "## From Language Modeling to Grade Prediction\n",
+    "\n",
+    "In NLP, **BERT-style models** are encoder-only transformers that take a sequence of tokens, process them through multiple self-attention layers, and produce a single output (like a classification label). The key insight is:\n",
+    "\n",
+    "- **Input**: A sequence of tokens (words, subwords, or in our case, holds)\n",
+    "- **Processing**: Multiple layers of self-attention that let each token \"look at\" every other token\n",
+    "- **Output**: A pooled representation (typically from a `[CLS]` token) that summarizes the entire sequence\n",
+    "\n",
+    "### Our architecture\n",
+    "\n",
+    "We use a **Transformer Encoder** (similar to BERT) with these components:\n",
+    "\n",
+    "1. **Token embeddings**: Convert integer token IDs to dense vectors\n",
+    "2. **Positional embeddings**: Tell the model where each token is in the sequence\n",
+    "3. **Coordinate features**: Inject physical (x, y) position of each hold into the embedding\n",
+    "4. **Transformer encoder layers**: Multiple layers of self-attention + feedforward\n",
+    "5. **Regression head**: Take the `<CLS>` token's output and predict a single difficulty score\n",
+    "\n",
+    "### Why this works for climbing\n",
+    "\n",
+    "A climb's difficulty depends on the *relationships between holds*, not just individual holds. Self-attention naturally captures these relationships:\n",
+    "\n",
+    "- A start hold far from the first middle hold suggests a big opening move\n",
+    "- Two hand holds close together with a foot hold far away suggests a dyno\n",
+    "- The overall spatial distribution determines the \"flow\" of the climb\n",
+    "\n",
+    "The transformer can learn these spatial relationships through attention, without us having to manually engineer features like \"mean hand reach\" or \"height gained\" (though those features were useful in the classical model).\n",
+    "\n",
+    "### Input format\n",
+    "\n",
+    "```text\n",
+    "<CLS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> <TB2_p369_middle> ... <TB2_p603_finish>\n",
+    "```\n",
+    "\n",
+    "Note: We use `<CLS>` instead of `<BOS>` and we **exclude the grade token** — the model must predict the grade, not see it!\n",
+    "\n",
+    "### Target\n",
+    "\n",
+    "```text\n",
+    "display_difficulty (continuous value, e.g., 20.5)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3dfd6081",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import sys\n",
+    "import json\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "ROOT = Path.cwd().resolve()\n",
+    "if ROOT.name == \"notebooks\":\n",
+    "    ROOT = ROOT.parent\n",
+    "sys.path.insert(0, str(ROOT / \"src\"))\n",
+    "\n",
+    "from climbingboardgpt.datasets import RouteGradeDataset\n",
+    "from climbingboardgpt.metrics import regression_metrics, metrics_by_board\n",
+    "from climbingboardgpt.models import JointRouteTransformerRegressor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a9e2443",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
+    "df_routes = pd.read_csv(TOKENIZED / \"route_sequences.csv\")\n",
+    "vocab = json.loads((TOKENIZED / \"token_vocab.json\").read_text(encoding=\"utf-8\"))\n",
+    "\n",
+    "stoi = {str(k): int(v) for k, v in vocab[\"stoi\"].items()}\n",
+    "itos = {int(k): str(v) for k, v in vocab[\"itos\"].items()}\n",
+    "df_token_meta = pd.read_csv(TOKENIZED / \"token_metadata.csv\")\n",
+    "\n",
+    "pad_id = stoi[\"<PAD>\"]\n",
+    "unk_id = stoi[\"<UNK>\"]\n",
+    "\n",
+    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
+    "print(f\"Total routes: {len(df_routes):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4abfd9b",
+   "metadata": {},
+   "source": [
+    "## Build model IDs and coordinate features\n",
+    "\n",
+    "### Coordinate features: Why inject physical position?\n",
+    "\n",
+    "In standard NLP, positional embeddings tell the model *which position in the sequence* a token occupies. But for climbing, the **physical position on the wall** matters more than the sequence position.\n",
+    "\n",
+    "We create a 3-dimensional feature vector for each token:\n",
+    "1. `x_norm`: Normalized horizontal position on the board (-1 to 1)\n",
+    "2. `y_norm`: Normalized vertical position on the board (-1 to 1)\n",
+    "3. `is_hold`: 1 if this token represents a hold, 0 otherwise\n",
+    "\n",
+    "These features are projected through a linear layer and added to the token embeddings. This is similar to how some vision-language models inject spatial features from images alongside text tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95bb745f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def encode(tokens):\n",
+    "    \"\"\"Convert a list of token strings to integer IDs.\"\"\"\n",
+    "    return [stoi.get(token, unk_id) for token in tokens]\n",
+    "\n",
+    "# Prepare input sequences for the grade predictor\n",
+    "# We use the \"no grade\" version because the model should predict the grade,\n",
+    "# not see it in the input!\n",
+    "# We also prepend <CLS> which will be used for pooling the sequence representation\n",
+    "df_routes[\"tokens_no_grade\"] = df_routes[\"sequence_no_grade\"].fillna(\"\").str.split()\n",
+    "df_routes[\"model_tokens\"] = df_routes[\"tokens_no_grade\"].apply(\n",
+    "    lambda tokens: [\"<CLS>\"] + tokens[1:] if tokens else [\"<CLS>\"]\n",
+    ")\n",
+    "df_routes[\"model_ids\"] = df_routes[\"model_tokens\"].apply(encode)\n",
+    "df_routes[\"seq_len\"] = df_routes[\"model_ids\"].apply(len)\n",
+    "max_len = int(df_routes[\"seq_len\"].max())\n",
+    "\n",
+    "# Build coordinate features matrix: (vocab_size, 3)\n",
+    "# Each row corresponds to a token ID and contains [x_norm, y_norm, is_hold]\n",
+    "# This will be used as additional input to the model alongside token embeddings\n",
+    "coord_features = np.zeros((len(stoi), 3), dtype=np.float32)\n",
+    "for _, row in df_token_meta.iterrows():\n",
+    "    token_id = int(row[\"token_id\"])\n",
+    "    coord_features[token_id, 0] = 0.0 if pd.isna(row.get(\"x_norm\", 0.0)) else float(row.get(\"x_norm\", 0.0))\n",
+    "    coord_features[token_id, 1] = 0.0 if pd.isna(row.get(\"y_norm\", 0.0)) else float(row.get(\"y_norm\", 0.0))\n",
+    "    coord_features[token_id, 2] = 0.0 if pd.isna(row.get(\"is_hold\", 0.0)) else float(row.get(\"is_hold\", 0.0))\n",
+    "coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
+    "\n",
+    "print(f\"Max sequence length: {max_len}\")\n",
+    "print(f\"Coordinate features shape: {coord_features.shape}\")\n",
+    "print(f\"Vocabulary size: {len(stoi)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5da4ca8",
+   "metadata": {},
+   "source": [
+    "## Data loaders\n",
+    "\n",
+    "### Batching and padding\n",
+    "\n",
+    "Transformers process data in **batches** for efficiency. But routes have different lengths (different numbers of holds). We handle this by:\n",
+    "\n",
+    "1. **Padding**: Shorter sequences are padded with `<PAD>` tokens to match the longest sequence in the batch\n",
+    "2. **Attention masking**: The model receives a binary mask that tells it which positions are real data vs padding\n",
+    "\n",
+    "This is exactly how BERT and GPT handle variable-length text sequences.\n",
+    "\n",
+    "### The RouteGradeDataset class\n",
+    "\n",
+    "For each route, this dataset produces:\n",
+    "- `input_ids`: Integer token IDs, padded to `max_len`\n",
+    "- `attention_mask`: 1 for real tokens, 0 for padding\n",
+    "- `target`: The difficulty score we want to predict\n",
+    "- `uuid`, `board_key`: Metadata for evaluation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c9e5543",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_df = df_routes[df_routes[\"split\"] == \"train\"].reset_index(drop=True)\n",
+    "val_df = df_routes[df_routes[\"split\"] == \"val\"].reset_index(drop=True)\n",
+    "test_df = df_routes[df_routes[\"split\"] == \"test\"].reset_index(drop=True)\n",
+    "\n",
+    "train_ds = RouteGradeDataset(train_df, max_len=max_len, pad_id=pad_id)\n",
+    "val_ds = RouteGradeDataset(val_df, max_len=max_len, pad_id=pad_id)\n",
+    "test_ds = RouteGradeDataset(test_df, max_len=max_len, pad_id=pad_id)\n",
+    "\n",
+    "train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)\n",
+    "val_loader = DataLoader(val_ds, batch_size=128, shuffle=False)\n",
+    "test_loader = DataLoader(test_ds, batch_size=128, shuffle=False)\n",
+    "\n",
+    "print(f\"Training samples: {len(train_ds):,}\")\n",
+    "print(f\"Validation samples: {len(val_ds):,}\")\n",
+    "print(f\"Test samples: {len(test_ds):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90fa8ae5",
+   "metadata": {},
+   "source": [
+    "## Model Architecture\n",
+    "\n",
+    "### The JointRouteTransformerRegressor\n",
+    "\n",
+    "This model is a **transformer encoder** with a regression head. Here's what each component does:\n",
+    "\n",
+    "1. **Token embedding** (`nn.Embedding`): Converts integer token IDs to dense vectors of dimension `d_model`. This is the same as word embeddings in NLP.\n",
+    "\n",
+    "2. **Positional embedding** (`nn.Embedding`): Adds position information so the model knows which position each token occupies. Unlike sinusoidal positional encodings in the original Transformer paper, we use learned embeddings.\n",
+    "\n",
+    "3. **Coordinate projection** (`nn.Linear`): Projects the 3-dimensional coordinate features (x_norm, y_norm, is_hold) to `d_model` dimensions and adds them to the token embeddings. This injects physical position information.\n",
+    "\n",
+    "4. **Transformer encoder** (`nn.TransformerEncoder`): Multiple layers of self-attention and feedforward networks. Each layer:\n",
+    "   - Computes self-attention: every hold \"looks at\" every other hold\n",
+    "   - Applies feedforward transformation\n",
+    "   - Uses residual connections and layer normalization\n",
+    "\n",
+    "5. **Regression head**: Takes the `<CLS>` token's output (which has aggregated information from the entire sequence) and predicts a single difficulty score.\n",
+    "\n",
+    "### Hyperparameters\n",
+    "\n",
+    "- `d_model=128`: The dimensionality of embeddings and hidden states\n",
+    "- `nhead=4`: Number of attention heads (multi-head attention)\n",
+    "- `num_layers=4`: Number of transformer layers\n",
+    "- `dim_feedforward=256`: Dimension of the feedforward network inside each layer\n",
+    "- `dropout=0.10`: Dropout probability for regularization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "62c2db48",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "\n",
+    "model = JointRouteTransformerRegressor(\n",
+    "    vocab_size=len(stoi),\n",
+    "    max_len=max_len,\n",
+    "    coord_features=coord_features,\n",
+    "    d_model=128,\n",
+    "    nhead=4,\n",
+    "    num_layers=4,\n",
+    "    dim_feedforward=256,\n",
+    "    dropout=0.10,\n",
+    "    pad_id=pad_id,\n",
+    ").to(device)\n",
+    "\n",
+    "optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)\n",
+    "\n",
+    "print(f\"Device: {device}\")\n",
+    "print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ded8d846",
+   "metadata": {},
+   "source": [
+    "## Training Configuration\n",
+    "\n",
+    "### Loss function: MSE (Mean Squared Error)\n",
+    "\n",
+    "We use MSE loss because we're predicting a continuous value (difficulty score). This penalizes large errors more than small ones, which is appropriate for grade prediction.\n",
+    "\n",
+    "### Optimizer: AdamW\n",
+    "\n",
+    "AdamW is the standard optimizer for transformer models. It combines:\n",
+    "- **Adam**: Adaptive learning rates per parameter\n",
+    "- **Weight decay**: L2 regularization to prevent overfitting\n",
+    "\n",
+    "### Early stopping\n",
+    "\n",
+    "We stop training if validation loss doesn't improve for `patience` epochs. This prevents overfitting and saves compute."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "665deadb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def run_epoch(model, loader, device, optimizer=None):\n",
+    "    \"\"\"Run one epoch of training or evaluation.\n",
+    "    \n",
+    "    The RouteGradeDataset returns a dictionary with keys:\n",
+    "    - input_ids: token IDs, shape (batch_size, seq_len)\n",
+    "    - attention_mask: binary mask, shape (batch_size, seq_len)\n",
+    "    - target: difficulty score, shape (batch_size,)\n",
+    "    - uuid: route identifiers (for logging)\n",
+    "    - board_key: board identifiers (for logging)\n",
+    "    \"\"\"\n",
+    "    is_train = optimizer is not None\n",
+    "    model.train(is_train)\n",
+    "    criterion = nn.MSELoss()\n",
+    "\n",
+    "    losses, preds, targets, uuids, boards = [], [], [], [], []\n",
+    "\n",
+    "    for batch in loader:\n",
+    "        input_ids = batch[\"input_ids\"].to(device)\n",
+    "        attention_mask = batch[\"attention_mask\"].to(device)\n",
+    "        target = batch[\"target\"].to(device)\n",
+    "\n",
+    "        if is_train:\n",
+    "            optimizer.zero_grad(set_to_none=True)\n",
+    "\n",
+    "        pred = model(input_ids, attention_mask)\n",
+    "        loss = criterion(pred, target)\n",
+    "\n",
+    "        if is_train:\n",
+    "            loss.backward()\n",
+    "            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "            optimizer.step()\n",
+    "\n",
+    "        losses.append(loss.item() * input_ids.size(0))\n",
+    "        preds.extend(pred.detach().cpu().numpy().tolist())\n",
+    "        targets.extend(target.detach().cpu().numpy().tolist())\n",
+    "        uuids.extend(batch[\"uuid\"])\n",
+    "        boards.extend(batch[\"board_key\"])\n",
+    "\n",
+    "    avg_loss = sum(losses) / max(1, len(loader.dataset))\n",
+    "    return avg_loss, np.asarray(preds), np.asarray(targets), uuids, boards\n",
+    "\n",
+    "\n",
+    "# Training configuration\n",
+    "num_epochs = 30\n",
+    "patience = 12\n",
+    "\n",
+    "print(f\"Max epochs: {num_epochs}\")\n",
+    "print(f\"Early stopping patience: {patience}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35d4bd8b",
+   "metadata": {},
+   "source": [
+    "## Training Loop\n",
+    "\n",
+    "The training loop follows the standard deep learning workflow:\n",
+    "\n",
+    "1. **Forward pass**: Feed input through the model to get predictions\n",
+    "2. **Compute loss**: Compare predictions to actual grades using MSE\n",
+    "3. **Backward pass**: Compute gradients via backpropagation\n",
+    "4. **Update weights**: Adjust model parameters using the optimizer\n",
+    "5. **Validate**: Check performance on held-out validation data\n",
+    "6. **Early stopping**: Stop if validation loss stops improving\n",
+    "\n",
+    "We track both fine-grained metrics (MAE, RMSE) and practical metrics (V-grade accuracy within ±1 grade)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "476b158d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "history = []\n",
+    "best_val_mae = float(\"inf\")\n",
+    "best_state = None\n",
+    "best_epoch = 0\n",
+    "epochs_without_improvement = 0\n",
+    "\n",
+    "print(\"Starting training...\\n\")\n",
+    "\n",
+    "for epoch in range(1, num_epochs + 1):\n",
+    "    train_loss, train_pred, train_true, _, _ = run_epoch(model, train_loader, device, optimizer)\n",
+    "    val_loss, val_pred, val_true, _, _ = run_epoch(model, val_loader, device, optimizer=None)\n",
+    "\n",
+    "    train_metrics = regression_metrics(train_true, train_pred)\n",
+    "    val_metrics = regression_metrics(val_true, val_pred)\n",
+    "\n",
+    "    history.append({\n",
+    "        \"epoch\": epoch,\n",
+    "        \"train_loss\": train_loss,\n",
+    "        \"val_loss\": val_loss,\n",
+    "        \"train_mae\": train_metrics[\"mae\"],\n",
+    "        \"val_mae\": val_metrics[\"mae\"],\n",
+    "        \"train_r2\": train_metrics[\"r2\"],\n",
+    "        \"val_r2\": val_metrics[\"r2\"],\n",
+    "        \"val_within_1_vgrade\": val_metrics[\"within_1_vgrade\"],\n",
+    "    })\n",
+    "\n",
+    "    if val_metrics[\"mae\"] < best_val_mae:\n",
+    "        best_val_mae = val_metrics[\"mae\"]\n",
+    "        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}\n",
+    "        best_epoch = epoch\n",
+    "        epochs_without_improvement = 0\n",
+    "    else:\n",
+    "        epochs_without_improvement += 1\n",
+    "\n",
+    "    if epoch == 1 or epoch % 5 == 0 or epoch == best_epoch:\n",
+    "        print(\n",
+    "            f\"Epoch {epoch:03d} | \"\n",
+    "            f\"train MAE {train_metrics['mae']:.3f} | \"\n",
+    "            f\"val MAE {val_metrics['mae']:.3f} | \"\n",
+    "            f\"val R² {val_metrics['r2']:.3f} | \"\n",
+    "            f\"val ±1V {val_metrics['within_1_vgrade']:.1f}%\"\n",
+    "        )\n",
+    "\n",
+    "    if epochs_without_improvement >= patience:\n",
+    "        print(f\"Early stopping at epoch {epoch}; best epoch was {best_epoch}.\")\n",
+    "        break\n",
+    "\n",
+    "if best_state is not None:\n",
+    "    model.load_state_dict(best_state)\n",
+    "\n",
+    "print(f\"\\nTraining complete. Best epoch: {best_epoch}, Best val MAE: {best_val_mae:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "589ad448",
+   "metadata": {},
+   "source": [
+    "## Test Set Evaluation\n",
+    "\n",
+    "After training, we load the best model (based on validation MAE) and evaluate on the held-out test set. We report:\n",
+    "\n",
+    "- **MAE** (Mean Absolute Error): Average error in difficulty score points\n",
+    "- **RMSE** (Root Mean Squared Error): Penalizes large errors more\n",
+    "- **R²** (R-squared): How much variance in grades the model explains\n",
+    "- **Within ±1 difficulty**: Percentage of predictions within 1 point\n",
+    "- **Within ±1 V-grade**: Percentage of predictions within 1 V-grade\n",
+    "\n",
+    "We also break down performance by board (TB2 vs Kilter) to check for bias."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9abc3a72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_loss, test_pred, test_true, test_uuid, test_board = run_epoch(model, test_loader, device, optimizer=None)\n",
+    "overall_metrics = regression_metrics(test_true, test_pred)\n",
+    "\n",
+    "pred_df = pd.DataFrame({\n",
+    "    \"uuid\": test_uuid,\n",
+    "    \"board_key\": test_board,\n",
+    "    \"y_true\": test_true,\n",
+    "    \"y_pred\": test_pred,\n",
+    "})\n",
+    "board_metrics_df = metrics_by_board(pred_df)\n",
+    "\n",
+    "print(\"=\" * 50)\n",
+    "print(\"Overall joint test performance\")\n",
+    "print(\"=\" * 50)\n",
+    "for key, value in overall_metrics.items():\n",
+    "    suffix = \"%\" if \"within\" in key or \"exact\" in key else \"\"\n",
+    "    print(f\"{key:24s}: {value:8.4f}{suffix}\")\n",
+    "\n",
+    "print(\"\\nBoard-specific test performance:\")\n",
+    "print(board_metrics_df.to_string(index=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "556be142",
+   "metadata": {},
+   "source": [
+    "## Save Model and Artifacts\n",
+    "\n",
+    "We save the trained model checkpoint and evaluation metrics for use in notebook 04 (route evaluation) and for future inference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "save_model",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save model checkpoint\n",
+    "MODEL_DIR = ROOT / \"models\"\n",
+    "MODEL_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "OUT_DIR = ROOT / \"data\" / \"processed\" / \"grade_prediction\"\n",
+    "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "# Save the full model checkpoint (needed by notebook 04)\n",
+    "checkpoint = {\n",
+    "    \"model_state_dict\": model.state_dict(),\n",
+    "    \"config\": {\n",
+    "        \"vocab_size\": len(stoi),\n",
+    "        \"max_len\": max_len,\n",
+    "        \"d_model\": 128,\n",
+    "        \"nhead\": 4,\n",
+    "        \"num_layers\": 4,\n",
+    "        \"dim_feedforward\": 256,\n",
+    "        \"dropout\": 0.10,\n",
+    "        \"pad_id\": pad_id,\n",
+    "    },\n",
+    "    \"stoi\": stoi,\n",
+    "    \"itos\": {str(k): v for k, v in itos.items()},\n",
+    "    \"coord_features\": coord_features.cpu(),\n",
+    "    \"overall_metrics\": overall_metrics,\n",
+    "}\n",
+    "model_path = MODEL_DIR / \"joint_transformer_grade_predictor.pth\"\n",
+    "torch.save(checkpoint, model_path)\n",
+    "\n",
+    "# Save training history and metrics\n",
+    "pd.DataFrame(history).to_csv(OUT_DIR / \"training_history.csv\", index=False)\n",
+    "pred_df.to_csv(OUT_DIR / \"test_predictions.csv\", index=False)\n",
+    "board_metrics_df.to_csv(OUT_DIR / \"board_metrics.csv\", index=False)\n",
+    "\n",
+    "from climbingboardgpt.utils import write_json\n",
+    "write_json(OUT_DIR / \"overall_metrics.json\", overall_metrics)\n",
+    "\n",
+    "print(f\"Saved model checkpoint to: {model_path}\")\n",
+    "print(f\"Saved training history to: {OUT_DIR / 'training_history.csv'}\")\n",
+    "print(f\"Saved test predictions to: {OUT_DIR / 'test_predictions.csv'}\")\n",
+    "print(f\"Saved board metrics to: {OUT_DIR / 'board_metrics.csv'}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "key_takeaways",
+   "metadata": {},
+   "source": [
+    "## Key Takeaways\n",
+    "\n",
+    "1. **The transformer can learn from raw token sequences** without hand-engineered features like \"mean hand reach\" or \"height gained\". The self-attention mechanism lets it discover these patterns.\n",
+    "\n",
+    "2. **Coordinate features help**: Injecting physical (x, y) position information gives the model a strong prior about spatial relationships, similar to how positional embeddings help language models.\n",
+    "\n",
+    "3. **Joint training across boards**: By training on both TB2 and Kilter data simultaneously, the model can share statistical strength. The board token (`<BOARD_TB2>` vs `<BOARD_KILTER>`) tells it which \"language\" it's operating in.\n",
+    "\n",
+    "4. **The gap between fine-grained and grouped metrics**: Being off by 1 difficulty point often stays within the same V-grade bucket. This is why the ±1 V-grade accuracy is much higher than the ±1 difficulty accuracy."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.14.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,518 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "27197e7d",
+   "metadata": {},
+   "source": [
+    "# 03 — Joint nanoGPT-style Route Generation\n",
+    "\n",
+    "## From Understanding to Generation\n",
+    "\n",
+    "Notebook 02 used a **transformer encoder** (BERT-style) to *understand* routes and predict their grade. This notebook uses a **transformer decoder** (GPT-style) to *generate* new routes.\n",
+    "\n",
+    "### The key difference: Encoder vs Decoder\n",
+    "\n",
+    "| Aspect | BERT-style (Encoder) | GPT-style (Decoder) |\n",
+    "|---|---|---|\n",
+    "| Attention | Bidirectional (sees all tokens) | Causal (only sees past tokens) |\n",
+    "| Training | Masked language modeling | Next-token prediction |\n",
+    "| Use case | Classification, regression | Text generation |\n",
+    "| Output | Single prediction per sequence | One prediction per position |\n",
+    "\n",
+    "### How GPT-style generation works\n",
+    "\n",
+    "The model is trained to predict the **next token** given all previous tokens:\n",
+    "\n",
+    "```text\n",
+    "Input:  <BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6>\n",
+    "Target: <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start>\n",
+    "```\n",
+    "\n",
+    "At generation time, we:\n",
+    "1. Start with a prompt like `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6>`\n",
+    "2. Ask the model to predict the next token\n",
+    "3. Sample from the predicted probability distribution\n",
+    "4. Append the sampled token to the sequence\n",
+    "5. Repeat until we generate `<EOS>` or hit a max length\n",
+    "\n",
+    "### Conditioning on board, angle, and grade\n",
+    "\n",
+    "The prompt tokens tell the model *what kind of route to generate*:\n",
+    "- `<BOARD_TB2>`: Generate a route for the Tension Board 2\n",
+    "- `<ANGLE_40>`: At 40 degrees\n",
+    "- `<GRADE_V6>`: At V6 difficulty\n",
+    "\n",
+    "This is analogous to how ChatGPT uses a system prompt to condition its responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6590822",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import sys\n",
+    "import json\n",
+    "import math\n",
+    "import pandas as pd\n",
+    "import torch\n",
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "ROOT = Path.cwd().resolve()\n",
+    "if ROOT.name == \"notebooks\":\n",
+    "    ROOT = ROOT.parent\n",
+    "sys.path.insert(0, str(ROOT / \"src\"))\n",
+    "\n",
+    "from climbingboardgpt.config import load_board_configs\n",
+    "from climbingboardgpt.datasets import RouteGPTDataset\n",
+    "from climbingboardgpt.generation import generate_one\n",
+    "from climbingboardgpt.models import JointRouteGPT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f09fdf54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
+    "df_routes = pd.read_csv(TOKENIZED / \"route_sequences.csv\")\n",
+    "vocab = json.loads((TOKENIZED / \"token_vocab.json\").read_text(encoding=\"utf-8\"))\n",
+    "stoi = {str(k): int(v) for k, v in vocab[\"stoi\"].items()}\n",
+    "itos = {int(k): str(v) for k, v in vocab[\"itos\"].items()}\n",
+    "\n",
+    "pad_id = stoi[\"<PAD>\"]\n",
+    "unk_id = stoi[\"<UNK>\"]\n",
+    "\n",
+    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
+    "print(f\"Total routes: {len(df_routes):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe4b0faf",
+   "metadata": {},
+   "source": [
+    "## Sequence encoding for causal language modeling\n",
+    "\n",
+    "### The autoregressive setup\n",
+    "\n",
+    "For GPT-style training, each route becomes a sequence where the model learns to predict each token given all previous tokens:\n",
+    "\n",
+    "```text\n",
+    "Input:   <BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> <TB2_p369_middle>\n",
+    "Target:  <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> <TB2_p369_middle> <TB2_p603_finish>\n",
+    "```\n",
+    "\n",
+    "The input is shifted right by one position compared to the target. This is the standard causal language modeling setup.\n",
+    "\n",
+    "### Why include the grade in the training sequence?\n",
+    "\n",
+    "For the grade predictor (notebook 02), we excluded the grade because the model needed to predict it. But for the generator, we **include** the grade (`<GRADE_V6>`) in the training data so the model learns the relationship between grade and hold selection.\n",
+    "\n",
+    "At generation time, we provide the grade as part of the prompt, and the model generates holds that are appropriate for that grade."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ad61dbd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def encode(tokens):\n",
+    "    \"\"\"Convert token strings to integer IDs.\"\"\"\n",
+    "    return [stoi.get(token, unk_id) for token in tokens]\n",
+    "\n",
+    "# Use the \"with grade\" version for GPT training\n",
+    "# The model needs to see the grade to learn grade-hold relationships\n",
+    "df_routes[\"gpt_tokens\"] = df_routes[\"sequence_with_grade\"].fillna(\"\").str.split()\n",
+    "df_routes[\"gpt_ids\"] = df_routes[\"gpt_tokens\"].apply(encode)\n",
+    "df_routes[\"seq_len\"] = df_routes[\"gpt_ids\"].apply(len)\n",
+    "max_len = int(df_routes[\"seq_len\"].max())\n",
+    "block_size = max_len - 1  # Input length (one less than full sequence)\n",
+    "\n",
+    "# Create train/val splits\n",
+    "train_df = df_routes[df_routes[\"split\"] == \"train\"].reset_index(drop=True)\n",
+    "val_df = df_routes[df_routes[\"split\"] == \"val\"].reset_index(drop=True)\n",
+    "\n",
+    "# Create datasets and data loaders\n",
+    "# RouteGPTDataset handles the input/target shift for causal modeling\n",
+    "train_ds = RouteGPTDataset(train_df, max_len=max_len, pad_id=pad_id)\n",
+    "val_ds = RouteGPTDataset(val_df, max_len=max_len, pad_id=pad_id)\n",
+    "train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)\n",
+    "val_loader = DataLoader(val_ds, batch_size=128, shuffle=False)\n",
+    "\n",
+    "print(f\"Max sequence length: {max_len}\")\n",
+    "print(f\"Block size (input length): {block_size}\")\n",
+    "print(f\"Training samples: {len(train_ds):,}\")\n",
+    "print(f\"Validation samples: {len(val_ds):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66d98641",
+   "metadata": {},
+   "source": [
+    "## The GPT Model Architecture\n",
+    "\n",
+    "### JointRouteGPT\n",
+    "\n",
+    "This is a **causal transformer decoder** — the same architecture used in GPT-2, GPT-3, etc., but much smaller:\n",
+    "\n",
+    "1. **Token embeddings**: Convert integer token IDs to dense vectors\n",
+    "2. **Positional embeddings**: Learned position vectors (not sinusoidal)\n",
+    "3. **Causal self-attention**: Each position can only attend to previous positions (via a causal mask)\n",
+    "4. **Transformer layers**: Multiple layers of attention + feedforward\n",
+    "5. **Language modeling head**: Projects hidden states to vocabulary logits\n",
+    "\n",
+    "### Key hyperparameters\n",
+    "\n",
+    "- `n_embd=128`: Embedding dimension (GPT-2 small uses 768)\n",
+    "- `n_head=4`: Number of attention heads\n",
+    "- `n_layer=4`: Number of transformer layers (GPT-2 small uses 12)\n",
+    "- `dropout=0.10`: Dropout probability\n",
+    "\n",
+    "This is intentionally small — we're training on ~40K short sequences, not billions of long documents.\n",
+    "\n",
+    "### Weight tying\n",
+    "\n",
+    "The output projection layer shares weights with the token embedding layer (`self.lm_head.weight = self.token_emb.weight`). This is a common technique that:\n",
+    "- Reduces parameter count\n",
+    "- Acts as a regularizer\n",
+    "- Is used in GPT-2 and many other language models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3eec6f35",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "\n",
+    "model = JointRouteGPT(\n",
+    "    vocab_size=len(stoi),\n",
+    "    block_size=block_size,\n",
+    "    n_embd=128,\n",
+    "    n_head=4,\n",
+    "    n_layer=4,\n",
+    "    dropout=0.10,\n",
+    "    pad_id=pad_id,\n",
+    ").to(device)\n",
+    "\n",
+    "optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)\n",
+    "\n",
+    "print(f\"Device: {device}\")\n",
+    "print(f\"Total parameters: {sum(p.numel() for p in model.parameters()):,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f999cf05",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_epoch():\n",
+    "    \"\"\"Train for one epoch.\"\"\"\n",
+    "    model.train()\n",
+    "    losses = []\n",
+    "    n = 0\n",
+    "    for batch in train_loader:\n",
+    "        x = batch[\"input_ids\"].to(device)\n",
+    "        y = batch[\"target_ids\"].to(device)\n",
+    "        \n",
+    "        optimizer.zero_grad(set_to_none=True)\n",
+    "        _, loss = model(x, y)\n",
+    "        loss.backward()\n",
+    "        \n",
+    "        # Gradient clipping prevents exploding gradients\n",
+    "        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "        \n",
+    "        optimizer.step()\n",
+    "        losses.append(loss.item() * x.size(0))\n",
+    "        n += x.size(0)\n",
+    "    return sum(losses) / max(1, n)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_loss(loader):\n",
+    "    \"\"\"Evaluate loss on a data loader.\"\"\"\n",
+    "    model.eval()\n",
+    "    losses = []\n",
+    "    n = 0\n",
+    "    for batch in loader:\n",
+    "        x = batch[\"input_ids\"].to(device)\n",
+    "        y = batch[\"target_ids\"].to(device)\n",
+    "        _, loss = model(x, y)\n",
+    "        losses.append(loss.item() * x.size(0))\n",
+    "        n += x.size(0)\n",
+    "    return sum(losses) / max(1, n)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51fb8b6e",
+   "metadata": {},
+   "source": [
+    "## Training\n",
+    "\n",
+    "### What we're optimizing\n",
+    "\n",
+    "The model minimizes **cross-entropy loss** — the standard loss function for language modeling. At each position, the model outputs a probability distribution over the entire vocabulary, and the loss measures how surprised it is by the actual next token.\n",
+    "\n",
+    "### Perplexity\n",
+    "\n",
+    "We also track **perplexity**, which is `exp(loss)`. Perplexity answers the question: \"On average, how many tokens was the model choosing between at each step?\" Lower perplexity = better model.\n",
+    "\n",
+    "For reference:\n",
+    "- A model that always predicts the right token has perplexity = 1\n",
+    "- A model that picks uniformly from a 1000-token vocab has perplexity = 1000\n",
+    "- Good language models on English text achieve perplexity ~15-20\n",
+    "\n",
+    "Our vocabulary is ~4000+ tokens, so a perplexity significantly below that indicates the model is learning meaningful patterns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70b38b02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "history = []\n",
+    "best_val_loss = float(\"inf\")\n",
+    "best_state = None\n",
+    "patience = 10\n",
+    "stagnant = 0\n",
+    "\n",
+    "print(\"Starting GPT training...\\n\")\n",
+    "\n",
+    "for epoch in range(1, 21):\n",
+    "    train_loss = train_epoch()\n",
+    "    val_loss = eval_loss(val_loader)\n",
+    "    \n",
+    "    # Track perplexity (exponentiated loss)\n",
+    "    train_ppl = math.exp(min(train_loss, 20))\n",
+    "    val_ppl = math.exp(min(val_loss, 20))\n",
+    "    \n",
+    "    history.append({\n",
+    "        \"epoch\": epoch,\n",
+    "        \"train_loss\": train_loss,\n",
+    "        \"val_loss\": val_loss,\n",
+    "        \"train_perplexity\": train_ppl,\n",
+    "        \"val_perplexity\": val_ppl,\n",
+    "    })\n",
+    "    \n",
+    "    # Early stopping\n",
+    "    if val_loss < best_val_loss:\n",
+    "        best_val_loss = val_loss\n",
+    "        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}\n",
+    "        stagnant = 0\n",
+    "    else:\n",
+    "        stagnant += 1\n",
+    "    \n",
+    "    if epoch == 1 or epoch % 5 == 0:\n",
+    "        print(f\"Epoch {epoch:3d} | \"\n",
+    "              f\"Train Loss: {train_loss:.4f} | \"\n",
+    "              f\"Val Loss: {val_loss:.4f} | \"\n",
+    "              f\"Val PPL: {val_ppl:.1f}\")\n",
+    "    \n",
+    "    if stagnant >= patience:\n",
+    "        print(f\"\\nEarly stopping at epoch {epoch}\")\n",
+    "        break\n",
+    "\n",
+    "# Load best model\n",
+    "if best_state is not None:\n",
+    "    model.load_state_dict(best_state)\n",
+    "\n",
+    "print(f\"\\nBest validation loss: {best_val_loss:.4f}\")\n",
+    "print(f\"Best validation perplexity: {math.exp(min(best_val_loss, 20)):.1f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69926180",
+   "metadata": {},
+   "source": [
+    "## Generating Routes\n",
+    "\n",
+    "### The generation process\n",
+    "\n",
+    "To generate a route, we:\n",
+    "\n",
+    "1. **Create a prompt**: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6>`\n",
+    "2. **Feed it to the model**: Get a probability distribution over the vocabulary for the next token\n",
+    "3. **Sample a token**: Use temperature and top-k filtering to control randomness\n",
+    "4. **Append and repeat**: Add the sampled token to the sequence and repeat until `<EOS>` or max length\n",
+    "\n",
+    "### Temperature and top-k sampling\n",
+    "\n",
+    "- **Temperature** (default 0.9): Controls randomness. Lower = more deterministic, higher = more random\n",
+    "- **Top-k** (default 50): Only consider the k most likely tokens. This prevents the model from generating very unlikely tokens.\n",
+    "\n",
+    "These are the same techniques used in language models like GPT-3 to control output diversity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "029eb911",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate sample routes for both boards\n",
+    "configs = load_board_configs([\"tb2\", \"kilter\"])\n",
+    "configs_by_key = {config.board_key: config for config in configs}\n",
+    "\n",
+    "samples = []\n",
+    "for board_key, config in configs_by_key.items():\n",
+    "    for grouped_v in [3, 5, 7]:  # V3, V5, V7\n",
+    "        sample = generate_one(\n",
+    "            model=model,\n",
+    "            stoi=stoi,\n",
+    "            itos=itos,\n",
+    "            device=device,\n",
+    "            board_prefix=config.token_prefix,\n",
+    "            angle=40,\n",
+    "            grouped_v=grouped_v,\n",
+    "            role_name_to_id=config.role_definitions,\n",
+    "            temperature=0.9,\n",
+    "            top_k=50,\n",
+    "            max_new_tokens=40,\n",
+    "        )\n",
+    "        samples.append({\"board_key\": board_key, **sample})\n",
+    "\n",
+    "samples_df = pd.DataFrame(samples)\n",
+    "print(\"Generated route samples:\")\n",
+    "print(samples_df[[\"board_key\", \"requested_grouped_v\", \"basic_valid\", \"sequence\", \"frames\"]])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "generate_more",
+   "metadata": {},
+   "source": [
+    "## Generate More Routes for Evaluation\n",
+    "\n",
+    "Notebook 04 needs a larger set of generated routes for meaningful evaluation. Let's generate routes across multiple angles and grades for both boards."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "generate_bulk",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate routes across multiple angles and grades for evaluation\n",
+    "all_samples = []\n",
+    "\n",
+    "for board_key, config in configs_by_key.items():\n",
+    "    # Get common angles and grades for this board\n",
+    "    board_df = df_routes[df_routes[\"board_key\"] == board_key]\n",
+    "    common_angles = sorted(board_df[\"angle\"].astype(int).value_counts().head(5).index.tolist())\n",
+    "    common_grades = sorted(board_df[\"grouped_v\"].astype(int).value_counts().head(8).index.tolist())\n",
+    "    \n",
+    "    print(f\"\\nGenerating for {config.display_name}:\")\n",
+    "    print(f\"  Angles: {common_angles}\")\n",
+    "    print(f\"  Grades: V{min(common_grades)}-V{max(common_grades)}\")\n",
+    "    \n",
+    "    for angle in common_angles:\n",
+    "        for grade in common_grades:\n",
+    "            for i in range(5):  # 5 samples per condition\n",
+    "                sample = generate_one(\n",
+    "                    model=model,\n",
+    "                    stoi=stoi,\n",
+    "                    itos=itos,\n",
+    "                    device=device,\n",
+    "                    board_prefix=config.token_prefix,\n",
+    "                    angle=int(angle),\n",
+    "                    grouped_v=int(grade),\n",
+    "                    role_name_to_id=config.role_definitions,\n",
+    "                    temperature=0.9,\n",
+    "                    top_k=50,\n",
+    "                    max_new_tokens=40,\n",
+    "                )\n",
+    "                all_samples.append({\"board_key\": board_key, **sample})\n",
+    "\n",
+    "all_samples_df = pd.DataFrame(all_samples)\n",
+    "print(f\"\\nTotal generated routes: {len(all_samples_df):,}\")\n",
+    "print(\"\\nBasic validity by board:\")\n",
+    "print(all_samples_df.groupby(\"board_key\")[\"basic_valid\"].mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "save_artifacts",
+   "metadata": {},
+   "source": [
+    "## Save Model and Generated Routes\n",
+    "\n",
+    "We save the trained model checkpoint and generated routes for use in notebook 04 (evaluation)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "save_outputs",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# Save model checkpoint\n",
+    "MODEL_DIR = ROOT / \"models\"\n",
+    "MODEL_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "checkpoint = {\n",
+    "    \"model_state_dict\": model.state_dict(),\n",
+    "    \"config\": {\n",
+    "        \"vocab_size\": len(stoi),\n",
+    "        \"block_size\": block_size,\n",
+    "        \"n_embd\": 128,\n",
+    "        \"n_head\": 4,\n",
+    "        \"n_layer\": 4,\n",
+    "        \"dropout\": 0.10,\n",
+    "        \"pad_id\": pad_id,\n",
+    "    },\n",
+    "    \"stoi\": stoi,\n",
+    "    \"itos\": {str(k): v for k, v in itos.items()},\n",
+    "    \"best_val_loss\": best_val_loss,\n",
+    "}\n",
+    "model_path = MODEL_DIR / \"joint_route_gpt_generator.pth\"\n",
+    "torch.save(checkpoint, model_path)\n",
+    "print(f\"Saved model checkpoint to: {model_path}\")\n",
+    "\n",
+    "# Save training history\n",
+    "GEN_DIR = ROOT / \"data\" / \"processed\" / \"generation\"\n",
+    "GEN_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "pd.DataFrame(history).to_csv(GEN_DIR / \"training_history.csv\", index=False)\n",
+    "print(f\"Saved training history to: {GEN_DIR / 'training_history.csv'}\")\n",
+    "\n",
+    "# Save generated routes (this is what notebook 04 needs)\n",
+    "all_samples_df.to_csv(GEN_DIR / \"generated_routes.csv\", index=False)\n",
+    "print(f\"Saved {len(all_samples_df)} generated routes to: {GEN_DIR / 'generated_routes.csv'}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,513 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "84d328e0",
+   "metadata": {},
+   "source": [
+    "# 04 — Generated Route Evaluation\n",
+    "\n",
+    "## Why evaluate generated routes?\n",
+    "\n",
+    "In language modeling, we evaluate generated text using metrics like BLEU, ROUGE, or perplexity. For climbing routes, we need domain-specific evaluation:\n",
+    "\n",
+    "1. **Validity**: Does the route follow the rules of climbing boards?\n",
+    "2. **Novelty**: Is the route different from existing climbs, or just a copy?\n",
+    "3. **Geometric plausibility**: Are the holds in reasonable positions?\n",
+    "4. **Grade consistency**: Does the route's predicted grade match the requested grade?\n",
+    "\n",
+    "### Validity checks\n",
+    "\n",
+    "A \"basic valid\" route must have:\n",
+    "- At least 3 holds (you need at least 2 hands + 1 foot to climb)\n",
+    "- No duplicate placements (you can't use the same hold twice)\n",
+    "- At least one start hold and one finish hold\n",
+    "- All holds from the same board (no mixing TB2 and Kilter holds)\n",
+    "\n",
+    "A \"strict valid\" route additionally has:\n",
+    "- At least one middle hold (most real climbs have more than just start + finish)\n",
+    "- At least 4 holds total\n",
+    "\n",
+    "### Novelty metrics\n",
+    "\n",
+    "We measure novelty using **Jaccard distance**: 1 minus the Jaccard similarity between the generated route's hold set and the most similar real route's hold set.\n",
+    "\n",
+    "- Jaccard similarity = |A intersection B| / |A union B|\n",
+    "- Novelty distance = 1 - Jaccard similarity\n",
+    "\n",
+    "A novelty distance of 1.0 means the generated route shares no holds with any real route. A distance of 0.0 means it's identical to an existing route."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "726b846f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import sys\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import torch\n",
+    "\n",
+    "ROOT = Path.cwd().resolve()\n",
+    "if ROOT.name == \"notebooks\":\n",
+    "    ROOT = ROOT.parent\n",
+    "sys.path.insert(0, str(ROOT / \"src\"))\n",
+    "\n",
+    "from climbingboardgpt.evaluation import (\n",
+    "    build_placement_coords,\n",
+    "    frames_to_holds,\n",
+    "    holds_to_placement_set,\n",
+    "    nearest_real_route_same_board,\n",
+    "    parse_token_list,\n",
+    "    simple_route_features,\n",
+    "    tokens_to_hold_records,\n",
+    "    validity_from_records,\n",
+    ")\n",
+    "from climbingboardgpt.grades import to_grouped_v\n",
+    "from climbingboardgpt.models import JointRouteTransformerRegressor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f8bb61f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load generated routes and real routes for comparison\n",
+    "# NOTE: This notebook requires that you've run notebook 03 first to\n",
+    "# generate and save the routes.\n",
+    "\n",
+    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
+    "GENERATED = ROOT / \"data\" / \"processed\" / \"generation\"\n",
+    "\n",
+    "# Check if required files exist\n",
+    "generated_path = GENERATED / \"generated_routes.csv\"\n",
+    "routes_path = TOKENIZED / \"route_sequences.csv\"\n",
+    "token_meta_path = TOKENIZED / \"token_metadata.csv\"\n",
+    "\n",
+    "if not generated_path.exists():\n",
+    "    raise FileNotFoundError(\n",
+    "        f\"Generated routes not found at: {generated_path}\\n\"\n",
+    "        f\"Please run notebook 03 first to generate and save routes,\\n\"\n",
+    "        f\"or run: python scripts/03_train_route_generator.py\"\n",
+    "    )\n",
+    "\n",
+    "if not routes_path.exists() or not token_meta_path.exists():\n",
+    "    raise FileNotFoundError(\n",
+    "        f\"Tokenized data not found at: {TOKENIZED}\\n\"\n",
+    "        f\"Please run notebook 01 first to tokenize routes,\\n\"\n",
+    "        f\"or run: python scripts/01_tokenize_routes.py\"\n",
+    "    )\n",
+    "\n",
+    "df_generated = pd.read_csv(generated_path)\n",
+    "df_real = pd.read_csv(routes_path)\n",
+    "df_token_meta = pd.read_csv(token_meta_path)\n",
+    "\n",
+    "print(f\"Generated routes: {len(df_generated):,}\")\n",
+    "print(f\"Real routes: {len(df_real):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0091bafb",
+   "metadata": {},
+   "source": [
+    "## Parse generated tokens and check validity\n",
+    "\n",
+    "We parse the generated token sequences and check each route for validity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f5c2b25a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Parse the token strings into structured records\n",
+    "df_generated[\"tokens_parsed\"] = df_generated[\"tokens\"].apply(parse_token_list)\n",
+    "\n",
+    "# Extract hold information from tokens\n",
+    "df_generated[\"hold_records\"] = df_generated[\"tokens_parsed\"].apply(tokens_to_hold_records)\n",
+    "\n",
+    "# Check validity for each generated route\n",
+    "validity = pd.DataFrame(df_generated[\"hold_records\"].apply(validity_from_records).tolist())\n",
+    "df_eval = pd.concat([df_generated.reset_index(drop=True), validity], axis=1)\n",
+    "\n",
+    "print(\"Validity rates by board:\")\n",
+    "print(\"=\" * 50)\n",
+    "validity_summary = df_eval.groupby(\"board_key\").agg(\n",
+    "    total=(\"basic_valid_eval\", \"count\"),\n",
+    "    basic_valid_rate=(\"basic_valid_eval\", \"mean\"),\n",
+    "    strict_valid_rate=(\"strict_valid_eval\", \"mean\"),\n",
+    "    avg_holds=(\"n_holds_eval\", \"mean\"),\n",
+    ").round(3)\n",
+    "print(validity_summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0cf2170e",
+   "metadata": {},
+   "source": [
+    "## Novelty against real climbs\n",
+    "\n",
+    "For each generated route, we find the most similar real route from the same board (by Jaccard similarity of hold sets). A good generator should produce routes that are novel (low Jaccard similarity to existing routes) while still being valid."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7f34524",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convert hold sets to frozensets for fast comparison\n",
+    "df_eval[\"hold_set\"] = df_eval[\"hold_records\"].apply(\n",
+    "    lambda records: frozenset(int(record[\"placement_id\"]) for record in records)\n",
+    ")\n",
+    "\n",
+    "# Parse real routes' frames strings into hold sets\n",
+    "df_real[\"real_holds\"] = df_real[\"frames\"].apply(frames_to_holds)\n",
+    "df_real[\"hold_set\"] = df_real[\"real_holds\"].apply(holds_to_placement_set)\n",
+    "\n",
+    "# Find nearest real route for each generated route\n",
+    "print(\"Computing novelty (finding nearest real route for each generated route)...\")\n",
+    "print(\"This may take a few minutes...\")\n",
+    "\n",
+    "nearest = pd.DataFrame(\n",
+    "    df_eval.apply(\n",
+    "        lambda row: nearest_real_route_same_board(\n",
+    "            generated_set=row[\"hold_set\"],\n",
+    "            generated_board_key=row[\"board_key\"],\n",
+    "            real_df=df_real,\n",
+    "        ),\n",
+    "        axis=1,\n",
+    "    ).tolist()\n",
+    ")\n",
+    "df_eval = pd.concat([df_eval, nearest], axis=1)\n",
+    "\n",
+    "print(\"\\nNovelty statistics by board:\")\n",
+    "print(\"=\" * 50)\n",
+    "novelty_summary = df_eval.groupby(\"board_key\").agg(\n",
+    "    mean_jaccard=(\"nearest_real_jaccard\", \"mean\"),\n",
+    "    mean_novelty=(\"novelty_distance\", \"mean\"),\n",
+    "    median_novelty=(\"novelty_distance\", \"median\"),\n",
+    ").round(3)\n",
+    "print(novelty_summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b581705d",
+   "metadata": {},
+   "source": [
+    "## Geometric descriptors\n",
+    "\n",
+    "We compute simple geometric features for each generated route:\n",
+    "\n",
+    "- `geom_n_holds`: Number of holds\n",
+    "- `geom_height`: Vertical extent of the route\n",
+    "- `geom_width`: Horizontal extent\n",
+    "- `geom_mean_hand_reach`: Average distance between hand holds\n",
+    "\n",
+    "These features help us understand whether generated routes have reasonable spatial properties."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d74d4cad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build coordinate lookup from token metadata\n",
+    "coords = build_placement_coords(df_token_meta)\n",
+    "\n",
+    "# Compute geometric features for each generated route\n",
+    "geom = pd.DataFrame(\n",
+    "    df_eval.apply(\n",
+    "        lambda row: simple_route_features(\n",
+    "            board_key=row[\"board_key\"],\n",
+    "            records=row[\"hold_records\"],\n",
+    "            placement_coords=coords,\n",
+    "        ),\n",
+    "        axis=1,\n",
+    "    ).tolist()\n",
+    ")\n",
+    "df_eval = pd.concat([df_eval, geom], axis=1)\n",
+    "\n",
+    "print(\"Geometric feature statistics by board:\")\n",
+    "print(\"=\" * 50)\n",
+    "geom_summary = df_eval.groupby(\"board_key\").agg(\n",
+    "    mean_holds=(\"geom_n_holds\", \"mean\"),\n",
+    "    mean_height=(\"geom_height\", \"mean\"),\n",
+    "    mean_width=(\"geom_width\", \"mean\"),\n",
+    "    mean_hand_reach=(\"geom_mean_hand_reach\", \"mean\"),\n",
+    ").round(3)\n",
+    "print(geom_summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4455557a",
+   "metadata": {},
+   "source": [
+    "## Grade consistency (using the trained critic)\n",
+    "\n",
+    "If we have a trained grade predictor (from notebook 02), we can use it as a **critic** to check whether generated routes have grades consistent with what was requested.\n",
+    "\n",
+    "This is similar to how GANs use a discriminator to evaluate generated samples, except our critic is a regression model rather than a binary classifier."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88747d6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Try to load the grade critic from notebook 02\n",
+    "GRADE_MODEL_PATH = ROOT / \"models\" / \"joint_transformer_grade_predictor.pth\"\n",
+    "\n",
+    "def load_grade_critic(model_path, device):\n",
+    "    \"\"\"Load the trained grade predictor model.\"\"\"\n",
+    "    if not model_path.exists():\n",
+    "        return None\n",
+    "    try:\n",
+    "        checkpoint = torch.load(model_path, map_location=device, weights_only=False)\n",
+    "    except TypeError:\n",
+    "        checkpoint = torch.load(model_path, map_location=device)\n",
+    "\n",
+    "    cfg = checkpoint[\"config\"]\n",
+    "    stoi = {str(k): int(v) for k, v in checkpoint[\"stoi\"].items()}\n",
+    "    coord_features = checkpoint[\"coord_features\"]\n",
+    "    if not isinstance(coord_features, torch.Tensor):\n",
+    "        coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
+    "\n",
+    "    model = JointRouteTransformerRegressor(\n",
+    "        vocab_size=cfg[\"vocab_size\"],\n",
+    "        max_len=cfg[\"max_len\"],\n",
+    "        coord_features=coord_features,\n",
+    "        d_model=cfg.get(\"d_model\", 128),\n",
+    "        nhead=cfg.get(\"nhead\", 4),\n",
+    "        num_layers=cfg.get(\"num_layers\", 4),\n",
+    "        dim_feedforward=cfg.get(\"dim_feedforward\", 256),\n",
+    "        dropout=cfg.get(\"dropout\", 0.10),\n",
+    "        pad_id=cfg.get(\"pad_id\", stoi[\"<PAD>\"]),\n",
+    "    ).to(device)\n",
+    "    model.load_state_dict(checkpoint[\"model_state_dict\"])\n",
+    "    model.eval()\n",
+    "\n",
+    "    return {\n",
+    "        \"model\": model,\n",
+    "        \"stoi\": stoi,\n",
+    "        \"pad_id\": stoi[\"<PAD>\"],\n",
+    "        \"unk_id\": stoi[\"<UNK>\"],\n",
+    "        \"max_len\": cfg[\"max_len\"],\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "def predict_generated_grade(tokens, critic, device):\n",
+    "    \"\"\"Predict the difficulty of a generated route using the critic.\"\"\"\n",
+    "    model = critic[\"model\"]\n",
+    "    stoi = critic[\"stoi\"]\n",
+    "    pad_id = critic[\"pad_id\"]\n",
+    "    unk_id = critic[\"unk_id\"]\n",
+    "    max_len = critic[\"max_len\"]\n",
+    "\n",
+    "    # Remove grade tokens and replace BOS with CLS\n",
+    "    tokens = [t for t in tokens if not t.startswith(\"<GRADE_\")]\n",
+    "    if tokens and tokens[0] == \"<BOS>\":\n",
+    "        tokens = [\"<CLS>\"] + tokens[1:]\n",
+    "    else:\n",
+    "        tokens = [\"<CLS>\"] + tokens\n",
+    "\n",
+    "    ids = [stoi.get(t, unk_id) for t in tokens][:max_len]\n",
+    "    mask = [1] * len(ids)\n",
+    "    if len(ids) < max_len:\n",
+    "        pad_n = max_len - len(ids)\n",
+    "        ids += [pad_id] * pad_n\n",
+    "        mask += [0] * pad_n\n",
+    "\n",
+    "    with torch.no_grad():\n",
+    "        input_ids = torch.tensor([ids], dtype=torch.long, device=device)\n",
+    "        attention_mask = torch.tensor([mask], dtype=torch.bool, device=device)\n",
+    "        return float(model(input_ids, attention_mask).cpu().item())\n",
+    "\n",
+    "\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "critic = load_grade_critic(GRADE_MODEL_PATH, device)\n",
+    "\n",
+    "if critic is not None:\n",
+    "    print(\"Grade critic loaded successfully!\")\n",
+    "    print(f\"Device: {device}\")\n",
+    "else:\n",
+    "    print(\"No trained grade critic found. Skipping critic-based scoring.\")\n",
+    "    print(\"Run notebook 02 first to train the grade predictor.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "critic_eval",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Apply the critic to evaluate grade consistency\n",
+    "if critic is not None:\n",
+    "    df_eval[\"critic_pred_display_difficulty\"] = df_eval[\"tokens_parsed\"].apply(\n",
+    "        lambda tokens: predict_generated_grade(tokens, critic, device)\n",
+    "    )\n",
+    "    df_eval[\"critic_pred_grouped_v\"] = df_eval[\"critic_pred_display_difficulty\"].apply(to_grouped_v)\n",
+    "    df_eval[\"critic_v_error\"] = df_eval[\"critic_pred_grouped_v\"] - df_eval[\"requested_grouped_v\"]\n",
+    "\n",
+    "    print(\"Grade consistency by board:\")\n",
+    "    print(\"=\" * 50)\n",
+    "    critic_summary = df_eval.groupby(\"board_key\").agg(\n",
+    "        exact_v=(\"critic_v_error\", lambda s: float((s == 0).mean() * 100)),\n",
+    "        within_1_v=(\"critic_v_error\", lambda s: float((s.abs() <= 1).mean() * 100)),\n",
+    "        within_2_v=(\"critic_v_error\", lambda s: float((s.abs() <= 2).mean() * 100)),\n",
+    "        mean_error=(\"critic_v_error\", \"mean\"),\n",
+    "    ).round(2)\n",
+    "    print(critic_summary)\n",
+    "else:\n",
+    "    print(\"Skipping critic evaluation (no model loaded).\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ranking",
+   "metadata": {},
+   "source": [
+    "## Ranking generated routes\n",
+    "\n",
+    "We rank candidates by a composite score that rewards:\n",
+    "- **Basic validity** (required): At least 3 holds, start/finish, no duplicates, one board\n",
+    "- **Strict validity** (bonus): Also has middle holds and 4+ holds\n",
+    "- **Novelty** (higher is better): Distance from nearest real route\n",
+    "- **Grade consistency** (if critic available): Predicted grade close to requested grade"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88747d6e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Rank candidates by composite score\n",
+    "ranked = df_eval.copy()\n",
+    "ranked[\"score\"] = 0.0\n",
+    "ranked[\"score\"] += ranked[\"basic_valid_eval\"].astype(float) * 2.0\n",
+    "ranked[\"score\"] += ranked[\"strict_valid_eval\"].astype(float) * 1.0\n",
+    "ranked[\"score\"] += ranked[\"novelty_distance\"].fillna(0.0)\n",
+    "\n",
+    "if \"critic_v_error\" in ranked.columns:\n",
+    "    ranked[\"score\"] += (ranked[\"critic_v_error\"].abs() <= 1).astype(float)\n",
+    "    ranked[\"score\"] -= 0.25 * ranked[\"critic_v_error\"].abs()\n",
+    "\n",
+    "print(\"Top 10 generated routes by composite score:\")\n",
+    "print(\"=\" * 70)\n",
+    "top_routes = ranked.sort_values(\"score\", ascending=False).head(10)\n",
+    "display_cols = [\"board_key\", \"score\", \"basic_valid_eval\", \"strict_valid_eval\", \"novelty_distance\"]\n",
+    "if \"critic_v_error\" in top_routes.columns:\n",
+    "    display_cols.append(\"critic_v_error\")\n",
+    "print(top_routes[display_cols].to_string(index=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "evaluation_summary",
+   "metadata": {},
+   "source": [
+    "## Save evaluation results\n",
+    "\n",
+    "We save the full evaluation DataFrame and the top candidates for further analysis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "save_results",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save evaluation results\n",
+    "OUT_DIR = ROOT / \"data\" / \"processed\" / \"evaluation\"\n",
+    "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "df_eval.to_csv(OUT_DIR / \"generated_route_evaluation.csv\", index=False)\n",
+    "top_candidates = ranked.sort_values(\"score\", ascending=False).head(100)\n",
+    "top_candidates.to_csv(OUT_DIR / \"top_generated_candidates.csv\", index=False)\n",
+    "\n",
+    "print(f\"Saved evaluation results to: {OUT_DIR}\")\n",
+    "print(f\"  - generated_route_evaluation.csv ({len(df_eval)} rows)\")\n",
+    "print(f\"  - top_generated_candidates.csv (100 rows)\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 50)\n",
+    "print(\"EVALUATION SUMMARY\")\n",
+    "print(\"=\" * 50)\n",
+    "print(f\"\\nTotal generated routes: {len(df_eval):,}\")\n",
+    "print(f\"\\nBasic validity rate: {df_eval['basic_valid_eval'].mean():.1%}\")\n",
+    "print(f\"Strict validity rate: {df_eval['strict_valid_eval'].mean():.1%}\")\n",
+    "print(f\"Mean novelty distance: {df_eval['novelty_distance'].mean():.3f}\")\n",
+    "\n",
+    "if 'critic_v_error' in df_eval.columns:\n",
+    "    print(f\"\\nGrade consistency:\")\n",
+    "    print(f\"  Exact V-grade: {(df_eval['critic_v_error'] == 0).mean():.1%}\")\n",
+    "    print(f\"  Within 1 V-grade: {(df_eval['critic_v_error'].abs() <= 1).mean():.1%}\")\n",
+    "    print(f\"  Within 2 V-grades: {(df_eval['critic_v_error'].abs() <= 2).mean():.1%}\")\n",
+    "else:\n",
+    "    print(\"\\n(Grade consistency not available - no critic model loaded)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "takeaways",
+   "metadata": {},
+   "source": [
+    "## Key Takeaways\n",
+    "\n",
+    "1. **Validity**: The generator produces routes that mostly satisfy structural constraints (start/finish holds, no duplicates, single board).\n",
+    "\n",
+    "2. **Novelty**: Generated routes are meaningfully different from existing routes, as measured by Jaccard distance.\n",
+    "\n",
+    "3. **Geometric plausibility**: The geometric features (height, width, hand reach) should be in reasonable ranges compared to real routes.\n",
+    "\n",
+    "4. **Grade consistency**: If the critic is available, we can check whether routes generated at a requested grade actually feel like that grade.\n",
+    "\n",
+    "### Limitations\n",
+    "\n",
+    "- Validity checks are structural, not semantic. A route might have valid start/finish holds but still be impossible.\n",
+    "- Geometric features are simple. More sophisticated analysis could check reachability and move sequences.\n",
+    "- The critic model was trained on real data, so it may not generalize well to novel route structures."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}