ClimbingBoardGPT/notebooks/01_unified_route_tokenization.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "94a0ae33",
   "metadata": {},
   "source": [
    "# 01 — Unified Route Tokenization for TB2 and Kilter\n",
    "\n",
    "## What is tokenization and why does it matter?\n",
    "\n",
    "In natural language processing, **tokenization** is the process of converting raw text into a sequence of discrete symbols (tokens) that a model can process. For example, the sentence \"I climb rocks\" might be tokenized as `[\"I\", \" climb\", \" rocks\"]` using a subword tokenizer like BPE.\n",
    "\n",
    "For climbing board routes, we face an analogous problem: how do we convert a climb — which is fundamentally a *set of holds at specific positions with specific roles* — into a sequence of tokens that a transformer can learn from?\n",
    "\n",
    "### Key design decisions in this notebook\n",
    "\n",
    "1. **Board namespacing**: Each hold token includes the board prefix (e.g., `TB2_p344_start` vs `KILTER_p1084_start`). This prevents placement ID collisions between boards — placement 344 on TB2 is a completely different physical hold than placement 344 on Kilter (in fact, the latter does not exist).\n",
    "\n",
    "2. **Semantic role mapping**: Different boards use different role IDs (TB2 uses 5/6/7/8, Kilter uses 12/13/14/15), but they all map to the same semantic roles: `start`, `middle`, `finish`, `foot`. This shared vocabulary lets the model learn transferable patterns.\n",
    "\n",
    "3. **Canonical ordering**: Holds within a route are sorted by (role priority, y-position, x-position). This gives the model a consistent input ordering, similar to how LLMs expect text in left-to-right order.\n",
    "\n",
    "4. **Special tokens**: Like BERT and GPT, we use special tokens:\n",
    "   - `<BOS>` (beginning of sequence) — marks the start, like `[CLS]` in BERT\n",
    "   - `<EOS>` (end of sequence) — marks the end, like `[SEP]` or the end-of-text token in GPT\n",
    "   - `<PAD>` — for batching sequences of different lengths\n",
    "   - `<UNK>` — for unknown tokens (safety net)\n",
    "   - `<CLS>` — used by the grade predictor to pool sequence information\n",
    "   - `<MASK>` — reserved for future masked language modeling experiments\n",
    "\n",
    "5. **Conditioning tokens**: Routes are prefixed with board, angle, and grade tokens. This is analogous to how modern LLMs use system prompts or task prefixes to condition generation.\n",
    "\n",
    "### The analogy to NLP\n",
    "\n",
    "| NLP Concept | Climbing Board Analog |\n",
    "|---|---|\n",
    "| Word / Subword | Hold token (placement + role) |\n",
    "| Sentence | Route (sequence of holds) |\n",
    "| Document language | Board type (TB2 vs Kilter) |\n",
    "| Sentence length | Number of holds in route |\n",
    "| POS tag | Semantic role (start/middle/finish/foot) |\n",
    "| Genre / Domain | Angle + Grade conditioning |\n",
    "\n",
    "This notebook tokenizes climbing routes from **both** supported boards:\n",
    "\n",
    "- Tension Board 2 Mirror\n",
    "- Kilter Board Original\n",
    "\n",
    "The board-specific details are stored in `configs/tb2.json` and `configs/kilter.json`.\n",
    "This version defines the tokenization helpers inline as the notebook needs them.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ee2907f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:23.269153Z",
     "iopub.status.busy": "2026-06-07T15:45:23.268660Z",
     "iopub.status.idle": "2026-06-07T15:45:25.138003Z",
     "shell.execute_reply": "2026-06-07T15:45:25.137054Z"
    }
   },
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import ast\n",
    "import json\n",
    "import random\n",
    "import re\n",
    "import sqlite3\n",
    "from dataclasses import dataclass\n",
    "from pathlib import Path\n",
    "from typing import Any, Iterable\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "ROOT = Path.cwd().resolve()\n",
    "if ROOT.name == \"notebooks\":\n",
    "    ROOT = ROOT.parent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fec9ef3b",
   "metadata": {},
   "source": [
    "### Board configuration helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c110801",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:25.142164Z",
     "iopub.status.busy": "2026-06-07T15:45:25.141718Z",
     "iopub.status.idle": "2026-06-07T15:45:25.156480Z",
     "shell.execute_reply": "2026-06-07T15:45:25.155743Z"
    }
   },
   "outputs": [],
   "source": [
    "# Find the project root and load board configuration JSON files.\n",
    "def find_project_root(start: str | Path | None = None) -> Path:\n",
    "    \"\"\"Walk upward until the repository root markers are found.\n",
    "\n",
    "    The project root is identified by both ``pyproject.toml`` and ``configs``.\n",
    "    If neither marker pair is found, the resolved starting directory is returned\n",
    "    so callers still have a deterministic base path.\n",
    "    \"\"\"\n",
    "    current = Path(start).resolve() if start is not None else Path.cwd().resolve()\n",
    "    for candidate in [current, *current.parents]:\n",
    "        if (candidate / \"pyproject.toml\").exists() and (candidate / \"configs\").exists():\n",
    "            return candidate\n",
    "    return current\n",
    "\n",
    "@dataclass(frozen=True)\n",
    "class BoardConfig:\n",
    "    \"\"\"Configuration for a single climbing board.\n",
    "    \n",
    "    This dataclass stores all board-specific settings needed for\n",
    "    data loading, tokenization, and model training.\n",
    "    \n",
    "    Attributes:\n",
    "        board_key: Short identifier (e.g., \"tb2\", \"kilter\")\n",
    "        display_name: Human-readable name (e.g., \"Tension Board 2 Mirror\")\n",
    "        token_prefix: Namespace for hold tokens (e.g., \"TB2\", \"KILTER\")\n",
    "        db_path: Path to the SQLite database\n",
    "        layout_id: Which layout in the database to use\n",
    "        max_angle: Filter out routes steeper than this (None = no filter)\n",
    "        min_fa_date: Filter out routes first ascended before this date\n",
    "        placement_y_max: Filter out placements above this Y coordinate\n",
    "        include_mirror_placement_id: Whether to include mirror info (TB2 only)\n",
    "        role_definitions: Maps semantic role names to numeric IDs\n",
    "        boardlib_database_command: Command to download the database\n",
    "        boardlib_images_command: Command to download board images\n",
    "        notes: Additional notes about the configuration\n",
    "    \"\"\"\n",
    "    board_key: str\n",
    "    display_name: str\n",
    "    token_prefix: str\n",
    "    db_path: Path\n",
    "    layout_id: int\n",
    "    max_angle: float | None\n",
    "    min_fa_date: str | None\n",
    "    placement_y_max: float | None\n",
    "    include_mirror_placement_id: bool\n",
    "    role_definitions: dict[str, int]\n",
    "    boardlib_database_command: str | None = None\n",
    "    boardlib_images_command: str | None = None\n",
    "    notes: tuple[str, ...] = ()\n",
    "\n",
    "    @property\n",
    "    def role_id_to_name(self) -> dict[int, str]:\n",
    "        \"\"\"Reverse mapping from numeric role IDs to semantic role names.\n",
    "        \n",
    "        Example: {5: 'start', 6: 'middle', 7: 'finish', 8: 'foot'} for TB2\n",
    "        \"\"\"\n",
    "        return {int(role_id): name for name, role_id in self.role_definitions.items()}\n",
    "\n",
    "    @property\n",
    "    def board_token(self) -> str:\n",
    "        \"\"\"The special token representing this board.\n",
    "        \n",
    "        Example: \"<BOARD_TB2>\" or \"<BOARD_KILTER>\"\n",
    "        \"\"\"\n",
    "        return f\"<BOARD_{self.token_prefix}>\"\n",
    "\n",
    "    def resolve_db_path(self, project_root: Path | None = None) -> Path:\n",
    "        \"\"\"Resolve the database path relative to the project root.\n",
    "        \n",
    "        If db_path is absolute, return it as-is.\n",
    "        Otherwise, resolve it relative to the project root.\n",
    "        \"\"\"\n",
    "        project_root = project_root or find_project_root()\n",
    "        return self.db_path if self.db_path.is_absolute() else project_root / self.db_path\n",
    "\n",
    "def load_board_config(board_key: str, config_dir: str | Path | None = None) -> BoardConfig:\n",
    "    \"\"\"Load a single board configuration from a JSON file.\n",
    "    \n",
    "    Args:\n",
    "        board_key: Board identifier (e.g., \"tb2\", \"kilter\")\n",
    "        config_dir: Directory containing config JSON files\n",
    "        \n",
    "    Returns:\n",
    "        BoardConfig dataclass with all board settings\n",
    "        \n",
    "    Raises:\n",
    "        FileNotFoundError: If the config file doesn't exist\n",
    "    \"\"\"\n",
    "    project_root = find_project_root()\n",
    "    config_dir = Path(config_dir) if config_dir is not None else project_root / \"configs\"\n",
    "    path = config_dir / f\"{board_key}.json\"\n",
    "    if not path.exists():\n",
    "        available = sorted(p.stem for p in config_dir.glob(\"*.json\"))\n",
    "        raise FileNotFoundError(\n",
    "            f\"Unknown board config '{board_key}'. Available: {available}\"\n",
    "        )\n",
    "\n",
    "    payload = json.loads(path.read_text(encoding=\"utf-8\"))\n",
    "    return BoardConfig(\n",
    "        board_key=str(payload[\"board_key\"]),\n",
    "        display_name=str(payload[\"display_name\"]),\n",
    "        token_prefix=str(payload[\"token_prefix\"]),\n",
    "        db_path=Path(payload[\"db_path\"]),\n",
    "        layout_id=int(payload[\"layout_id\"]),\n",
    "        max_angle=None if payload.get(\"max_angle\") is None else float(payload[\"max_angle\"]),\n",
    "        min_fa_date=payload.get(\"min_fa_date\"),\n",
    "        placement_y_max=None if payload.get(\"placement_y_max\") is None else float(payload[\"placement_y_max\"]),\n",
    "        include_mirror_placement_id=bool(payload.get(\"include_mirror_placement_id\", False)),\n",
    "        role_definitions={str(k): int(v) for k, v in payload[\"role_definitions\"].items()},\n",
    "        boardlib_database_command=payload.get(\"boardlib_database_command\"),\n",
    "        boardlib_images_command=payload.get(\"boardlib_images_command\"),\n",
    "        notes=tuple(payload.get(\"notes\", [])),\n",
    "    )\n",
    "\n",
    "def load_board_configs(board_keys: list[str] | tuple[str, ...]) -> list[BoardConfig]:\n",
    "    \"\"\"Load multiple board configurations.\n",
    "    \n",
    "    Args:\n",
    "        board_keys: List of board identifiers\n",
    "        \n",
    "    Returns:\n",
    "        List of BoardConfig dataclasses\n",
    "    \"\"\"\n",
    "    return [load_board_config(board_key) for board_key in board_keys]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3066e2b",
   "metadata": {},
   "source": [
    "## Load board configurations\n",
    "\n",
    "Each board has its own configuration file (`configs/tb2.json`, `configs/kilter.json`) that specifies:\n",
    "\n",
    "- **`layout_id`**: Which board layout to use (TB2 Mirror = 10, Kilter Original = 1)\n",
    "- **`role_definitions`**: Maps semantic role names to board-specific role IDs\n",
    "  - TB2: start=5, middle=6, finish=7, foot=8\n",
    "  - Kilter: start=12, middle=13, finish=14, foot=15\n",
    "- **`max_angle`**: We filter out climbs at extreme angles (>50° for TB2, >55° for Kilter) because those grades are biased toward elite climbers\n",
    "- **`token_prefix`**: The namespace prefix for hold tokens (\"TB2\" vs \"KILTER\")\n",
    "- **`include_mirror_placement_id`**: Whether to include mirror information (TB2 has symmetric left/right holds)\n",
    "\n",
    "This configuration-driven approach means we can add new boards by creating a new JSON config file, without changing any code.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f04dcea",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:25.159465Z",
     "iopub.status.busy": "2026-06-07T15:45:25.159209Z",
     "iopub.status.idle": "2026-06-07T15:45:25.166377Z",
     "shell.execute_reply": "2026-06-07T15:45:25.165663Z"
    }
   },
   "outputs": [],
   "source": [
    "configs = load_board_configs([\"tb2\", \"kilter\"])\n",
    "configs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25242855",
   "metadata": {},
   "source": [
    "### Database loading helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a076d997",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:25.170098Z",
     "iopub.status.busy": "2026-06-07T15:45:25.169626Z",
     "iopub.status.idle": "2026-06-07T15:45:25.182438Z",
     "shell.execute_reply": "2026-06-07T15:45:25.181766Z"
    }
   },
   "outputs": [],
   "source": [
    "# Query each BoardLib SQLite database and attach board identity columns.\n",
    "def build_climbs_query(config: BoardConfig) -> tuple[str, list]:\n",
    "    \"\"\"Build a SQL query for climbs data with board-specific filters.\n",
    "    \n",
    "    The query joins climbs, layouts, products, climb_stats, and difficulty_grades\n",
    "    tables, applying filters for:\n",
    "    - layout_id: Which board layout to use\n",
    "    - max_angle: Exclude routes steeper than this\n",
    "    - min_fa_date: Exclude routes first ascended before this date\n",
    "    - display_difficulty IS NOT NULL: Only routes with difficulty ratings\n",
    "    - is_listed = 1: Only publicly listed routes\n",
    "    \n",
    "    Args:\n",
    "        config: Board configuration\n",
    "        \n",
    "    Returns:\n",
    "        Tuple of (SQL query string, list of query parameters)\n",
    "    \"\"\"\n",
    "    conditions = [\n",
    "        \"cs.display_difficulty IS NOT NULL\",\n",
    "        \"c.is_listed = 1\",\n",
    "        \"c.layout_id = ?\",\n",
    "    ]\n",
    "    params: list = [config.layout_id]\n",
    "\n",
    "    if config.max_angle is not None:\n",
    "        conditions.append(\"cs.angle <= ?\")\n",
    "        params.append(config.max_angle)\n",
    "\n",
    "    if config.min_fa_date is not None:\n",
    "        conditions.append(\"cs.fa_at > ?\")\n",
    "        params.append(config.min_fa_date)\n",
    "\n",
    "    query = f\"\"\"\n",
    "    SELECT\n",
    "        c.uuid,\n",
    "        c.name AS climb_name,\n",
    "        c.setter_username,\n",
    "        c.layout_id AS layout_id,\n",
    "        c.description,\n",
    "        c.is_nomatch,\n",
    "        c.is_listed,\n",
    "        l.name AS layout_name,\n",
    "        p.name AS board_name,\n",
    "        c.frames,\n",
    "        cs.angle,\n",
    "        cs.display_difficulty,\n",
    "        dg.boulder_name AS boulder_grade,\n",
    "        cs.ascensionist_count,\n",
    "        cs.quality_average,\n",
    "        cs.fa_at\n",
    "    FROM climbs c\n",
    "    JOIN layouts l ON c.layout_id = l.id\n",
    "    JOIN products p ON l.product_id = p.id\n",
    "    JOIN climb_stats cs ON c.uuid = cs.climb_uuid\n",
    "    JOIN difficulty_grades dg ON ROUND(cs.display_difficulty) = dg.difficulty\n",
    "    WHERE {' AND '.join(conditions)}\n",
    "    \"\"\"\n",
    "    return query, params\n",
    "\n",
    "def build_placements_query(config: BoardConfig) -> tuple[str, list]:\n",
    "    \"\"\"Build a SQL query for placement data with board-specific filters.\n",
    "    \n",
    "    The query retrieves hold positions, default roles, material types,\n",
    "    and (optionally) mirror placement IDs for symmetric holds.\n",
    "    \n",
    "    Args:\n",
    "        config: Board configuration\n",
    "        \n",
    "    Returns:\n",
    "        Tuple of (SQL query string, list of query parameters)\n",
    "    \"\"\"\n",
    "    params: list = [config.layout_id]\n",
    "    y_condition = \"\"\n",
    "    if config.placement_y_max is not None:\n",
    "        y_condition = \" AND h.y <= ?\"\n",
    "        params.append(config.placement_y_max)\n",
    "\n",
    "    if config.include_mirror_placement_id:\n",
    "        # TB2 has mirrored holds — include the mirror placement ID\n",
    "        query = f\"\"\"\n",
    "        SELECT\n",
    "            p.id AS placement_id,\n",
    "            h.x,\n",
    "            h.y,\n",
    "            p.default_placement_role_id AS default_role_id,\n",
    "            p.set_id AS set_id,\n",
    "            s.name AS set_name,\n",
    "            p_mirror.id AS mirror_placement_id\n",
    "        FROM placements p\n",
    "        JOIN holes h ON p.hole_id = h.id\n",
    "        JOIN sets s ON p.set_id = s.id\n",
    "        LEFT JOIN holes h_mirror ON h.mirrored_hole_id = h_mirror.id\n",
    "        LEFT JOIN placements p_mirror\n",
    "            ON p_mirror.hole_id = h_mirror.id\n",
    "           AND p_mirror.layout_id = p.layout_id\n",
    "        WHERE p.layout_id = ?{y_condition}\n",
    "        \"\"\"\n",
    "    else:\n",
    "        # Kilter doesn't have mirrored holds\n",
    "        query = f\"\"\"\n",
    "        SELECT\n",
    "            p.id AS placement_id,\n",
    "            h.x,\n",
    "            h.y,\n",
    "            p.default_placement_role_id AS default_role_id,\n",
    "            p.set_id AS set_id,\n",
    "            s.name AS set_name,\n",
    "            NULL AS mirror_placement_id\n",
    "        FROM placements p\n",
    "        JOIN holes h ON p.hole_id = h.id\n",
    "        JOIN sets s ON p.set_id = s.id\n",
    "        WHERE p.layout_id = ?{y_condition}\n",
    "        \"\"\"\n",
    "    return query, params\n",
    "\n",
    "def load_board_data(\n",
    "    config: BoardConfig,\n",
    "    project_root: str | Path | None = None,\n",
    "    max_climbs: int | None = None,\n",
    ") -> tuple[pd.DataFrame, pd.DataFrame]:\n",
    "    \"\"\"Load climbs and placements data for a single board.\n",
    "    \n",
    "    Args:\n",
    "        config: Board configuration\n",
    "        project_root: Path to project root (for resolving db_path)\n",
    "        max_climbs: Optional row limit for fast smoke-test loads.\n",
    "        \n",
    "    Returns:\n",
    "        Tuple of (climbs DataFrame, placements DataFrame)\n",
    "    \"\"\"\n",
    "    project_root = Path(project_root) if project_root is not None else find_project_root()\n",
    "    db_path = config.resolve_db_path(project_root)\n",
    "    if not db_path.exists():\n",
    "        raise FileNotFoundError(\n",
    "            f\"Could not find database for board '{config.board_key}': {db_path}\"\n",
    "        )\n",
    "\n",
    "    climbs_query, climbs_params = build_climbs_query(config)\n",
    "    placements_query, placements_params = build_placements_query(config)\n",
    "    if max_climbs is not None:\n",
    "        if max_climbs < 1:\n",
    "            raise ValueError(\"max_climbs must be at least 1.\")\n",
    "        climbs_query = f\"{climbs_query}\\nORDER BY c.uuid, cs.angle\\nLIMIT ?\"\n",
    "        climbs_params = [*climbs_params, int(max_climbs)]\n",
    "\n",
    "    with sqlite3.connect(db_path) as conn:\n",
    "        df_climbs = pd.read_sql_query(climbs_query, conn, params=climbs_params)\n",
    "        df_placements = pd.read_sql_query(placements_query, conn, params=placements_params)\n",
    "\n",
    "    # Add board identifiers for multi-board processing\n",
    "    df_climbs[\"board_key\"] = config.board_key\n",
    "    df_climbs[\"board_token_prefix\"] = config.token_prefix\n",
    "    df_climbs[\"board_display_name\"] = config.display_name\n",
    "\n",
    "    df_placements[\"board_key\"] = config.board_key\n",
    "    df_placements[\"board_token_prefix\"] = config.token_prefix\n",
    "    df_placements[\"board_display_name\"] = config.display_name\n",
    "\n",
    "    return df_climbs, df_placements\n",
    "\n",
    "def load_multi_board_data(\n",
    "    configs: list[BoardConfig],\n",
    "    project_root: str | Path | None = None,\n",
    "    max_climbs_per_board: int | None = None,\n",
    ") -> tuple[pd.DataFrame, pd.DataFrame]:\n",
    "    \"\"\"Load and concatenate data from multiple boards.\n",
    "    \n",
    "    This function loads data from each board's database and concatenates\n",
    "    them into unified DataFrames. Board identifiers are preserved in\n",
    "    the board_key column.\n",
    "    \n",
    "    Args:\n",
    "        configs: List of board configurations\n",
    "        project_root: Path to project root\n",
    "        max_climbs_per_board: Optional row limit per board for smoke tests.\n",
    "        \n",
    "    Returns:\n",
    "        Tuple of (combined climbs DataFrame, combined placements DataFrame)\n",
    "    \"\"\"\n",
    "    climb_frames = []\n",
    "    placement_frames = []\n",
    "\n",
    "    for config in configs:\n",
    "        climbs, placements = load_board_data(\n",
    "            config,\n",
    "            project_root=project_root,\n",
    "            max_climbs=max_climbs_per_board,\n",
    "        )\n",
    "        climb_frames.append(climbs)\n",
    "        placement_frames.append(placements)\n",
    "\n",
    "    return (\n",
    "        pd.concat(climb_frames, ignore_index=True),\n",
    "        pd.concat(placement_frames, ignore_index=True),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a5c9a9b",
   "metadata": {},
   "source": [
    "## Load raw climbs and placement metadata\n",
    "\n",
    "The data loading step reads from SQLite databases downloaded using BoardLib:\n",
    "\n",
    "```bash\n",
    "boardlib database tension data/raw/tb2.db\n",
    "boardlib database kilter data/raw/kilter.db\n",
    "```\n",
    "\n",
    "### What we're loading\n",
    "\n",
    "**Climbs data** (`df_climbs`): Each row is a climb-angle observation. Key columns:\n",
    "- `uuid`: Unique climb identifier\n",
    "- `frames`: The raw string encoding holds and roles, e.g., `p344r5p369r6p603r7`\n",
    "- `angle`: Wall angle in degrees\n",
    "- `display_difficulty`: Numeric difficulty score (maps to V-grades)\n",
    "- `boulder_grade`: Human-readable grade like \"6b/V4\"\n",
    "\n",
    "**Placements data** (`df_placements`): Each row is a physical hold position on the board. Key columns:\n",
    "- `placement_id`: The hold's unique ID within its board\n",
    "- `x`, `y`: Physical coordinates on the board (in inches)\n",
    "- `default_role_id`: What role this hold typically plays (hand vs foot)\n",
    "- `set_name`: Material type (\"Wood\" or \"Plastic\")\n",
    "- `mirror_placement_id`: For TB2, the ID of the symmetric hold on the other side\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53c1951a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:25.185319Z",
     "iopub.status.busy": "2026-06-07T15:45:25.184989Z",
     "iopub.status.idle": "2026-06-07T15:45:29.117312Z",
     "shell.execute_reply": "2026-06-07T15:45:29.116566Z"
    }
   },
   "outputs": [],
   "source": [
    "df_climbs, df_placements = load_multi_board_data(configs, project_root=ROOT)\n",
    "print(f\"Total climbs loaded: {len(df_climbs):,}\")\n",
    "print(f\"Total placements loaded: {len(df_placements):,}\")\n",
    "print()\n",
    "print(\"Climbs per board:\")\n",
    "print(df_climbs.groupby(\"board_key\").size())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6a93063",
   "metadata": {},
   "source": [
    "### Grade and route-tokenization helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7597dfc3",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:29.120837Z",
     "iopub.status.busy": "2026-06-07T15:45:29.120438Z",
     "iopub.status.idle": "2026-06-07T15:45:29.144382Z",
     "shell.execute_reply": "2026-06-07T15:45:29.143567Z"
    }
   },
   "outputs": [],
   "source": [
    "# Map BoardLib display difficulties into grouped V-grade tokens.\n",
    "GRADE_TO_V = {\n",
    "    10: 0, 11: 0, 12: 0,\n",
    "    13: 1, 14: 1,\n",
    "    15: 2,\n",
    "    16: 3, 17: 3,\n",
    "    18: 4, 19: 4,\n",
    "    20: 5, 21: 5,\n",
    "    22: 6,\n",
    "    23: 7,\n",
    "    24: 8, 25: 8,\n",
    "    26: 9,\n",
    "    27: 10,\n",
    "    28: 11,\n",
    "    29: 12,\n",
    "    30: 13,\n",
    "    31: 14,\n",
    "    32: 15,\n",
    "    33: 16,\n",
    "}\n",
    "\n",
    "def to_grouped_v(display_difficulty: float) -> int:\n",
    "    \"\"\"Map a continuous display difficulty to the nearest grouped V grade.\"\"\"\n",
    "    rounded = int(round(float(display_difficulty)))\n",
    "    rounded = max(min(rounded, max(GRADE_TO_V)), min(GRADE_TO_V))\n",
    "    return GRADE_TO_V[rounded]\n",
    "\n",
    "def grade_token(display_difficulty: float) -> str:\n",
    "    \"\"\"Return the grade-conditioning token for a display difficulty value.\"\"\"\n",
    "    return f\"<GRADE_V{to_grouped_v(display_difficulty)}>\"\n",
    "\n",
    "# Parse frames, canonicalize holds, and build route-level token sequences.\n",
    "SPECIAL_TOKENS = [\n",
    "    \"<PAD>\",\n",
    "    \"<UNK>\",\n",
    "    \"<BOS>\",\n",
    "    \"<EOS>\",\n",
    "    \"<CLS>\",\n",
    "    \"<MASK>\",\n",
    "]\n",
    "\n",
    "ANGLE_TOKEN_PATTERN = re.compile(r\"^<ANGLE_(-?\\d+)>$\")\n",
    "\n",
    "GRADE_TOKEN_PATTERN = re.compile(r\"^<GRADE_V(\\d+)>$\")\n",
    "\n",
    "BOARD_TOKEN_PATTERN = re.compile(r\"^<BOARD_([A-Z0-9_]+)>$\")\n",
    "\n",
    "HOLD_TOKEN_PATTERN = re.compile(r\"^<([A-Z0-9_]+)_p(\\d+)_(start|middle|finish|foot|unknown)>$\")\n",
    "\n",
    "ROLE_SORT_ORDER = {\n",
    "    \"start\": 0,\n",
    "    \"middle\": 1,\n",
    "    \"foot\": 2,\n",
    "    \"finish\": 3,\n",
    "    \"unknown\": 9,\n",
    "}\n",
    "\n",
    "def parse_frames(frames_str: str | None) -> list[tuple[int, int]]:\n",
    "    \"\"\"Parse a frames string into ``(placement_id, role_id)`` pairs.\n",
    "\n",
    "    Frames strings are compact concatenations such as ``p344r5p369r6``. Invalid\n",
    "    or missing input returns an empty list so callers can skip unusable climbs\n",
    "    without special-case exception handling.\n",
    "    \"\"\"\n",
    "    if not isinstance(frames_str, str):\n",
    "        return []\n",
    "    matches = re.findall(r\"p(\\d+)r(\\d+)\", frames_str)\n",
    "    return [(int(placement_id), int(role_id)) for placement_id, role_id in matches]\n",
    "\n",
    "def make_placement_lookup(df_placements: pd.DataFrame) -> dict[tuple[str, int], dict]:\n",
    "    \"\"\"Build a coordinate/metadata lookup keyed by ``(board_key, placement_id)``.\"\"\"\n",
    "    rows = {}\n",
    "    for _, row in df_placements.iterrows():\n",
    "        key = (str(row[\"board_key\"]), int(row[\"placement_id\"]))\n",
    "        rows[key] = row.to_dict()\n",
    "    return rows\n",
    "\n",
    "def role_name(role_id: int, config: BoardConfig) -> str:\n",
    "    \"\"\"Map a board-specific numeric role ID to a shared semantic role name.\"\"\"\n",
    "    return config.role_id_to_name.get(int(role_id), \"unknown\")\n",
    "\n",
    "def placement_xy(\n",
    "    board_key: str,\n",
    "    placement_id: int,\n",
    "    placement_lookup: dict[tuple[str, int], dict],\n",
    ") -> tuple[float, float]:\n",
    "    \"\"\"Return raw board coordinates for a placement, or NaNs if unknown.\"\"\"\n",
    "    row = placement_lookup.get((str(board_key), int(placement_id)))\n",
    "    if row is None:\n",
    "        return (float(\"nan\"), float(\"nan\"))\n",
    "    return (float(row[\"x\"]), float(row[\"y\"]))\n",
    "\n",
    "def canonicalize_holds(\n",
    "    holds: Iterable[tuple[int, int]],\n",
    "    config: BoardConfig,\n",
    "    placement_lookup: dict[tuple[str, int], dict],\n",
    ") -> list[tuple[int, int]]:\n",
    "    \"\"\"Sort holds into the canonical route order used by all model inputs.\n",
    "\n",
    "    Frames preserve setter/storage order, which is not always stable\n",
    "    across routes or boards. Canonical ordering keeps starts first, hand/foot\n",
    "    holds in a bottom-to-top scan, and finishes last, giving the models a more\n",
    "    consistent sequence grammar.\n",
    "    \"\"\"\n",
    "    def key(pair: tuple[int, int]):\n",
    "        \"\"\"Sort by semantic role, then board position, then placement ID.\"\"\"\n",
    "        placement_id, role_id = pair\n",
    "        x, y = placement_xy(config.board_key, placement_id, placement_lookup)\n",
    "        name = role_name(role_id, config)\n",
    "        return (\n",
    "            ROLE_SORT_ORDER.get(name, 9),\n",
    "            y if not np.isnan(y) else 9999.0,\n",
    "            x if not np.isnan(x) else 9999.0,\n",
    "            placement_id,\n",
    "        )\n",
    "\n",
    "    return sorted(list(holds), key=key)\n",
    "\n",
    "def board_token(config: BoardConfig) -> str:\n",
    "    \"\"\"Return the special conditioning token for a board config.\"\"\"\n",
    "    return f\"<BOARD_{config.token_prefix}>\"\n",
    "\n",
    "def angle_token(angle: float) -> str:\n",
    "    \"\"\"Round a wall angle into the shared angle-token format.\"\"\"\n",
    "    return f\"<ANGLE_{int(round(float(angle)))}>\"\n",
    "\n",
    "def hold_token(\n",
    "    placement_id: int,\n",
    "    role_id: int,\n",
    "    config: BoardConfig,\n",
    ") -> str:\n",
    "    \"\"\"Return a board-namespaced hold token for a placement and role.\"\"\"\n",
    "    semantic_role = role_name(role_id, config)\n",
    "    return f\"<{config.token_prefix}_p{int(placement_id)}_{semantic_role}>\"\n",
    "\n",
    "def tokenize_route(\n",
    "    row,\n",
    "    config: BoardConfig,\n",
    "    placement_lookup: dict[tuple[str, int], dict],\n",
    "    include_grade: bool = True,\n",
    "    canonical: bool = True,\n",
    ") -> list[str]:\n",
    "    \"\"\"Tokenize one climb row into the sequence consumed by the models.\n",
    "\n",
    "    ``include_grade=True`` is used for GPT-style generation, where the target\n",
    "    grade is a conditioning token. ``include_grade=False`` is used for grade\n",
    "    prediction so the model cannot read the answer from its input.\n",
    "    \"\"\"\n",
    "    holds = parse_frames(row[\"frames\"])\n",
    "    if canonical:\n",
    "        holds = canonicalize_holds(holds, config, placement_lookup)\n",
    "\n",
    "    tokens = [\n",
    "        \"<BOS>\",\n",
    "        board_token(config),\n",
    "        angle_token(row[\"angle\"]),\n",
    "    ]\n",
    "    if include_grade:\n",
    "        tokens.append(grade_token(row[\"display_difficulty\"]))\n",
    "\n",
    "    tokens.extend(hold_token(placement_id, role_id, config) for placement_id, role_id in holds)\n",
    "    tokens.append(\"<EOS>\")\n",
    "    return tokens\n",
    "\n",
    "def build_route_records(\n",
    "    df_climbs: pd.DataFrame,\n",
    "    configs_by_key: dict[str, BoardConfig],\n",
    "    placement_lookup: dict[tuple[str, int], dict],\n",
    ") -> pd.DataFrame:\n",
    "    \"\"\"Create one training/evaluation record per climb-angle row.\n",
    "\n",
    "    The returned frame keeps both human-readable route metadata and model-ready\n",
    "    token sequences, which lets downstream scripts save compact CSV summaries\n",
    "    while still retaining the richer JSONL training artifacts.\n",
    "    \"\"\"\n",
    "    records: list[dict] = []\n",
    "\n",
    "    for _, row in df_climbs.iterrows():\n",
    "        board_key = str(row[\"board_key\"])\n",
    "        config = configs_by_key[board_key]\n",
    "        holds = canonicalize_holds(parse_frames(row[\"frames\"]), config, placement_lookup)\n",
    "        if not holds:\n",
    "            continue\n",
    "\n",
    "        hold_tokens = [hold_token(p, r, config) for p, r in holds]\n",
    "        semantic_roles = [role_name(r, config) for _, r in holds]\n",
    "\n",
    "        tokens_with_grade = tokenize_route(\n",
    "            row,\n",
    "            config=config,\n",
    "            placement_lookup=placement_lookup,\n",
    "            include_grade=True,\n",
    "            canonical=True,\n",
    "        )\n",
    "        tokens_no_grade = tokenize_route(\n",
    "            row,\n",
    "            config=config,\n",
    "            placement_lookup=placement_lookup,\n",
    "            include_grade=False,\n",
    "            canonical=True,\n",
    "        )\n",
    "\n",
    "        records.append(\n",
    "            {\n",
    "                \"uuid\": row[\"uuid\"],\n",
    "                \"board_key\": board_key,\n",
    "                \"board_display_name\": row[\"board_display_name\"],\n",
    "                \"board_token_prefix\": row[\"board_token_prefix\"],\n",
    "                \"board_token\": board_token(config),\n",
    "                \"climb_name\": row[\"climb_name\"],\n",
    "                \"setter_username\": row.get(\"setter_username\"),\n",
    "                \"layout_id\": int(row[\"layout_id\"]),\n",
    "                \"layout_name\": row.get(\"layout_name\"),\n",
    "                \"board_name\": row.get(\"board_name\"),\n",
    "                \"frames\": row[\"frames\"],\n",
    "                \"angle\": float(row[\"angle\"]),\n",
    "                \"display_difficulty\": float(row[\"display_difficulty\"]),\n",
    "                \"grouped_v\": int(to_grouped_v(row[\"display_difficulty\"])),\n",
    "                \"boulder_grade\": row.get(\"boulder_grade\"),\n",
    "                \"ascensionist_count\": row.get(\"ascensionist_count\"),\n",
    "                \"quality_average\": row.get(\"quality_average\"),\n",
    "                \"fa_at\": row.get(\"fa_at\"),\n",
    "                \"n_holds\": len(holds),\n",
    "                \"n_start\": semantic_roles.count(\"start\"),\n",
    "                \"n_middle\": semantic_roles.count(\"middle\"),\n",
    "                \"n_foot\": semantic_roles.count(\"foot\"),\n",
    "                \"n_finish\": semantic_roles.count(\"finish\"),\n",
    "                \"holds\": holds,\n",
    "                \"hold_tokens\": hold_tokens,\n",
    "                \"tokens_with_grade\": tokens_with_grade,\n",
    "                \"tokens_no_grade\": tokens_no_grade,\n",
    "                \"sequence_with_grade\": \" \".join(tokens_with_grade),\n",
    "                \"sequence_no_grade\": \" \".join(tokens_no_grade),\n",
    "            }\n",
    "        )\n",
    "\n",
    "    return pd.DataFrame(records)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6198c24e",
   "metadata": {},
   "source": [
    "## Build unified route records\n",
    "\n",
    "This is the core tokenization step. For each climb, we:\n",
    "\n",
    "1. **Parse the frames string**: Convert `p344r5p369r6p603r7` into a list of `(placement_id, role_id)` tuples\n",
    "\n",
    "2. **Map role IDs to semantic roles**: Convert board-specific role IDs (5→start, 6→middle, etc.) to shared semantic names\n",
    "\n",
    "3. **Canonicalize hold order**: Sort holds by (role priority, y-position, x-position). This is important because:\n",
    "   - The same climb can be represented with holds in any order in the database\n",
    "   - Transformers need consistent input ordering to learn patterns\n",
    "   - This is analogous to how NLP tokenizers normalize text (lowercasing, etc.)\n",
    "\n",
    "4. **Generate token sequences**: Create two versions of each route:\n",
    "   - `sequence_with_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> ... <EOS>`\n",
    "   - `sequence_no_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> ... <EOS>` (grade removed)\n",
    "\n",
    "The grade-included version is used for the GPT generator (which predicts the next token, including grade). The grade-excluded version is used for the grade predictor (which receives the route without knowing the grade and must predict it).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20bed1da",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:45:29.147179Z",
     "iopub.status.busy": "2026-06-07T15:45:29.146784Z",
     "iopub.status.idle": "2026-06-07T15:47:22.798391Z",
     "shell.execute_reply": "2026-06-07T15:47:22.797739Z"
    }
   },
   "outputs": [],
   "source": [
    "configs_by_key = {config.board_key: config for config in configs}\n",
    "configs_by_prefix = {config.token_prefix: config for config in configs}\n",
    "placement_lookup = make_placement_lookup(df_placements)\n",
    "\n",
    "df_routes = build_route_records(\n",
    "    df_climbs=df_climbs,\n",
    "    configs_by_key=configs_by_key,\n",
    "    placement_lookup=placement_lookup,\n",
    ")\n",
    "print(f\"Tokenized routes: {len(df_routes):,}\")\n",
    "print()\n",
    "df_routes[[\"board_key\", \"angle\", \"display_difficulty\", \"sequence_with_grade\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d007fc0",
   "metadata": {},
   "source": [
    "## Example tokenized routes\n",
    "\n",
    "Let's look at what the tokenized routes actually look like. This is the \"text\" that our transformer models will read.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5b7391b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:22.801970Z",
     "iopub.status.busy": "2026-06-07T15:47:22.801298Z",
     "iopub.status.idle": "2026-06-07T15:47:22.833306Z",
     "shell.execute_reply": "2026-06-07T15:47:22.832513Z"
    }
   },
   "outputs": [],
   "source": [
    "for _, row in df_routes.groupby(\"board_key\").head(2).iterrows():\n",
    "    print(f\"Board: {row['board_key']}\")\n",
    "    print(f\"  Angle: {row['angle']}°\")\n",
    "    print(f\"  Grade: {row['boulder_grade']} (V{row['grouped_v']})\")\n",
    "    print(f\"  Tokens: {row['sequence_with_grade']}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f46bb08",
   "metadata": {},
   "source": [
    "### Vocabulary helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d52041b7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:22.836570Z",
     "iopub.status.busy": "2026-06-07T15:47:22.836109Z",
     "iopub.status.idle": "2026-06-07T15:47:22.843523Z",
     "shell.execute_reply": "2026-06-07T15:47:22.842719Z"
    }
   },
   "outputs": [],
   "source": [
    "# Build the shared vocabulary and encode/decode token strings.\n",
    "def build_vocab(df_routes: pd.DataFrame) -> tuple[list[str], dict[str, int], dict[int, str]]:\n",
    "    \"\"\"Build the shared token vocabulary from grade-conditioned sequences.\"\"\"\n",
    "    all_tokens: list[str] = []\n",
    "    for tokens in df_routes[\"tokens_with_grade\"]:\n",
    "        all_tokens.extend(tokens)\n",
    "\n",
    "    vocab_tokens = list(SPECIAL_TOKENS)\n",
    "    for token in sorted(set(all_tokens)):\n",
    "        if token not in vocab_tokens:\n",
    "            vocab_tokens.append(token)\n",
    "\n",
    "    stoi = {token: idx for idx, token in enumerate(vocab_tokens)}\n",
    "    itos = {idx: token for token, idx in stoi.items()}\n",
    "    return vocab_tokens, stoi, itos\n",
    "\n",
    "def encode(tokens: Iterable[str], stoi: dict[str, int]) -> list[int]:\n",
    "    \"\"\"Convert tokens to integer IDs, using ``<UNK>`` for unseen tokens.\"\"\"\n",
    "    unk_id = stoi[\"<UNK>\"]\n",
    "    return [stoi.get(token, unk_id) for token in tokens]\n",
    "\n",
    "def decode(ids: Iterable[int], itos: dict[int, str]) -> list[str]:\n",
    "    \"\"\"Convert integer IDs back to token strings.\"\"\"\n",
    "    return [itos.get(int(idx), \"<UNK>\") for idx in ids]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0393d191",
   "metadata": {},
   "source": [
    "## Build the shared vocabulary\n",
    "\n",
    "### What is a vocabulary?\n",
    "\n",
    "In NLP, the **vocabulary** (or \"vocab\") is the set of all possible tokens the model can produce or understand. For GPT-3, this is ~50,000 BPE tokens. For our climbing model, it includes:\n",
    "\n",
    "1. **Special tokens** (6): `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`, `<CLS>`, `<MASK>`\n",
    "2. **Board tokens** (2): `<BOARD_TB2>`, `<BOARD_KILTER>`\n",
    "3. **Angle tokens** (~6): `<ANGLE_30>`, `<ANGLE_35>`, `<ANGLE_40>`, etc.\n",
    "4. **Grade tokens** (~17): `<GRADE_V0>` through `<GRADE_V16>`\n",
    "5. **Hold tokens** (~1000+): One per placement per board per role\n",
    "\n",
    "### Why board-namespaced hold tokens?\n",
    "\n",
    "Placement ID 344 on TB2 refers to a completely different physical hold than placement ID 344 on Kilter (the latter doesn't exist). By prefixing with the board name (`TB2_p344_start` vs `KILTER_p344_start`), we ensure the model treats these as distinct tokens.\n",
    "\n",
    "This is analogous to how multilingual LLMs use language-specific subword tokens — the same byte sequence can mean different things in different languages.\n",
    "\n",
    "### String-to-integer mapping\n",
    "\n",
    "Transformers operate on integer indices, not strings. The `stoi` (string-to-integer) and `itos` (integer-to-string) dictionaries provide this mapping, similar to how HuggingFace tokenizers work.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ba5b78d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:22.846657Z",
     "iopub.status.busy": "2026-06-07T15:47:22.846214Z",
     "iopub.status.idle": "2026-06-07T15:47:23.506016Z",
     "shell.execute_reply": "2026-06-07T15:47:23.505338Z"
    }
   },
   "outputs": [],
   "source": [
    "vocab_tokens, stoi, itos = build_vocab(df_routes)\n",
    "\n",
    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
    "print()\n",
    "print(\"First 20 tokens (special + board tokens):\")\n",
    "print(vocab_tokens[:20])\n",
    "print()\n",
    "hold_tokens = [t for t in vocab_tokens if t.startswith('<') and '_p' in t]\n",
    "angle_tokens = [t for t in vocab_tokens if t.startswith('<ANGLE_')]\n",
    "grade_tokens = [t for t in vocab_tokens if t.startswith('<GRADE_')]\n",
    "board_tokens = [t for t in vocab_tokens if t.startswith('<BOARD_')]\n",
    "special_tokens = [t for t in vocab_tokens if t in ['<PAD>', '<UNK>', '<BOS>', '<EOS>', '<CLS>', '<MASK>']]\n",
    "\n",
    "print(f\"Special tokens: {len(special_tokens)}\")\n",
    "print(f\"Board tokens: {len(board_tokens)}\")\n",
    "print(f\"Angle tokens: {len(angle_tokens)}\")\n",
    "print(f\"Grade tokens: {len(grade_tokens)}\")\n",
    "print(f\"Hold tokens: {len(hold_tokens)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8408e4ab",
   "metadata": {},
   "source": [
    "### Split helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "366d46b5",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:23.509090Z",
     "iopub.status.busy": "2026-06-07T15:47:23.508606Z",
     "iopub.status.idle": "2026-06-07T15:47:23.518536Z",
     "shell.execute_reply": "2026-06-07T15:47:23.517995Z"
    }
   },
   "outputs": [],
   "source": [
    "# Assign train/validation/test splits at the logical-climb group level.\n",
    "def safe_train_test_split(\n",
    "    df: pd.DataFrame,\n",
    "    test_size: float,\n",
    "    random_state: int,\n",
    "    stratify_col: str | None = None,\n",
    "):\n",
    "    \"\"\"Split a DataFrame with optional stratification and graceful fallback.\n",
    "\n",
    "    scikit-learn raises when a requested stratum is too small. The tokenization\n",
    "    pipeline prefers stratified splits when possible, but falls back to an\n",
    "    unstratified split rather than failing on tiny smoke-test subsets.\n",
    "    \"\"\"\n",
    "    stratify = None\n",
    "    if stratify_col is not None and stratify_col in df.columns:\n",
    "        counts = df[stratify_col].value_counts()\n",
    "        if len(counts) > 1 and counts.min() >= 2:\n",
    "            stratify = df[stratify_col]\n",
    "\n",
    "    try:\n",
    "        return train_test_split(\n",
    "            df,\n",
    "            test_size=test_size,\n",
    "            random_state=random_state,\n",
    "            stratify=stratify,\n",
    "        )\n",
    "    except ValueError:\n",
    "        return train_test_split(\n",
    "            df,\n",
    "            test_size=test_size,\n",
    "            random_state=random_state,\n",
    "            stratify=None,\n",
    "        )\n",
    "\n",
    "def assign_group_splits(\n",
    "    df: pd.DataFrame,\n",
    "    group_cols: list[str],\n",
    "    test_size: float,\n",
    "    val_size_within_temp: float,\n",
    "    random_state: int,\n",
    "    stratify_col: str | None = None,\n",
    ") -> pd.Series:\n",
    "    \"\"\"Assign train/val/test splits at group level.\n",
    "\n",
    "    This prevents multiple rows for the same logical climb, for example the\n",
    "    same UUID at several angles, from being distributed across different\n",
    "    splits. The returned Series is indexed like ``df`` and contains\n",
    "    ``train``, ``val``, or ``test``.\n",
    "    \"\"\"\n",
    "    group_df = df[group_cols + ([stratify_col] if stratify_col else [])].copy()\n",
    "    group_df = group_df.drop_duplicates(group_cols).reset_index(drop=True)\n",
    "\n",
    "    train_groups, temp_groups = safe_train_test_split(\n",
    "        group_df,\n",
    "        test_size=test_size,\n",
    "        random_state=random_state,\n",
    "        stratify_col=stratify_col,\n",
    "    )\n",
    "    val_groups, test_groups = safe_train_test_split(\n",
    "        temp_groups,\n",
    "        test_size=val_size_within_temp,\n",
    "        random_state=random_state,\n",
    "        stratify_col=stratify_col,\n",
    "    )\n",
    "\n",
    "    def key_frame(frame: pd.DataFrame) -> set[tuple]:\n",
    "        \"\"\"Return stringified group keys so pandas dtypes cannot affect joins.\"\"\"\n",
    "        return set(map(tuple, frame[group_cols].astype(str).values.tolist()))\n",
    "\n",
    "    train_keys = key_frame(train_groups)\n",
    "    val_keys = key_frame(val_groups)\n",
    "    test_keys = key_frame(test_groups)\n",
    "\n",
    "    def split_for_row(row) -> str:\n",
    "        \"\"\"Map one original row back to its group-level split assignment.\"\"\"\n",
    "        key = tuple(str(row[col]) for col in group_cols)\n",
    "        if key in train_keys:\n",
    "            return \"train\"\n",
    "        if key in val_keys:\n",
    "            return \"val\"\n",
    "        if key in test_keys:\n",
    "            return \"test\"\n",
    "        raise KeyError(f\"Could not assign split for group key {key}\")\n",
    "\n",
    "    return df.apply(split_for_row, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93176cff",
   "metadata": {},
   "source": [
    "## Train/validation/test splits\n",
    "\n",
    "### Why stratified splitting?\n",
    "\n",
    "We split data into train (80%), validation (10%), and test (10%) sets. Crucially, we **stratify by `board_key × grouped_v`** — this ensures that:\n",
    "\n",
    "1. Both boards (TB2 and Kilter) are represented in all splits\n",
    "2. All difficulty levels (V0 through V16) are represented in all splits\n",
    "\n",
    "Without stratification, we might end up with all V14 climbs in the test set and none in training, which would make evaluation meaningless.\n",
    "\n",
    "This is the same principle as stratified splitting in NLP, where you ensure all languages or domains are represented in each split.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff18298e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:23.521543Z",
     "iopub.status.busy": "2026-06-07T15:47:23.521153Z",
     "iopub.status.idle": "2026-06-07T15:47:31.054015Z",
     "shell.execute_reply": "2026-06-07T15:47:31.053128Z"
    }
   },
   "outputs": [],
   "source": [
    "df_routes[\"ids_with_grade\"] = df_routes[\"tokens_with_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
    "df_routes[\"ids_no_grade\"] = df_routes[\"tokens_no_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
    "df_routes[\"split_stratum\"] = df_routes[\"board_key\"].astype(str) + \"__V\" + df_routes[\"grouped_v\"].astype(str)\n",
    "df_routes[\"split\"] = assign_group_splits(\n",
    "    df_routes,\n",
    "    group_cols=[\"board_key\", \"uuid\"],\n",
    "    test_size=0.20,\n",
    "    val_size_within_temp=0.50,\n",
    "    random_state=3,\n",
    "    stratify_col=\"split_stratum\",\n",
    ")\n",
    "\n",
    "df_routes.groupby([\"board_key\", \"split\"]).size().unstack(fill_value=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae789b11",
   "metadata": {},
   "source": [
    "### Token metadata helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43e9dd8b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:31.057561Z",
     "iopub.status.busy": "2026-06-07T15:47:31.056935Z",
     "iopub.status.idle": "2026-06-07T15:47:31.072487Z",
     "shell.execute_reply": "2026-06-07T15:47:31.071641Z"
    }
   },
   "outputs": [],
   "source": [
    "# Attach board, role, placement, and coordinate metadata to each token.\n",
    "def build_token_metadata(\n",
    "    vocab_tokens: list[str],\n",
    "    stoi: dict[str, int],\n",
    "    df_placements: pd.DataFrame,\n",
    "    placement_lookup: dict[tuple[str, int], dict],\n",
    "    configs_by_prefix: dict[str, BoardConfig],\n",
    ") -> pd.DataFrame:\n",
    "    \"\"\"Build per-token metadata used for coordinate features and plotting.\n",
    "\n",
    "    Hold tokens receive raw coordinates, normalized coordinates in ``[-1, 1]``,\n",
    "    role labels, and board identity. Non-hold tokens keep neutral coordinate\n",
    "    features so the grade predictor can safely index every token ID.\n",
    "    \"\"\"\n",
    "    bounds = {}\n",
    "    for board_key, frame in df_placements.groupby(\"board_key\"):\n",
    "        xs = frame[\"x\"].astype(float)\n",
    "        ys = frame[\"y\"].astype(float)\n",
    "        bounds[str(board_key)] = {\n",
    "            \"x_min\": float(xs.min()),\n",
    "            \"x_max\": float(xs.max()),\n",
    "            \"y_min\": float(ys.min()),\n",
    "            \"y_max\": float(ys.max()),\n",
    "        }\n",
    "\n",
    "    def normalize(value: float, lo: float, hi: float) -> float:\n",
    "        \"\"\"Scale one coordinate into ``[-1, 1]`` with safe missing-value handling.\"\"\"\n",
    "        if pd.isna(value) or hi == lo:\n",
    "            return 0.0\n",
    "        return 2 * ((float(value) - lo) / (hi - lo)) - 1\n",
    "\n",
    "    rows: list[dict] = []\n",
    "\n",
    "    for token in vocab_tokens:\n",
    "        meta = {\n",
    "            \"token\": token,\n",
    "            \"token_id\": stoi[token],\n",
    "            \"kind\": \"special\",\n",
    "            \"board_key\": None,\n",
    "            \"board_token_prefix\": None,\n",
    "            \"placement_id\": np.nan,\n",
    "            \"role\": None,\n",
    "            \"x\": np.nan,\n",
    "            \"y\": np.nan,\n",
    "            \"x_norm\": 0.0,\n",
    "            \"y_norm\": 0.0,\n",
    "            \"is_hold\": 0,\n",
    "            \"angle\": np.nan,\n",
    "            \"grouped_v\": np.nan,\n",
    "        }\n",
    "\n",
    "        hold_match = HOLD_TOKEN_PATTERN.match(token)\n",
    "        if hold_match:\n",
    "            prefix = hold_match.group(1)\n",
    "            placement_id = int(hold_match.group(2))\n",
    "            role = hold_match.group(3)\n",
    "            config = configs_by_prefix[prefix]\n",
    "            board_key = config.board_key\n",
    "            row = placement_lookup.get((board_key, placement_id), {})\n",
    "            x = float(row.get(\"x\", np.nan))\n",
    "            y = float(row.get(\"y\", np.nan))\n",
    "            board_bounds = bounds.get(board_key, {\"x_min\": 0, \"x_max\": 1, \"y_min\": 0, \"y_max\": 1})\n",
    "\n",
    "            meta.update(\n",
    "                {\n",
    "                    \"kind\": \"hold\",\n",
    "                    \"board_key\": board_key,\n",
    "                    \"board_token_prefix\": prefix,\n",
    "                    \"placement_id\": placement_id,\n",
    "                    \"role\": role,\n",
    "                    \"x\": x,\n",
    "                    \"y\": y,\n",
    "                    \"x_norm\": normalize(x, board_bounds[\"x_min\"], board_bounds[\"x_max\"]),\n",
    "                    \"y_norm\": normalize(y, board_bounds[\"y_min\"], board_bounds[\"y_max\"]),\n",
    "                    \"is_hold\": 1,\n",
    "                }\n",
    "            )\n",
    "\n",
    "        angle_match = ANGLE_TOKEN_PATTERN.match(token)\n",
    "        if angle_match:\n",
    "            meta.update({\"kind\": \"angle\", \"angle\": int(angle_match.group(1))})\n",
    "\n",
    "        grade_match = GRADE_TOKEN_PATTERN.match(token)\n",
    "        if grade_match:\n",
    "            meta.update({\"kind\": \"grade\", \"grouped_v\": int(grade_match.group(1))})\n",
    "\n",
    "        board_match = BOARD_TOKEN_PATTERN.match(token)\n",
    "        if board_match:\n",
    "            prefix = board_match.group(1)\n",
    "            config = configs_by_prefix.get(prefix)\n",
    "            meta.update(\n",
    "                {\n",
    "                    \"kind\": \"board\",\n",
    "                    \"board_key\": None if config is None else config.board_key,\n",
    "                    \"board_token_prefix\": prefix,\n",
    "                }\n",
    "            )\n",
    "\n",
    "        rows.append(meta)\n",
    "\n",
    "    return pd.DataFrame(rows)\n",
    "\n",
    "def vocab_payload(\n",
    "    stoi: dict[str, int],\n",
    "    itos: dict[int, str],\n",
    "    configs_by_key: dict[str, BoardConfig],\n",
    ") -> dict:\n",
    "    \"\"\"Package vocabulary and board metadata for JSON serialization.\"\"\"\n",
    "    return {\n",
    "        \"stoi\": stoi,\n",
    "        \"itos\": {str(k): v for k, v in itos.items()},\n",
    "        \"special_tokens\": SPECIAL_TOKENS,\n",
    "        \"boards\": {\n",
    "            board_key: {\n",
    "                \"token_prefix\": config.token_prefix,\n",
    "                \"board_token\": board_token(config),\n",
    "                \"role_definitions\": config.role_definitions,\n",
    "            }\n",
    "            for board_key, config in configs_by_key.items()\n",
    "        },\n",
    "        \"grade_to_v\": {str(k): v for k, v in GRADE_TO_V.items()},\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "414dca92",
   "metadata": {},
   "source": [
    "## Token metadata\n",
    "\n",
    "### Why metadata matters\n",
    "\n",
    "Each hold token carries rich metadata that the model can use:\n",
    "\n",
    "- **Physical coordinates** (`x`, `y`): Where the hold is on the board\n",
    "- **Normalized coordinates** (`x_norm`, `y_norm`): Scaled to [-1, 1] per board, so the model doesn't need to learn different coordinate scales\n",
    "- **Semantic role** (`start`, `middle`, `finish`, `foot`): What the hold is used for\n",
    "- **Board identity** (`board_key`): Which board this hold belongs to\n",
    "\n",
    "The grade predictor uses these coordinate features as additional embeddings alongside the token embeddings. This is similar to how some LLMs inject positional embeddings or segment embeddings — we're giving the model extra structured information about each token.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48c3692e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:31.075613Z",
     "iopub.status.busy": "2026-06-07T15:47:31.075057Z",
     "iopub.status.idle": "2026-06-07T15:47:31.130838Z",
     "shell.execute_reply": "2026-06-07T15:47:31.129975Z"
    }
   },
   "outputs": [],
   "source": [
    "df_token_meta = build_token_metadata(\n",
    "    vocab_tokens=vocab_tokens,\n",
    "    stoi=stoi,\n",
    "    df_placements=df_placements,\n",
    "    placement_lookup=placement_lookup,\n",
    "    configs_by_prefix=configs_by_prefix,\n",
    ")\n",
    "\n",
    "print(\"Token metadata columns:\")\n",
    "print(df_token_meta.columns.tolist())\n",
    "print()\n",
    "print(\"Example hold token metadata:\")\n",
    "df_token_meta[df_token_meta[\"kind\"] == \"hold\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79774dc1",
   "metadata": {},
   "source": [
    "### JSON output helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a53bf23d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:31.134369Z",
     "iopub.status.busy": "2026-06-07T15:47:31.133692Z",
     "iopub.status.idle": "2026-06-07T15:47:31.141767Z",
     "shell.execute_reply": "2026-06-07T15:47:31.140911Z"
    }
   },
   "outputs": [],
   "source": [
    "# Write JSON artifacts after converting NumPy/pandas values to plain Python values.\n",
    "def json_safe(obj: Any) -> Any:\n",
    "    \"\"\"Convert NumPy/pandas values into JSON-serializable Python objects.\"\"\"\n",
    "    if isinstance(obj, dict):\n",
    "        return {str(k): json_safe(v) for k, v in obj.items()}\n",
    "    if isinstance(obj, (list, tuple)):\n",
    "        return [json_safe(v) for v in obj]\n",
    "    if isinstance(obj, np.integer):\n",
    "        return int(obj)\n",
    "    if isinstance(obj, np.floating):\n",
    "        if np.isnan(obj):\n",
    "            return None\n",
    "        return float(obj)\n",
    "    if isinstance(obj, np.ndarray):\n",
    "        return json_safe(obj.tolist())\n",
    "    try:\n",
    "        if pd.isna(obj):\n",
    "            return None\n",
    "    except Exception:\n",
    "        pass\n",
    "    return obj\n",
    "\n",
    "def write_json(path: str | Path, payload: Any) -> None:\n",
    "    \"\"\"Write an object as indented UTF-8 JSON after ``json_safe`` cleanup.\"\"\"\n",
    "    path = Path(path)\n",
    "    path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    path.write_text(json.dumps(json_safe(payload), indent=2), encoding=\"utf-8\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "414dca93",
   "metadata": {},
   "source": [
    "## Save artifacts\n",
    "\n",
    "We save several files that will be consumed by later notebooks:\n",
    "\n",
    "1. **`route_sequences.csv`**: The main tokenized dataset with train/val/test splits\n",
    "2. **`routes_tokenized.jsonl`**: Same data in JSON Lines format (one JSON object per route)\n",
    "3. **`token_vocab.json`**: The vocabulary mapping (stoi and itos)\n",
    "4. **`token_metadata.csv`**: Metadata for each token (coordinates, roles, etc.)\n",
    "5. **`placement_metadata.csv`**: Physical placement information\n",
    "6. **`board_summary.csv`**: Aggregate statistics per board\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50e81878",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:47:31.144897Z",
     "iopub.status.busy": "2026-06-07T15:47:31.144460Z",
     "iopub.status.idle": "2026-06-07T15:48:29.973473Z",
     "shell.execute_reply": "2026-06-07T15:48:29.972779Z"
    }
   },
   "outputs": [],
   "source": [
    "OUT = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
    "OUT.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "csv_cols = [\n",
    "    \"uuid\", \"board_key\", \"board_display_name\", \"board_token_prefix\", \"board_token\",\n",
    "    \"climb_name\", \"setter_username\", \"layout_id\", \"layout_name\", \"board_name\",\n",
    "    \"frames\", \"angle\", \"display_difficulty\", \"grouped_v\", \"boulder_grade\",\n",
    "    \"ascensionist_count\", \"quality_average\", \"fa_at\",\n",
    "    \"n_holds\", \"n_start\", \"n_middle\", \"n_foot\", \"n_finish\",\n",
    "    \"sequence_with_grade\", \"sequence_no_grade\", \"split\",\n",
    "]\n",
    "df_routes[csv_cols].to_csv(OUT / \"route_sequences.csv\", index=False)\n",
    "\n",
    "df_placements.to_csv(OUT / \"placement_metadata.csv\", index=False)\n",
    "\n",
    "df_token_meta.to_csv(OUT / \"token_metadata.csv\", index=False)\n",
    "\n",
    "write_json(OUT / \"token_vocab.json\", vocab_payload(stoi, itos, configs_by_key))\n",
    "\n",
    "with (OUT / \"routes_tokenized.jsonl\").open(\"w\", encoding=\"utf-8\") as handle:\n",
    "    for record in df_routes.to_dict(orient=\"records\"):\n",
    "        handle.write(json.dumps(json_safe(record)) + \"\\n\")\n",
    "\n",
    "board_summary = (\n",
    "    df_routes.groupby(\"board_key\")\n",
    "    .agg(\n",
    "        n_routes=(\"uuid\", \"count\"),\n",
    "        mean_angle=(\"angle\", \"mean\"),\n",
    "        mean_display_difficulty=(\"display_difficulty\", \"mean\"),\n",
    "        mean_holds=(\"n_holds\", \"mean\"),\n",
    "    )\n",
    "    .reset_index()\n",
    ")\n",
    "board_summary.to_csv(OUT / \"board_summary.csv\", index=False)\n",
    "\n",
    "print(\"Saved artifacts to:\", OUT)\n",
    "print(f\"  - route_sequences.csv ({len(df_routes):,} rows)\")\n",
    "print(f\"  - routes_tokenized.jsonl\")\n",
    "print(f\"  - token_vocab.json ({len(stoi):,} tokens)\")\n",
    "print(f\"  - token_metadata.csv ({len(df_token_meta):,} rows)\")\n",
    "print(f\"  - placement_metadata.csv ({len(df_placements):,} rows)\")\n",
    "print(f\"  - board_summary.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}