ClimbingBoardGPT/notebooks/01_unified_route_tokenization.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "94a0ae33",
   "metadata": {},
   "source": [
    "# 01 — Unified Route Tokenization for TB2 and Kilter\n",
    "\n",
    "## What is tokenization and why does it matter?\n",
    "\n",
    "In natural language processing, **tokenization** is the process of converting raw text into a sequence of discrete symbols (tokens) that a model can process. For example, the sentence \"I climb rocks\" might be tokenized as `[\"I\", \" climb\", \" rocks\"]` using a subword tokenizer like BPE.\n",
    "\n",
    "For climbing board routes, we face an analogous problem: how do we convert a climb — which is fundamentally a *set of holds at specific positions with specific roles* — into a sequence of tokens that a transformer can learn from?\n",
    "\n",
    "### Key design decisions in this notebook\n",
    "\n",
    "1. **Board namespacing**: Each hold token includes the board prefix (e.g., `TB2_p344_start` vs `KILTER_p1084_start`). This prevents placement ID collisions between boards — placement 344 on TB2 is a completely different physical hold than placement 344 on Kilter (in fact, the latter does not exist).\n",
    "\n",
    "2. **Semantic role mapping**: Different boards use different role IDs (TB2 uses 5/6/7/8, Kilter uses 12/13/14/15), but they all map to the same semantic roles: `start`, `middle`, `finish`, `foot`. This shared vocabulary lets the model learn transferable patterns.\n",
    "\n",
    "3. **Canonical ordering**: Holds within a route are sorted by (role priority, y-position, x-position). This gives the model a consistent input ordering, similar to how LLMs expect text in left-to-right order.\n",
    "\n",
    "4. **Special tokens**: Like BERT and GPT, we use special tokens:\n",
    "   - `<BOS>` (beginning of sequence) — marks the start, like `[CLS]` in BERT\n",
    "   - `<EOS>` (end of sequence) — marks the end, like `[SEP]` or the end-of-text token in GPT\n",
    "   - `<PAD>` — for batching sequences of different lengths\n",
    "   - `<UNK>` — for unknown tokens (safety net)\n",
    "   - `<CLS>` — used by the grade predictor to pool sequence information\n",
    "   - `<MASK>` — reserved for future masked language modeling experiments\n",
    "\n",
    "5. **Conditioning tokens**: Routes are prefixed with board, angle, and grade tokens. This is analogous to how modern LLMs use system prompts or task prefixes to condition generation.\n",
    "\n",
    "### The analogy to NLP\n",
    "\n",
    "| NLP Concept | Climbing Board Analog |\n",
    "|---|---|\n",
    "| Word / Subword | Hold token (placement + role) |\n",
    "| Sentence | Route (sequence of holds) |\n",
    "| Document language | Board type (TB2 vs Kilter) |\n",
    "| Sentence length | Number of holds in route |\n",
    "| POS tag | Semantic role (start/middle/finish/foot) |\n",
    "| Genre / Domain | Angle + Grade conditioning |\n",
    "\n",
    "This notebook tokenizes climbing routes from **both** supported boards:\n",
    "\n",
    "- Tension Board 2 Mirror\n",
    "- Kilter Board Original\n",
    "\n",
    "The board-specific details are stored in `configs/tb2.json` and `configs/kilter.json`.\n",
    "The shared tokenization code lives in `src/climbingboardgpt/`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ee2907f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import sys\n",
    "import json\n",
    "import pandas as pd\n",
    "\n",
    "# Set up the project root so we can import our custom package\n",
    "ROOT = Path.cwd().resolve()\n",
    "if ROOT.name == \"notebooks\":\n",
    "    ROOT = ROOT.parent\n",
    "sys.path.insert(0, str(ROOT / \"src\"))\n",
    "\n",
    "# Import our custom modules\n",
    "from climbingboardgpt.config import load_board_configs\n",
    "from climbingboardgpt.data import load_multi_board_data\n",
    "from climbingboardgpt.tokenization import (\n",
    "    build_route_records,\n",
    "    build_token_metadata,\n",
    "    build_vocab,\n",
    "    encode,\n",
    "    make_placement_lookup,\n",
    "    vocab_payload,\n",
    ")\n",
    "from climbingboardgpt.utils import assign_group_splits, write_json, json_safe"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3066e2b",
   "metadata": {},
   "source": [
    "## Load board configurations\n",
    "\n",
    "Each board has its own configuration file (`configs/tb2.json`, `configs/kilter.json`) that specifies:\n",
    "\n",
    "- **`layout_id`**: Which board layout to use (TB2 Mirror = 10, Kilter Original = 1)\n",
    "- **`role_definitions`**: Maps semantic role names to board-specific role IDs\n",
    "  - TB2: start=5, middle=6, finish=7, foot=8\n",
    "  - Kilter: start=12, middle=13, finish=14, foot=15\n",
    "- **`max_angle`**: We filter out climbs at extreme angles (>50° for TB2, >55° for Kilter) because those grades are biased toward elite climbers\n",
    "- **`token_prefix`**: The namespace prefix for hold tokens (\"TB2\" vs \"KILTER\")\n",
    "- **`include_mirror_placement_id`**: Whether to include mirror information (TB2 has symmetric left/right holds)\n",
    "\n",
    "This configuration-driven approach means we can add new boards by creating a new JSON config file, without changing any code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f04dcea",
   "metadata": {},
   "outputs": [],
   "source": [
    "configs = load_board_configs([\"tb2\", \"kilter\"])\n",
    "configs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a5c9a9b",
   "metadata": {},
   "source": [
    "## Load raw climbs and placement metadata\n",
    "\n",
    "The data loading step reads from SQLite databases downloaded using BoardLib:\n",
    "\n",
    "```bash\n",
    "boardlib database tension data/raw/tb2.db\n",
    "boardlib database kilter data/raw/kilter.db\n",
    "```\n",
    "\n",
    "### What we're loading\n",
    "\n",
    "**Climbs data** (`df_climbs`): Each row is a climb-angle observation. Key columns:\n",
    "- `uuid`: Unique climb identifier\n",
    "- `frames`: The raw string encoding holds and roles, e.g., `p344r5p369r6p603r7`\n",
    "- `angle`: Wall angle in degrees\n",
    "- `display_difficulty`: Numeric difficulty score (maps to V-grades)\n",
    "- `boulder_grade`: Human-readable grade like \"6b/V4\"\n",
    "\n",
    "**Placements data** (`df_placements`): Each row is a physical hold position on the board. Key columns:\n",
    "- `placement_id`: The hold's unique ID within its board\n",
    "- `x`, `y`: Physical coordinates on the board (in inches)\n",
    "- `default_role_id`: What role this hold typically plays (hand vs foot)\n",
    "- `set_name`: Material type (\"Wood\" or \"Plastic\")\n",
    "- `mirror_placement_id`: For TB2, the ID of the symmetric hold on the other side"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53c1951a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_climbs, df_placements = load_multi_board_data(configs, project_root=ROOT)\n",
    "print(f\"Total climbs loaded: {len(df_climbs):,}\")\n",
    "print(f\"Total placements loaded: {len(df_placements):,}\")\n",
    "print()\n",
    "print(\"Climbs per board:\")\n",
    "print(df_climbs.groupby(\"board_key\").size())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6198c24e",
   "metadata": {},
   "source": [
    "## Build unified route records\n",
    "\n",
    "This is the core tokenization step. For each climb, we:\n",
    "\n",
    "1. **Parse the frames string**: Convert `p344r5p369r6p603r7` into a list of `(placement_id, role_id)` tuples\n",
    "\n",
    "2. **Map role IDs to semantic roles**: Convert board-specific role IDs (5→start, 6→middle, etc.) to shared semantic names\n",
    "\n",
    "3. **Canonicalize hold order**: Sort holds by (role priority, y-position, x-position). This is important because:\n",
    "   - The same climb can be represented with holds in any order in the database\n",
    "   - Transformers need consistent input ordering to learn patterns\n",
    "   - This is analogous to how NLP tokenizers normalize text (lowercasing, etc.)\n",
    "\n",
    "4. **Generate token sequences**: Create two versions of each route:\n",
    "   - `sequence_with_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> ... <EOS>`\n",
    "   - `sequence_no_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> ... <EOS>` (grade removed)\n",
    "\n",
    "The grade-included version is used for the GPT generator (which predicts the next token, including grade). The grade-excluded version is used for the grade predictor (which receives the route without knowing the grade and must predict it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20bed1da",
   "metadata": {},
   "outputs": [],
   "source": [
    "configs_by_key = {config.board_key: config for config in configs}\n",
    "configs_by_prefix = {config.token_prefix: config for config in configs}\n",
    "placement_lookup = make_placement_lookup(df_placements)\n",
    "\n",
    "df_routes = build_route_records(\n",
    "    df_climbs=df_climbs,\n",
    "    configs_by_key=configs_by_key,\n",
    "    placement_lookup=placement_lookup,\n",
    ")\n",
    "print(f\"Tokenized routes: {len(df_routes):,}\")\n",
    "print()\n",
    "df_routes[[\"board_key\", \"angle\", \"display_difficulty\", \"sequence_with_grade\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d007fc0",
   "metadata": {},
   "source": [
    "## Example tokenized routes\n",
    "\n",
    "Let's look at what the tokenized routes actually look like. This is the \"text\" that our transformer models will read."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5b7391b",
   "metadata": {},
   "outputs": [],
   "source": [
    "for _, row in df_routes.groupby(\"board_key\").head(2).iterrows():\n",
    "    print(f\"Board: {row['board_key']}\")\n",
    "    print(f\"  Angle: {row['angle']}°\")\n",
    "    print(f\"  Grade: {row['boulder_grade']} (V{row['grouped_v']})\")\n",
    "    print(f\"  Tokens: {row['sequence_with_grade']}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0393d191",
   "metadata": {},
   "source": [
    "## Build the shared vocabulary\n",
    "\n",
    "### What is a vocabulary?\n",
    "\n",
    "In NLP, the **vocabulary** (or \"vocab\") is the set of all possible tokens the model can produce or understand. For GPT-3, this is ~50,000 BPE tokens. For our climbing model, it includes:\n",
    "\n",
    "1. **Special tokens** (6): `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`, `<CLS>`, `<MASK>`\n",
    "2. **Board tokens** (2): `<BOARD_TB2>`, `<BOARD_KILTER>`\n",
    "3. **Angle tokens** (~6): `<ANGLE_30>`, `<ANGLE_35>`, `<ANGLE_40>`, etc.\n",
    "4. **Grade tokens** (~17): `<GRADE_V0>` through `<GRADE_V16>`\n",
    "5. **Hold tokens** (~1000+): One per placement per board per role\n",
    "\n",
    "### Why board-namespaced hold tokens?\n",
    "\n",
    "Placement ID 344 on TB2 refers to a completely different physical hold than placement ID 344 on Kilter. By prefixing with the board name (`TB2_p344_start` vs `KILTER_p344_start`), we ensure the model treats these as distinct tokens.\n",
    "\n",
    "This is analogous to how multilingual LLMs use language-specific subword tokens — the same byte sequence can mean different things in different languages.\n",
    "\n",
    "### String-to-integer mapping\n",
    "\n",
    "Transformers operate on integer indices, not strings. The `stoi` (string-to-integer) and `itos` (integer-to-string) dictionaries provide this mapping, similar to how HuggingFace tokenizers work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ba5b78d",
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_tokens, stoi, itos = build_vocab(df_routes)\n",
    "\n",
    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
    "print()\n",
    "print(\"First 20 tokens (special + board tokens):\")\n",
    "print(vocab_tokens[:20])\n",
    "print()\n",
    "hold_tokens = [t for t in vocab_tokens if t.startswith('<') and '_p' in t]\n",
    "angle_tokens = [t for t in vocab_tokens if t.startswith('<ANGLE_')]\n",
    "grade_tokens = [t for t in vocab_tokens if t.startswith('<GRADE_')]\n",
    "board_tokens = [t for t in vocab_tokens if t.startswith('<BOARD_')]\n",
    "special_tokens = [t for t in vocab_tokens if t in ['<PAD>', '<UNK>', '<BOS>', '<EOS>', '<CLS>', '<MASK>']]\n",
    "\n",
    "print(f\"Special tokens: {len(special_tokens)}\")\n",
    "print(f\"Board tokens: {len(board_tokens)}\")\n",
    "print(f\"Angle tokens: {len(angle_tokens)}\")\n",
    "print(f\"Grade tokens: {len(grade_tokens)}\")\n",
    "print(f\"Hold tokens: {len(hold_tokens)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93176cff",
   "metadata": {},
   "source": [
    "## Train/validation/test splits\n",
    "\n",
    "### Why stratified splitting?\n",
    "\n",
    "We split data into train (80%), validation (10%), and test (10%) sets. Crucially, we **stratify by `board_key × grouped_v`** — this ensures that:\n",
    "\n",
    "1. Both boards (TB2 and Kilter) are represented in all splits\n",
    "2. All difficulty levels (V0 through V16) are represented in all splits\n",
    "\n",
    "Without stratification, we might end up with all V14 climbs in the test set and none in training, which would make evaluation meaningless.\n",
    "\n",
    "This is the same principle as stratified splitting in NLP, where you ensure all languages or domains are represented in each split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff18298e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_routes[\"ids_with_grade\"] = df_routes[\"tokens_with_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
    "df_routes[\"ids_no_grade\"] = df_routes[\"tokens_no_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
    "df_routes[\"split_stratum\"] = df_routes[\"board_key\"].astype(str) + \"__V\" + df_routes[\"grouped_v\"].astype(str)\n",
    "df_routes[\"split\"] = assign_group_splits(\n",
    "    df_routes,\n",
    "    group_cols=[\"board_key\", \"uuid\"],\n",
    "    test_size=0.20,\n",
    "    val_size_within_temp=0.50,\n",
    "    random_state=3,\n",
    "    stratify_col=\"split_stratum\",\n",
    ")\n",
    "\n",
    "df_routes.groupby([\"board_key\", \"split\"]).size().unstack(fill_value=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "414dca92",
   "metadata": {},
   "source": [
    "## Token metadata\n",
    "\n",
    "### Why metadata matters\n",
    "\n",
    "Each hold token carries rich metadata that the model can use:\n",
    "\n",
    "- **Physical coordinates** (`x`, `y`): Where the hold is on the board\n",
    "- **Normalized coordinates** (`x_norm`, `y_norm`): Scaled to [-1, 1] per board, so the model doesn't need to learn different coordinate scales\n",
    "- **Semantic role** (`start`, `middle`, `finish`, `foot`): What the hold is used for\n",
    "- **Board identity** (`board_key`): Which board this hold belongs to\n",
    "\n",
    "The grade predictor uses these coordinate features as additional embeddings alongside the token embeddings. This is similar to how some LLMs inject positional embeddings or segment embeddings — we're giving the model extra structured information about each token."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48c3692e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_token_meta = build_token_metadata(\n",
    "    vocab_tokens=vocab_tokens,\n",
    "    stoi=stoi,\n",
    "    df_placements=df_placements,\n",
    "    placement_lookup=placement_lookup,\n",
    "    configs_by_prefix=configs_by_prefix,\n",
    ")\n",
    "\n",
    "print(\"Token metadata columns:\")\n",
    "print(df_token_meta.columns.tolist())\n",
    "print()\n",
    "print(\"Example hold token metadata:\")\n",
    "df_token_meta[df_token_meta[\"kind\"] == \"hold\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "414dca93",
   "metadata": {},
   "source": [
    "## Save artifacts\n",
    "\n",
    "We save several files that will be consumed by later notebooks:\n",
    "\n",
    "1. **`route_sequences.csv`**: The main tokenized dataset with train/val/test splits\n",
    "2. **`routes_tokenized.jsonl`**: Same data in JSON Lines format (one JSON object per route)\n",
    "3. **`token_vocab.json`**: The vocabulary mapping (stoi and itos)\n",
    "4. **`token_metadata.csv`**: Metadata for each token (coordinates, roles, etc.)\n",
    "5. **`placement_metadata.csv`**: Physical placement information\n",
    "6. **`board_summary.csv`**: Aggregate statistics per board"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50e81878",
   "metadata": {},
   "outputs": [],
   "source": [
    "OUT = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
    "OUT.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "csv_cols = [\n",
    "    \"uuid\", \"board_key\", \"board_display_name\", \"board_token_prefix\", \"board_token\",\n",
    "    \"climb_name\", \"setter_username\", \"layout_id\", \"layout_name\", \"board_name\",\n",
    "    \"frames\", \"angle\", \"display_difficulty\", \"grouped_v\", \"boulder_grade\",\n",
    "    \"ascensionist_count\", \"quality_average\", \"fa_at\",\n",
    "    \"n_holds\", \"n_start\", \"n_middle\", \"n_foot\", \"n_finish\",\n",
    "    \"sequence_with_grade\", \"sequence_no_grade\", \"split\",\n",
    "]\n",
    "df_routes[csv_cols].to_csv(OUT / \"route_sequences.csv\", index=False)\n",
    "\n",
    "df_placements.to_csv(OUT / \"placement_metadata.csv\", index=False)\n",
    "\n",
    "df_token_meta.to_csv(OUT / \"token_metadata.csv\", index=False)\n",
    "\n",
    "write_json(OUT / \"token_vocab.json\", vocab_payload(stoi, itos, configs_by_key))\n",
    "\n",
    "with (OUT / \"routes_tokenized.jsonl\").open(\"w\", encoding=\"utf-8\") as handle:\n",
    "    for record in df_routes.to_dict(orient=\"records\"):\n",
    "        handle.write(json.dumps(json_safe(record)) + \"\\n\")\n",
    "\n",
    "board_summary = (\n",
    "    df_routes.groupby(\"board_key\")\n",
    "    .agg(\n",
    "        n_routes=(\"uuid\", \"count\"),\n",
    "        mean_angle=(\"angle\", \"mean\"),\n",
    "        mean_display_difficulty=(\"display_difficulty\", \"mean\"),\n",
    "        mean_holds=(\"n_holds\", \"mean\"),\n",
    "    )\n",
    "    .reset_index()\n",
    ")\n",
    "board_summary.to_csv(OUT / \"board_summary.csv\", index=False)\n",
    "\n",
    "print(\"Saved artifacts to:\", OUT)\n",
    "print(f\"  - route_sequences.csv ({len(df_routes):,} rows)\")\n",
    "print(f\"  - routes_tokenized.jsonl\")\n",
    "print(f\"  - token_vocab.json ({len(stoi):,} tokens)\")\n",
    "print(f\"  - token_metadata.csv ({len(df_token_meta):,} rows)\")\n",
    "print(f\"  - placement_metadata.csv ({len(df_placements):,} rows)\")\n",
    "print(f\"  - board_summary.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}