462 lines
19 KiB
Plaintext
462 lines
19 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "94a0ae33",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 01 — Unified Route Tokenization for TB2 and Kilter\n",
|
||
"\n",
|
||
"## What is tokenization and why does it matter?\n",
|
||
"\n",
|
||
"In natural language processing, **tokenization** is the process of converting raw text into a sequence of discrete symbols (tokens) that a model can process. For example, the sentence \"I climb rocks\" might be tokenized as `[\"I\", \" climb\", \" rocks\"]` using a subword tokenizer like BPE.\n",
|
||
"\n",
|
||
"For climbing board routes, we face an analogous problem: how do we convert a climb — which is fundamentally a *set of holds at specific positions with specific roles* — into a sequence of tokens that a transformer can learn from?\n",
|
||
"\n",
|
||
"### Key design decisions in this notebook\n",
|
||
"\n",
|
||
"1. **Board namespacing**: Each hold token includes the board prefix (e.g., `TB2_p344_start` vs `KILTER_p1084_start`). This prevents placement ID collisions between boards — placement 344 on TB2 is a completely different physical hold than placement 344 on Kilter (in fact, the latter does not exist).\n",
|
||
"\n",
|
||
"2. **Semantic role mapping**: Different boards use different role IDs (TB2 uses 5/6/7/8, Kilter uses 12/13/14/15), but they all map to the same semantic roles: `start`, `middle`, `finish`, `foot`. This shared vocabulary lets the model learn transferable patterns.\n",
|
||
"\n",
|
||
"3. **Canonical ordering**: Holds within a route are sorted by (role priority, y-position, x-position). This gives the model a consistent input ordering, similar to how LLMs expect text in left-to-right order.\n",
|
||
"\n",
|
||
"4. **Special tokens**: Like BERT and GPT, we use special tokens:\n",
|
||
" - `<BOS>` (beginning of sequence) — marks the start, like `[CLS]` in BERT\n",
|
||
" - `<EOS>` (end of sequence) — marks the end, like `[SEP]` or the end-of-text token in GPT\n",
|
||
" - `<PAD>` — for batching sequences of different lengths\n",
|
||
" - `<UNK>` — for unknown tokens (safety net)\n",
|
||
" - `<CLS>` — used by the grade predictor to pool sequence information\n",
|
||
" - `<MASK>` — reserved for future masked language modeling experiments\n",
|
||
"\n",
|
||
"5. **Conditioning tokens**: Routes are prefixed with board, angle, and grade tokens. This is analogous to how modern LLMs use system prompts or task prefixes to condition generation.\n",
|
||
"\n",
|
||
"### The analogy to NLP\n",
|
||
"\n",
|
||
"| NLP Concept | Climbing Board Analog |\n",
|
||
"|---|---|\n",
|
||
"| Word / Subword | Hold token (placement + role) |\n",
|
||
"| Sentence | Route (sequence of holds) |\n",
|
||
"| Document language | Board type (TB2 vs Kilter) |\n",
|
||
"| Sentence length | Number of holds in route |\n",
|
||
"| POS tag | Semantic role (start/middle/finish/foot) |\n",
|
||
"| Genre / Domain | Angle + Grade conditioning |\n",
|
||
"\n",
|
||
"This notebook tokenizes climbing routes from **both** supported boards:\n",
|
||
"\n",
|
||
"- Tension Board 2 Mirror\n",
|
||
"- Kilter Board Original\n",
|
||
"\n",
|
||
"The board-specific details are stored in `configs/tb2.json` and `configs/kilter.json`.\n",
|
||
"The shared tokenization code lives in `src/climbingboardgpt/`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6ee2907f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from pathlib import Path\n",
|
||
"import sys\n",
|
||
"import json\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"# Set up the project root so we can import our custom package\n",
|
||
"ROOT = Path.cwd().resolve()\n",
|
||
"if ROOT.name == \"notebooks\":\n",
|
||
" ROOT = ROOT.parent\n",
|
||
"sys.path.insert(0, str(ROOT / \"src\"))\n",
|
||
"\n",
|
||
"# Import our custom modules\n",
|
||
"from climbingboardgpt.config import load_board_configs\n",
|
||
"from climbingboardgpt.data import load_multi_board_data\n",
|
||
"from climbingboardgpt.tokenization import (\n",
|
||
" build_route_records,\n",
|
||
" build_token_metadata,\n",
|
||
" build_vocab,\n",
|
||
" encode,\n",
|
||
" make_placement_lookup,\n",
|
||
" vocab_payload,\n",
|
||
")\n",
|
||
"from climbingboardgpt.utils import assign_group_splits, write_json, json_safe"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b3066e2b",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load board configurations\n",
|
||
"\n",
|
||
"Each board has its own configuration file (`configs/tb2.json`, `configs/kilter.json`) that specifies:\n",
|
||
"\n",
|
||
"- **`layout_id`**: Which board layout to use (TB2 Mirror = 10, Kilter Original = 1)\n",
|
||
"- **`role_definitions`**: Maps semantic role names to board-specific role IDs\n",
|
||
" - TB2: start=5, middle=6, finish=7, foot=8\n",
|
||
" - Kilter: start=12, middle=13, finish=14, foot=15\n",
|
||
"- **`max_angle`**: We filter out climbs at extreme angles (>50° for TB2, >55° for Kilter) because those grades are biased toward elite climbers\n",
|
||
"- **`token_prefix`**: The namespace prefix for hold tokens (\"TB2\" vs \"KILTER\")\n",
|
||
"- **`include_mirror_placement_id`**: Whether to include mirror information (TB2 has symmetric left/right holds)\n",
|
||
"\n",
|
||
"This configuration-driven approach means we can add new boards by creating a new JSON config file, without changing any code."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4f04dcea",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"configs = load_board_configs([\"tb2\", \"kilter\"])\n",
|
||
"configs"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2a5c9a9b",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load raw climbs and placement metadata\n",
|
||
"\n",
|
||
"The data loading step reads from SQLite databases downloaded using BoardLib:\n",
|
||
"\n",
|
||
"```bash\n",
|
||
"boardlib database tension data/raw/tb2.db\n",
|
||
"boardlib database kilter data/raw/kilter.db\n",
|
||
"```\n",
|
||
"\n",
|
||
"### What we're loading\n",
|
||
"\n",
|
||
"**Climbs data** (`df_climbs`): Each row is a climb-angle observation. Key columns:\n",
|
||
"- `uuid`: Unique climb identifier\n",
|
||
"- `frames`: The raw string encoding holds and roles, e.g., `p344r5p369r6p603r7`\n",
|
||
"- `angle`: Wall angle in degrees\n",
|
||
"- `display_difficulty`: Numeric difficulty score (maps to V-grades)\n",
|
||
"- `boulder_grade`: Human-readable grade like \"6b/V4\"\n",
|
||
"\n",
|
||
"**Placements data** (`df_placements`): Each row is a physical hold position on the board. Key columns:\n",
|
||
"- `placement_id`: The hold's unique ID within its board\n",
|
||
"- `x`, `y`: Physical coordinates on the board (in inches)\n",
|
||
"- `default_role_id`: What role this hold typically plays (hand vs foot)\n",
|
||
"- `set_name`: Material type (\"Wood\" or \"Plastic\")\n",
|
||
"- `mirror_placement_id`: For TB2, the ID of the symmetric hold on the other side"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "53c1951a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_climbs, df_placements = load_multi_board_data(configs, project_root=ROOT)\n",
|
||
"print(f\"Total climbs loaded: {len(df_climbs):,}\")\n",
|
||
"print(f\"Total placements loaded: {len(df_placements):,}\")\n",
|
||
"print()\n",
|
||
"print(\"Climbs per board:\")\n",
|
||
"print(df_climbs.groupby(\"board_key\").size())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6198c24e",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Build unified route records\n",
|
||
"\n",
|
||
"This is the core tokenization step. For each climb, we:\n",
|
||
"\n",
|
||
"1. **Parse the frames string**: Convert `p344r5p369r6p603r7` into a list of `(placement_id, role_id)` tuples\n",
|
||
"\n",
|
||
"2. **Map role IDs to semantic roles**: Convert board-specific role IDs (5→start, 6→middle, etc.) to shared semantic names\n",
|
||
"\n",
|
||
"3. **Canonicalize hold order**: Sort holds by (role priority, y-position, x-position). This is important because:\n",
|
||
" - The same climb can be represented with holds in any order in the database\n",
|
||
" - Transformers need consistent input ordering to learn patterns\n",
|
||
" - This is analogous to how NLP tokenizers normalize text (lowercasing, etc.)\n",
|
||
"\n",
|
||
"4. **Generate token sequences**: Create two versions of each route:\n",
|
||
" - `sequence_with_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> ... <EOS>`\n",
|
||
" - `sequence_no_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> ... <EOS>` (grade removed)\n",
|
||
"\n",
|
||
"The grade-included version is used for the GPT generator (which predicts the next token, including grade). The grade-excluded version is used for the grade predictor (which receives the route without knowing the grade and must predict it)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "20bed1da",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"configs_by_key = {config.board_key: config for config in configs}\n",
|
||
"configs_by_prefix = {config.token_prefix: config for config in configs}\n",
|
||
"placement_lookup = make_placement_lookup(df_placements)\n",
|
||
"\n",
|
||
"df_routes = build_route_records(\n",
|
||
" df_climbs=df_climbs,\n",
|
||
" configs_by_key=configs_by_key,\n",
|
||
" placement_lookup=placement_lookup,\n",
|
||
")\n",
|
||
"print(f\"Tokenized routes: {len(df_routes):,}\")\n",
|
||
"print()\n",
|
||
"df_routes[[\"board_key\", \"angle\", \"display_difficulty\", \"sequence_with_grade\"]].head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4d007fc0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Example tokenized routes\n",
|
||
"\n",
|
||
"Let's look at what the tokenized routes actually look like. This is the \"text\" that our transformer models will read."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "f5b7391b",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for _, row in df_routes.groupby(\"board_key\").head(2).iterrows():\n",
|
||
" print(f\"Board: {row['board_key']}\")\n",
|
||
" print(f\" Angle: {row['angle']}°\")\n",
|
||
" print(f\" Grade: {row['boulder_grade']} (V{row['grouped_v']})\")\n",
|
||
" print(f\" Tokens: {row['sequence_with_grade']}\")\n",
|
||
" print()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0393d191",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Build the shared vocabulary\n",
|
||
"\n",
|
||
"### What is a vocabulary?\n",
|
||
"\n",
|
||
"In NLP, the **vocabulary** (or \"vocab\") is the set of all possible tokens the model can produce or understand. For GPT-3, this is ~50,000 BPE tokens. For our climbing model, it includes:\n",
|
||
"\n",
|
||
"1. **Special tokens** (6): `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`, `<CLS>`, `<MASK>`\n",
|
||
"2. **Board tokens** (2): `<BOARD_TB2>`, `<BOARD_KILTER>`\n",
|
||
"3. **Angle tokens** (~6): `<ANGLE_30>`, `<ANGLE_35>`, `<ANGLE_40>`, etc.\n",
|
||
"4. **Grade tokens** (~17): `<GRADE_V0>` through `<GRADE_V16>`\n",
|
||
"5. **Hold tokens** (~1000+): One per placement per board per role\n",
|
||
"\n",
|
||
"### Why board-namespaced hold tokens?\n",
|
||
"\n",
|
||
"Placement ID 344 on TB2 refers to a completely different physical hold than placement ID 344 on Kilter. By prefixing with the board name (`TB2_p344_start` vs `KILTER_p344_start`), we ensure the model treats these as distinct tokens.\n",
|
||
"\n",
|
||
"This is analogous to how multilingual LLMs use language-specific subword tokens — the same byte sequence can mean different things in different languages.\n",
|
||
"\n",
|
||
"### String-to-integer mapping\n",
|
||
"\n",
|
||
"Transformers operate on integer indices, not strings. The `stoi` (string-to-integer) and `itos` (integer-to-string) dictionaries provide this mapping, similar to how HuggingFace tokenizers work."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1ba5b78d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"vocab_tokens, stoi, itos = build_vocab(df_routes)\n",
|
||
"\n",
|
||
"print(f\"Vocabulary size: {len(stoi):,}\")\n",
|
||
"print()\n",
|
||
"print(\"First 20 tokens (special + board tokens):\")\n",
|
||
"print(vocab_tokens[:20])\n",
|
||
"print()\n",
|
||
"hold_tokens = [t for t in vocab_tokens if t.startswith('<') and '_p' in t]\n",
|
||
"angle_tokens = [t for t in vocab_tokens if t.startswith('<ANGLE_')]\n",
|
||
"grade_tokens = [t for t in vocab_tokens if t.startswith('<GRADE_')]\n",
|
||
"board_tokens = [t for t in vocab_tokens if t.startswith('<BOARD_')]\n",
|
||
"special_tokens = [t for t in vocab_tokens if t in ['<PAD>', '<UNK>', '<BOS>', '<EOS>', '<CLS>', '<MASK>']]\n",
|
||
"\n",
|
||
"print(f\"Special tokens: {len(special_tokens)}\")\n",
|
||
"print(f\"Board tokens: {len(board_tokens)}\")\n",
|
||
"print(f\"Angle tokens: {len(angle_tokens)}\")\n",
|
||
"print(f\"Grade tokens: {len(grade_tokens)}\")\n",
|
||
"print(f\"Hold tokens: {len(hold_tokens)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "93176cff",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Train/validation/test splits\n",
|
||
"\n",
|
||
"### Why stratified splitting?\n",
|
||
"\n",
|
||
"We split data into train (80%), validation (10%), and test (10%) sets. Crucially, we **stratify by `board_key × grouped_v`** — this ensures that:\n",
|
||
"\n",
|
||
"1. Both boards (TB2 and Kilter) are represented in all splits\n",
|
||
"2. All difficulty levels (V0 through V16) are represented in all splits\n",
|
||
"\n",
|
||
"Without stratification, we might end up with all V14 climbs in the test set and none in training, which would make evaluation meaningless.\n",
|
||
"\n",
|
||
"This is the same principle as stratified splitting in NLP, where you ensure all languages or domains are represented in each split."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ff18298e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_routes[\"ids_with_grade\"] = df_routes[\"tokens_with_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
|
||
"df_routes[\"ids_no_grade\"] = df_routes[\"tokens_no_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
|
||
"df_routes[\"split_stratum\"] = df_routes[\"board_key\"].astype(str) + \"__V\" + df_routes[\"grouped_v\"].astype(str)\n",
|
||
"df_routes[\"split\"] = assign_group_splits(\n",
|
||
" df_routes,\n",
|
||
" group_cols=[\"board_key\", \"uuid\"],\n",
|
||
" test_size=0.20,\n",
|
||
" val_size_within_temp=0.50,\n",
|
||
" random_state=3,\n",
|
||
" stratify_col=\"split_stratum\",\n",
|
||
")\n",
|
||
"\n",
|
||
"df_routes.groupby([\"board_key\", \"split\"]).size().unstack(fill_value=0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "414dca92",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Token metadata\n",
|
||
"\n",
|
||
"### Why metadata matters\n",
|
||
"\n",
|
||
"Each hold token carries rich metadata that the model can use:\n",
|
||
"\n",
|
||
"- **Physical coordinates** (`x`, `y`): Where the hold is on the board\n",
|
||
"- **Normalized coordinates** (`x_norm`, `y_norm`): Scaled to [-1, 1] per board, so the model doesn't need to learn different coordinate scales\n",
|
||
"- **Semantic role** (`start`, `middle`, `finish`, `foot`): What the hold is used for\n",
|
||
"- **Board identity** (`board_key`): Which board this hold belongs to\n",
|
||
"\n",
|
||
"The grade predictor uses these coordinate features as additional embeddings alongside the token embeddings. This is similar to how some LLMs inject positional embeddings or segment embeddings — we're giving the model extra structured information about each token."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "48c3692e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_token_meta = build_token_metadata(\n",
|
||
" vocab_tokens=vocab_tokens,\n",
|
||
" stoi=stoi,\n",
|
||
" df_placements=df_placements,\n",
|
||
" placement_lookup=placement_lookup,\n",
|
||
" configs_by_prefix=configs_by_prefix,\n",
|
||
")\n",
|
||
"\n",
|
||
"print(\"Token metadata columns:\")\n",
|
||
"print(df_token_meta.columns.tolist())\n",
|
||
"print()\n",
|
||
"print(\"Example hold token metadata:\")\n",
|
||
"df_token_meta[df_token_meta[\"kind\"] == \"hold\"].head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "414dca93",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Save artifacts\n",
|
||
"\n",
|
||
"We save several files that will be consumed by later notebooks:\n",
|
||
"\n",
|
||
"1. **`route_sequences.csv`**: The main tokenized dataset with train/val/test splits\n",
|
||
"2. **`routes_tokenized.jsonl`**: Same data in JSON Lines format (one JSON object per route)\n",
|
||
"3. **`token_vocab.json`**: The vocabulary mapping (stoi and itos)\n",
|
||
"4. **`token_metadata.csv`**: Metadata for each token (coordinates, roles, etc.)\n",
|
||
"5. **`placement_metadata.csv`**: Physical placement information\n",
|
||
"6. **`board_summary.csv`**: Aggregate statistics per board"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "50e81878",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"OUT = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
|
||
"OUT.mkdir(parents=True, exist_ok=True)\n",
|
||
"\n",
|
||
"csv_cols = [\n",
|
||
" \"uuid\", \"board_key\", \"board_display_name\", \"board_token_prefix\", \"board_token\",\n",
|
||
" \"climb_name\", \"setter_username\", \"layout_id\", \"layout_name\", \"board_name\",\n",
|
||
" \"frames\", \"angle\", \"display_difficulty\", \"grouped_v\", \"boulder_grade\",\n",
|
||
" \"ascensionist_count\", \"quality_average\", \"fa_at\",\n",
|
||
" \"n_holds\", \"n_start\", \"n_middle\", \"n_foot\", \"n_finish\",\n",
|
||
" \"sequence_with_grade\", \"sequence_no_grade\", \"split\",\n",
|
||
"]\n",
|
||
"df_routes[csv_cols].to_csv(OUT / \"route_sequences.csv\", index=False)\n",
|
||
"\n",
|
||
"df_placements.to_csv(OUT / \"placement_metadata.csv\", index=False)\n",
|
||
"\n",
|
||
"df_token_meta.to_csv(OUT / \"token_metadata.csv\", index=False)\n",
|
||
"\n",
|
||
"write_json(OUT / \"token_vocab.json\", vocab_payload(stoi, itos, configs_by_key))\n",
|
||
"\n",
|
||
"with (OUT / \"routes_tokenized.jsonl\").open(\"w\", encoding=\"utf-8\") as handle:\n",
|
||
" for record in df_routes.to_dict(orient=\"records\"):\n",
|
||
" handle.write(json.dumps(json_safe(record)) + \"\\n\")\n",
|
||
"\n",
|
||
"board_summary = (\n",
|
||
" df_routes.groupby(\"board_key\")\n",
|
||
" .agg(\n",
|
||
" n_routes=(\"uuid\", \"count\"),\n",
|
||
" mean_angle=(\"angle\", \"mean\"),\n",
|
||
" mean_display_difficulty=(\"display_difficulty\", \"mean\"),\n",
|
||
" mean_holds=(\"n_holds\", \"mean\"),\n",
|
||
" )\n",
|
||
" .reset_index()\n",
|
||
")\n",
|
||
"board_summary.to_csv(OUT / \"board_summary.csv\", index=False)\n",
|
||
"\n",
|
||
"print(\"Saved artifacts to:\", OUT)\n",
|
||
"print(f\" - route_sequences.csv ({len(df_routes):,} rows)\")\n",
|
||
"print(f\" - routes_tokenized.jsonl\")\n",
|
||
"print(f\" - token_vocab.json ({len(stoi):,} tokens)\")\n",
|
||
"print(f\" - token_metadata.csv ({len(df_token_meta):,} rows)\")\n",
|
||
"print(f\" - placement_metadata.csv ({len(df_placements):,} rows)\")\n",
|
||
"print(f\" - board_summary.csv\")"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|