1512 lines
62 KiB
Plaintext
1512 lines
62 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "94a0ae33",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 01 — Unified Route Tokenization for TB2 and Kilter\n",
|
||
"\n",
|
||
"## What is tokenization and why does it matter?\n",
|
||
"\n",
|
||
"In natural language processing, **tokenization** is the process of converting raw text into a sequence of discrete symbols (tokens) that a model can process. For example, the sentence \"I climb rocks\" might be tokenized as `[\"I\", \" climb\", \" rocks\"]` using a subword tokenizer like BPE.\n",
|
||
"\n",
|
||
"For climbing board routes, we face an analogous problem: how do we convert a climb — which is fundamentally a *set of holds at specific positions with specific roles* — into a sequence of tokens that a transformer can learn from?\n",
|
||
"\n",
|
||
"### Key design decisions in this notebook\n",
|
||
"\n",
|
||
"1. **Board namespacing**: Each hold token includes the board prefix (e.g., `TB2_p344_start` vs `KILTER_p1084_start`). This prevents placement ID collisions between boards — placement 344 on TB2 is a completely different physical hold than placement 344 on Kilter (in fact, the latter does not exist).\n",
|
||
"\n",
|
||
"2. **Semantic role mapping**: Different boards use different role IDs (TB2 uses 5/6/7/8, Kilter uses 12/13/14/15), but they all map to the same semantic roles: `start`, `middle`, `finish`, `foot`. This shared vocabulary lets the model learn transferable patterns.\n",
|
||
"\n",
|
||
"3. **Canonical ordering**: Holds within a route are sorted by (role priority, y-position, x-position). This gives the model a consistent input ordering, similar to how LLMs expect text in left-to-right order.\n",
|
||
"\n",
|
||
"4. **Special tokens**: Like BERT and GPT, we use special tokens:\n",
|
||
" - `<BOS>` (beginning of sequence) — marks the start, like `[CLS]` in BERT\n",
|
||
" - `<EOS>` (end of sequence) — marks the end, like `[SEP]` or the end-of-text token in GPT\n",
|
||
" - `<PAD>` — for batching sequences of different lengths\n",
|
||
" - `<UNK>` — for unknown tokens (safety net)\n",
|
||
" - `<CLS>` — used by the grade predictor to pool sequence information\n",
|
||
" - `<MASK>` — reserved for future masked language modeling experiments\n",
|
||
"\n",
|
||
"5. **Conditioning tokens**: Routes are prefixed with board, angle, and grade tokens. This is analogous to how modern LLMs use system prompts or task prefixes to condition generation.\n",
|
||
"\n",
|
||
"### The analogy to NLP\n",
|
||
"\n",
|
||
"| NLP Concept | Climbing Board Analog |\n",
|
||
"|---|---|\n",
|
||
"| Word / Subword | Hold token (placement + role) |\n",
|
||
"| Sentence | Route (sequence of holds) |\n",
|
||
"| Document language | Board type (TB2 vs Kilter) |\n",
|
||
"| Sentence length | Number of holds in route |\n",
|
||
"| POS tag | Semantic role (start/middle/finish/foot) |\n",
|
||
"| Genre / Domain | Angle + Grade conditioning |\n",
|
||
"\n",
|
||
"This notebook tokenizes climbing routes from **both** supported boards:\n",
|
||
"\n",
|
||
"- Tension Board 2 Mirror\n",
|
||
"- Kilter Board Original\n",
|
||
"\n",
|
||
"The board-specific details are stored in `configs/tb2.json` and `configs/kilter.json`.\n",
|
||
"This version defines the tokenization helpers inline as the notebook needs them.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "6ee2907f",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:23.269153Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:23.268660Z",
|
||
"iopub.status.idle": "2026-06-07T15:45:25.138003Z",
|
||
"shell.execute_reply": "2026-06-07T15:45:25.137054Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from __future__ import annotations\n",
|
||
"\n",
|
||
"import ast\n",
|
||
"import json\n",
|
||
"import random\n",
|
||
"import re\n",
|
||
"import sqlite3\n",
|
||
"from dataclasses import dataclass\n",
|
||
"from pathlib import Path\n",
|
||
"from typing import Any, Iterable\n",
|
||
"\n",
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"\n",
|
||
"ROOT = Path.cwd().resolve()\n",
|
||
"if ROOT.name == \"notebooks\":\n",
|
||
" ROOT = ROOT.parent"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fec9ef3b",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Board configuration helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4c110801",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:25.142164Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:25.141718Z",
|
||
"iopub.status.idle": "2026-06-07T15:45:25.156480Z",
|
||
"shell.execute_reply": "2026-06-07T15:45:25.155743Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Find the project root and load board configuration JSON files.\n",
|
||
"def find_project_root(start: str | Path | None = None) -> Path:\n",
|
||
" \"\"\"Walk upward until the repository root markers are found.\n",
|
||
"\n",
|
||
" The project root is identified by both ``pyproject.toml`` and ``configs``.\n",
|
||
" If neither marker pair is found, the resolved starting directory is returned\n",
|
||
" so callers still have a deterministic base path.\n",
|
||
" \"\"\"\n",
|
||
" current = Path(start).resolve() if start is not None else Path.cwd().resolve()\n",
|
||
" for candidate in [current, *current.parents]:\n",
|
||
" if (candidate / \"pyproject.toml\").exists() and (candidate / \"configs\").exists():\n",
|
||
" return candidate\n",
|
||
" return current\n",
|
||
"\n",
|
||
"@dataclass(frozen=True)\n",
|
||
"class BoardConfig:\n",
|
||
" \"\"\"Configuration for a single climbing board.\n",
|
||
" \n",
|
||
" This dataclass stores all board-specific settings needed for\n",
|
||
" data loading, tokenization, and model training.\n",
|
||
" \n",
|
||
" Attributes:\n",
|
||
" board_key: Short identifier (e.g., \"tb2\", \"kilter\")\n",
|
||
" display_name: Human-readable name (e.g., \"Tension Board 2 Mirror\")\n",
|
||
" token_prefix: Namespace for hold tokens (e.g., \"TB2\", \"KILTER\")\n",
|
||
" db_path: Path to the SQLite database\n",
|
||
" layout_id: Which layout in the database to use\n",
|
||
" max_angle: Filter out routes steeper than this (None = no filter)\n",
|
||
" min_fa_date: Filter out routes first ascended before this date\n",
|
||
" placement_y_max: Filter out placements above this Y coordinate\n",
|
||
" include_mirror_placement_id: Whether to include mirror info (TB2 only)\n",
|
||
" role_definitions: Maps semantic role names to numeric IDs\n",
|
||
" boardlib_database_command: Command to download the database\n",
|
||
" boardlib_images_command: Command to download board images\n",
|
||
" notes: Additional notes about the configuration\n",
|
||
" \"\"\"\n",
|
||
" board_key: str\n",
|
||
" display_name: str\n",
|
||
" token_prefix: str\n",
|
||
" db_path: Path\n",
|
||
" layout_id: int\n",
|
||
" max_angle: float | None\n",
|
||
" min_fa_date: str | None\n",
|
||
" placement_y_max: float | None\n",
|
||
" include_mirror_placement_id: bool\n",
|
||
" role_definitions: dict[str, int]\n",
|
||
" boardlib_database_command: str | None = None\n",
|
||
" boardlib_images_command: str | None = None\n",
|
||
" notes: tuple[str, ...] = ()\n",
|
||
"\n",
|
||
" @property\n",
|
||
" def role_id_to_name(self) -> dict[int, str]:\n",
|
||
" \"\"\"Reverse mapping from numeric role IDs to semantic role names.\n",
|
||
" \n",
|
||
" Example: {5: 'start', 6: 'middle', 7: 'finish', 8: 'foot'} for TB2\n",
|
||
" \"\"\"\n",
|
||
" return {int(role_id): name for name, role_id in self.role_definitions.items()}\n",
|
||
"\n",
|
||
" @property\n",
|
||
" def board_token(self) -> str:\n",
|
||
" \"\"\"The special token representing this board.\n",
|
||
" \n",
|
||
" Example: \"<BOARD_TB2>\" or \"<BOARD_KILTER>\"\n",
|
||
" \"\"\"\n",
|
||
" return f\"<BOARD_{self.token_prefix}>\"\n",
|
||
"\n",
|
||
" def resolve_db_path(self, project_root: Path | None = None) -> Path:\n",
|
||
" \"\"\"Resolve the database path relative to the project root.\n",
|
||
" \n",
|
||
" If db_path is absolute, return it as-is.\n",
|
||
" Otherwise, resolve it relative to the project root.\n",
|
||
" \"\"\"\n",
|
||
" project_root = project_root or find_project_root()\n",
|
||
" return self.db_path if self.db_path.is_absolute() else project_root / self.db_path\n",
|
||
"\n",
|
||
"def load_board_config(board_key: str, config_dir: str | Path | None = None) -> BoardConfig:\n",
|
||
" \"\"\"Load a single board configuration from a JSON file.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" board_key: Board identifier (e.g., \"tb2\", \"kilter\")\n",
|
||
" config_dir: Directory containing config JSON files\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" BoardConfig dataclass with all board settings\n",
|
||
" \n",
|
||
" Raises:\n",
|
||
" FileNotFoundError: If the config file doesn't exist\n",
|
||
" \"\"\"\n",
|
||
" project_root = find_project_root()\n",
|
||
" config_dir = Path(config_dir) if config_dir is not None else project_root / \"configs\"\n",
|
||
" path = config_dir / f\"{board_key}.json\"\n",
|
||
" if not path.exists():\n",
|
||
" available = sorted(p.stem for p in config_dir.glob(\"*.json\"))\n",
|
||
" raise FileNotFoundError(\n",
|
||
" f\"Unknown board config '{board_key}'. Available: {available}\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" payload = json.loads(path.read_text(encoding=\"utf-8\"))\n",
|
||
" return BoardConfig(\n",
|
||
" board_key=str(payload[\"board_key\"]),\n",
|
||
" display_name=str(payload[\"display_name\"]),\n",
|
||
" token_prefix=str(payload[\"token_prefix\"]),\n",
|
||
" db_path=Path(payload[\"db_path\"]),\n",
|
||
" layout_id=int(payload[\"layout_id\"]),\n",
|
||
" max_angle=None if payload.get(\"max_angle\") is None else float(payload[\"max_angle\"]),\n",
|
||
" min_fa_date=payload.get(\"min_fa_date\"),\n",
|
||
" placement_y_max=None if payload.get(\"placement_y_max\") is None else float(payload[\"placement_y_max\"]),\n",
|
||
" include_mirror_placement_id=bool(payload.get(\"include_mirror_placement_id\", False)),\n",
|
||
" role_definitions={str(k): int(v) for k, v in payload[\"role_definitions\"].items()},\n",
|
||
" boardlib_database_command=payload.get(\"boardlib_database_command\"),\n",
|
||
" boardlib_images_command=payload.get(\"boardlib_images_command\"),\n",
|
||
" notes=tuple(payload.get(\"notes\", [])),\n",
|
||
" )\n",
|
||
"\n",
|
||
"def load_board_configs(board_keys: list[str] | tuple[str, ...]) -> list[BoardConfig]:\n",
|
||
" \"\"\"Load multiple board configurations.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" board_keys: List of board identifiers\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" List of BoardConfig dataclasses\n",
|
||
" \"\"\"\n",
|
||
" return [load_board_config(board_key) for board_key in board_keys]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b3066e2b",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load board configurations\n",
|
||
"\n",
|
||
"Each board has its own configuration file (`configs/tb2.json`, `configs/kilter.json`) that specifies:\n",
|
||
"\n",
|
||
"- **`layout_id`**: Which board layout to use (TB2 Mirror = 10, Kilter Original = 1)\n",
|
||
"- **`role_definitions`**: Maps semantic role names to board-specific role IDs\n",
|
||
" - TB2: start=5, middle=6, finish=7, foot=8\n",
|
||
" - Kilter: start=12, middle=13, finish=14, foot=15\n",
|
||
"- **`max_angle`**: We filter out climbs at extreme angles (>50° for TB2, >55° for Kilter) because those grades are biased toward elite climbers\n",
|
||
"- **`token_prefix`**: The namespace prefix for hold tokens (\"TB2\" vs \"KILTER\")\n",
|
||
"- **`include_mirror_placement_id`**: Whether to include mirror information (TB2 has symmetric left/right holds)\n",
|
||
"\n",
|
||
"This configuration-driven approach means we can add new boards by creating a new JSON config file, without changing any code.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4f04dcea",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:25.159465Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:25.159209Z",
|
||
"iopub.status.idle": "2026-06-07T15:45:25.166377Z",
|
||
"shell.execute_reply": "2026-06-07T15:45:25.165663Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"configs = load_board_configs([\"tb2\", \"kilter\"])\n",
|
||
"configs"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "25242855",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Database loading helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a076d997",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:25.170098Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:25.169626Z",
|
||
"iopub.status.idle": "2026-06-07T15:45:25.182438Z",
|
||
"shell.execute_reply": "2026-06-07T15:45:25.181766Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Query each BoardLib SQLite database and attach board identity columns.\n",
|
||
"def build_climbs_query(config: BoardConfig) -> tuple[str, list]:\n",
|
||
" \"\"\"Build a SQL query for climbs data with board-specific filters.\n",
|
||
" \n",
|
||
" The query joins climbs, layouts, products, climb_stats, and difficulty_grades\n",
|
||
" tables, applying filters for:\n",
|
||
" - layout_id: Which board layout to use\n",
|
||
" - max_angle: Exclude routes steeper than this\n",
|
||
" - min_fa_date: Exclude routes first ascended before this date\n",
|
||
" - display_difficulty IS NOT NULL: Only routes with difficulty ratings\n",
|
||
" - is_listed = 1: Only publicly listed routes\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" config: Board configuration\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" Tuple of (SQL query string, list of query parameters)\n",
|
||
" \"\"\"\n",
|
||
" conditions = [\n",
|
||
" \"cs.display_difficulty IS NOT NULL\",\n",
|
||
" \"c.is_listed = 1\",\n",
|
||
" \"c.layout_id = ?\",\n",
|
||
" ]\n",
|
||
" params: list = [config.layout_id]\n",
|
||
"\n",
|
||
" if config.max_angle is not None:\n",
|
||
" conditions.append(\"cs.angle <= ?\")\n",
|
||
" params.append(config.max_angle)\n",
|
||
"\n",
|
||
" if config.min_fa_date is not None:\n",
|
||
" conditions.append(\"cs.fa_at > ?\")\n",
|
||
" params.append(config.min_fa_date)\n",
|
||
"\n",
|
||
" query = f\"\"\"\n",
|
||
" SELECT\n",
|
||
" c.uuid,\n",
|
||
" c.name AS climb_name,\n",
|
||
" c.setter_username,\n",
|
||
" c.layout_id AS layout_id,\n",
|
||
" c.description,\n",
|
||
" c.is_nomatch,\n",
|
||
" c.is_listed,\n",
|
||
" l.name AS layout_name,\n",
|
||
" p.name AS board_name,\n",
|
||
" c.frames,\n",
|
||
" cs.angle,\n",
|
||
" cs.display_difficulty,\n",
|
||
" dg.boulder_name AS boulder_grade,\n",
|
||
" cs.ascensionist_count,\n",
|
||
" cs.quality_average,\n",
|
||
" cs.fa_at\n",
|
||
" FROM climbs c\n",
|
||
" JOIN layouts l ON c.layout_id = l.id\n",
|
||
" JOIN products p ON l.product_id = p.id\n",
|
||
" JOIN climb_stats cs ON c.uuid = cs.climb_uuid\n",
|
||
" JOIN difficulty_grades dg ON ROUND(cs.display_difficulty) = dg.difficulty\n",
|
||
" WHERE {' AND '.join(conditions)}\n",
|
||
" \"\"\"\n",
|
||
" return query, params\n",
|
||
"\n",
|
||
"def build_placements_query(config: BoardConfig) -> tuple[str, list]:\n",
|
||
" \"\"\"Build a SQL query for placement data with board-specific filters.\n",
|
||
" \n",
|
||
" The query retrieves hold positions, default roles, material types,\n",
|
||
" and (optionally) mirror placement IDs for symmetric holds.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" config: Board configuration\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" Tuple of (SQL query string, list of query parameters)\n",
|
||
" \"\"\"\n",
|
||
" params: list = [config.layout_id]\n",
|
||
" y_condition = \"\"\n",
|
||
" if config.placement_y_max is not None:\n",
|
||
" y_condition = \" AND h.y <= ?\"\n",
|
||
" params.append(config.placement_y_max)\n",
|
||
"\n",
|
||
" if config.include_mirror_placement_id:\n",
|
||
" # TB2 has mirrored holds — include the mirror placement ID\n",
|
||
" query = f\"\"\"\n",
|
||
" SELECT\n",
|
||
" p.id AS placement_id,\n",
|
||
" h.x,\n",
|
||
" h.y,\n",
|
||
" p.default_placement_role_id AS default_role_id,\n",
|
||
" p.set_id AS set_id,\n",
|
||
" s.name AS set_name,\n",
|
||
" p_mirror.id AS mirror_placement_id\n",
|
||
" FROM placements p\n",
|
||
" JOIN holes h ON p.hole_id = h.id\n",
|
||
" JOIN sets s ON p.set_id = s.id\n",
|
||
" LEFT JOIN holes h_mirror ON h.mirrored_hole_id = h_mirror.id\n",
|
||
" LEFT JOIN placements p_mirror\n",
|
||
" ON p_mirror.hole_id = h_mirror.id\n",
|
||
" AND p_mirror.layout_id = p.layout_id\n",
|
||
" WHERE p.layout_id = ?{y_condition}\n",
|
||
" \"\"\"\n",
|
||
" else:\n",
|
||
" # Kilter doesn't have mirrored holds\n",
|
||
" query = f\"\"\"\n",
|
||
" SELECT\n",
|
||
" p.id AS placement_id,\n",
|
||
" h.x,\n",
|
||
" h.y,\n",
|
||
" p.default_placement_role_id AS default_role_id,\n",
|
||
" p.set_id AS set_id,\n",
|
||
" s.name AS set_name,\n",
|
||
" NULL AS mirror_placement_id\n",
|
||
" FROM placements p\n",
|
||
" JOIN holes h ON p.hole_id = h.id\n",
|
||
" JOIN sets s ON p.set_id = s.id\n",
|
||
" WHERE p.layout_id = ?{y_condition}\n",
|
||
" \"\"\"\n",
|
||
" return query, params\n",
|
||
"\n",
|
||
"def load_board_data(\n",
|
||
" config: BoardConfig,\n",
|
||
" project_root: str | Path | None = None,\n",
|
||
" max_climbs: int | None = None,\n",
|
||
") -> tuple[pd.DataFrame, pd.DataFrame]:\n",
|
||
" \"\"\"Load climbs and placements data for a single board.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" config: Board configuration\n",
|
||
" project_root: Path to project root (for resolving db_path)\n",
|
||
" max_climbs: Optional row limit for fast smoke-test loads.\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" Tuple of (climbs DataFrame, placements DataFrame)\n",
|
||
" \"\"\"\n",
|
||
" project_root = Path(project_root) if project_root is not None else find_project_root()\n",
|
||
" db_path = config.resolve_db_path(project_root)\n",
|
||
" if not db_path.exists():\n",
|
||
" raise FileNotFoundError(\n",
|
||
" f\"Could not find database for board '{config.board_key}': {db_path}\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" climbs_query, climbs_params = build_climbs_query(config)\n",
|
||
" placements_query, placements_params = build_placements_query(config)\n",
|
||
" if max_climbs is not None:\n",
|
||
" if max_climbs < 1:\n",
|
||
" raise ValueError(\"max_climbs must be at least 1.\")\n",
|
||
" climbs_query = f\"{climbs_query}\\nORDER BY c.uuid, cs.angle\\nLIMIT ?\"\n",
|
||
" climbs_params = [*climbs_params, int(max_climbs)]\n",
|
||
"\n",
|
||
" with sqlite3.connect(db_path) as conn:\n",
|
||
" df_climbs = pd.read_sql_query(climbs_query, conn, params=climbs_params)\n",
|
||
" df_placements = pd.read_sql_query(placements_query, conn, params=placements_params)\n",
|
||
"\n",
|
||
" # Add board identifiers for multi-board processing\n",
|
||
" df_climbs[\"board_key\"] = config.board_key\n",
|
||
" df_climbs[\"board_token_prefix\"] = config.token_prefix\n",
|
||
" df_climbs[\"board_display_name\"] = config.display_name\n",
|
||
"\n",
|
||
" df_placements[\"board_key\"] = config.board_key\n",
|
||
" df_placements[\"board_token_prefix\"] = config.token_prefix\n",
|
||
" df_placements[\"board_display_name\"] = config.display_name\n",
|
||
"\n",
|
||
" return df_climbs, df_placements\n",
|
||
"\n",
|
||
"def load_multi_board_data(\n",
|
||
" configs: list[BoardConfig],\n",
|
||
" project_root: str | Path | None = None,\n",
|
||
" max_climbs_per_board: int | None = None,\n",
|
||
") -> tuple[pd.DataFrame, pd.DataFrame]:\n",
|
||
" \"\"\"Load and concatenate data from multiple boards.\n",
|
||
" \n",
|
||
" This function loads data from each board's database and concatenates\n",
|
||
" them into unified DataFrames. Board identifiers are preserved in\n",
|
||
" the board_key column.\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" configs: List of board configurations\n",
|
||
" project_root: Path to project root\n",
|
||
" max_climbs_per_board: Optional row limit per board for smoke tests.\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" Tuple of (combined climbs DataFrame, combined placements DataFrame)\n",
|
||
" \"\"\"\n",
|
||
" climb_frames = []\n",
|
||
" placement_frames = []\n",
|
||
"\n",
|
||
" for config in configs:\n",
|
||
" climbs, placements = load_board_data(\n",
|
||
" config,\n",
|
||
" project_root=project_root,\n",
|
||
" max_climbs=max_climbs_per_board,\n",
|
||
" )\n",
|
||
" climb_frames.append(climbs)\n",
|
||
" placement_frames.append(placements)\n",
|
||
"\n",
|
||
" return (\n",
|
||
" pd.concat(climb_frames, ignore_index=True),\n",
|
||
" pd.concat(placement_frames, ignore_index=True),\n",
|
||
" )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2a5c9a9b",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Load raw climbs and placement metadata\n",
|
||
"\n",
|
||
"The data loading step reads from SQLite databases downloaded using BoardLib:\n",
|
||
"\n",
|
||
"```bash\n",
|
||
"boardlib database tension data/raw/tb2.db\n",
|
||
"boardlib database kilter data/raw/kilter.db\n",
|
||
"```\n",
|
||
"\n",
|
||
"### What we're loading\n",
|
||
"\n",
|
||
"**Climbs data** (`df_climbs`): Each row is a climb-angle observation. Key columns:\n",
|
||
"- `uuid`: Unique climb identifier\n",
|
||
"- `frames`: The raw string encoding holds and roles, e.g., `p344r5p369r6p603r7`\n",
|
||
"- `angle`: Wall angle in degrees\n",
|
||
"- `display_difficulty`: Numeric difficulty score (maps to V-grades)\n",
|
||
"- `boulder_grade`: Human-readable grade like \"6b/V4\"\n",
|
||
"\n",
|
||
"**Placements data** (`df_placements`): Each row is a physical hold position on the board. Key columns:\n",
|
||
"- `placement_id`: The hold's unique ID within its board\n",
|
||
"- `x`, `y`: Physical coordinates on the board (in inches)\n",
|
||
"- `default_role_id`: What role this hold typically plays (hand vs foot)\n",
|
||
"- `set_name`: Material type (\"Wood\" or \"Plastic\")\n",
|
||
"- `mirror_placement_id`: For TB2, the ID of the symmetric hold on the other side\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "53c1951a",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:25.185319Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:25.184989Z",
|
||
"iopub.status.idle": "2026-06-07T15:45:29.117312Z",
|
||
"shell.execute_reply": "2026-06-07T15:45:29.116566Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_climbs, df_placements = load_multi_board_data(configs, project_root=ROOT)\n",
|
||
"print(f\"Total climbs loaded: {len(df_climbs):,}\")\n",
|
||
"print(f\"Total placements loaded: {len(df_placements):,}\")\n",
|
||
"print()\n",
|
||
"print(\"Climbs per board:\")\n",
|
||
"print(df_climbs.groupby(\"board_key\").size())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f6a93063",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Grade and route-tokenization helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7597dfc3",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:29.120837Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:29.120438Z",
|
||
"iopub.status.idle": "2026-06-07T15:45:29.144382Z",
|
||
"shell.execute_reply": "2026-06-07T15:45:29.143567Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Map BoardLib display difficulties into grouped V-grade tokens.\n",
|
||
"GRADE_TO_V = {\n",
|
||
" 10: 0, 11: 0, 12: 0,\n",
|
||
" 13: 1, 14: 1,\n",
|
||
" 15: 2,\n",
|
||
" 16: 3, 17: 3,\n",
|
||
" 18: 4, 19: 4,\n",
|
||
" 20: 5, 21: 5,\n",
|
||
" 22: 6,\n",
|
||
" 23: 7,\n",
|
||
" 24: 8, 25: 8,\n",
|
||
" 26: 9,\n",
|
||
" 27: 10,\n",
|
||
" 28: 11,\n",
|
||
" 29: 12,\n",
|
||
" 30: 13,\n",
|
||
" 31: 14,\n",
|
||
" 32: 15,\n",
|
||
" 33: 16,\n",
|
||
"}\n",
|
||
"\n",
|
||
"def to_grouped_v(display_difficulty: float) -> int:\n",
|
||
" \"\"\"Map a continuous display difficulty to the nearest grouped V grade.\"\"\"\n",
|
||
" rounded = int(round(float(display_difficulty)))\n",
|
||
" rounded = max(min(rounded, max(GRADE_TO_V)), min(GRADE_TO_V))\n",
|
||
" return GRADE_TO_V[rounded]\n",
|
||
"\n",
|
||
"def grade_token(display_difficulty: float) -> str:\n",
|
||
" \"\"\"Return the grade-conditioning token for a display difficulty value.\"\"\"\n",
|
||
" return f\"<GRADE_V{to_grouped_v(display_difficulty)}>\"\n",
|
||
"\n",
|
||
"# Parse frames, canonicalize holds, and build route-level token sequences.\n",
|
||
"SPECIAL_TOKENS = [\n",
|
||
" \"<PAD>\",\n",
|
||
" \"<UNK>\",\n",
|
||
" \"<BOS>\",\n",
|
||
" \"<EOS>\",\n",
|
||
" \"<CLS>\",\n",
|
||
" \"<MASK>\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"ANGLE_TOKEN_PATTERN = re.compile(r\"^<ANGLE_(-?\\d+)>$\")\n",
|
||
"\n",
|
||
"GRADE_TOKEN_PATTERN = re.compile(r\"^<GRADE_V(\\d+)>$\")\n",
|
||
"\n",
|
||
"BOARD_TOKEN_PATTERN = re.compile(r\"^<BOARD_([A-Z0-9_]+)>$\")\n",
|
||
"\n",
|
||
"HOLD_TOKEN_PATTERN = re.compile(r\"^<([A-Z0-9_]+)_p(\\d+)_(start|middle|finish|foot|unknown)>$\")\n",
|
||
"\n",
|
||
"ROLE_SORT_ORDER = {\n",
|
||
" \"start\": 0,\n",
|
||
" \"middle\": 1,\n",
|
||
" \"foot\": 2,\n",
|
||
" \"finish\": 3,\n",
|
||
" \"unknown\": 9,\n",
|
||
"}\n",
|
||
"\n",
|
||
"def parse_frames(frames_str: str | None) -> list[tuple[int, int]]:\n",
|
||
" \"\"\"Parse a frames string into ``(placement_id, role_id)`` pairs.\n",
|
||
"\n",
|
||
" Frames strings are compact concatenations such as ``p344r5p369r6``. Invalid\n",
|
||
" or missing input returns an empty list so callers can skip unusable climbs\n",
|
||
" without special-case exception handling.\n",
|
||
" \"\"\"\n",
|
||
" if not isinstance(frames_str, str):\n",
|
||
" return []\n",
|
||
" matches = re.findall(r\"p(\\d+)r(\\d+)\", frames_str)\n",
|
||
" return [(int(placement_id), int(role_id)) for placement_id, role_id in matches]\n",
|
||
"\n",
|
||
"def make_placement_lookup(df_placements: pd.DataFrame) -> dict[tuple[str, int], dict]:\n",
|
||
" \"\"\"Build a coordinate/metadata lookup keyed by ``(board_key, placement_id)``.\"\"\"\n",
|
||
" rows = {}\n",
|
||
" for _, row in df_placements.iterrows():\n",
|
||
" key = (str(row[\"board_key\"]), int(row[\"placement_id\"]))\n",
|
||
" rows[key] = row.to_dict()\n",
|
||
" return rows\n",
|
||
"\n",
|
||
"def role_name(role_id: int, config: BoardConfig) -> str:\n",
|
||
" \"\"\"Map a board-specific numeric role ID to a shared semantic role name.\"\"\"\n",
|
||
" return config.role_id_to_name.get(int(role_id), \"unknown\")\n",
|
||
"\n",
|
||
"def placement_xy(\n",
|
||
" board_key: str,\n",
|
||
" placement_id: int,\n",
|
||
" placement_lookup: dict[tuple[str, int], dict],\n",
|
||
") -> tuple[float, float]:\n",
|
||
" \"\"\"Return raw board coordinates for a placement, or NaNs if unknown.\"\"\"\n",
|
||
" row = placement_lookup.get((str(board_key), int(placement_id)))\n",
|
||
" if row is None:\n",
|
||
" return (float(\"nan\"), float(\"nan\"))\n",
|
||
" return (float(row[\"x\"]), float(row[\"y\"]))\n",
|
||
"\n",
|
||
"def canonicalize_holds(\n",
|
||
" holds: Iterable[tuple[int, int]],\n",
|
||
" config: BoardConfig,\n",
|
||
" placement_lookup: dict[tuple[str, int], dict],\n",
|
||
") -> list[tuple[int, int]]:\n",
|
||
" \"\"\"Sort holds into the canonical route order used by all model inputs.\n",
|
||
"\n",
|
||
" Frames preserve setter/storage order, which is not always stable\n",
|
||
" across routes or boards. Canonical ordering keeps starts first, hand/foot\n",
|
||
" holds in a bottom-to-top scan, and finishes last, giving the models a more\n",
|
||
" consistent sequence grammar.\n",
|
||
" \"\"\"\n",
|
||
" def key(pair: tuple[int, int]):\n",
|
||
" \"\"\"Sort by semantic role, then board position, then placement ID.\"\"\"\n",
|
||
" placement_id, role_id = pair\n",
|
||
" x, y = placement_xy(config.board_key, placement_id, placement_lookup)\n",
|
||
" name = role_name(role_id, config)\n",
|
||
" return (\n",
|
||
" ROLE_SORT_ORDER.get(name, 9),\n",
|
||
" y if not np.isnan(y) else 9999.0,\n",
|
||
" x if not np.isnan(x) else 9999.0,\n",
|
||
" placement_id,\n",
|
||
" )\n",
|
||
"\n",
|
||
" return sorted(list(holds), key=key)\n",
|
||
"\n",
|
||
"def board_token(config: BoardConfig) -> str:\n",
|
||
" \"\"\"Return the special conditioning token for a board config.\"\"\"\n",
|
||
" return f\"<BOARD_{config.token_prefix}>\"\n",
|
||
"\n",
|
||
"def angle_token(angle: float) -> str:\n",
|
||
" \"\"\"Round a wall angle into the shared angle-token format.\"\"\"\n",
|
||
" return f\"<ANGLE_{int(round(float(angle)))}>\"\n",
|
||
"\n",
|
||
"def hold_token(\n",
|
||
" placement_id: int,\n",
|
||
" role_id: int,\n",
|
||
" config: BoardConfig,\n",
|
||
") -> str:\n",
|
||
" \"\"\"Return a board-namespaced hold token for a placement and role.\"\"\"\n",
|
||
" semantic_role = role_name(role_id, config)\n",
|
||
" return f\"<{config.token_prefix}_p{int(placement_id)}_{semantic_role}>\"\n",
|
||
"\n",
|
||
"def tokenize_route(\n",
|
||
" row,\n",
|
||
" config: BoardConfig,\n",
|
||
" placement_lookup: dict[tuple[str, int], dict],\n",
|
||
" include_grade: bool = True,\n",
|
||
" canonical: bool = True,\n",
|
||
") -> list[str]:\n",
|
||
" \"\"\"Tokenize one climb row into the sequence consumed by the models.\n",
|
||
"\n",
|
||
" ``include_grade=True`` is used for GPT-style generation, where the target\n",
|
||
" grade is a conditioning token. ``include_grade=False`` is used for grade\n",
|
||
" prediction so the model cannot read the answer from its input.\n",
|
||
" \"\"\"\n",
|
||
" holds = parse_frames(row[\"frames\"])\n",
|
||
" if canonical:\n",
|
||
" holds = canonicalize_holds(holds, config, placement_lookup)\n",
|
||
"\n",
|
||
" tokens = [\n",
|
||
" \"<BOS>\",\n",
|
||
" board_token(config),\n",
|
||
" angle_token(row[\"angle\"]),\n",
|
||
" ]\n",
|
||
" if include_grade:\n",
|
||
" tokens.append(grade_token(row[\"display_difficulty\"]))\n",
|
||
"\n",
|
||
" tokens.extend(hold_token(placement_id, role_id, config) for placement_id, role_id in holds)\n",
|
||
" tokens.append(\"<EOS>\")\n",
|
||
" return tokens\n",
|
||
"\n",
|
||
"def build_route_records(\n",
|
||
" df_climbs: pd.DataFrame,\n",
|
||
" configs_by_key: dict[str, BoardConfig],\n",
|
||
" placement_lookup: dict[tuple[str, int], dict],\n",
|
||
") -> pd.DataFrame:\n",
|
||
" \"\"\"Create one training/evaluation record per climb-angle row.\n",
|
||
"\n",
|
||
" The returned frame keeps both human-readable route metadata and model-ready\n",
|
||
" token sequences, which lets downstream scripts save compact CSV summaries\n",
|
||
" while still retaining the richer JSONL training artifacts.\n",
|
||
" \"\"\"\n",
|
||
" records: list[dict] = []\n",
|
||
"\n",
|
||
" for _, row in df_climbs.iterrows():\n",
|
||
" board_key = str(row[\"board_key\"])\n",
|
||
" config = configs_by_key[board_key]\n",
|
||
" holds = canonicalize_holds(parse_frames(row[\"frames\"]), config, placement_lookup)\n",
|
||
" if not holds:\n",
|
||
" continue\n",
|
||
"\n",
|
||
" hold_tokens = [hold_token(p, r, config) for p, r in holds]\n",
|
||
" semantic_roles = [role_name(r, config) for _, r in holds]\n",
|
||
"\n",
|
||
" tokens_with_grade = tokenize_route(\n",
|
||
" row,\n",
|
||
" config=config,\n",
|
||
" placement_lookup=placement_lookup,\n",
|
||
" include_grade=True,\n",
|
||
" canonical=True,\n",
|
||
" )\n",
|
||
" tokens_no_grade = tokenize_route(\n",
|
||
" row,\n",
|
||
" config=config,\n",
|
||
" placement_lookup=placement_lookup,\n",
|
||
" include_grade=False,\n",
|
||
" canonical=True,\n",
|
||
" )\n",
|
||
"\n",
|
||
" records.append(\n",
|
||
" {\n",
|
||
" \"uuid\": row[\"uuid\"],\n",
|
||
" \"board_key\": board_key,\n",
|
||
" \"board_display_name\": row[\"board_display_name\"],\n",
|
||
" \"board_token_prefix\": row[\"board_token_prefix\"],\n",
|
||
" \"board_token\": board_token(config),\n",
|
||
" \"climb_name\": row[\"climb_name\"],\n",
|
||
" \"setter_username\": row.get(\"setter_username\"),\n",
|
||
" \"layout_id\": int(row[\"layout_id\"]),\n",
|
||
" \"layout_name\": row.get(\"layout_name\"),\n",
|
||
" \"board_name\": row.get(\"board_name\"),\n",
|
||
" \"frames\": row[\"frames\"],\n",
|
||
" \"angle\": float(row[\"angle\"]),\n",
|
||
" \"display_difficulty\": float(row[\"display_difficulty\"]),\n",
|
||
" \"grouped_v\": int(to_grouped_v(row[\"display_difficulty\"])),\n",
|
||
" \"boulder_grade\": row.get(\"boulder_grade\"),\n",
|
||
" \"ascensionist_count\": row.get(\"ascensionist_count\"),\n",
|
||
" \"quality_average\": row.get(\"quality_average\"),\n",
|
||
" \"fa_at\": row.get(\"fa_at\"),\n",
|
||
" \"n_holds\": len(holds),\n",
|
||
" \"n_start\": semantic_roles.count(\"start\"),\n",
|
||
" \"n_middle\": semantic_roles.count(\"middle\"),\n",
|
||
" \"n_foot\": semantic_roles.count(\"foot\"),\n",
|
||
" \"n_finish\": semantic_roles.count(\"finish\"),\n",
|
||
" \"holds\": holds,\n",
|
||
" \"hold_tokens\": hold_tokens,\n",
|
||
" \"tokens_with_grade\": tokens_with_grade,\n",
|
||
" \"tokens_no_grade\": tokens_no_grade,\n",
|
||
" \"sequence_with_grade\": \" \".join(tokens_with_grade),\n",
|
||
" \"sequence_no_grade\": \" \".join(tokens_no_grade),\n",
|
||
" }\n",
|
||
" )\n",
|
||
"\n",
|
||
" return pd.DataFrame(records)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6198c24e",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Build unified route records\n",
|
||
"\n",
|
||
"This is the core tokenization step. For each climb, we:\n",
|
||
"\n",
|
||
"1. **Parse the frames string**: Convert `p344r5p369r6p603r7` into a list of `(placement_id, role_id)` tuples\n",
|
||
"\n",
|
||
"2. **Map role IDs to semantic roles**: Convert board-specific role IDs (5→start, 6→middle, etc.) to shared semantic names\n",
|
||
"\n",
|
||
"3. **Canonicalize hold order**: Sort holds by (role priority, y-position, x-position). This is important because:\n",
|
||
" - The same climb can be represented with holds in any order in the database\n",
|
||
" - Transformers need consistent input ordering to learn patterns\n",
|
||
" - This is analogous to how NLP tokenizers normalize text (lowercasing, etc.)\n",
|
||
"\n",
|
||
"4. **Generate token sequences**: Create two versions of each route:\n",
|
||
" - `sequence_with_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> ... <EOS>`\n",
|
||
" - `sequence_no_grade`: `<BOS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> ... <EOS>` (grade removed)\n",
|
||
"\n",
|
||
"The grade-included version is used for the GPT generator (which predicts the next token, including grade). The grade-excluded version is used for the grade predictor (which receives the route without knowing the grade and must predict it).\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "20bed1da",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:45:29.147179Z",
|
||
"iopub.status.busy": "2026-06-07T15:45:29.146784Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:22.798391Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:22.797739Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"configs_by_key = {config.board_key: config for config in configs}\n",
|
||
"configs_by_prefix = {config.token_prefix: config for config in configs}\n",
|
||
"placement_lookup = make_placement_lookup(df_placements)\n",
|
||
"\n",
|
||
"df_routes = build_route_records(\n",
|
||
" df_climbs=df_climbs,\n",
|
||
" configs_by_key=configs_by_key,\n",
|
||
" placement_lookup=placement_lookup,\n",
|
||
")\n",
|
||
"print(f\"Tokenized routes: {len(df_routes):,}\")\n",
|
||
"print()\n",
|
||
"df_routes[[\"board_key\", \"angle\", \"display_difficulty\", \"sequence_with_grade\"]].head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4d007fc0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Example tokenized routes\n",
|
||
"\n",
|
||
"Let's look at what the tokenized routes actually look like. This is the \"text\" that our transformer models will read.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "f5b7391b",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:22.801970Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:22.801298Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:22.833306Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:22.832513Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"for _, row in df_routes.groupby(\"board_key\").head(2).iterrows():\n",
|
||
" print(f\"Board: {row['board_key']}\")\n",
|
||
" print(f\" Angle: {row['angle']}°\")\n",
|
||
" print(f\" Grade: {row['boulder_grade']} (V{row['grouped_v']})\")\n",
|
||
" print(f\" Tokens: {row['sequence_with_grade']}\")\n",
|
||
" print()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8f46bb08",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Vocabulary helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d52041b7",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:22.836570Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:22.836109Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:22.843523Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:22.842719Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Build the shared vocabulary and encode/decode token strings.\n",
|
||
"def build_vocab(df_routes: pd.DataFrame) -> tuple[list[str], dict[str, int], dict[int, str]]:\n",
|
||
" \"\"\"Build the shared token vocabulary from grade-conditioned sequences.\"\"\"\n",
|
||
" all_tokens: list[str] = []\n",
|
||
" for tokens in df_routes[\"tokens_with_grade\"]:\n",
|
||
" all_tokens.extend(tokens)\n",
|
||
"\n",
|
||
" vocab_tokens = list(SPECIAL_TOKENS)\n",
|
||
" for token in sorted(set(all_tokens)):\n",
|
||
" if token not in vocab_tokens:\n",
|
||
" vocab_tokens.append(token)\n",
|
||
"\n",
|
||
" stoi = {token: idx for idx, token in enumerate(vocab_tokens)}\n",
|
||
" itos = {idx: token for token, idx in stoi.items()}\n",
|
||
" return vocab_tokens, stoi, itos\n",
|
||
"\n",
|
||
"def encode(tokens: Iterable[str], stoi: dict[str, int]) -> list[int]:\n",
|
||
" \"\"\"Convert tokens to integer IDs, using ``<UNK>`` for unseen tokens.\"\"\"\n",
|
||
" unk_id = stoi[\"<UNK>\"]\n",
|
||
" return [stoi.get(token, unk_id) for token in tokens]\n",
|
||
"\n",
|
||
"def decode(ids: Iterable[int], itos: dict[int, str]) -> list[str]:\n",
|
||
" \"\"\"Convert integer IDs back to token strings.\"\"\"\n",
|
||
" return [itos.get(int(idx), \"<UNK>\") for idx in ids]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0393d191",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Build the shared vocabulary\n",
|
||
"\n",
|
||
"### What is a vocabulary?\n",
|
||
"\n",
|
||
"In NLP, the **vocabulary** (or \"vocab\") is the set of all possible tokens the model can produce or understand. For GPT-3, this is ~50,000 BPE tokens. For our climbing model, it includes:\n",
|
||
"\n",
|
||
"1. **Special tokens** (6): `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`, `<CLS>`, `<MASK>`\n",
|
||
"2. **Board tokens** (2): `<BOARD_TB2>`, `<BOARD_KILTER>`\n",
|
||
"3. **Angle tokens** (~6): `<ANGLE_30>`, `<ANGLE_35>`, `<ANGLE_40>`, etc.\n",
|
||
"4. **Grade tokens** (~17): `<GRADE_V0>` through `<GRADE_V16>`\n",
|
||
"5. **Hold tokens** (~1000+): One per placement per board per role\n",
|
||
"\n",
|
||
"### Why board-namespaced hold tokens?\n",
|
||
"\n",
|
||
"Placement ID 344 on TB2 refers to a completely different physical hold than placement ID 344 on Kilter (the latter doesn't exist). By prefixing with the board name (`TB2_p344_start` vs `KILTER_p344_start`), we ensure the model treats these as distinct tokens.\n",
|
||
"\n",
|
||
"This is analogous to how multilingual LLMs use language-specific subword tokens — the same byte sequence can mean different things in different languages.\n",
|
||
"\n",
|
||
"### String-to-integer mapping\n",
|
||
"\n",
|
||
"Transformers operate on integer indices, not strings. The `stoi` (string-to-integer) and `itos` (integer-to-string) dictionaries provide this mapping, similar to how HuggingFace tokenizers work.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1ba5b78d",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:22.846657Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:22.846214Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:23.506016Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:23.505338Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"vocab_tokens, stoi, itos = build_vocab(df_routes)\n",
|
||
"\n",
|
||
"print(f\"Vocabulary size: {len(stoi):,}\")\n",
|
||
"print()\n",
|
||
"print(\"First 20 tokens (special + board tokens):\")\n",
|
||
"print(vocab_tokens[:20])\n",
|
||
"print()\n",
|
||
"hold_tokens = [t for t in vocab_tokens if t.startswith('<') and '_p' in t]\n",
|
||
"angle_tokens = [t for t in vocab_tokens if t.startswith('<ANGLE_')]\n",
|
||
"grade_tokens = [t for t in vocab_tokens if t.startswith('<GRADE_')]\n",
|
||
"board_tokens = [t for t in vocab_tokens if t.startswith('<BOARD_')]\n",
|
||
"special_tokens = [t for t in vocab_tokens if t in ['<PAD>', '<UNK>', '<BOS>', '<EOS>', '<CLS>', '<MASK>']]\n",
|
||
"\n",
|
||
"print(f\"Special tokens: {len(special_tokens)}\")\n",
|
||
"print(f\"Board tokens: {len(board_tokens)}\")\n",
|
||
"print(f\"Angle tokens: {len(angle_tokens)}\")\n",
|
||
"print(f\"Grade tokens: {len(grade_tokens)}\")\n",
|
||
"print(f\"Hold tokens: {len(hold_tokens)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8408e4ab",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Split helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "366d46b5",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:23.509090Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:23.508606Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:23.518536Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:23.517995Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Assign train/validation/test splits at the logical-climb group level.\n",
|
||
"def safe_train_test_split(\n",
|
||
" df: pd.DataFrame,\n",
|
||
" test_size: float,\n",
|
||
" random_state: int,\n",
|
||
" stratify_col: str | None = None,\n",
|
||
"):\n",
|
||
" \"\"\"Split a DataFrame with optional stratification and graceful fallback.\n",
|
||
"\n",
|
||
" scikit-learn raises when a requested stratum is too small. The tokenization\n",
|
||
" pipeline prefers stratified splits when possible, but falls back to an\n",
|
||
" unstratified split rather than failing on tiny smoke-test subsets.\n",
|
||
" \"\"\"\n",
|
||
" stratify = None\n",
|
||
" if stratify_col is not None and stratify_col in df.columns:\n",
|
||
" counts = df[stratify_col].value_counts()\n",
|
||
" if len(counts) > 1 and counts.min() >= 2:\n",
|
||
" stratify = df[stratify_col]\n",
|
||
"\n",
|
||
" try:\n",
|
||
" return train_test_split(\n",
|
||
" df,\n",
|
||
" test_size=test_size,\n",
|
||
" random_state=random_state,\n",
|
||
" stratify=stratify,\n",
|
||
" )\n",
|
||
" except ValueError:\n",
|
||
" return train_test_split(\n",
|
||
" df,\n",
|
||
" test_size=test_size,\n",
|
||
" random_state=random_state,\n",
|
||
" stratify=None,\n",
|
||
" )\n",
|
||
"\n",
|
||
"def assign_group_splits(\n",
|
||
" df: pd.DataFrame,\n",
|
||
" group_cols: list[str],\n",
|
||
" test_size: float,\n",
|
||
" val_size_within_temp: float,\n",
|
||
" random_state: int,\n",
|
||
" stratify_col: str | None = None,\n",
|
||
") -> pd.Series:\n",
|
||
" \"\"\"Assign train/val/test splits at group level.\n",
|
||
"\n",
|
||
" This prevents multiple rows for the same logical climb, for example the\n",
|
||
" same UUID at several angles, from being distributed across different\n",
|
||
" splits. The returned Series is indexed like ``df`` and contains\n",
|
||
" ``train``, ``val``, or ``test``.\n",
|
||
" \"\"\"\n",
|
||
" group_df = df[group_cols + ([stratify_col] if stratify_col else [])].copy()\n",
|
||
" group_df = group_df.drop_duplicates(group_cols).reset_index(drop=True)\n",
|
||
"\n",
|
||
" train_groups, temp_groups = safe_train_test_split(\n",
|
||
" group_df,\n",
|
||
" test_size=test_size,\n",
|
||
" random_state=random_state,\n",
|
||
" stratify_col=stratify_col,\n",
|
||
" )\n",
|
||
" val_groups, test_groups = safe_train_test_split(\n",
|
||
" temp_groups,\n",
|
||
" test_size=val_size_within_temp,\n",
|
||
" random_state=random_state,\n",
|
||
" stratify_col=stratify_col,\n",
|
||
" )\n",
|
||
"\n",
|
||
" def key_frame(frame: pd.DataFrame) -> set[tuple]:\n",
|
||
" \"\"\"Return stringified group keys so pandas dtypes cannot affect joins.\"\"\"\n",
|
||
" return set(map(tuple, frame[group_cols].astype(str).values.tolist()))\n",
|
||
"\n",
|
||
" train_keys = key_frame(train_groups)\n",
|
||
" val_keys = key_frame(val_groups)\n",
|
||
" test_keys = key_frame(test_groups)\n",
|
||
"\n",
|
||
" def split_for_row(row) -> str:\n",
|
||
" \"\"\"Map one original row back to its group-level split assignment.\"\"\"\n",
|
||
" key = tuple(str(row[col]) for col in group_cols)\n",
|
||
" if key in train_keys:\n",
|
||
" return \"train\"\n",
|
||
" if key in val_keys:\n",
|
||
" return \"val\"\n",
|
||
" if key in test_keys:\n",
|
||
" return \"test\"\n",
|
||
" raise KeyError(f\"Could not assign split for group key {key}\")\n",
|
||
"\n",
|
||
" return df.apply(split_for_row, axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "93176cff",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Train/validation/test splits\n",
|
||
"\n",
|
||
"### Why stratified splitting?\n",
|
||
"\n",
|
||
"We split data into train (80%), validation (10%), and test (10%) sets. Crucially, we **stratify by `board_key × grouped_v`** — this ensures that:\n",
|
||
"\n",
|
||
"1. Both boards (TB2 and Kilter) are represented in all splits\n",
|
||
"2. All difficulty levels (V0 through V16) are represented in all splits\n",
|
||
"\n",
|
||
"Without stratification, we might end up with all V14 climbs in the test set and none in training, which would make evaluation meaningless.\n",
|
||
"\n",
|
||
"This is the same principle as stratified splitting in NLP, where you ensure all languages or domains are represented in each split.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ff18298e",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:23.521543Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:23.521153Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:31.054015Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:31.053128Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_routes[\"ids_with_grade\"] = df_routes[\"tokens_with_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
|
||
"df_routes[\"ids_no_grade\"] = df_routes[\"tokens_no_grade\"].apply(lambda tokens: encode(tokens, stoi))\n",
|
||
"df_routes[\"split_stratum\"] = df_routes[\"board_key\"].astype(str) + \"__V\" + df_routes[\"grouped_v\"].astype(str)\n",
|
||
"df_routes[\"split\"] = assign_group_splits(\n",
|
||
" df_routes,\n",
|
||
" group_cols=[\"board_key\", \"uuid\"],\n",
|
||
" test_size=0.20,\n",
|
||
" val_size_within_temp=0.50,\n",
|
||
" random_state=3,\n",
|
||
" stratify_col=\"split_stratum\",\n",
|
||
")\n",
|
||
"\n",
|
||
"df_routes.groupby([\"board_key\", \"split\"]).size().unstack(fill_value=0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ae789b11",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Token metadata helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "43e9dd8b",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:31.057561Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:31.056935Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:31.072487Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:31.071641Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Attach board, role, placement, and coordinate metadata to each token.\n",
|
||
"def build_token_metadata(\n",
|
||
" vocab_tokens: list[str],\n",
|
||
" stoi: dict[str, int],\n",
|
||
" df_placements: pd.DataFrame,\n",
|
||
" placement_lookup: dict[tuple[str, int], dict],\n",
|
||
" configs_by_prefix: dict[str, BoardConfig],\n",
|
||
") -> pd.DataFrame:\n",
|
||
" \"\"\"Build per-token metadata used for coordinate features and plotting.\n",
|
||
"\n",
|
||
" Hold tokens receive raw coordinates, normalized coordinates in ``[-1, 1]``,\n",
|
||
" role labels, and board identity. Non-hold tokens keep neutral coordinate\n",
|
||
" features so the grade predictor can safely index every token ID.\n",
|
||
" \"\"\"\n",
|
||
" bounds = {}\n",
|
||
" for board_key, frame in df_placements.groupby(\"board_key\"):\n",
|
||
" xs = frame[\"x\"].astype(float)\n",
|
||
" ys = frame[\"y\"].astype(float)\n",
|
||
" bounds[str(board_key)] = {\n",
|
||
" \"x_min\": float(xs.min()),\n",
|
||
" \"x_max\": float(xs.max()),\n",
|
||
" \"y_min\": float(ys.min()),\n",
|
||
" \"y_max\": float(ys.max()),\n",
|
||
" }\n",
|
||
"\n",
|
||
" def normalize(value: float, lo: float, hi: float) -> float:\n",
|
||
" \"\"\"Scale one coordinate into ``[-1, 1]`` with safe missing-value handling.\"\"\"\n",
|
||
" if pd.isna(value) or hi == lo:\n",
|
||
" return 0.0\n",
|
||
" return 2 * ((float(value) - lo) / (hi - lo)) - 1\n",
|
||
"\n",
|
||
" rows: list[dict] = []\n",
|
||
"\n",
|
||
" for token in vocab_tokens:\n",
|
||
" meta = {\n",
|
||
" \"token\": token,\n",
|
||
" \"token_id\": stoi[token],\n",
|
||
" \"kind\": \"special\",\n",
|
||
" \"board_key\": None,\n",
|
||
" \"board_token_prefix\": None,\n",
|
||
" \"placement_id\": np.nan,\n",
|
||
" \"role\": None,\n",
|
||
" \"x\": np.nan,\n",
|
||
" \"y\": np.nan,\n",
|
||
" \"x_norm\": 0.0,\n",
|
||
" \"y_norm\": 0.0,\n",
|
||
" \"is_hold\": 0,\n",
|
||
" \"angle\": np.nan,\n",
|
||
" \"grouped_v\": np.nan,\n",
|
||
" }\n",
|
||
"\n",
|
||
" hold_match = HOLD_TOKEN_PATTERN.match(token)\n",
|
||
" if hold_match:\n",
|
||
" prefix = hold_match.group(1)\n",
|
||
" placement_id = int(hold_match.group(2))\n",
|
||
" role = hold_match.group(3)\n",
|
||
" config = configs_by_prefix[prefix]\n",
|
||
" board_key = config.board_key\n",
|
||
" row = placement_lookup.get((board_key, placement_id), {})\n",
|
||
" x = float(row.get(\"x\", np.nan))\n",
|
||
" y = float(row.get(\"y\", np.nan))\n",
|
||
" board_bounds = bounds.get(board_key, {\"x_min\": 0, \"x_max\": 1, \"y_min\": 0, \"y_max\": 1})\n",
|
||
"\n",
|
||
" meta.update(\n",
|
||
" {\n",
|
||
" \"kind\": \"hold\",\n",
|
||
" \"board_key\": board_key,\n",
|
||
" \"board_token_prefix\": prefix,\n",
|
||
" \"placement_id\": placement_id,\n",
|
||
" \"role\": role,\n",
|
||
" \"x\": x,\n",
|
||
" \"y\": y,\n",
|
||
" \"x_norm\": normalize(x, board_bounds[\"x_min\"], board_bounds[\"x_max\"]),\n",
|
||
" \"y_norm\": normalize(y, board_bounds[\"y_min\"], board_bounds[\"y_max\"]),\n",
|
||
" \"is_hold\": 1,\n",
|
||
" }\n",
|
||
" )\n",
|
||
"\n",
|
||
" angle_match = ANGLE_TOKEN_PATTERN.match(token)\n",
|
||
" if angle_match:\n",
|
||
" meta.update({\"kind\": \"angle\", \"angle\": int(angle_match.group(1))})\n",
|
||
"\n",
|
||
" grade_match = GRADE_TOKEN_PATTERN.match(token)\n",
|
||
" if grade_match:\n",
|
||
" meta.update({\"kind\": \"grade\", \"grouped_v\": int(grade_match.group(1))})\n",
|
||
"\n",
|
||
" board_match = BOARD_TOKEN_PATTERN.match(token)\n",
|
||
" if board_match:\n",
|
||
" prefix = board_match.group(1)\n",
|
||
" config = configs_by_prefix.get(prefix)\n",
|
||
" meta.update(\n",
|
||
" {\n",
|
||
" \"kind\": \"board\",\n",
|
||
" \"board_key\": None if config is None else config.board_key,\n",
|
||
" \"board_token_prefix\": prefix,\n",
|
||
" }\n",
|
||
" )\n",
|
||
"\n",
|
||
" rows.append(meta)\n",
|
||
"\n",
|
||
" return pd.DataFrame(rows)\n",
|
||
"\n",
|
||
"def vocab_payload(\n",
|
||
" stoi: dict[str, int],\n",
|
||
" itos: dict[int, str],\n",
|
||
" configs_by_key: dict[str, BoardConfig],\n",
|
||
") -> dict:\n",
|
||
" \"\"\"Package vocabulary and board metadata for JSON serialization.\"\"\"\n",
|
||
" return {\n",
|
||
" \"stoi\": stoi,\n",
|
||
" \"itos\": {str(k): v for k, v in itos.items()},\n",
|
||
" \"special_tokens\": SPECIAL_TOKENS,\n",
|
||
" \"boards\": {\n",
|
||
" board_key: {\n",
|
||
" \"token_prefix\": config.token_prefix,\n",
|
||
" \"board_token\": board_token(config),\n",
|
||
" \"role_definitions\": config.role_definitions,\n",
|
||
" }\n",
|
||
" for board_key, config in configs_by_key.items()\n",
|
||
" },\n",
|
||
" \"grade_to_v\": {str(k): v for k, v in GRADE_TO_V.items()},\n",
|
||
" }"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "414dca92",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Token metadata\n",
|
||
"\n",
|
||
"### Why metadata matters\n",
|
||
"\n",
|
||
"Each hold token carries rich metadata that the model can use:\n",
|
||
"\n",
|
||
"- **Physical coordinates** (`x`, `y`): Where the hold is on the board\n",
|
||
"- **Normalized coordinates** (`x_norm`, `y_norm`): Scaled to [-1, 1] per board, so the model doesn't need to learn different coordinate scales\n",
|
||
"- **Semantic role** (`start`, `middle`, `finish`, `foot`): What the hold is used for\n",
|
||
"- **Board identity** (`board_key`): Which board this hold belongs to\n",
|
||
"\n",
|
||
"The grade predictor uses these coordinate features as additional embeddings alongside the token embeddings. This is similar to how some LLMs inject positional embeddings or segment embeddings — we're giving the model extra structured information about each token.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "48c3692e",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:31.075613Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:31.075057Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:31.130838Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:31.129975Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_token_meta = build_token_metadata(\n",
|
||
" vocab_tokens=vocab_tokens,\n",
|
||
" stoi=stoi,\n",
|
||
" df_placements=df_placements,\n",
|
||
" placement_lookup=placement_lookup,\n",
|
||
" configs_by_prefix=configs_by_prefix,\n",
|
||
")\n",
|
||
"\n",
|
||
"print(\"Token metadata columns:\")\n",
|
||
"print(df_token_meta.columns.tolist())\n",
|
||
"print()\n",
|
||
"print(\"Example hold token metadata:\")\n",
|
||
"df_token_meta[df_token_meta[\"kind\"] == \"hold\"].head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "79774dc1",
|
||
"metadata": {},
|
||
"source": [
|
||
"### JSON output helpers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a53bf23d",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:31.134369Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:31.133692Z",
|
||
"iopub.status.idle": "2026-06-07T15:47:31.141767Z",
|
||
"shell.execute_reply": "2026-06-07T15:47:31.140911Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Write JSON artifacts after converting NumPy/pandas values to plain Python values.\n",
|
||
"def json_safe(obj: Any) -> Any:\n",
|
||
" \"\"\"Convert NumPy/pandas values into JSON-serializable Python objects.\"\"\"\n",
|
||
" if isinstance(obj, dict):\n",
|
||
" return {str(k): json_safe(v) for k, v in obj.items()}\n",
|
||
" if isinstance(obj, (list, tuple)):\n",
|
||
" return [json_safe(v) for v in obj]\n",
|
||
" if isinstance(obj, np.integer):\n",
|
||
" return int(obj)\n",
|
||
" if isinstance(obj, np.floating):\n",
|
||
" if np.isnan(obj):\n",
|
||
" return None\n",
|
||
" return float(obj)\n",
|
||
" if isinstance(obj, np.ndarray):\n",
|
||
" return json_safe(obj.tolist())\n",
|
||
" try:\n",
|
||
" if pd.isna(obj):\n",
|
||
" return None\n",
|
||
" except Exception:\n",
|
||
" pass\n",
|
||
" return obj\n",
|
||
"\n",
|
||
"def write_json(path: str | Path, payload: Any) -> None:\n",
|
||
" \"\"\"Write an object as indented UTF-8 JSON after ``json_safe`` cleanup.\"\"\"\n",
|
||
" path = Path(path)\n",
|
||
" path.parent.mkdir(parents=True, exist_ok=True)\n",
|
||
" path.write_text(json.dumps(json_safe(payload), indent=2), encoding=\"utf-8\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "414dca93",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Save artifacts\n",
|
||
"\n",
|
||
"We save several files that will be consumed by later notebooks:\n",
|
||
"\n",
|
||
"1. **`route_sequences.csv`**: The main tokenized dataset with train/val/test splits\n",
|
||
"2. **`routes_tokenized.jsonl`**: Same data in JSON Lines format (one JSON object per route)\n",
|
||
"3. **`token_vocab.json`**: The vocabulary mapping (stoi and itos)\n",
|
||
"4. **`token_metadata.csv`**: Metadata for each token (coordinates, roles, etc.)\n",
|
||
"5. **`placement_metadata.csv`**: Physical placement information\n",
|
||
"6. **`board_summary.csv`**: Aggregate statistics per board\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "50e81878",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2026-06-07T15:47:31.144897Z",
|
||
"iopub.status.busy": "2026-06-07T15:47:31.144460Z",
|
||
"iopub.status.idle": "2026-06-07T15:48:29.973473Z",
|
||
"shell.execute_reply": "2026-06-07T15:48:29.972779Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"OUT = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
|
||
"OUT.mkdir(parents=True, exist_ok=True)\n",
|
||
"\n",
|
||
"csv_cols = [\n",
|
||
" \"uuid\", \"board_key\", \"board_display_name\", \"board_token_prefix\", \"board_token\",\n",
|
||
" \"climb_name\", \"setter_username\", \"layout_id\", \"layout_name\", \"board_name\",\n",
|
||
" \"frames\", \"angle\", \"display_difficulty\", \"grouped_v\", \"boulder_grade\",\n",
|
||
" \"ascensionist_count\", \"quality_average\", \"fa_at\",\n",
|
||
" \"n_holds\", \"n_start\", \"n_middle\", \"n_foot\", \"n_finish\",\n",
|
||
" \"sequence_with_grade\", \"sequence_no_grade\", \"split\",\n",
|
||
"]\n",
|
||
"df_routes[csv_cols].to_csv(OUT / \"route_sequences.csv\", index=False)\n",
|
||
"\n",
|
||
"df_placements.to_csv(OUT / \"placement_metadata.csv\", index=False)\n",
|
||
"\n",
|
||
"df_token_meta.to_csv(OUT / \"token_metadata.csv\", index=False)\n",
|
||
"\n",
|
||
"write_json(OUT / \"token_vocab.json\", vocab_payload(stoi, itos, configs_by_key))\n",
|
||
"\n",
|
||
"with (OUT / \"routes_tokenized.jsonl\").open(\"w\", encoding=\"utf-8\") as handle:\n",
|
||
" for record in df_routes.to_dict(orient=\"records\"):\n",
|
||
" handle.write(json.dumps(json_safe(record)) + \"\\n\")\n",
|
||
"\n",
|
||
"board_summary = (\n",
|
||
" df_routes.groupby(\"board_key\")\n",
|
||
" .agg(\n",
|
||
" n_routes=(\"uuid\", \"count\"),\n",
|
||
" mean_angle=(\"angle\", \"mean\"),\n",
|
||
" mean_display_difficulty=(\"display_difficulty\", \"mean\"),\n",
|
||
" mean_holds=(\"n_holds\", \"mean\"),\n",
|
||
" )\n",
|
||
" .reset_index()\n",
|
||
")\n",
|
||
"board_summary.to_csv(OUT / \"board_summary.csv\", index=False)\n",
|
||
"\n",
|
||
"print(\"Saved artifacts to:\", OUT)\n",
|
||
"print(f\" - route_sequences.csv ({len(df_routes):,} rows)\")\n",
|
||
"print(f\" - routes_tokenized.jsonl\")\n",
|
||
"print(f\" - token_vocab.json ({len(stoi):,} tokens)\")\n",
|
||
"print(f\" - token_metadata.csv ({len(df_token_meta):,} rows)\")\n",
|
||
"print(f\" - placement_metadata.csv ({len(df_placements):,} rows)\")\n",
|
||
"print(f\" - board_summary.csv\")"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|