ClimbingBoardGPT/notebooks/02_joint_transformer_grade_prediction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "92b83b1d",
   "metadata": {},
   "source": [
    "# 02 — Joint Transformer Grade Prediction\n",
    "\n",
    "## From Language Modeling to Grade Prediction\n",
    "\n",
    "In NLP, **BERT-style models** are encoder-only transformers that take a sequence of tokens, process them through multiple self-attention layers, and produce a single output (like a classification label). The key insight is:\n",
    "\n",
    "- **Input**: A sequence of tokens (words, subwords, or in our case, holds)\n",
    "- **Processing**: Multiple layers of self-attention that let each token \"look at\" every other token\n",
    "- **Output**: A pooled representation (typically from a `[CLS]` token) that summarizes the entire sequence\n",
    "\n",
    "### Our architecture\n",
    "\n",
    "We use a **Transformer Encoder** (similar to BERT) with these components:\n",
    "\n",
    "1. **Token embeddings**: Convert integer token IDs to dense vectors\n",
    "2. **Positional embeddings**: Tell the model where each token is in the sequence\n",
    "3. **Coordinate features**: Inject physical (x, y) position of each hold into the embedding\n",
    "4. **Transformer encoder layers**: Multiple layers of self-attention + feedforward\n",
    "5. **Regression head**: Take the `<CLS>` token's output and predict a single difficulty score\n",
    "\n",
    "### Why this works for climbing\n",
    "\n",
    "A climb's difficulty depends on the *relationships between holds*, not just individual holds. Self-attention naturally captures these relationships:\n",
    "\n",
    "- A start hold far from the first middle hold suggests a big opening move\n",
    "- Two holds that are very far apart suggest a dyno\n",
    "- The overall spatial distribution determines the \"flow\" of the climb\n",
    "\n",
    "The transformer can learn these spatial relationships through attention, without us having to manually engineer features like \"mean hand reach\" or \"height gained\" (though those features were useful in the classical model).\n",
    "\n",
    "### Input format\n",
    "\n",
    "```text\n",
    "<CLS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> <TB2_p369_middle> ... <TB2_p603_finish>\n",
    "```\n",
    "\n",
    "Note: We use `<CLS>` instead of `<BOS>` and we **exclude the grade token** — the model must predict the grade, not see it!\n",
    "\n",
    "### Target\n",
    "\n",
    "```text\n",
    "display_difficulty (continuous value, e.g., 20.5)\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dfd6081",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:37.490884Z",
     "iopub.status.busy": "2026-06-07T15:48:37.490209Z",
     "iopub.status.idle": "2026-06-07T15:48:42.972689Z",
     "shell.execute_reply": "2026-06-07T15:48:42.971662Z"
    }
   },
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import json\n",
    "import math\n",
    "from pathlib import Path\n",
    "from typing import Any\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
    "from torch.utils.data import DataLoader, Dataset\n",
    "\n",
    "ROOT = Path.cwd().resolve()\n",
    "if ROOT.name == \"notebooks\":\n",
    "    ROOT = ROOT.parent\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a9e2443",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:42.976137Z",
     "iopub.status.busy": "2026-06-07T15:48:42.975792Z",
     "iopub.status.idle": "2026-06-07T15:48:48.768984Z",
     "shell.execute_reply": "2026-06-07T15:48:48.768115Z"
    }
   },
   "outputs": [],
   "source": [
    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
    "df_routes = pd.read_csv(TOKENIZED / \"route_sequences.csv\")\n",
    "vocab = json.loads((TOKENIZED / \"token_vocab.json\").read_text(encoding=\"utf-8\"))\n",
    "\n",
    "stoi = {str(k): int(v) for k, v in vocab[\"stoi\"].items()}\n",
    "itos = {int(k): str(v) for k, v in vocab[\"itos\"].items()}\n",
    "df_token_meta = pd.read_csv(TOKENIZED / \"token_metadata.csv\")\n",
    "\n",
    "pad_id = stoi[\"<PAD>\"]\n",
    "unk_id = stoi[\"<UNK>\"]\n",
    "\n",
    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
    "print(f\"Total routes: {len(df_routes):,}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4abfd9b",
   "metadata": {},
   "source": [
    "## Build model IDs and coordinate features\n",
    "\n",
    "### Coordinate features: Why inject physical position?\n",
    "\n",
    "In standard NLP, positional embeddings tell the model *which position in the sequence* a token occupies. But for climbing, the **physical position on the wall** matters more than the sequence position.\n",
    "\n",
    "We create a 3-dimensional feature vector for each token:\n",
    "1. `x_norm`: Normalized horizontal position on the board (-1 to 1)\n",
    "2. `y_norm`: Normalized vertical position on the board (-1 to 1)\n",
    "3. `is_hold`: 1 if this token represents a hold, 0 otherwise\n",
    "\n",
    "These features are projected through a linear layer and added to the token embeddings. This is similar to how some vision-language models inject spatial features from images alongside text tokens.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95bb745f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:48.772384Z",
     "iopub.status.busy": "2026-06-07T15:48:48.771749Z",
     "iopub.status.idle": "2026-06-07T15:48:52.916642Z",
     "shell.execute_reply": "2026-06-07T15:48:52.915616Z"
    }
   },
   "outputs": [],
   "source": [
    "def encode(tokens):\n",
    "    \"\"\"Convert a list of token strings to integer IDs.\"\"\"\n",
    "    return [stoi.get(token, unk_id) for token in tokens]\n",
    "\n",
    "# Prepare input sequences for the grade predictor\n",
    "# We use the \"no grade\" version because the model should predict the grade,\n",
    "# not see it in the input!\n",
    "# We also prepend <CLS> which will be used for pooling the sequence representation\n",
    "df_routes[\"tokens_no_grade\"] = df_routes[\"sequence_no_grade\"].fillna(\"\").str.split()\n",
    "df_routes[\"model_tokens\"] = df_routes[\"tokens_no_grade\"].apply(\n",
    "    lambda tokens: [\"<CLS>\"] + tokens[1:] if tokens else [\"<CLS>\"]\n",
    ")\n",
    "df_routes[\"model_ids\"] = df_routes[\"model_tokens\"].apply(encode)\n",
    "df_routes[\"seq_len\"] = df_routes[\"model_ids\"].apply(len)\n",
    "max_len = int(df_routes[\"seq_len\"].max())\n",
    "\n",
    "# Build coordinate features matrix: (vocab_size, 3)\n",
    "# Each row corresponds to a token ID and contains [x_norm, y_norm, is_hold]\n",
    "# This will be used as additional input to the model alongside token embeddings\n",
    "coord_features = np.zeros((len(stoi), 3), dtype=np.float32)\n",
    "for _, row in df_token_meta.iterrows():\n",
    "    token_id = int(row[\"token_id\"])\n",
    "    coord_features[token_id, 0] = 0.0 if pd.isna(row.get(\"x_norm\", 0.0)) else float(row.get(\"x_norm\", 0.0))\n",
    "    coord_features[token_id, 1] = 0.0 if pd.isna(row.get(\"y_norm\", 0.0)) else float(row.get(\"y_norm\", 0.0))\n",
    "    coord_features[token_id, 2] = 0.0 if pd.isna(row.get(\"is_hold\", 0.0)) else float(row.get(\"is_hold\", 0.0))\n",
    "coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
    "\n",
    "print(f\"Max sequence length: {max_len}\")\n",
    "print(f\"Coordinate features shape: {coord_features.shape}\")\n",
    "print(f\"Vocabulary size: {len(stoi)}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9033f9e8",
   "metadata": {},
   "source": [
    "### Dataset helper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c55c1d26",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:52.920221Z",
     "iopub.status.busy": "2026-06-07T15:48:52.919793Z",
     "iopub.status.idle": "2026-06-07T15:48:52.927627Z",
     "shell.execute_reply": "2026-06-07T15:48:52.926737Z"
    }
   },
   "outputs": [],
   "source": [
    "# Pad route-token sequences for transformer grade prediction.\n",
    "class RouteGradeDataset(Dataset):\n",
    "    \"\"\"Dataset for transformer encoder grade prediction.\n",
    "\n",
    "    Each item returns a padded token sequence, a boolean attention mask, the\n",
    "    continuous display-difficulty target, and a small amount of route identity\n",
    "    metadata used when writing prediction CSVs.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, df, max_len: int, pad_id: int):\n",
    "        \"\"\"Store model IDs and labels from a tokenized route DataFrame.\"\"\"\n",
    "        self.row_ids = df[\"row_id\"].tolist() if \"row_id\" in df.columns else df.index.tolist()\n",
    "        self.ids = df[\"model_ids\"].tolist()\n",
    "        self.targets = df[\"display_difficulty\"].astype(float).values\n",
    "        self.uuids = df[\"uuid\"].tolist()\n",
    "        self.boards = df[\"board_key\"].astype(str).tolist()\n",
    "        self.max_len = int(max_len)\n",
    "        self.pad_id = int(pad_id)\n",
    "\n",
    "    def __len__(self) -> int:\n",
    "        \"\"\"Return the number of route examples.\"\"\"\n",
    "        return len(self.ids)\n",
    "\n",
    "    def __getitem__(self, idx: int):\n",
    "        \"\"\"Return one padded encoder example and its regression target.\"\"\"\n",
    "        ids = list(self.ids[idx])[: self.max_len]\n",
    "        mask = [1] * len(ids)\n",
    "        if len(ids) < self.max_len:\n",
    "            pad_n = self.max_len - len(ids)\n",
    "            ids += [self.pad_id] * pad_n\n",
    "            mask += [0] * pad_n\n",
    "\n",
    "        return {\n",
    "            \"input_ids\": torch.tensor(ids, dtype=torch.long),\n",
    "            \"attention_mask\": torch.tensor(mask, dtype=torch.bool),\n",
    "            \"target\": torch.tensor(self.targets[idx], dtype=torch.float32),\n",
    "            \"row_id\": int(self.row_ids[idx]),\n",
    "            \"uuid\": self.uuids[idx],\n",
    "            \"board_key\": self.boards[idx],\n",
    "        }\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5da4ca8",
   "metadata": {},
   "source": [
    "## Data loaders\n",
    "\n",
    "### Batching and padding\n",
    "\n",
    "Transformers process data in **batches** for efficiency. But routes have different lengths (different numbers of holds). We handle this by:\n",
    "\n",
    "1. **Padding**: Shorter sequences are padded with `<PAD>` tokens to match the longest sequence in the batch\n",
    "2. **Attention masking**: The model receives a binary mask that tells it which positions are real data vs padding\n",
    "\n",
    "This is exactly how BERT and GPT handle variable-length text sequences.\n",
    "\n",
    "### The RouteGradeDataset class\n",
    "\n",
    "For each route, this dataset produces:\n",
    "- `input_ids`: Integer token IDs, padded to `max_len`\n",
    "- `attention_mask`: 1 for real tokens, 0 for padding\n",
    "- `target`: The difficulty score we want to predict\n",
    "- `uuid`, `board_key`: Metadata for evaluation\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c9e5543",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:52.930809Z",
     "iopub.status.busy": "2026-06-07T15:48:52.930299Z",
     "iopub.status.idle": "2026-06-07T15:48:53.612170Z",
     "shell.execute_reply": "2026-06-07T15:48:53.611156Z"
    }
   },
   "outputs": [],
   "source": [
    "train_df = df_routes[df_routes[\"split\"] == \"train\"].reset_index(drop=True)\n",
    "val_df = df_routes[df_routes[\"split\"] == \"val\"].reset_index(drop=True)\n",
    "test_df = df_routes[df_routes[\"split\"] == \"test\"].reset_index(drop=True)\n",
    "\n",
    "train_ds = RouteGradeDataset(train_df, max_len=max_len, pad_id=pad_id)\n",
    "val_ds = RouteGradeDataset(val_df, max_len=max_len, pad_id=pad_id)\n",
    "test_ds = RouteGradeDataset(test_df, max_len=max_len, pad_id=pad_id)\n",
    "\n",
    "train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)\n",
    "val_loader = DataLoader(val_ds, batch_size=128, shuffle=False)\n",
    "test_loader = DataLoader(test_ds, batch_size=128, shuffle=False)\n",
    "\n",
    "print(f\"Training samples: {len(train_ds):,}\")\n",
    "print(f\"Validation samples: {len(val_ds):,}\")\n",
    "print(f\"Test samples: {len(test_ds):,}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03091a62",
   "metadata": {},
   "source": [
    "### Transformer regressor model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78612fe7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:53.616012Z",
     "iopub.status.busy": "2026-06-07T15:48:53.615396Z",
     "iopub.status.idle": "2026-06-07T15:48:53.640842Z",
     "shell.execute_reply": "2026-06-07T15:48:53.639849Z"
    }
   },
   "outputs": [],
   "source": [
    "# Transformer encoder used as a continuous grade regressor.\n",
    "class JointRouteTransformerRegressor(nn.Module):\n",
    "    \"\"\"Transformer encoder for joint TB2/Kilter route difficulty prediction.\n",
    "\n",
    "    Inputs are token IDs plus an attention mask. Token, position, and learned\n",
    "    projections of coordinate metadata are added before the encoder. The first\n",
    "    ``<CLS>`` position is then used as a pooled route representation for scalar\n",
    "    difficulty regression.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        vocab_size: int,\n",
    "        max_len: int,\n",
    "        coord_features: torch.Tensor,\n",
    "        d_model: int = 128,\n",
    "        nhead: int = 4,\n",
    "        num_layers: int = 4,\n",
    "        dim_feedforward: int = 256,\n",
    "        dropout: float = 0.10,\n",
    "        pad_id: int = 0,\n",
    "    ):\n",
    "        \"\"\"Create the encoder, coordinate projection, and regression head.\"\"\"\n",
    "        super().__init__()\n",
    "        self.vocab_size = vocab_size\n",
    "        self.max_len = max_len\n",
    "        self.d_model = d_model\n",
    "        self.pad_id = pad_id\n",
    "\n",
    "        self.token_emb = nn.Embedding(vocab_size, d_model, padding_idx=pad_id)\n",
    "        self.pos_emb = nn.Embedding(max_len, d_model)\n",
    "\n",
    "        self.register_buffer(\"coord_features\", coord_features.clone().float())\n",
    "        self.coord_proj = nn.Linear(coord_features.shape[1], d_model)\n",
    "\n",
    "        encoder_layer = nn.TransformerEncoderLayer(\n",
    "            d_model=d_model,\n",
    "            nhead=nhead,\n",
    "            dim_feedforward=dim_feedforward,\n",
    "            dropout=dropout,\n",
    "            activation=\"gelu\",\n",
    "            batch_first=True,\n",
    "            norm_first=True,\n",
    "        )\n",
    "        self.encoder = nn.TransformerEncoder(\n",
    "            encoder_layer,\n",
    "            num_layers=num_layers,\n",
    "            enable_nested_tensor=False,\n",
    "        )\n",
    "        self.norm = nn.LayerNorm(d_model)\n",
    "        self.head = nn.Sequential(\n",
    "            nn.Linear(d_model, d_model),\n",
    "            nn.GELU(),\n",
    "            nn.Dropout(dropout),\n",
    "            nn.Linear(d_model, 1),\n",
    "        )\n",
    "\n",
    "    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:\n",
    "        \"\"\"Return one continuous difficulty prediction per input sequence.\"\"\"\n",
    "        batch_size, seq_len = input_ids.shape\n",
    "        positions = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand(batch_size, seq_len)\n",
    "\n",
    "        # Coordinate features are indexed by token ID, so every occurrence of a\n",
    "        # hold token gets the same physical x/y hint wherever it appears.\n",
    "        x = self.token_emb(input_ids) + self.pos_emb(positions)\n",
    "        x = x + self.coord_proj(self.coord_features[input_ids])\n",
    "\n",
    "        key_padding_mask = ~attention_mask.bool()\n",
    "        h = self.encoder(x, src_key_padding_mask=key_padding_mask)\n",
    "        h = self.norm(h)\n",
    "\n",
    "        cls_state = h[:, 0, :]\n",
    "        return self.head(cls_state).squeeze(-1)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90fa8ae5",
   "metadata": {},
   "source": [
    "## Model Architecture\n",
    "\n",
    "### The JointRouteTransformerRegressor\n",
    "\n",
    "This model is a **transformer encoder** with a regression head. Here's what each component does:\n",
    "\n",
    "1. **Token embedding** (`nn.Embedding`): Converts integer token IDs to dense vectors of dimension `d_model`. This is the same as word embeddings in NLP.\n",
    "\n",
    "2. **Positional embedding** (`nn.Embedding`): Adds position information so the model knows which position each token occupies. Unlike sinusoidal positional encodings in the original Transformer paper, we use learned embeddings.\n",
    "\n",
    "3. **Coordinate projection** (`nn.Linear`): Projects the 3-dimensional coordinate features (x_norm, y_norm, is_hold) to `d_model` dimensions and adds them to the token embeddings. This injects physical position information.\n",
    "\n",
    "4. **Transformer encoder** (`nn.TransformerEncoder`): Multiple layers of self-attention and feedforward networks. Each layer:\n",
    "   - Computes self-attention: every hold \"looks at\" every other hold\n",
    "   - Applies feedforward transformation\n",
    "   - Uses residual connections and layer normalization\n",
    "\n",
    "5. **Regression head**: Takes the `<CLS>` token's output (which has aggregated information from the entire sequence) and predicts a single difficulty score.\n",
    "\n",
    "### Hyperparameters\n",
    "\n",
    "- `d_model=128`: The dimensionality of embeddings and hidden states\n",
    "- `nhead=4`: Number of attention heads (multi-head attention)\n",
    "- `num_layers=4`: Number of transformer layers\n",
    "- `dim_feedforward=256`: Dimension of the feedforward network inside each layer\n",
    "- `dropout=0.10`: Dropout probability for regularization\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62c2db48",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:53.644453Z",
     "iopub.status.busy": "2026-06-07T15:48:53.643654Z",
     "iopub.status.idle": "2026-06-07T15:48:59.327913Z",
     "shell.execute_reply": "2026-06-07T15:48:59.326972Z"
    }
   },
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "model = JointRouteTransformerRegressor(\n",
    "    vocab_size=len(stoi),\n",
    "    max_len=max_len,\n",
    "    coord_features=coord_features,\n",
    "    d_model=128,\n",
    "    nhead=4,\n",
    "    num_layers=4,\n",
    "    dim_feedforward=256,\n",
    "    dropout=0.10,\n",
    "    pad_id=pad_id,\n",
    ").to(device)\n",
    "\n",
    "optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)\n",
    "\n",
    "print(f\"Device: {device}\")\n",
    "print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ded8d846",
   "metadata": {},
   "source": [
    "## Training Configuration\n",
    "\n",
    "### Loss function: MSE (Mean Squared Error)\n",
    "\n",
    "We use MSE loss because we're predicting a continuous value (difficulty score). This penalizes large errors more than small ones, which is appropriate for grade prediction.\n",
    "\n",
    "### Optimizer: AdamW\n",
    "\n",
    "AdamW is the standard optimizer for transformer models. It combines:\n",
    "- **Adam**: Adaptive learning rates per parameter\n",
    "- **Weight decay**: L2 regularization to prevent overfitting\n",
    "\n",
    "### Early stopping\n",
    "\n",
    "We stop training if validation loss doesn't improve for `patience` epochs. This prevents overfitting and saves compute.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "665deadb",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:59.331996Z",
     "iopub.status.busy": "2026-06-07T15:48:59.331485Z",
     "iopub.status.idle": "2026-06-07T15:48:59.340181Z",
     "shell.execute_reply": "2026-06-07T15:48:59.339495Z"
    }
   },
   "outputs": [],
   "source": [
    "def run_epoch(model, loader, device, optimizer=None):\n",
    "    \"\"\"Run one epoch of training or evaluation.\n",
    "    \n",
    "    The RouteGradeDataset returns a dictionary with keys:\n",
    "    - input_ids: token IDs, shape (batch_size, seq_len)\n",
    "    - attention_mask: binary mask, shape (batch_size, seq_len)\n",
    "    - target: difficulty score, shape (batch_size,)\n",
    "    - uuid: route identifiers (for logging)\n",
    "    - board_key: board identifiers (for logging)\n",
    "    \"\"\"\n",
    "    is_train = optimizer is not None\n",
    "    model.train(is_train)\n",
    "    criterion = nn.MSELoss()\n",
    "\n",
    "    losses, preds, targets, uuids, boards = [], [], [], [], []\n",
    "\n",
    "    for batch in loader:\n",
    "        input_ids = batch[\"input_ids\"].to(device)\n",
    "        attention_mask = batch[\"attention_mask\"].to(device)\n",
    "        target = batch[\"target\"].to(device)\n",
    "\n",
    "        if is_train:\n",
    "            optimizer.zero_grad(set_to_none=True)\n",
    "\n",
    "        pred = model(input_ids, attention_mask)\n",
    "        loss = criterion(pred, target)\n",
    "\n",
    "        if is_train:\n",
    "            loss.backward()\n",
    "            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
    "            optimizer.step()\n",
    "\n",
    "        losses.append(loss.item() * input_ids.size(0))\n",
    "        preds.extend(pred.detach().cpu().numpy().tolist())\n",
    "        targets.extend(target.detach().cpu().numpy().tolist())\n",
    "        uuids.extend(batch[\"uuid\"])\n",
    "        boards.extend(batch[\"board_key\"])\n",
    "\n",
    "    avg_loss = sum(losses) / max(1, len(loader.dataset))\n",
    "    return avg_loss, np.asarray(preds), np.asarray(targets), uuids, boards\n",
    "\n",
    "\n",
    "# Training configuration\n",
    "num_epochs = 30\n",
    "patience = 12\n",
    "\n",
    "print(f\"Max epochs: {num_epochs}\")\n",
    "print(f\"Early stopping patience: {patience}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e5bb77f",
   "metadata": {},
   "source": [
    "### Grade metrics helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aeeb2294",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:59.343447Z",
     "iopub.status.busy": "2026-06-07T15:48:59.342978Z",
     "iopub.status.idle": "2026-06-07T15:48:59.353066Z",
     "shell.execute_reply": "2026-06-07T15:48:59.352152Z"
    }
   },
   "outputs": [],
   "source": [
    "# Map BoardLib display difficulties into grouped V-grade tokens.\n",
    "GRADE_TO_V = {\n",
    "    10: 0, 11: 0, 12: 0,\n",
    "    13: 1, 14: 1,\n",
    "    15: 2,\n",
    "    16: 3, 17: 3,\n",
    "    18: 4, 19: 4,\n",
    "    20: 5, 21: 5,\n",
    "    22: 6,\n",
    "    23: 7,\n",
    "    24: 8, 25: 8,\n",
    "    26: 9,\n",
    "    27: 10,\n",
    "    28: 11,\n",
    "    29: 12,\n",
    "    30: 13,\n",
    "    31: 14,\n",
    "    32: 15,\n",
    "    33: 16,\n",
    "}\n",
    "\n",
    "def to_grouped_v(display_difficulty: float) -> int:\n",
    "    \"\"\"Map a continuous display difficulty to the nearest grouped V grade.\"\"\"\n",
    "    rounded = int(round(float(display_difficulty)))\n",
    "    rounded = max(min(rounded, max(GRADE_TO_V)), min(GRADE_TO_V))\n",
    "    return GRADE_TO_V[rounded]\n",
    "\n",
    "def grade_token(display_difficulty: float) -> str:\n",
    "    \"\"\"Return the grade-conditioning token for a display difficulty value.\"\"\"\n",
    "    return f\"<GRADE_V{to_grouped_v(display_difficulty)}>\"\n",
    "\n",
    "# Evaluate difficulty regression and grouped V-grade accuracy.\n",
    "def regression_metrics(y_true, y_pred) -> dict[str, float]:\n",
    "    \"\"\"Compute difficulty-scale and grouped-V-grade prediction metrics.\"\"\"\n",
    "    y_true = np.asarray(y_true)\n",
    "    y_pred = np.asarray(y_pred)\n",
    "    true_v = np.asarray([to_grouped_v(x) for x in y_true])\n",
    "    pred_v = np.asarray([to_grouped_v(x) for x in y_pred])\n",
    "\n",
    "    return {\n",
    "        \"mae\": float(mean_absolute_error(y_true, y_pred)),\n",
    "        \"rmse\": float(math.sqrt(mean_squared_error(y_true, y_pred))),\n",
    "        \"r2\": float(r2_score(y_true, y_pred)),\n",
    "        \"within_1_difficulty\": float(np.mean(np.abs(y_true - y_pred) <= 1) * 100),\n",
    "        \"within_2_difficulty\": float(np.mean(np.abs(y_true - y_pred) <= 2) * 100),\n",
    "        \"exact_grouped_v\": float(np.mean(true_v == pred_v) * 100),\n",
    "        \"within_1_vgrade\": float(np.mean(np.abs(true_v - pred_v) <= 1) * 100),\n",
    "        \"within_2_vgrades\": float(np.mean(np.abs(true_v - pred_v) <= 2) * 100),\n",
    "    }\n",
    "\n",
    "def metrics_by_board(pred_df: pd.DataFrame) -> pd.DataFrame:\n",
    "    \"\"\"Compute regression metrics separately for each board in a prediction table.\"\"\"\n",
    "    rows = []\n",
    "    for board_key, frame in pred_df.groupby(\"board_key\"):\n",
    "        metrics = regression_metrics(frame[\"y_true\"].values, frame[\"y_pred\"].values)\n",
    "        rows.append({\"board_key\": board_key, **metrics})\n",
    "    return pd.DataFrame(rows)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35d4bd8b",
   "metadata": {},
   "source": [
    "## Training Loop\n",
    "\n",
    "The training loop follows the standard deep learning workflow:\n",
    "\n",
    "1. **Forward pass**: Feed input through the model to get predictions\n",
    "2. **Compute loss**: Compare predictions to actual grades using MSE\n",
    "3. **Backward pass**: Compute gradients via backpropagation\n",
    "4. **Update weights**: Adjust model parameters using the optimizer\n",
    "5. **Validate**: Check performance on held-out validation data\n",
    "6. **Early stopping**: Stop if validation loss stops improving\n",
    "\n",
    "We track both fine-grained metrics (MAE, RMSE) and practical metrics (V-grade accuracy within ±1 grade).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "476b158d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T15:48:59.356313Z",
     "iopub.status.busy": "2026-06-07T15:48:59.355799Z",
     "iopub.status.idle": "2026-06-07T19:11:46.644946Z",
     "shell.execute_reply": "2026-06-07T19:11:46.644060Z"
    }
   },
   "outputs": [],
   "source": [
    "history = []\n",
    "best_val_mae = float(\"inf\")\n",
    "best_state = None\n",
    "best_epoch = 0\n",
    "epochs_without_improvement = 0\n",
    "\n",
    "print(\"Starting training...\\n\")\n",
    "\n",
    "for epoch in range(1, num_epochs + 1):\n",
    "    train_loss, train_pred, train_true, _, _ = run_epoch(model, train_loader, device, optimizer)\n",
    "    val_loss, val_pred, val_true, _, _ = run_epoch(model, val_loader, device, optimizer=None)\n",
    "\n",
    "    train_metrics = regression_metrics(train_true, train_pred)\n",
    "    val_metrics = regression_metrics(val_true, val_pred)\n",
    "\n",
    "    history.append({\n",
    "        \"epoch\": epoch,\n",
    "        \"train_loss\": train_loss,\n",
    "        \"val_loss\": val_loss,\n",
    "        \"train_mae\": train_metrics[\"mae\"],\n",
    "        \"val_mae\": val_metrics[\"mae\"],\n",
    "        \"train_r2\": train_metrics[\"r2\"],\n",
    "        \"val_r2\": val_metrics[\"r2\"],\n",
    "        \"val_within_1_vgrade\": val_metrics[\"within_1_vgrade\"],\n",
    "    })\n",
    "\n",
    "    if val_metrics[\"mae\"] < best_val_mae:\n",
    "        best_val_mae = val_metrics[\"mae\"]\n",
    "        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}\n",
    "        best_epoch = epoch\n",
    "        epochs_without_improvement = 0\n",
    "    else:\n",
    "        epochs_without_improvement += 1\n",
    "\n",
    "    if epoch == 1 or epoch % 5 == 0 or epoch == best_epoch:\n",
    "        print(\n",
    "            f\"Epoch {epoch:03d} | \"\n",
    "            f\"train MAE {train_metrics['mae']:.3f} | \"\n",
    "            f\"val MAE {val_metrics['mae']:.3f} | \"\n",
    "            f\"val R² {val_metrics['r2']:.3f} | \"\n",
    "            f\"val ±1V {val_metrics['within_1_vgrade']:.1f}%\"\n",
    "        )\n",
    "\n",
    "    if epochs_without_improvement >= patience:\n",
    "        print(f\"Early stopping at epoch {epoch}; best epoch was {best_epoch}.\")\n",
    "        break\n",
    "\n",
    "if best_state is not None:\n",
    "    model.load_state_dict(best_state)\n",
    "\n",
    "print(f\"\\nTraining complete. Best epoch: {best_epoch}, Best val MAE: {best_val_mae:.4f}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "589ad448",
   "metadata": {},
   "source": [
    "## Test Set Evaluation\n",
    "\n",
    "After training, we load the best model (based on validation MAE) and evaluate on the held-out test set. We report:\n",
    "\n",
    "- **MAE** (Mean Absolute Error): Average error in difficulty score points\n",
    "- **RMSE** (Root Mean Squared Error): Penalizes large errors more\n",
    "- **R²** (R-squared): How much variance in grades the model explains\n",
    "- **Within ±1 difficulty**: Percentage of predictions within 1 point\n",
    "- **Within ±1 V-grade**: Percentage of predictions within 1 V-grade\n",
    "\n",
    "We also break down performance by board (TB2 vs Kilter) to check for bias.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9abc3a72",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T19:11:46.648067Z",
     "iopub.status.busy": "2026-06-07T19:11:46.647798Z",
     "iopub.status.idle": "2026-06-07T19:12:05.427217Z",
     "shell.execute_reply": "2026-06-07T19:12:05.426288Z"
    }
   },
   "outputs": [],
   "source": [
    "test_loss, test_pred, test_true, test_uuid, test_board = run_epoch(model, test_loader, device, optimizer=None)\n",
    "overall_metrics = regression_metrics(test_true, test_pred)\n",
    "\n",
    "pred_df = pd.DataFrame({\n",
    "    \"uuid\": test_uuid,\n",
    "    \"board_key\": test_board,\n",
    "    \"y_true\": test_true,\n",
    "    \"y_pred\": test_pred,\n",
    "})\n",
    "board_metrics_df = metrics_by_board(pred_df)\n",
    "\n",
    "print(\"=\" * 50)\n",
    "print(\"Overall joint test performance\")\n",
    "print(\"=\" * 50)\n",
    "for key, value in overall_metrics.items():\n",
    "    suffix = \"%\" if \"within\" in key or \"exact\" in key else \"\"\n",
    "    print(f\"{key:24s}: {value:8.4f}{suffix}\")\n",
    "\n",
    "print(\"\\nBoard-specific test performance:\")\n",
    "print(board_metrics_df.to_string(index=False))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01c90e93",
   "metadata": {},
   "source": [
    "### JSON output helpers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3027d982",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T19:12:05.430611Z",
     "iopub.status.busy": "2026-06-07T19:12:05.430084Z",
     "iopub.status.idle": "2026-06-07T19:12:05.436838Z",
     "shell.execute_reply": "2026-06-07T19:12:05.436135Z"
    }
   },
   "outputs": [],
   "source": [
    "# Write JSON artifacts after converting NumPy/pandas values to plain Python values.\n",
    "def json_safe(obj: Any) -> Any:\n",
    "    \"\"\"Convert NumPy/pandas values into JSON-serializable Python objects.\"\"\"\n",
    "    if isinstance(obj, dict):\n",
    "        return {str(k): json_safe(v) for k, v in obj.items()}\n",
    "    if isinstance(obj, (list, tuple)):\n",
    "        return [json_safe(v) for v in obj]\n",
    "    if isinstance(obj, np.integer):\n",
    "        return int(obj)\n",
    "    if isinstance(obj, np.floating):\n",
    "        if np.isnan(obj):\n",
    "            return None\n",
    "        return float(obj)\n",
    "    if isinstance(obj, np.ndarray):\n",
    "        return json_safe(obj.tolist())\n",
    "    try:\n",
    "        if pd.isna(obj):\n",
    "            return None\n",
    "    except Exception:\n",
    "        pass\n",
    "    return obj\n",
    "\n",
    "def write_json(path: str | Path, payload: Any) -> None:\n",
    "    \"\"\"Write an object as indented UTF-8 JSON after ``json_safe`` cleanup.\"\"\"\n",
    "    path = Path(path)\n",
    "    path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    path.write_text(json.dumps(json_safe(payload), indent=2), encoding=\"utf-8\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "556be142",
   "metadata": {},
   "source": [
    "## Save Model and Artifacts\n",
    "\n",
    "We save the trained model checkpoint and evaluation metrics for use in notebook 04 (route evaluation) and for future inference.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "save_model",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-07T19:12:05.439746Z",
     "iopub.status.busy": "2026-06-07T19:12:05.439205Z",
     "iopub.status.idle": "2026-06-07T19:12:05.604325Z",
     "shell.execute_reply": "2026-06-07T19:12:05.603607Z"
    }
   },
   "outputs": [],
   "source": [
    "# Save model checkpoint\n",
    "MODEL_DIR = ROOT / \"models\"\n",
    "MODEL_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "OUT_DIR = ROOT / \"data\" / \"processed\" / \"grade_prediction\"\n",
    "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Save the full model checkpoint (needed by notebook 04)\n",
    "checkpoint = {\n",
    "    \"model_state_dict\": model.state_dict(),\n",
    "    \"config\": {\n",
    "        \"vocab_size\": len(stoi),\n",
    "        \"max_len\": max_len,\n",
    "        \"d_model\": 128,\n",
    "        \"nhead\": 4,\n",
    "        \"num_layers\": 4,\n",
    "        \"dim_feedforward\": 256,\n",
    "        \"dropout\": 0.10,\n",
    "        \"pad_id\": pad_id,\n",
    "    },\n",
    "    \"stoi\": stoi,\n",
    "    \"itos\": {str(k): v for k, v in itos.items()},\n",
    "    \"coord_features\": coord_features.cpu(),\n",
    "    \"overall_metrics\": overall_metrics,\n",
    "}\n",
    "model_path = MODEL_DIR / \"joint_transformer_grade_predictor.pth\"\n",
    "torch.save(checkpoint, model_path)\n",
    "\n",
    "# Save training history and metrics\n",
    "pd.DataFrame(history).to_csv(OUT_DIR / \"training_history.csv\", index=False)\n",
    "pred_df.to_csv(OUT_DIR / \"test_predictions.csv\", index=False)\n",
    "board_metrics_df.to_csv(OUT_DIR / \"board_metrics.csv\", index=False)\n",
    "\n",
    "# write_json is defined in the JSON output helper cell above.\n",
    "write_json(OUT_DIR / \"overall_metrics.json\", overall_metrics)\n",
    "\n",
    "print(f\"Saved model checkpoint to: {model_path}\")\n",
    "print(f\"Saved training history to: {OUT_DIR / 'training_history.csv'}\")\n",
    "print(f\"Saved test predictions to: {OUT_DIR / 'test_predictions.csv'}\")\n",
    "print(f\"Saved board metrics to: {OUT_DIR / 'board_metrics.csv'}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "key_takeaways",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **The transformer can learn from raw token sequences** without hand-engineered features like \"mean hand reach\" or \"height gained\". The self-attention mechanism lets it discover these patterns.\n",
    "\n",
    "2. **Coordinate features help**: Injecting physical (x, y) position information gives the model a strong prior about spatial relationships, similar to how positional embeddings help language models.\n",
    "\n",
    "3. **Joint training across boards**: By training on both TB2 and Kilter data simultaneously, the model can share statistical strength. The board token (`<BOARD_TB2>` vs `<BOARD_KILTER>`) tells it which \"language\" it's operating in.\n",
    "\n",
    "4. **The gap between fine-grained and grouped metrics**: Being off by 1 difficulty point often stays within the same V-grade bucket. This is why the ±1 V-grade accuracy is much higher than the ±1 difficulty accuracy.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}