ClimbingBoardGPT/notebooks/02_joint_transformer_grade_prediction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "92b83b1d",
   "metadata": {},
   "source": [
    "# 02 — Joint Transformer Grade Prediction\n",
    "\n",
    "## From Language Modeling to Grade Prediction\n",
    "\n",
    "In NLP, **BERT-style models** are encoder-only transformers that take a sequence of tokens, process them through multiple self-attention layers, and produce a single output (like a classification label). The key insight is:\n",
    "\n",
    "- **Input**: A sequence of tokens (words, subwords, or in our case, holds)\n",
    "- **Processing**: Multiple layers of self-attention that let each token \"look at\" every other token\n",
    "- **Output**: A pooled representation (typically from a `[CLS]` token) that summarizes the entire sequence\n",
    "\n",
    "### Our architecture\n",
    "\n",
    "We use a **Transformer Encoder** (similar to BERT) with these components:\n",
    "\n",
    "1. **Token embeddings**: Convert integer token IDs to dense vectors\n",
    "2. **Positional embeddings**: Tell the model where each token is in the sequence\n",
    "3. **Coordinate features**: Inject physical (x, y) position of each hold into the embedding\n",
    "4. **Transformer encoder layers**: Multiple layers of self-attention + feedforward\n",
    "5. **Regression head**: Take the `<CLS>` token's output and predict a single difficulty score\n",
    "\n",
    "### Why this works for climbing\n",
    "\n",
    "A climb's difficulty depends on the *relationships between holds*, not just individual holds. Self-attention naturally captures these relationships:\n",
    "\n",
    "- A start hold far from the first middle hold suggests a big opening move\n",
    "- Two hand holds close together with a foot hold far away suggests a dyno\n",
    "- The overall spatial distribution determines the \"flow\" of the climb\n",
    "\n",
    "The transformer can learn these spatial relationships through attention, without us having to manually engineer features like \"mean hand reach\" or \"height gained\" (though those features were useful in the classical model).\n",
    "\n",
    "### Input format\n",
    "\n",
    "```text\n",
    "<CLS> <BOARD_TB2> <ANGLE_40> <TB2_p344_start> <TB2_p369_middle> ... <TB2_p603_finish>\n",
    "```\n",
    "\n",
    "Note: We use `<CLS>` instead of `<BOS>` and we **exclude the grade token** — the model must predict the grade, not see it!\n",
    "\n",
    "### Target\n",
    "\n",
    "```text\n",
    "display_difficulty (continuous value, e.g., 20.5)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dfd6081",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import sys\n",
    "import json\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "ROOT = Path.cwd().resolve()\n",
    "if ROOT.name == \"notebooks\":\n",
    "    ROOT = ROOT.parent\n",
    "sys.path.insert(0, str(ROOT / \"src\"))\n",
    "\n",
    "from climbingboardgpt.datasets import RouteGradeDataset\n",
    "from climbingboardgpt.metrics import regression_metrics, metrics_by_board\n",
    "from climbingboardgpt.models import JointRouteTransformerRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a9e2443",
   "metadata": {},
   "outputs": [],
   "source": [
    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
    "df_routes = pd.read_csv(TOKENIZED / \"route_sequences.csv\")\n",
    "vocab = json.loads((TOKENIZED / \"token_vocab.json\").read_text(encoding=\"utf-8\"))\n",
    "\n",
    "stoi = {str(k): int(v) for k, v in vocab[\"stoi\"].items()}\n",
    "itos = {int(k): str(v) for k, v in vocab[\"itos\"].items()}\n",
    "df_token_meta = pd.read_csv(TOKENIZED / \"token_metadata.csv\")\n",
    "\n",
    "pad_id = stoi[\"<PAD>\"]\n",
    "unk_id = stoi[\"<UNK>\"]\n",
    "\n",
    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
    "print(f\"Total routes: {len(df_routes):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4abfd9b",
   "metadata": {},
   "source": [
    "## Build model IDs and coordinate features\n",
    "\n",
    "### Coordinate features: Why inject physical position?\n",
    "\n",
    "In standard NLP, positional embeddings tell the model *which position in the sequence* a token occupies. But for climbing, the **physical position on the wall** matters more than the sequence position.\n",
    "\n",
    "We create a 3-dimensional feature vector for each token:\n",
    "1. `x_norm`: Normalized horizontal position on the board (-1 to 1)\n",
    "2. `y_norm`: Normalized vertical position on the board (-1 to 1)\n",
    "3. `is_hold`: 1 if this token represents a hold, 0 otherwise\n",
    "\n",
    "These features are projected through a linear layer and added to the token embeddings. This is similar to how some vision-language models inject spatial features from images alongside text tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95bb745f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode(tokens):\n",
    "    \"\"\"Convert a list of token strings to integer IDs.\"\"\"\n",
    "    return [stoi.get(token, unk_id) for token in tokens]\n",
    "\n",
    "# Prepare input sequences for the grade predictor\n",
    "# We use the \"no grade\" version because the model should predict the grade,\n",
    "# not see it in the input!\n",
    "# We also prepend <CLS> which will be used for pooling the sequence representation\n",
    "df_routes[\"tokens_no_grade\"] = df_routes[\"sequence_no_grade\"].fillna(\"\").str.split()\n",
    "df_routes[\"model_tokens\"] = df_routes[\"tokens_no_grade\"].apply(\n",
    "    lambda tokens: [\"<CLS>\"] + tokens[1:] if tokens else [\"<CLS>\"]\n",
    ")\n",
    "df_routes[\"model_ids\"] = df_routes[\"model_tokens\"].apply(encode)\n",
    "df_routes[\"seq_len\"] = df_routes[\"model_ids\"].apply(len)\n",
    "max_len = int(df_routes[\"seq_len\"].max())\n",
    "\n",
    "# Build coordinate features matrix: (vocab_size, 3)\n",
    "# Each row corresponds to a token ID and contains [x_norm, y_norm, is_hold]\n",
    "# This will be used as additional input to the model alongside token embeddings\n",
    "coord_features = np.zeros((len(stoi), 3), dtype=np.float32)\n",
    "for _, row in df_token_meta.iterrows():\n",
    "    token_id = int(row[\"token_id\"])\n",
    "    coord_features[token_id, 0] = 0.0 if pd.isna(row.get(\"x_norm\", 0.0)) else float(row.get(\"x_norm\", 0.0))\n",
    "    coord_features[token_id, 1] = 0.0 if pd.isna(row.get(\"y_norm\", 0.0)) else float(row.get(\"y_norm\", 0.0))\n",
    "    coord_features[token_id, 2] = 0.0 if pd.isna(row.get(\"is_hold\", 0.0)) else float(row.get(\"is_hold\", 0.0))\n",
    "coord_features = torch.tensor(coord_features, dtype=torch.float32)\n",
    "\n",
    "print(f\"Max sequence length: {max_len}\")\n",
    "print(f\"Coordinate features shape: {coord_features.shape}\")\n",
    "print(f\"Vocabulary size: {len(stoi)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5da4ca8",
   "metadata": {},
   "source": [
    "## Data loaders\n",
    "\n",
    "### Batching and padding\n",
    "\n",
    "Transformers process data in **batches** for efficiency. But routes have different lengths (different numbers of holds). We handle this by:\n",
    "\n",
    "1. **Padding**: Shorter sequences are padded with `<PAD>` tokens to match the longest sequence in the batch\n",
    "2. **Attention masking**: The model receives a binary mask that tells it which positions are real data vs padding\n",
    "\n",
    "This is exactly how BERT and GPT handle variable-length text sequences.\n",
    "\n",
    "### The RouteGradeDataset class\n",
    "\n",
    "For each route, this dataset produces:\n",
    "- `input_ids`: Integer token IDs, padded to `max_len`\n",
    "- `attention_mask`: 1 for real tokens, 0 for padding\n",
    "- `target`: The difficulty score we want to predict\n",
    "- `uuid`, `board_key`: Metadata for evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c9e5543",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = df_routes[df_routes[\"split\"] == \"train\"].reset_index(drop=True)\n",
    "val_df = df_routes[df_routes[\"split\"] == \"val\"].reset_index(drop=True)\n",
    "test_df = df_routes[df_routes[\"split\"] == \"test\"].reset_index(drop=True)\n",
    "\n",
    "train_ds = RouteGradeDataset(train_df, max_len=max_len, pad_id=pad_id)\n",
    "val_ds = RouteGradeDataset(val_df, max_len=max_len, pad_id=pad_id)\n",
    "test_ds = RouteGradeDataset(test_df, max_len=max_len, pad_id=pad_id)\n",
    "\n",
    "train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)\n",
    "val_loader = DataLoader(val_ds, batch_size=128, shuffle=False)\n",
    "test_loader = DataLoader(test_ds, batch_size=128, shuffle=False)\n",
    "\n",
    "print(f\"Training samples: {len(train_ds):,}\")\n",
    "print(f\"Validation samples: {len(val_ds):,}\")\n",
    "print(f\"Test samples: {len(test_ds):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90fa8ae5",
   "metadata": {},
   "source": [
    "## Model Architecture\n",
    "\n",
    "### The JointRouteTransformerRegressor\n",
    "\n",
    "This model is a **transformer encoder** with a regression head. Here's what each component does:\n",
    "\n",
    "1. **Token embedding** (`nn.Embedding`): Converts integer token IDs to dense vectors of dimension `d_model`. This is the same as word embeddings in NLP.\n",
    "\n",
    "2. **Positional embedding** (`nn.Embedding`): Adds position information so the model knows which position each token occupies. Unlike sinusoidal positional encodings in the original Transformer paper, we use learned embeddings.\n",
    "\n",
    "3. **Coordinate projection** (`nn.Linear`): Projects the 3-dimensional coordinate features (x_norm, y_norm, is_hold) to `d_model` dimensions and adds them to the token embeddings. This injects physical position information.\n",
    "\n",
    "4. **Transformer encoder** (`nn.TransformerEncoder`): Multiple layers of self-attention and feedforward networks. Each layer:\n",
    "   - Computes self-attention: every hold \"looks at\" every other hold\n",
    "   - Applies feedforward transformation\n",
    "   - Uses residual connections and layer normalization\n",
    "\n",
    "5. **Regression head**: Takes the `<CLS>` token's output (which has aggregated information from the entire sequence) and predicts a single difficulty score.\n",
    "\n",
    "### Hyperparameters\n",
    "\n",
    "- `d_model=128`: The dimensionality of embeddings and hidden states\n",
    "- `nhead=4`: Number of attention heads (multi-head attention)\n",
    "- `num_layers=4`: Number of transformer layers\n",
    "- `dim_feedforward=256`: Dimension of the feedforward network inside each layer\n",
    "- `dropout=0.10`: Dropout probability for regularization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62c2db48",
   "metadata": {},
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "model = JointRouteTransformerRegressor(\n",
    "    vocab_size=len(stoi),\n",
    "    max_len=max_len,\n",
    "    coord_features=coord_features,\n",
    "    d_model=128,\n",
    "    nhead=4,\n",
    "    num_layers=4,\n",
    "    dim_feedforward=256,\n",
    "    dropout=0.10,\n",
    "    pad_id=pad_id,\n",
    ").to(device)\n",
    "\n",
    "optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)\n",
    "\n",
    "print(f\"Device: {device}\")\n",
    "print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ded8d846",
   "metadata": {},
   "source": [
    "## Training Configuration\n",
    "\n",
    "### Loss function: MSE (Mean Squared Error)\n",
    "\n",
    "We use MSE loss because we're predicting a continuous value (difficulty score). This penalizes large errors more than small ones, which is appropriate for grade prediction.\n",
    "\n",
    "### Optimizer: AdamW\n",
    "\n",
    "AdamW is the standard optimizer for transformer models. It combines:\n",
    "- **Adam**: Adaptive learning rates per parameter\n",
    "- **Weight decay**: L2 regularization to prevent overfitting\n",
    "\n",
    "### Early stopping\n",
    "\n",
    "We stop training if validation loss doesn't improve for `patience` epochs. This prevents overfitting and saves compute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "665deadb",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_epoch(model, loader, device, optimizer=None):\n",
    "    \"\"\"Run one epoch of training or evaluation.\n",
    "    \n",
    "    The RouteGradeDataset returns a dictionary with keys:\n",
    "    - input_ids: token IDs, shape (batch_size, seq_len)\n",
    "    - attention_mask: binary mask, shape (batch_size, seq_len)\n",
    "    - target: difficulty score, shape (batch_size,)\n",
    "    - uuid: route identifiers (for logging)\n",
    "    - board_key: board identifiers (for logging)\n",
    "    \"\"\"\n",
    "    is_train = optimizer is not None\n",
    "    model.train(is_train)\n",
    "    criterion = nn.MSELoss()\n",
    "\n",
    "    losses, preds, targets, uuids, boards = [], [], [], [], []\n",
    "\n",
    "    for batch in loader:\n",
    "        input_ids = batch[\"input_ids\"].to(device)\n",
    "        attention_mask = batch[\"attention_mask\"].to(device)\n",
    "        target = batch[\"target\"].to(device)\n",
    "\n",
    "        if is_train:\n",
    "            optimizer.zero_grad(set_to_none=True)\n",
    "\n",
    "        pred = model(input_ids, attention_mask)\n",
    "        loss = criterion(pred, target)\n",
    "\n",
    "        if is_train:\n",
    "            loss.backward()\n",
    "            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
    "            optimizer.step()\n",
    "\n",
    "        losses.append(loss.item() * input_ids.size(0))\n",
    "        preds.extend(pred.detach().cpu().numpy().tolist())\n",
    "        targets.extend(target.detach().cpu().numpy().tolist())\n",
    "        uuids.extend(batch[\"uuid\"])\n",
    "        boards.extend(batch[\"board_key\"])\n",
    "\n",
    "    avg_loss = sum(losses) / max(1, len(loader.dataset))\n",
    "    return avg_loss, np.asarray(preds), np.asarray(targets), uuids, boards\n",
    "\n",
    "\n",
    "# Training configuration\n",
    "num_epochs = 30\n",
    "patience = 12\n",
    "\n",
    "print(f\"Max epochs: {num_epochs}\")\n",
    "print(f\"Early stopping patience: {patience}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35d4bd8b",
   "metadata": {},
   "source": [
    "## Training Loop\n",
    "\n",
    "The training loop follows the standard deep learning workflow:\n",
    "\n",
    "1. **Forward pass**: Feed input through the model to get predictions\n",
    "2. **Compute loss**: Compare predictions to actual grades using MSE\n",
    "3. **Backward pass**: Compute gradients via backpropagation\n",
    "4. **Update weights**: Adjust model parameters using the optimizer\n",
    "5. **Validate**: Check performance on held-out validation data\n",
    "6. **Early stopping**: Stop if validation loss stops improving\n",
    "\n",
    "We track both fine-grained metrics (MAE, RMSE) and practical metrics (V-grade accuracy within ±1 grade)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "476b158d",
   "metadata": {},
   "outputs": [],
   "source": [
    "history = []\n",
    "best_val_mae = float(\"inf\")\n",
    "best_state = None\n",
    "best_epoch = 0\n",
    "epochs_without_improvement = 0\n",
    "\n",
    "print(\"Starting training...\\n\")\n",
    "\n",
    "for epoch in range(1, num_epochs + 1):\n",
    "    train_loss, train_pred, train_true, _, _ = run_epoch(model, train_loader, device, optimizer)\n",
    "    val_loss, val_pred, val_true, _, _ = run_epoch(model, val_loader, device, optimizer=None)\n",
    "\n",
    "    train_metrics = regression_metrics(train_true, train_pred)\n",
    "    val_metrics = regression_metrics(val_true, val_pred)\n",
    "\n",
    "    history.append({\n",
    "        \"epoch\": epoch,\n",
    "        \"train_loss\": train_loss,\n",
    "        \"val_loss\": val_loss,\n",
    "        \"train_mae\": train_metrics[\"mae\"],\n",
    "        \"val_mae\": val_metrics[\"mae\"],\n",
    "        \"train_r2\": train_metrics[\"r2\"],\n",
    "        \"val_r2\": val_metrics[\"r2\"],\n",
    "        \"val_within_1_vgrade\": val_metrics[\"within_1_vgrade\"],\n",
    "    })\n",
    "\n",
    "    if val_metrics[\"mae\"] < best_val_mae:\n",
    "        best_val_mae = val_metrics[\"mae\"]\n",
    "        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}\n",
    "        best_epoch = epoch\n",
    "        epochs_without_improvement = 0\n",
    "    else:\n",
    "        epochs_without_improvement += 1\n",
    "\n",
    "    if epoch == 1 or epoch % 5 == 0 or epoch == best_epoch:\n",
    "        print(\n",
    "            f\"Epoch {epoch:03d} | \"\n",
    "            f\"train MAE {train_metrics['mae']:.3f} | \"\n",
    "            f\"val MAE {val_metrics['mae']:.3f} | \"\n",
    "            f\"val R² {val_metrics['r2']:.3f} | \"\n",
    "            f\"val ±1V {val_metrics['within_1_vgrade']:.1f}%\"\n",
    "        )\n",
    "\n",
    "    if epochs_without_improvement >= patience:\n",
    "        print(f\"Early stopping at epoch {epoch}; best epoch was {best_epoch}.\")\n",
    "        break\n",
    "\n",
    "if best_state is not None:\n",
    "    model.load_state_dict(best_state)\n",
    "\n",
    "print(f\"\\nTraining complete. Best epoch: {best_epoch}, Best val MAE: {best_val_mae:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "589ad448",
   "metadata": {},
   "source": [
    "## Test Set Evaluation\n",
    "\n",
    "After training, we load the best model (based on validation MAE) and evaluate on the held-out test set. We report:\n",
    "\n",
    "- **MAE** (Mean Absolute Error): Average error in difficulty score points\n",
    "- **RMSE** (Root Mean Squared Error): Penalizes large errors more\n",
    "- **R²** (R-squared): How much variance in grades the model explains\n",
    "- **Within ±1 difficulty**: Percentage of predictions within 1 point\n",
    "- **Within ±1 V-grade**: Percentage of predictions within 1 V-grade\n",
    "\n",
    "We also break down performance by board (TB2 vs Kilter) to check for bias."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9abc3a72",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_loss, test_pred, test_true, test_uuid, test_board = run_epoch(model, test_loader, device, optimizer=None)\n",
    "overall_metrics = regression_metrics(test_true, test_pred)\n",
    "\n",
    "pred_df = pd.DataFrame({\n",
    "    \"uuid\": test_uuid,\n",
    "    \"board_key\": test_board,\n",
    "    \"y_true\": test_true,\n",
    "    \"y_pred\": test_pred,\n",
    "})\n",
    "board_metrics_df = metrics_by_board(pred_df)\n",
    "\n",
    "print(\"=\" * 50)\n",
    "print(\"Overall joint test performance\")\n",
    "print(\"=\" * 50)\n",
    "for key, value in overall_metrics.items():\n",
    "    suffix = \"%\" if \"within\" in key or \"exact\" in key else \"\"\n",
    "    print(f\"{key:24s}: {value:8.4f}{suffix}\")\n",
    "\n",
    "print(\"\\nBoard-specific test performance:\")\n",
    "print(board_metrics_df.to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "556be142",
   "metadata": {},
   "source": [
    "## Save Model and Artifacts\n",
    "\n",
    "We save the trained model checkpoint and evaluation metrics for use in notebook 04 (route evaluation) and for future inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "save_model",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save model checkpoint\n",
    "MODEL_DIR = ROOT / \"models\"\n",
    "MODEL_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "OUT_DIR = ROOT / \"data\" / \"processed\" / \"grade_prediction\"\n",
    "OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Save the full model checkpoint (needed by notebook 04)\n",
    "checkpoint = {\n",
    "    \"model_state_dict\": model.state_dict(),\n",
    "    \"config\": {\n",
    "        \"vocab_size\": len(stoi),\n",
    "        \"max_len\": max_len,\n",
    "        \"d_model\": 128,\n",
    "        \"nhead\": 4,\n",
    "        \"num_layers\": 4,\n",
    "        \"dim_feedforward\": 256,\n",
    "        \"dropout\": 0.10,\n",
    "        \"pad_id\": pad_id,\n",
    "    },\n",
    "    \"stoi\": stoi,\n",
    "    \"itos\": {str(k): v for k, v in itos.items()},\n",
    "    \"coord_features\": coord_features.cpu(),\n",
    "    \"overall_metrics\": overall_metrics,\n",
    "}\n",
    "model_path = MODEL_DIR / \"joint_transformer_grade_predictor.pth\"\n",
    "torch.save(checkpoint, model_path)\n",
    "\n",
    "# Save training history and metrics\n",
    "pd.DataFrame(history).to_csv(OUT_DIR / \"training_history.csv\", index=False)\n",
    "pred_df.to_csv(OUT_DIR / \"test_predictions.csv\", index=False)\n",
    "board_metrics_df.to_csv(OUT_DIR / \"board_metrics.csv\", index=False)\n",
    "\n",
    "from climbingboardgpt.utils import write_json\n",
    "write_json(OUT_DIR / \"overall_metrics.json\", overall_metrics)\n",
    "\n",
    "print(f\"Saved model checkpoint to: {model_path}\")\n",
    "print(f\"Saved training history to: {OUT_DIR / 'training_history.csv'}\")\n",
    "print(f\"Saved test predictions to: {OUT_DIR / 'test_predictions.csv'}\")\n",
    "print(f\"Saved board metrics to: {OUT_DIR / 'board_metrics.csv'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "key_takeaways",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **The transformer can learn from raw token sequences** without hand-engineered features like \"mean hand reach\" or \"height gained\". The self-attention mechanism lets it discover these patterns.\n",
    "\n",
    "2. **Coordinate features help**: Injecting physical (x, y) position information gives the model a strong prior about spatial relationships, similar to how positional embeddings help language models.\n",
    "\n",
    "3. **Joint training across boards**: By training on both TB2 and Kilter data simultaneously, the model can share statistical strength. The board token (`<BOARD_TB2>` vs `<BOARD_KILTER>`) tells it which \"language\" it's operating in.\n",
    "\n",
    "4. **The gap between fine-grained and grouped metrics**: Being off by 1 difficulty point often stays within the same V-grade bucket. This is why the ±1 V-grade accuracy is much higher than the ±1 difficulty accuracy."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.14.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}