Next version. Models + scripts updated. 2

2026-05-21 22:21:26 -04:00
parent 0002ef1545
commit 86d582a572
23 changed files with 1768 additions and 293 deletions
@@ -0,0 +1,518 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "27197e7d",
+   "metadata": {},
+   "source": [
+    "# 03 — Joint nanoGPT-style Route Generation\n",
+    "\n",
+    "## From Understanding to Generation\n",
+    "\n",
+    "Notebook 02 used a **transformer encoder** (BERT-style) to *understand* routes and predict their grade. This notebook uses a **transformer decoder** (GPT-style) to *generate* new routes.\n",
+    "\n",
+    "### The key difference: Encoder vs Decoder\n",
+    "\n",
+    "| Aspect | BERT-style (Encoder) | GPT-style (Decoder) |\n",
+    "|---|---|---|\n",
+    "| Attention | Bidirectional (sees all tokens) | Causal (only sees past tokens) |\n",
+    "| Training | Masked language modeling | Next-token prediction |\n",
+    "| Use case | Classification, regression | Text generation |\n",
+    "| Output | Single prediction per sequence | One prediction per position |\n",
+    "\n",
+    "### How GPT-style generation works\n",
+    "\n",
+    "The model is trained to predict the **next token** given all previous tokens:\n",
+    "\n",
+    "```text\n",
+    "Input:  <BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6>\n",
+    "Target: <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start>\n",
+    "```\n",
+    "\n",
+    "At generation time, we:\n",
+    "1. Start with a prompt like `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6>`\n",
+    "2. Ask the model to predict the next token\n",
+    "3. Sample from the predicted probability distribution\n",
+    "4. Append the sampled token to the sequence\n",
+    "5. Repeat until we generate `<EOS>` or hit a max length\n",
+    "\n",
+    "### Conditioning on board, angle, and grade\n",
+    "\n",
+    "The prompt tokens tell the model *what kind of route to generate*:\n",
+    "- `<BOARD_TB2>`: Generate a route for the Tension Board 2\n",
+    "- `<ANGLE_40>`: At 40 degrees\n",
+    "- `<GRADE_V6>`: At V6 difficulty\n",
+    "\n",
+    "This is analogous to how ChatGPT uses a system prompt to condition its responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6590822",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import sys\n",
+    "import json\n",
+    "import math\n",
+    "import pandas as pd\n",
+    "import torch\n",
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "ROOT = Path.cwd().resolve()\n",
+    "if ROOT.name == \"notebooks\":\n",
+    "    ROOT = ROOT.parent\n",
+    "sys.path.insert(0, str(ROOT / \"src\"))\n",
+    "\n",
+    "from climbingboardgpt.config import load_board_configs\n",
+    "from climbingboardgpt.datasets import RouteGPTDataset\n",
+    "from climbingboardgpt.generation import generate_one\n",
+    "from climbingboardgpt.models import JointRouteGPT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f09fdf54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "TOKENIZED = ROOT / \"data\" / \"processed\" / \"tokenized\"\n",
+    "df_routes = pd.read_csv(TOKENIZED / \"route_sequences.csv\")\n",
+    "vocab = json.loads((TOKENIZED / \"token_vocab.json\").read_text(encoding=\"utf-8\"))\n",
+    "stoi = {str(k): int(v) for k, v in vocab[\"stoi\"].items()}\n",
+    "itos = {int(k): str(v) for k, v in vocab[\"itos\"].items()}\n",
+    "\n",
+    "pad_id = stoi[\"<PAD>\"]\n",
+    "unk_id = stoi[\"<UNK>\"]\n",
+    "\n",
+    "print(f\"Vocabulary size: {len(stoi):,}\")\n",
+    "print(f\"Total routes: {len(df_routes):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe4b0faf",
+   "metadata": {},
+   "source": [
+    "## Sequence encoding for causal language modeling\n",
+    "\n",
+    "### The autoregressive setup\n",
+    "\n",
+    "For GPT-style training, each route becomes a sequence where the model learns to predict each token given all previous tokens:\n",
+    "\n",
+    "```text\n",
+    "Input:   <BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> <TB2_p369_middle>\n",
+    "Target:  <BOARD_TB2> <ANGLE_40> <GRADE_V6> <TB2_p344_start> <TB2_p369_middle> <TB2_p603_finish>\n",
+    "```\n",
+    "\n",
+    "The input is shifted right by one position compared to the target. This is the standard causal language modeling setup.\n",
+    "\n",
+    "### Why include the grade in the training sequence?\n",
+    "\n",
+    "For the grade predictor (notebook 02), we excluded the grade because the model needed to predict it. But for the generator, we **include** the grade (`<GRADE_V6>`) in the training data so the model learns the relationship between grade and hold selection.\n",
+    "\n",
+    "At generation time, we provide the grade as part of the prompt, and the model generates holds that are appropriate for that grade."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ad61dbd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def encode(tokens):\n",
+    "    \"\"\"Convert token strings to integer IDs.\"\"\"\n",
+    "    return [stoi.get(token, unk_id) for token in tokens]\n",
+    "\n",
+    "# Use the \"with grade\" version for GPT training\n",
+    "# The model needs to see the grade to learn grade-hold relationships\n",
+    "df_routes[\"gpt_tokens\"] = df_routes[\"sequence_with_grade\"].fillna(\"\").str.split()\n",
+    "df_routes[\"gpt_ids\"] = df_routes[\"gpt_tokens\"].apply(encode)\n",
+    "df_routes[\"seq_len\"] = df_routes[\"gpt_ids\"].apply(len)\n",
+    "max_len = int(df_routes[\"seq_len\"].max())\n",
+    "block_size = max_len - 1  # Input length (one less than full sequence)\n",
+    "\n",
+    "# Create train/val splits\n",
+    "train_df = df_routes[df_routes[\"split\"] == \"train\"].reset_index(drop=True)\n",
+    "val_df = df_routes[df_routes[\"split\"] == \"val\"].reset_index(drop=True)\n",
+    "\n",
+    "# Create datasets and data loaders\n",
+    "# RouteGPTDataset handles the input/target shift for causal modeling\n",
+    "train_ds = RouteGPTDataset(train_df, max_len=max_len, pad_id=pad_id)\n",
+    "val_ds = RouteGPTDataset(val_df, max_len=max_len, pad_id=pad_id)\n",
+    "train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)\n",
+    "val_loader = DataLoader(val_ds, batch_size=128, shuffle=False)\n",
+    "\n",
+    "print(f\"Max sequence length: {max_len}\")\n",
+    "print(f\"Block size (input length): {block_size}\")\n",
+    "print(f\"Training samples: {len(train_ds):,}\")\n",
+    "print(f\"Validation samples: {len(val_ds):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66d98641",
+   "metadata": {},
+   "source": [
+    "## The GPT Model Architecture\n",
+    "\n",
+    "### JointRouteGPT\n",
+    "\n",
+    "This is a **causal transformer decoder** — the same architecture used in GPT-2, GPT-3, etc., but much smaller:\n",
+    "\n",
+    "1. **Token embeddings**: Convert integer token IDs to dense vectors\n",
+    "2. **Positional embeddings**: Learned position vectors (not sinusoidal)\n",
+    "3. **Causal self-attention**: Each position can only attend to previous positions (via a causal mask)\n",
+    "4. **Transformer layers**: Multiple layers of attention + feedforward\n",
+    "5. **Language modeling head**: Projects hidden states to vocabulary logits\n",
+    "\n",
+    "### Key hyperparameters\n",
+    "\n",
+    "- `n_embd=128`: Embedding dimension (GPT-2 small uses 768)\n",
+    "- `n_head=4`: Number of attention heads\n",
+    "- `n_layer=4`: Number of transformer layers (GPT-2 small uses 12)\n",
+    "- `dropout=0.10`: Dropout probability\n",
+    "\n",
+    "This is intentionally small — we're training on ~40K short sequences, not billions of long documents.\n",
+    "\n",
+    "### Weight tying\n",
+    "\n",
+    "The output projection layer shares weights with the token embedding layer (`self.lm_head.weight = self.token_emb.weight`). This is a common technique that:\n",
+    "- Reduces parameter count\n",
+    "- Acts as a regularizer\n",
+    "- Is used in GPT-2 and many other language models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3eec6f35",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "\n",
+    "model = JointRouteGPT(\n",
+    "    vocab_size=len(stoi),\n",
+    "    block_size=block_size,\n",
+    "    n_embd=128,\n",
+    "    n_head=4,\n",
+    "    n_layer=4,\n",
+    "    dropout=0.10,\n",
+    "    pad_id=pad_id,\n",
+    ").to(device)\n",
+    "\n",
+    "optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)\n",
+    "\n",
+    "print(f\"Device: {device}\")\n",
+    "print(f\"Total parameters: {sum(p.numel() for p in model.parameters()):,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f999cf05",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_epoch():\n",
+    "    \"\"\"Train for one epoch.\"\"\"\n",
+    "    model.train()\n",
+    "    losses = []\n",
+    "    n = 0\n",
+    "    for batch in train_loader:\n",
+    "        x = batch[\"input_ids\"].to(device)\n",
+    "        y = batch[\"target_ids\"].to(device)\n",
+    "        \n",
+    "        optimizer.zero_grad(set_to_none=True)\n",
+    "        _, loss = model(x, y)\n",
+    "        loss.backward()\n",
+    "        \n",
+    "        # Gradient clipping prevents exploding gradients\n",
+    "        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "        \n",
+    "        optimizer.step()\n",
+    "        losses.append(loss.item() * x.size(0))\n",
+    "        n += x.size(0)\n",
+    "    return sum(losses) / max(1, n)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_loss(loader):\n",
+    "    \"\"\"Evaluate loss on a data loader.\"\"\"\n",
+    "    model.eval()\n",
+    "    losses = []\n",
+    "    n = 0\n",
+    "    for batch in loader:\n",
+    "        x = batch[\"input_ids\"].to(device)\n",
+    "        y = batch[\"target_ids\"].to(device)\n",
+    "        _, loss = model(x, y)\n",
+    "        losses.append(loss.item() * x.size(0))\n",
+    "        n += x.size(0)\n",
+    "    return sum(losses) / max(1, n)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51fb8b6e",
+   "metadata": {},
+   "source": [
+    "## Training\n",
+    "\n",
+    "### What we're optimizing\n",
+    "\n",
+    "The model minimizes **cross-entropy loss** — the standard loss function for language modeling. At each position, the model outputs a probability distribution over the entire vocabulary, and the loss measures how surprised it is by the actual next token.\n",
+    "\n",
+    "### Perplexity\n",
+    "\n",
+    "We also track **perplexity**, which is `exp(loss)`. Perplexity answers the question: \"On average, how many tokens was the model choosing between at each step?\" Lower perplexity = better model.\n",
+    "\n",
+    "For reference:\n",
+    "- A model that always predicts the right token has perplexity = 1\n",
+    "- A model that picks uniformly from a 1000-token vocab has perplexity = 1000\n",
+    "- Good language models on English text achieve perplexity ~15-20\n",
+    "\n",
+    "Our vocabulary is ~4000+ tokens, so a perplexity significantly below that indicates the model is learning meaningful patterns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70b38b02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "history = []\n",
+    "best_val_loss = float(\"inf\")\n",
+    "best_state = None\n",
+    "patience = 10\n",
+    "stagnant = 0\n",
+    "\n",
+    "print(\"Starting GPT training...\\n\")\n",
+    "\n",
+    "for epoch in range(1, 21):\n",
+    "    train_loss = train_epoch()\n",
+    "    val_loss = eval_loss(val_loader)\n",
+    "    \n",
+    "    # Track perplexity (exponentiated loss)\n",
+    "    train_ppl = math.exp(min(train_loss, 20))\n",
+    "    val_ppl = math.exp(min(val_loss, 20))\n",
+    "    \n",
+    "    history.append({\n",
+    "        \"epoch\": epoch,\n",
+    "        \"train_loss\": train_loss,\n",
+    "        \"val_loss\": val_loss,\n",
+    "        \"train_perplexity\": train_ppl,\n",
+    "        \"val_perplexity\": val_ppl,\n",
+    "    })\n",
+    "    \n",
+    "    # Early stopping\n",
+    "    if val_loss < best_val_loss:\n",
+    "        best_val_loss = val_loss\n",
+    "        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}\n",
+    "        stagnant = 0\n",
+    "    else:\n",
+    "        stagnant += 1\n",
+    "    \n",
+    "    if epoch == 1 or epoch % 5 == 0:\n",
+    "        print(f\"Epoch {epoch:3d} | \"\n",
+    "              f\"Train Loss: {train_loss:.4f} | \"\n",
+    "              f\"Val Loss: {val_loss:.4f} | \"\n",
+    "              f\"Val PPL: {val_ppl:.1f}\")\n",
+    "    \n",
+    "    if stagnant >= patience:\n",
+    "        print(f\"\\nEarly stopping at epoch {epoch}\")\n",
+    "        break\n",
+    "\n",
+    "# Load best model\n",
+    "if best_state is not None:\n",
+    "    model.load_state_dict(best_state)\n",
+    "\n",
+    "print(f\"\\nBest validation loss: {best_val_loss:.4f}\")\n",
+    "print(f\"Best validation perplexity: {math.exp(min(best_val_loss, 20)):.1f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69926180",
+   "metadata": {},
+   "source": [
+    "## Generating Routes\n",
+    "\n",
+    "### The generation process\n",
+    "\n",
+    "To generate a route, we:\n",
+    "\n",
+    "1. **Create a prompt**: `<BOS> <BOARD_TB2> <ANGLE_40> <GRADE_V6>`\n",
+    "2. **Feed it to the model**: Get a probability distribution over the vocabulary for the next token\n",
+    "3. **Sample a token**: Use temperature and top-k filtering to control randomness\n",
+    "4. **Append and repeat**: Add the sampled token to the sequence and repeat until `<EOS>` or max length\n",
+    "\n",
+    "### Temperature and top-k sampling\n",
+    "\n",
+    "- **Temperature** (default 0.9): Controls randomness. Lower = more deterministic, higher = more random\n",
+    "- **Top-k** (default 50): Only consider the k most likely tokens. This prevents the model from generating very unlikely tokens.\n",
+    "\n",
+    "These are the same techniques used in language models like GPT-3 to control output diversity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "029eb911",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate sample routes for both boards\n",
+    "configs = load_board_configs([\"tb2\", \"kilter\"])\n",
+    "configs_by_key = {config.board_key: config for config in configs}\n",
+    "\n",
+    "samples = []\n",
+    "for board_key, config in configs_by_key.items():\n",
+    "    for grouped_v in [3, 5, 7]:  # V3, V5, V7\n",
+    "        sample = generate_one(\n",
+    "            model=model,\n",
+    "            stoi=stoi,\n",
+    "            itos=itos,\n",
+    "            device=device,\n",
+    "            board_prefix=config.token_prefix,\n",
+    "            angle=40,\n",
+    "            grouped_v=grouped_v,\n",
+    "            role_name_to_id=config.role_definitions,\n",
+    "            temperature=0.9,\n",
+    "            top_k=50,\n",
+    "            max_new_tokens=40,\n",
+    "        )\n",
+    "        samples.append({\"board_key\": board_key, **sample})\n",
+    "\n",
+    "samples_df = pd.DataFrame(samples)\n",
+    "print(\"Generated route samples:\")\n",
+    "print(samples_df[[\"board_key\", \"requested_grouped_v\", \"basic_valid\", \"sequence\", \"frames\"]])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "generate_more",
+   "metadata": {},
+   "source": [
+    "## Generate More Routes for Evaluation\n",
+    "\n",
+    "Notebook 04 needs a larger set of generated routes for meaningful evaluation. Let's generate routes across multiple angles and grades for both boards."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "generate_bulk",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate routes across multiple angles and grades for evaluation\n",
+    "all_samples = []\n",
+    "\n",
+    "for board_key, config in configs_by_key.items():\n",
+    "    # Get common angles and grades for this board\n",
+    "    board_df = df_routes[df_routes[\"board_key\"] == board_key]\n",
+    "    common_angles = sorted(board_df[\"angle\"].astype(int).value_counts().head(5).index.tolist())\n",
+    "    common_grades = sorted(board_df[\"grouped_v\"].astype(int).value_counts().head(8).index.tolist())\n",
+    "    \n",
+    "    print(f\"\\nGenerating for {config.display_name}:\")\n",
+    "    print(f\"  Angles: {common_angles}\")\n",
+    "    print(f\"  Grades: V{min(common_grades)}-V{max(common_grades)}\")\n",
+    "    \n",
+    "    for angle in common_angles:\n",
+    "        for grade in common_grades:\n",
+    "            for i in range(5):  # 5 samples per condition\n",
+    "                sample = generate_one(\n",
+    "                    model=model,\n",
+    "                    stoi=stoi,\n",
+    "                    itos=itos,\n",
+    "                    device=device,\n",
+    "                    board_prefix=config.token_prefix,\n",
+    "                    angle=int(angle),\n",
+    "                    grouped_v=int(grade),\n",
+    "                    role_name_to_id=config.role_definitions,\n",
+    "                    temperature=0.9,\n",
+    "                    top_k=50,\n",
+    "                    max_new_tokens=40,\n",
+    "                )\n",
+    "                all_samples.append({\"board_key\": board_key, **sample})\n",
+    "\n",
+    "all_samples_df = pd.DataFrame(all_samples)\n",
+    "print(f\"\\nTotal generated routes: {len(all_samples_df):,}\")\n",
+    "print(\"\\nBasic validity by board:\")\n",
+    "print(all_samples_df.groupby(\"board_key\")[\"basic_valid\"].mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "save_artifacts",
+   "metadata": {},
+   "source": [
+    "## Save Model and Generated Routes\n",
+    "\n",
+    "We save the trained model checkpoint and generated routes for use in notebook 04 (evaluation)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "save_outputs",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# Save model checkpoint\n",
+    "MODEL_DIR = ROOT / \"models\"\n",
+    "MODEL_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "checkpoint = {\n",
+    "    \"model_state_dict\": model.state_dict(),\n",
+    "    \"config\": {\n",
+    "        \"vocab_size\": len(stoi),\n",
+    "        \"block_size\": block_size,\n",
+    "        \"n_embd\": 128,\n",
+    "        \"n_head\": 4,\n",
+    "        \"n_layer\": 4,\n",
+    "        \"dropout\": 0.10,\n",
+    "        \"pad_id\": pad_id,\n",
+    "    },\n",
+    "    \"stoi\": stoi,\n",
+    "    \"itos\": {str(k): v for k, v in itos.items()},\n",
+    "    \"best_val_loss\": best_val_loss,\n",
+    "}\n",
+    "model_path = MODEL_DIR / \"joint_route_gpt_generator.pth\"\n",
+    "torch.save(checkpoint, model_path)\n",
+    "print(f\"Saved model checkpoint to: {model_path}\")\n",
+    "\n",
+    "# Save training history\n",
+    "GEN_DIR = ROOT / \"data\" / \"processed\" / \"generation\"\n",
+    "GEN_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "pd.DataFrame(history).to_csv(GEN_DIR / \"training_history.csv\", index=False)\n",
+    "print(f\"Saved training history to: {GEN_DIR / 'training_history.csv'}\")\n",
+    "\n",
+    "# Save generated routes (this is what notebook 04 needs)\n",
+    "all_samples_df.to_csv(GEN_DIR / \"generated_routes.csv\", index=False)\n",
+    "print(f\"Saved {len(all_samples_df)} generated routes to: {GEN_DIR / 'generated_routes.csv'}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}