Kilter-Board-Analysis/notebooks/01_data_overview_and_climbing_statistics.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "37e8cfe9",
   "metadata": {},
   "source": [
    "Kilter Board: Data Overview and Climbing Statistics\n",
    "\n",
    "## Purpose\n",
    "\n",
    "This notebook establishes the basic statistical landscape of the dataset before we move into hold-level analysis and predictive modelling. The main goals are:\n",
    "\n",
    "1. to understand the size and scope of the data,\n",
    "2. to compare layouts, boards, and angles at a high level,\n",
    "3. to identify broad trends in grade, popularity, and quality,\n",
    "4. to create a clean descriptive baseline for the later modelling notebooks.\n",
    "\n",
    "Throughout, I treat each climb-angle entry as a separate observation unless explicitly noted otherwise. That matters because some climbs appear at multiple angles, so a unique climb count and a climb-angle count are not always the same thing.\n",
    "\n",
    "## Outputs\n",
    "\n",
    "This notebook produces summary tables and exploratory plots that motivate the later notebooks on:\n",
    "- hold usage,\n",
    "- hold difficulty,\n",
    "- feature engineering,\n",
    "- predictive modelling,\n",
    "- and deep learning.\n",
    "\n",
    "## Notebook Structure\n",
    "1. [Setup and Imports](#setup-and-imports)\n",
    "2. [Popularity and Temporal Trends](#popularity-and-temporal-trends)\n",
    "3. [Climbing Statistics](#climbing-statistics-grades-angles-quality-and-matching)\n",
    "4. [Prolific Statistics](#prolific-statistics)\n",
    "5. [Conclusion](#conclusion)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "898cad20",
   "metadata": {},
   "source": [
    "## Setup and Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e48e2d25",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Setup and imports\n",
    "==================================\n",
    "\"\"\"\n",
    "# Imports\n",
    "import pandas as pd\n",
    "import sqlite3\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.patches as mpatches\n",
    "import sqlite3\n",
    "\n",
    "\n",
    "# Set some display options\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "\n",
    "# Set style\n",
    "palette=['steelblue', 'coral', 'seagreen']  #(for multi-bar graphs)\n",
    "\n",
    "# Connect to the database\n",
    "DB_PATH=\"../data/kilter.db\"\n",
    "conn = sqlite3.connect(DB_PATH)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e7b5862",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "### Query our data from the DB\n",
    "==================================\n",
    "\n",
    "We restrict ourselves to the Kilter Original, i.e., layoud_id=1.\n",
    "For some reason there is also a first ascent logged in 2006 (\"'Preliminary first draft of Val David Sans Nom Practice. Not yet tested\").\n",
    "Since the first logged ascents start in 2018, we'll just insist that the date is larger than January 1, 2016.\n",
    "\"\"\"\n",
    "\n",
    "# Query climb data\n",
    "climbs_query = \"\"\"\n",
    "SELECT\n",
    "    c.uuid,\n",
    "    c.name AS climb_name,\n",
    "    c.setter_username,\n",
    "    c.layout_id AS layout_id,\n",
    "    c.description,\n",
    "    c.is_nomatch,\n",
    "    c.is_listed,\n",
    "    l.name AS layout_name,\n",
    "    p.name AS board_name,\n",
    "    c.frames,\n",
    "    cs.angle,\n",
    "    cs.display_difficulty,\n",
    "    dg.boulder_name AS boulder_grade,\n",
    "    cs.ascensionist_count,\n",
    "    cs.quality_average,\n",
    "    cs.fa_at\n",
    "FROM climbs c\n",
    "JOIN layouts l ON c.layout_id = l.id\n",
    "JOIN products p ON l.product_id = p.id\n",
    "JOIN climb_stats cs ON c.uuid = cs.climb_uuid\n",
    "JOIN difficulty_grades dg ON ROUND(cs.display_difficulty) = dg.difficulty\n",
    "WHERE cs.display_difficulty IS NOT NULL AND c.layout_id=1 AND cs.fa_at > '2016-01-01';\n",
    "\"\"\"\n",
    "\n",
    "# Load it into a DataFrame\n",
    "df = pd.read_sql_query(climbs_query, conn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70a88be4",
   "metadata": {},
   "source": [
    "The above query will allow us to gather basically anything we need to in order to analyze climbing statistics. We leave out information about climging holds and things like this, because they will be analyzed in a different notebook. Let's see what our DataFrame looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b0057f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7eb46cc",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba4fc956",
   "metadata": {},
   "source": [
    "# Popularity and Temporal Trends\n",
    "\n",
    "## Popularity of Tension Board\n",
    "\n",
    "Since we do not have access to user data, we will examine the popular of the Tension Boards by counting first ascents and unique setters by year. Often it's the case that the first ascensionist is the also the setter of the climb, but not always. None the less, we group up first ascensionists by year, with an extra tidbit about how many unique setters there were. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dff1bd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Popularity of Kilter board by year. \n",
    "First ascents by year + unique setters by year\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Convert df['fa_at'] to datetime format. (For some reason, it does not register as such)\n",
    "df['fa_at'] = pd.to_datetime(df['fa_at'])\n",
    "\n",
    "# Add a new column for the year\n",
    "df['fa_year'] = df['fa_at'].dt.year\n",
    "\n",
    "# Make a new DataFrame with year, first_ascents, and unique_setters\n",
    "df_growth = df.groupby('fa_year').agg(\n",
    "    first_ascents=('uuid', 'count'),\n",
    "    unique_setters=('setter_username', 'nunique')\n",
    ").reset_index()\n",
    "\n",
    "# Disregard the year 2026 since the data only goes one month in. \n",
    "df_growth = df_growth[df_growth['fa_year'] < 2026]\n",
    "\n",
    "# Convert year to string so that matplotlib doesn't think the year is continuous\n",
    "df_growth['fa_year'] = df_growth['fa_year'].astype(str)\n",
    "\n",
    "## Plot\n",
    "# Dual index plotting\n",
    "\n",
    "fig, ax1 = plt.subplots(figsize=(12,6))\n",
    "\n",
    "# Bar chart for first ascents\n",
    "ax1.bar(df_growth['fa_year'], df_growth['first_ascents'], label='First Ascents', color='coral')\n",
    "ax1.set_xlabel('Year')\n",
    "ax1.set_ylabel('First Ascents')\n",
    "ax1.set_title('TB First Ascents & Unique Setters over Time')\n",
    "#ax1.tick_params(axis='y')\n",
    "\n",
    "# Line chart for unique setters (secondary axis)\n",
    "ax2 = ax1.twinx()\n",
    "ax2.plot(df_growth['fa_year'], df_growth['unique_setters'], color='steelblue', marker='o', label='Unique Setters')\n",
    "ax2.set_ylabel('Unique Setters', color='steelblue')\n",
    "ax2.tick_params(axis='y', labelcolor='steelblue')\n",
    "\n",
    "# Other stuff\n",
    "fig.legend(loc='upper left', bbox_to_anchor=(0.15,0.85))\n",
    "\n",
    "plt.xticks()\n",
    "plt.savefig('../images/01_climb_stats/first_ascents_by_year.png')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2bca15a",
   "metadata": {},
   "source": [
    "## Seasonal analysis\n",
    "\n",
    "Next, we examine when the Tension board is most popular. Again, we will work with what we have and use first ascent data. We will plot first ascents by month, combing all years. We exclude the year 2026 because this can skew the analysis as some of the month of January has data (and clearly, 2026 is when the TB2 is the most popular, so this can actually add quite a bit bias). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8912306b",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Season analysis: first ascents by month\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# First let us add a column for the month to our data\n",
    "df['fa_month'] = df['fa_at'].dt.month\n",
    "\n",
    "# Filter to years < 2026 since the data only goes one month in\n",
    "df_filter = df[df['fa_year'] < 2026]\n",
    "\n",
    "# Make a new DataFrame with month and first ascents\n",
    "df_season = df_filter.groupby('fa_month').agg(\n",
    "    first_ascents=('uuid', 'count'),\n",
    ").reset_index()\n",
    "\n",
    "# We also add a column for the month name. \n",
    "month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', \n",
    "               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']\n",
    "df_season['fa_month_name'] = df_season['fa_month'].apply(lambda x: month_names[x-1])\n",
    "\n",
    "# Plot the data\n",
    "fig,ax = plt.subplots(figsize=(12,6))\n",
    "ax.bar(df_season['fa_month_name'], df_season['first_ascents'], color='coral')\n",
    "ax.set_title('First Ascents by Month (All Years Combined)')\n",
    "ax.set_xlabel('Month')\n",
    "ax.set_ylabel('Total First Ascents')\n",
    "\n",
    "# Save the file\n",
    "plt.savefig('../images/01_climb_stats/first_ascents_by_month.png')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb802d74",
   "metadata": {},
   "source": [
    "This should be what we expect: that the winter months (Dec) see the most traffic. This is probably when the outdoor climbers are hitting the boards because they're stuck inside. The warmer months see the least number of first ascents since the strong climbers are probably outdoors."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "def962ba",
   "metadata": {},
   "source": [
    "## Day of Week Analysis\n",
    "\n",
    "We can plot the number of first ascents by day of week. Removing the 2026 data shouldn't make a difference here, so we opt to keep it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "071cfd5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Day of Week analysis\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Let us add a column in our DataFrame for the day of the week.\n",
    "# Note that df.dt.day_of_week will have Monday be 0 and Sunday be 6. \n",
    "\n",
    "df['fa_day_of_week'] = df['fa_at'].dt.day_of_week\n",
    "\n",
    "\n",
    "# Make a new DataFrame with month and first ascents\n",
    "df_days = df.groupby('fa_day_of_week').agg(\n",
    "    first_ascents=('uuid', 'count'),\n",
    ").reset_index()\n",
    "\n",
    "# We also add a column for the month name. \n",
    "day_names = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']\n",
    "df_days['fa_day_name'] = df_days['fa_day_of_week'].apply(lambda x: day_names[x])\n",
    "\n",
    "# Plot the data\n",
    "fig,ax = plt.subplots(figsize=(12,6))\n",
    "ax.bar(df_days['fa_day_name'], df_days['first_ascents'], color='coral')\n",
    "ax.set_title('First Ascents by Day of Week (All Years Combined)')\n",
    "ax.set_xlabel('Day')\n",
    "ax.set_ylabel('Total First Ascents')\n",
    "\n",
    "# Save the file\n",
    "plt.savefig('../images/01_climb_stats/first_ascents_by_day_of_week.png')\n",
    "plt.show()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9961e92e",
   "metadata": {},
   "source": [
    "Interesting, Tuesday and Wednesday have the most traffic, while Monday is the least popular."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "440a0b28",
   "metadata": {},
   "source": [
    "## Time of Day Analysis\n",
    "\n",
    "We can even do a time of day analysis. Again, we will keep the 2026 data since it shouldn't affect much. It is not entirely clear that makes sense to look at this, as we don't know if the time of first ascent is recorded in local time of the climber or local time of the server. These boards are all over the world, so this may add quite a bit of variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b8e3c4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Time of Day analysis\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Let us add a column in our DataFrame for the day of the week.\n",
    "# Note that df.dt.day_of_week will have Monday be 0 and Sunday be 6. \n",
    "\n",
    "df['fa_hour'] = df['fa_at'].dt.hour\n",
    "\n",
    "\n",
    "# Make a new DataFrame with month and first ascents\n",
    "df_hour = df.groupby('fa_hour').agg(\n",
    "    first_ascents=('uuid', 'count'),\n",
    ").reset_index()\n",
    "\n",
    "\n",
    "# Plot the data\n",
    "fig,ax = plt.subplots(figsize=(12,6))\n",
    "ax.bar(df_hour['fa_hour'], df_hour['first_ascents'], color='coral')\n",
    "ax.set_title('First Ascents by Hour (All Years Combined)')\n",
    "ax.set_xlabel('Hour')\n",
    "ax.set_ylabel('Total First Ascents')\n",
    "\n",
    "# Save the file\n",
    "plt.savefig('../images/01_climb_stats/first_ascents_by_hour.png')\n",
    "plt.show()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ddc49b3",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "533dc4a2",
   "metadata": {},
   "source": [
    "# Climbing Statistics: Grades, Angles, Quality, and Matching\n",
    "\n",
    "We will visualize the climbing grade distribution. Recall that we have the following table of grades (with some other unlisted grades).\n",
    "\n",
    "|difficulty|boulder_name|route_name|\n",
    "|----------|------------|----------|\n",
    "|        10|4a/V0       |5b/5.9    |\n",
    "|        11|4b/V0       |5c/5.10a  |\n",
    "|        12|4c/V0       |6a/5.10b  |\n",
    "|        13|5a/V1       |6a+/5.10c |\n",
    "|        14|5b/V1       |6b/5.10d  |\n",
    "|        15|5c/V2       |6b+/5.11a |\n",
    "|        16|6a/V3       |6c/5.11b  |\n",
    "|        17|6a+/V3      |6c+/5.11c |\n",
    "|        18|6b/V4       |7a/5.11d  |\n",
    "|        19|6b+/V4      |7a+/5.12a |\n",
    "|        20|6c/V5       |7b/5.12b  |\n",
    "|        21|6c+/V5      |7b+/5.12c |\n",
    "|        22|7a/V6       |7c/5.12d  |\n",
    "|        23|7a+/V7      |7c+/5.13a |\n",
    "|        24|7b/V8       |8a/5.13b  |\n",
    "|        25|7b+/V8      |8a+/5.13c |\n",
    "|        26|7c/V9       |8b/5.13d  |\n",
    "|        27|7c+/V10     |8b+/5.14a |\n",
    "|        28|8a/V11      |8c/5.14b  |\n",
    "|        29|8a+/V12     |8c+/5.14c |\n",
    "|        30|8b/V13      |9a/5.14d  |\n",
    "|        31|8b+/V14     |9a+/5.15a |\n",
    "|        32|8c/V15      |9b/5.15b  |\n",
    "|        33|8c+/V16     |9b+/5.15c |\n",
    "\n",
    "We will use the actual difficulty in our work, and then unpack translations into boulder_name as we see fit."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f986836",
   "metadata": {},
   "source": [
    "## Grade distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fa594ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Difficulty distribution\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "grade_counts = df['boulder_grade'].value_counts()\n",
    "grade_order = df.groupby('boulder_grade')['display_difficulty'].mean().sort_values().index.tolist()\n",
    "grade_counts = grade_counts.reindex(grade_order)\n",
    "\n",
    "\n",
    "df_grades = df.groupby('boulder_grade').size().reset_index(name='count')\n",
    "\n",
    "\n",
    "# Plot\n",
    "fig, ax = plt.subplots(figsize=(16, 8))\n",
    "\n",
    "sns.barplot(\n",
    "    data=df_grades,\n",
    "    x='boulder_grade',\n",
    "    y='count',\n",
    "    color='steelblue',\n",
    "    ax=ax,\n",
    "    width=0.6,\n",
    "    order=grade_order\n",
    ")\n",
    "\n",
    "\n",
    "\n",
    "ax.set_xlabel('Grade', fontsize=11)\n",
    "ax.set_ylabel('Number of Climbs', fontsize=11)\n",
    "ax.set_title('Difficulty Distribution by Board Layout', fontsize=14)\n",
    "ax.tick_params(axis='x', rotation=45)\n",
    "ax.grid(axis='y', alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/01_climb_stats/difficulty_distribution.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "871bb45d",
   "metadata": {},
   "source": [
    "As a climber in North America, I tend to just use the V-grade and not look at the French grade. So let us group the V-grades together and show the distribution like that. We'll usually just stick the boulder_grade (e.g., 5c/V2) instead of grouping the V-grades though. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7f0b911",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "V-Grade distribution\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "grade_to_v = {\n",
    "10: 0, 11: 0, 12: 0,\n",
    "13: 1, 14: 1,\n",
    "15: 2,\n",
    "16: 3, 17: 3,\n",
    "18: 4, 19: 4,\n",
    "20: 5, 21: 5,\n",
    "22: 6,\n",
    "23: 7,\n",
    "24: 8, 25: 8,\n",
    "26: 9,\n",
    "27: 10,\n",
    "28: 11,\n",
    "29: 12,\n",
    "30: 13,\n",
    "31: 14,\n",
    "32: 15,\n",
    "33: 16,\n",
    "}\n",
    "\n",
    "# Let's add a v_grade column and v_grade_counts\n",
    "df['v_grade'] = df['display_difficulty'].round().map(grade_to_v)\n",
    "df_v_grades = df.groupby('v_grade').size().reset_index(name='count')\n",
    "df_v_grades['v_label'] = 'V' + df_v_grades['v_grade'].astype(str)\n",
    "\n",
    "\n",
    "# Plot\n",
    "fig, ax = plt.subplots(figsize=(16, 8))\n",
    "\n",
    "sns.barplot(\n",
    "    data=df_v_grades,\n",
    "    x='v_label',\n",
    "    y='count',\n",
    "    color='steelblue',\n",
    "    ax=ax,\n",
    "    width=0.6,\n",
    ")\n",
    "\n",
    "\n",
    "ax.set_xlabel('V-Grade', fontsize=11)\n",
    "ax.set_ylabel('Number of Climbs', fontsize=11)\n",
    "ax.set_title('V-Grade Distribution by Board Layout', fontsize=14)\n",
    "ax.tick_params(axis='x')\n",
    "ax.grid(axis='y', alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/01_climb_stats/v_grade_distribution.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e27f4277",
   "metadata": {},
   "source": [
    "Some key differences in grades are the angle at which the climb is. Note that climbs can be done at different angles."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a34766",
   "metadata": {},
   "source": [
    "## Angle Distribution\n",
    "\n",
    "What about the angle distribution? Since the TB1 goes from 0 to 50 and the TB2 goes from 0 to 65 (although my local board only goes to 60?), let's do an analysis on each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d65b6cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Angle distribution\n",
    "==================================\n",
    "\n",
    "The Kilter Board original goes to 70 degrees.\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "df_angle = df.groupby('angle').size().reset_index(name='count')\n",
    "\n",
    "# Reindex to correct order\n",
    "angle_order = sorted(df['angle'].unique())\n",
    "\n",
    "# Plot\n",
    "fix, ax = plt.subplots(figsize=(16,8))\n",
    "\n",
    "# Plot All Layouts\n",
    "sns.barplot(\n",
    "    data=df_angle,\n",
    "    x='angle',\n",
    "    y='count',\n",
    "    color='seagreen',\n",
    "    ax=ax,\n",
    "    width=0.6,\n",
    "    order=angle_order\n",
    ")\n",
    "\n",
    "\n",
    "ax.set_xlabel('Angle')\n",
    "ax.set_ylabel('Number of Climbs')\n",
    "ax.set_title('Angle Distribution by Board Layout')\n",
    "ax.grid(axis='y', alpha=0.3)\n",
    "\n",
    "\n",
    "plt.suptitle('Angle Distribution by Board Layout')\n",
    "plt.savefig('../images/01_climb_stats/angle_distribution.png')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a00ee20",
   "metadata": {},
   "source": [
    "Just like with Tension Boards, 40 is the most common angle. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc164cee",
   "metadata": {},
   "source": [
    "## Angle vs grade\n",
    "\n",
    "How is the distribution between angles and grades? Let's do this with a heatmap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2023cb49",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Angle vs grade\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(16, 8))\n",
    "\n",
    "# Create mapping from difficulty to boulder_grade\n",
    "grade_mapping = df.groupby('display_difficulty')['boulder_grade'].first().to_dict()\n",
    "\n",
    "# Plot \"All Layouts\" as faint background boxes\n",
    "sns.boxplot(\n",
    "    data=df,\n",
    "    x='angle',\n",
    "    y='display_difficulty',\n",
    "    color='seagreen',\n",
    "    order=angle_order,\n",
    "    showfliers=False,\n",
    "    width=0.6,\n",
    "    ax=ax,\n",
    ")\n",
    "\n",
    "# Relabel y-axis with boulder_grades\n",
    "yticks_rounded = sorted(set(int(round(t)) for t in df['display_difficulty'].unique() if not pd.isna(t)))\n",
    "ylabels = [grade_mapping.get(t, '') for t in yticks_rounded]\n",
    "ax.set_yticks(yticks_rounded)\n",
    "ax.set_yticklabels(ylabels)\n",
    "\n",
    "\n",
    "ax.set_xlabel('Angle (degrees)', fontsize=11)\n",
    "ax.set_ylabel('Boulder Grade', fontsize=11)\n",
    "ax.set_title('Difficulty Distribution by Angle', fontsize=14)\n",
    "ax.grid(axis='y', alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/01_climb_stats/difficulty_by_angle_boxplot.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5477862",
   "metadata": {},
   "source": [
    "We see that angle is directly correlated with how difficult climbs are on average, right up untill 55 degrees. Then it tapers off."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7a5973a",
   "metadata": {},
   "source": [
    "## The Quality of a climb\n",
    "\n",
    "Next we examine the quality of a climb. First we look at how quality relates to the number of ascents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf5bd013",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Climb quality vs popularity\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Filter to climbs with quality ratings\n",
    "df_quality = df[(df['quality_average'].notna()) & (df['quality_average'] > 0)]\n",
    "\n",
    "# Sample for performance\n",
    "#df_sample = df_quality.sample(min(2000, len(df_quality)))\n",
    "\n",
    "g = sns.jointplot(\n",
    "    data=df_quality,\n",
    "    x='quality_average',\n",
    "    y='ascensionist_count',\n",
    "    kind='scatter',\n",
    "    color='teal',\n",
    "    height=5\n",
    ")\n",
    "\n",
    "g.ax_joint.set_xlabel('Quality Rating')\n",
    "g.ax_joint.set_ylabel('Ascensionist Count')\n",
    "g.fig.suptitle('Quality vs Popularity')\n",
    "\n",
    "plt.savefig('../images/01_climb_stats/quality_popularity.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf0ca290",
   "metadata": {},
   "source": [
    "Next we visualize the average quality vs the angle and grade, by means of a heatmap. Keep in mind that the harder the climb and steeper the angle, the less people will be doing it. So harder climbs are skewed towards people who can actually do it. The point is that, on boards, the climb quality isn't always the best metric. As such, we won't spend too much time on the quality and will only do a heatmap which takes into account all layouts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af7282b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Average quality by angle and grade\n",
    "\n",
    "# Filter to climbs with quality ratings\n",
    "df_quality = df[(df['quality_average'].notna()) & (df['quality_average'] > 0)]\n",
    "\n",
    "\n",
    "# Create pivot table\n",
    "quality_pivot = df_quality.pivot_table(\n",
    "    index='boulder_grade',\n",
    "    columns='angle',\n",
    "    values='quality_average',\n",
    "    aggfunc='mean'\n",
    ")\n",
    "quality_pivot = quality_pivot.reindex(grade_order)\n",
    "quality_pivot = quality_pivot.reindex(columns=[a for a in angle_order if a in quality_pivot.columns])\n",
    "\n",
    "# Plot\n",
    "fig, ax = plt.subplots(figsize=(16, 8))\n",
    "\n",
    "sns.heatmap(\n",
    "    quality_pivot,\n",
    "    cmap='RdYlGn',\n",
    "    cbar_kws={'label': 'Avg Quality Rating'},\n",
    "    ax=ax\n",
    ")\n",
    "\n",
    "ax.set_xlabel('Angle (°)')\n",
    "ax.set_ylabel('Grade')\n",
    "ax.invert_yaxis()\n",
    "ax.set_title('Average Quality Rating by Grade and Angle (All Layouts)')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/01_climb_stats/quality_heatmap.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3630b077",
   "metadata": {},
   "source": [
    "## \"Match\" vs. \"No Match\"\n",
    "\n",
    "Some setters opt to put the \"no match\" tag onto their climbs. This means that the climber is not allowed to match their hands on any hold. Let's do an analysis of the differences with regular climbs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed1f0ec2",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Match vs No Match analysis\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Create status column\n",
    "df['status'] = df.apply(\n",
    "    lambda x: 'No Match' if (\n",
    "        pd.notna(x['description']) and 'No matching' in str(x['description'])\n",
    "    ) or x.get('is_nomatch', 0) == 1 else 'Matched',\n",
    "    axis=1\n",
    ")\n",
    "\n",
    "# Aggregate by status only\n",
    "df_agg = df.groupby('status').agg(\n",
    "    count=('uuid', 'count'),\n",
    "    avg_ascensionists=('ascensionist_count', 'mean'),\n",
    "    avg_difficulty=('display_difficulty', 'mean')\n",
    ").reset_index()\n",
    "\n",
    "status_order = ['Matched', 'No Match']\n",
    "status_colors = {'Matched': 'teal', 'No Match': 'coral'}\n",
    "\n",
    "# Plot\n",
    "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
    "\n",
    "for ax, metric, title in zip(axes, ['count', 'avg_difficulty', 'avg_ascensionists'], \n",
    "                               ['Total Climbs', 'Average Difficulty', 'Avg Ascensionists']):\n",
    "    \n",
    "    sns.barplot(\n",
    "        data=df_agg,\n",
    "        x='status',\n",
    "        y=metric,\n",
    "        hue='status',\n",
    "        legend=False,\n",
    "        order=status_order,\n",
    "        palette=status_colors,\n",
    "        ax=ax,\n",
    "        width=0.5\n",
    "    )\n",
    "    \n",
    "    ax.set_title(title, fontsize=12)\n",
    "    ax.set_xlabel('')\n",
    "    ax.grid(axis='y', alpha=0.3)\n",
    "\n",
    "# Y-axis labels for difficulty plot\n",
    "yticks = [11, 13, 15, 17, 19, 21, 23]\n",
    "ylabels = [grade_mapping.get(t, f\"V{t-10}\") for t in yticks]\n",
    "axes[1].set_yticks(yticks)\n",
    "axes[1].set_yticklabels(ylabels)\n",
    "axes[1].set_ylim(bottom=10)\n",
    "\n",
    "# Add value labels on bars\n",
    "for ax in axes:\n",
    "    for p in ax.patches:\n",
    "        if ax == axes[1]:  # Difficulty plot - show boulder_grade\n",
    "            height = p.get_height()\n",
    "            rounded_diff = round(height)\n",
    "            boulder_grade = grade_mapping.get(rounded_diff, f\"V{rounded_diff - 10}\")\n",
    "            ax.annotate(\n",
    "                boulder_grade,\n",
    "                (p.get_x() + p.get_width() / 2, height),\n",
    "                ha='center',\n",
    "                va='bottom',\n",
    "                fontsize=10,\n",
    "                fontweight='bold'\n",
    "            )\n",
    "        else:  # Other plots - show numeric values\n",
    "            fmt = f'{p.get_height():,.0f}' if ax == axes[0] else f'{p.get_height():.1f}'\n",
    "            ax.annotate(\n",
    "                fmt,\n",
    "                (p.get_x() + p.get_width() / 2, p.get_height()),\n",
    "                ha='center',\n",
    "                va='bottom',\n",
    "                fontsize=10\n",
    "            )\n",
    "\n",
    "plt.suptitle('Match vs No Match Climbs (All Layouts)', fontsize=14, y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/01_climb_stats/match_vs_nomatch.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f51156a",
   "metadata": {},
   "source": [
    "So we gather the following about \"no match\" climbs:\n",
    "\n",
    "- they are far fewer than \"match\" climbs,\n",
    "- they are on average harder than \"match\" climbs,\n",
    "- and that they have quite a bit less ascensionists on average."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21e12faa",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Match vs No Match Summary\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "summary = df_agg.pivot_table(\n",
    "    columns='status',\n",
    "    values=['count', 'avg_difficulty', 'avg_ascensionists']\n",
    ").round(2)\n",
    "\n",
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de8eb20e",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "754abf39",
   "metadata": {},
   "source": [
    "# Prolific statistics\n",
    "\n",
    "Here we will take note of some prolific statistics: what are the most popular climbs and who are the most popular setters?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "935c0ea3",
   "metadata": {},
   "source": [
    "## Most popular climbs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42f93d59",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Most popular climbs\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# The ascensionist_count column will allow us to easily deduce the top 15 climbs. \n",
    "\n",
    "# Create a DataFrame with the top 15 climbs\n",
    "df_popular_climbs = df.sort_values(by='ascensionist_count', ascending=False).head(15).reset_index(drop=True)\n",
    "\n",
    "# Appropriate index\n",
    "df_popular_climbs.index = df_popular_climbs.index + 1\n",
    "\n",
    "\n",
    "display(df_popular_climbs[['climb_name', 'setter_username', 'angle', 'boulder_grade', 'ascensionist_count']])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d70aa42",
   "metadata": {},
   "source": [
    "It's unsuprising that every one of these climbs is at 40° given that 40° is the most popular angle, by a long shot.\n",
    "\n",
    "What about an angle-agnostic analysis? What are the top climbs amonst all angles?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "404c67aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Top 15 most popular climbs (angle agnostic)\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Aggregate by climb_name (sum counts across all angles)\n",
    "df_agg = df.groupby(['climb_name']).agg(\n",
    "    total_ascensionists=('ascensionist_count', 'sum'),\n",
    "    avg_difficulty=('display_difficulty', 'mean')\n",
    ").reset_index()\n",
    "\n",
    "\n",
    "df_agg['avg_boulder_grade'] = df_agg['avg_difficulty'].round().astype(int).map(grade_mapping)\n",
    "\n",
    "# Sort and select top 15\n",
    "df_popular_climbs_aa = df_agg.sort_values(by='total_ascensionists', ascending=False).head(15).reset_index(drop=True)\n",
    "\n",
    "df_popular_climbs_aa.index = df_popular_climbs_aa.index + 1\n",
    "\n",
    "display(df_popular_climbs_aa)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab87caa",
   "metadata": {},
   "source": [
    "## Prolific setters\n",
    "\n",
    "Next, we will make a simple table of the most prolific setters by board."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3053ddc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "==================================\n",
    "Top 10 setters\n",
    "==================================\n",
    "\"\"\"\n",
    "\n",
    "# Make a DataFrame for the setters\n",
    "df_agg = df.groupby(['setter_username']).agg(\n",
    "    climb_count=('uuid', 'nunique')\n",
    ").reset_index()\n",
    "\n",
    "df_setters = df_agg.sort_values(by='climb_count', ascending=False).head(10).reset_index(drop=True)\n",
    "\n",
    "df_setters.index = df_setters.index + 1\n",
    "\n",
    "display(df_setters)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19dd4fad",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "085c99ea",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "At this point we have a board-level and climb-level picture of the dataset. In particular, we now know:\n",
    "\n",
    "- how large the dataset is,\n",
    "- how the grade and angle distributions vary across layouts,\n",
    "- which climbs and setters appear most often,\n",
    "- and where simple descriptive trends begin to show up.\n",
    "\n",
    "That gives us enough context to move from *global statistics* to *hold-level structure*. The next notebook focuses on hold usage patterns and board heatmaps, where we stop asking only **how many climbs there are** and start asking **which physical parts of the board are driving those climbs**."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}