DS-for-LA/notebooks/03_some_notes_and_what_goes_wrong.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f385cc40",
   "metadata": {},
   "source": [
    "\n",
    "# A note on other norms\n",
    "\n",
    "There are other canonical choices of norms for vectors and matrices. While $L^2$ leads naturally to least-squares problems with closed-form solutions, other norms induce different geometries and different optimal solutions. From the linear algebra perspective, changing the norm affects:\n",
    "- the shape of the unit ball,\n",
    "- the geometry of approximation,\n",
    "- the numerical behaviour of optimization problems. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca50a202",
   "metadata": {},
   "source": [
    "\n",
    "## $L^1$ norm (Manhattan distance)\n",
    "The $L^1$ norm of a vector $x = (x_1,\\dots,x_p) \\in \\mathbb{R}^p$ is defined as\n",
    "$$ \\|x\\|_1 = \\sum |x_i|. $$\n",
    "Minimizing the $L^1$ norm is less sensitive to outliers. Geometrically, the $L^1$ unit ball in $\\mathbb{R}^2$ is a diamond (a rotated square), rather than a circle.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77e7c0b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Grid\n",
    "xx = np.linspace(-1.2, 1.2, 400)\n",
    "yy = np.linspace(-1.2, 1.2, 400)\n",
    "X, Y = np.meshgrid(xx, yy)\n",
    "\n",
    "# Take the $L^1$ norm\n",
    "Z = np.abs(X) + np.abs(Y)\n",
    "\n",
    "plt.figure(figsize=(6,6))\n",
    "plt.contour(X, Y, Z, levels=[1])\n",
    "plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)\n",
    "\n",
    "plt.axhline(0)\n",
    "plt.axvline(0)\n",
    "plt.gca().set_aspect(\"equal\", adjustable=\"box\")\n",
    "plt.title(r\"$L^1$ unit ball: $|x|+|y|\\leq 1$\")\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/L1_unit_ball.png')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce59565a",
   "metadata": {},
   "source": [
    "\n",
    "Consequently, optimization problems involving $L^1$ tend to produce solutions which live on the corners of this polytope.\n",
    "Solutions often require linear programming or iterative reweighted least squares.\n",
    "\n",
    "$L^1$ based methods (such as LASSO) tend to set coefficients to be exactly zero. Unlike with $L^2$, the minimization problem for $L^1$ does not admit a closed form solution. Algorithms include:\n",
    "- linear programming formulations,\n",
    "- iterative reweighted least squares,\n",
    "- coordinate descent methods.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9c50d8e",
   "metadata": {},
   "source": [
    "## $L^{\\infty}$ norm (max/supremum norm)\n",
    "The supremum norm defined as\n",
    "$$ \\|x\\|_{\\infty} = \\max |x_i| $$\n",
    "seeks to control the worst-case error rather than the average error. Minimizing this norm is related to Chebyshev approximation by polynomials. \n",
    "\n",
    "Geometrically, the unit ball of $\\mathbb{R}^2$ with respect to the $L^{\\infty}$ norm looks like a square.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2724a3bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Grid\n",
    "xx = np.linspace(-1.2, 1.2, 400)\n",
    "yy = np.linspace(-1.2, 1.2, 400)\n",
    "X, Y = np.meshgrid(xx, yy)\n",
    "\n",
    "# Take the $L^{\\infty}$ norm\n",
    "Z = np.maximum(np.abs(X), np.abs(Y))\n",
    "\n",
    "plt.figure(figsize=(6,6))\n",
    "plt.contour(X, Y, Z, levels=[1])\n",
    "plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)\n",
    "\n",
    "plt.axhline(0)\n",
    "plt.axvline(0)\n",
    "plt.gca().set_aspect(\"equal\", adjustable=\"box\")\n",
    "plt.title(r\"$L^{\\infty}$ unit ball: $\\max\\{|x|,|y|\\} \\leq 1$\")\n",
    "plt.tight_layout()\n",
    "plt.savefig('../images/Linf_unit_ball.png')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55c4ce17",
   "metadata": {},
   "source": [
    "\n",
    "Problems involving the $L^{\\infty}$ norm are often formulated as linear programs, and are useful when worst-case guarantees are more important than optimizing average performance. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d393c069",
   "metadata": {},
   "source": [
    "\n",
    "## Matrix norms\n",
    "\n",
    "There are also various norms on matrices, each highlighting a different aspect of the associated linear transformation.\n",
    "- **Frobenius norm**. This is an important norm, essentially the analogue of the $L^2$ norm for matrices. It is the Euclidean norm if you think of your matrix as a vector, forgetting its rectangular shape. For $A = (a_{ij})$ a matrix, the Frobenius norm \n",
    "  $$ \\|A\\|_F = \\sqrt{\\sum a_{ij}^2} $$\n",
    "  is the square root of the sum of squares of all the entries. This treats a matrix as a long vector and is invariant under orthogonal transformations. As we'll see, it plays a central role in:\n",
    "  - least-squares problems,\n",
    "  - low-rank approximation,\n",
    "  - principal component analysis.\n",
    "\n",
    "  In particular, the truncated SVD yields a best low-rank approximation of a matrix with respect to the Frobenius norm.\n",
    "\n",
    "  We also that that the Frobenius norm can be written in terms of tracial data. We have that\n",
    "  $$ \\|A\\|_F^2 = \\text{Tr}(A^TA) = \\text{Tr}(AA^T). $$\n",
    "- **Operator norm** (spectral norm). This is just the norm as an operator $A: \\mathbb{R}^p \\to \\mathbb{R}^n$, where $\\mathbb{R}^p$ and $\\mathbb{R}^n$ are thought of as Hilbert spaces:\n",
    "  $$ \\|A\\| = \\max_{\\|x\\|_2 = 1}\\|Ax\\|_2. $$\n",
    "  This norm measures how big of an amplification $A$ can apply, and is equal to the largest singular value of $A$. This norm is related to stability properties, and is the analogue of the $L^{\\infty}$ norm.\n",
    "- **Nuclear norm**. The nuclear norm, defined as\n",
    "  $$ \\|A\\|_* = \\sum \\sigma_i, $$\n",
    "  is the sum of the singular values. When $A$ is square, this is precisely the trace-class norm, and is the analogue of the $L^1$ norm. This norm acts as a generalization of the concept of rank. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ee62ee0",
   "metadata": {},
   "source": [
    "\n",
    "# A note on regularization\n",
    "\n",
    "Regularization introduces additional constraints or penalties to stabilize ill-posed problems. From the linear algebra point of view, regularization modifies the singular value structure of a problem. \n",
    "- **Ridge regression**: add a positive multiple $\\lambda\\cdot I$ of the identity to $X^TX$ which will artificially inflate small singular values.\n",
    "- This dampens unstable directions while leaving well-conditioned directions largely unaffected.\n",
    " \n",
    "Geometrically, regularization reshapes the solution space to suppress directions that are poorly supported by the data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be3b8c1e",
   "metadata": {},
   "source": [
    "\n",
    "# A note on solving multiple targets concurrently\n",
    "\n",
    "Suppose now that we were interested in solving several problems concurrently; that is, given some data points, we would like to make $k$ predictions. Say we have our $n \\times p$ data matrix $X$, and we want to make $k$ predictions $y_1,\\dots,y_k$. We can then set the problem up as finding a best solution to the matrix equation\n",
    "$$ XB = Y $$\n",
    "where $B$ will be a $p \\times k$ matrix of parameters and $Y$ will be the $p \\times k$ matrix whose columns are $y_1,\\dots,y_k$. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "908cd528",
   "metadata": {},
   "source": [
    "\n",
    "# What can go wrong?\n",
    "\n",
    "We are often dealing with imperfect data, so there is plenty that could go wrong. Here are some basic cases of where things can break down.\n",
    "\n",
    "- **Perfect multicolinearity**: non-invertible $\\tilde{X}^T\\tilde{X}$. This happens when one feature is a perfect combination of the others. This means that the columns of the matrix $\\tilde{X}$ are linearly dependent, and so infinitely many solutions will exist to the least-squares problem. \n",
    "    - For example, if you are looking at characteristics of people and you have height in both inches and centimeters.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1d9e0ed",
   "metadata": {},
   "source": [
    "- **Almost multicolinearity**: this happens when one features is **almost** a perfect combination of the others. From the linear algebra perspective, the columns of $\\tilde{X}$ might not be dependent, but they will be be **almost** linearly dependent. This will cause problems in calculation, as the condition number will become large and amplify numerical errors. The inverse will blow up small spectral components. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa57c3a4",
   "metadata": {},
   "source": [
    "\n",
    "- **More features than observations**: this means that our matrix $\\tilde{X}$ will be wider than it is high. Necessarily, this means that the columns are linearly dependent. Regularization or dimensionality reduction becomes essential.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5fff0b5",
   "metadata": {},
   "source": [
    "\n",
    "- **Redundant or constant features**: this is when there is a characteristic that is satisfied by each observation. In terms of the linear algebraic data, this means that one of the columns of $X$ is constant.\n",
    "    - e.g., if you are looking at characteristics of penguins, and you have \"# of legs\". This will always be two, and doesn't add anything to the analysis.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed7d745d",
   "metadata": {},
   "source": [
    "\n",
    "- **Underfitting**: the model lacks sufficient expressivity to capture the underlying structure. For example, see the section on polynomial regression -- sometimes one might want a curve vs. a straight line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2de2ed0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Generate data\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# 1) Generate quadratic data\n",
    "np.random.seed(3)\n",
    "\n",
    "n = 50\n",
    "x = np.random.uniform(-5, 5, n)   # symmetric, wider range\n",
    "\n",
    "# True relationship: y = ax^2 + c + noise\n",
    "a_true = 2.0\n",
    "c_true = 5.0\n",
    "noise = np.random.normal(0, 3, n)\n",
    "\n",
    "y = a_true * x**2 + c_true + noise\n",
    "\n",
    "# find a line of best fit\n",
    "a,b = np.polyfit(x, y, 1)\n",
    "\n",
    "# add scatter points to plot\n",
    "plt.scatter(x,y)\n",
    "\n",
    "# add line of best fit to plot\n",
    "plt.plot(x, a*x + b, 'r', linewidth=1)\n",
    "\n",
    "# plot it\n",
    "plt.show()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbb01960",
   "metadata": {},
   "source": [
    "- **Overfitting**: the model captures noise rather than structure. Often due to model complexity relative to data size. Polynomial regression can give a nice visualization of overfitting. For example, if we worked with the same generated quadratic data from the polynomial regression section, and we tried to approximation it by a degree 11 polynomial, we get the following.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43ab6a3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# 1) Generate quadratic data\n",
    "np.random.seed(3)\n",
    "\n",
    "n = 50\n",
    "x = np.random.uniform(-5, 5, n)\n",
    "\n",
    "a_true = 2.0\n",
    "c_true = 5.0\n",
    "noise = np.random.normal(0, 3, n)\n",
    "\n",
    "y = a_true * x**2 + c_true + noise\n",
    "\n",
    "# 2) Fit degree 11 polynomial\n",
    "coeffs = np.polyfit(x, y, 11)\n",
    "\n",
    "# Create polynomial function\n",
    "p = np.poly1d(coeffs)\n",
    "\n",
    "# 3) Sort x for smooth plotting\n",
    "x_sorted = np.linspace(min(x), max(x), 500)\n",
    "\n",
    "# 4) Plot\n",
    "plt.scatter(x, y, label=\"Data\")\n",
    "plt.plot(x_sorted, p(x_sorted), 'r', linewidth=2, label=\"Degree 11 fit\")\n",
    "\n",
    "plt.legend()\n",
    "plt.title(\"Degree 11 Polynomial Fit\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3aa62d78",
   "metadata": {},
   "source": [
    "\n",
    "- **Outliers**: large deviations can dominate the $L^2$ norm. This is where normalization might be key.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86606942",
   "metadata": {},
   "source": [
    "\n",
    "- **Heteroscedasticity**: this is when the variance of noise changes across observations. Certain least-squares assumptions will break down."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04e122fb",
   "metadata": {},
   "source": [
    "\n",
    "- **Condition number**: a large condition number indicates numerical instability and sensitivity to perturbation, even when formal solutions exist."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5e424fd",
   "metadata": {},
   "source": [
    "\n",
    "- **Insufficient tolerance**: in numerical algorithms, thresholds used to determine rank or invertibility must be chosen carefully. Poor choices can lead to misleading results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d1dd798",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "The point is that many failures in data science are not conceptual, but they happen geometrically and numerically. Poor choices lead to poor results. \n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}