390 lines
14 KiB
Plaintext
390 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f385cc40",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"# A note on other norms\n",
|
|
"\n",
|
|
"There are other canonical choices of norms for vectors and matrices. While $L^2$ leads naturally to least-squares problems with closed-form solutions, other norms induce different geometries and different optimal solutions. From the linear algebra perspective, changing the norm affects:\n",
|
|
"- the shape of the unit ball,\n",
|
|
"- the geometry of approximation,\n",
|
|
"- the numerical behaviour of optimization problems. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ca50a202",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"## $L^1$ norm (Manhattan distance)\n",
|
|
"The $L^1$ norm of a vector $x = (x_1,\\dots,x_p) \\in \\mathbb{R}^p$ is defined as\n",
|
|
"$$ \\|x\\|_1 = \\sum |x_i|. $$\n",
|
|
"Minimizing the $L^1$ norm is less sensitive to outliers. Geometrically, the $L^1$ unit ball in $\\mathbb{R}^2$ is a diamond (a rotated square), rather than a circle.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "77e7c0b3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# Grid\n",
|
|
"xx = np.linspace(-1.2, 1.2, 400)\n",
|
|
"yy = np.linspace(-1.2, 1.2, 400)\n",
|
|
"X, Y = np.meshgrid(xx, yy)\n",
|
|
"\n",
|
|
"# Take the $L^1$ norm\n",
|
|
"Z = np.abs(X) + np.abs(Y)\n",
|
|
"\n",
|
|
"plt.figure(figsize=(6,6))\n",
|
|
"plt.contour(X, Y, Z, levels=[1])\n",
|
|
"plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)\n",
|
|
"\n",
|
|
"plt.axhline(0)\n",
|
|
"plt.axvline(0)\n",
|
|
"plt.gca().set_aspect(\"equal\", adjustable=\"box\")\n",
|
|
"plt.title(r\"$L^1$ unit ball: $|x|+|y|\\leq 1$\")\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.savefig('../images/L1_unit_ball.png')\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ce59565a",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"Consequently, optimization problems involving $L^1$ tend to produce solutions which live on the corners of this polytope.\n",
|
|
"Solutions often require linear programming or iterative reweighted least squares.\n",
|
|
"\n",
|
|
"$L^1$ based methods (such as LASSO) tend to set coefficients to be exactly zero. Unlike with $L^2$, the minimization problem for $L^1$ does not admit a closed form solution. Algorithms include:\n",
|
|
"- linear programming formulations,\n",
|
|
"- iterative reweighted least squares,\n",
|
|
"- coordinate descent methods.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d9c50d8e",
|
|
"metadata": {},
|
|
"source": [
|
|
"## $L^{\\infty}$ norm (max/supremum norm)\n",
|
|
"The supremum norm defined as\n",
|
|
"$$ \\|x\\|_{\\infty} = \\max |x_i| $$\n",
|
|
"seeks to control the worst-case error rather than the average error. Minimizing this norm is related to Chebyshev approximation by polynomials. \n",
|
|
"\n",
|
|
"Geometrically, the unit ball of $\\mathbb{R}^2$ with respect to the $L^{\\infty}$ norm looks like a square.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2724a3bc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# Grid\n",
|
|
"xx = np.linspace(-1.2, 1.2, 400)\n",
|
|
"yy = np.linspace(-1.2, 1.2, 400)\n",
|
|
"X, Y = np.meshgrid(xx, yy)\n",
|
|
"\n",
|
|
"# Take the $L^{\\infty}$ norm\n",
|
|
"Z = np.maximum(np.abs(X), np.abs(Y))\n",
|
|
"\n",
|
|
"plt.figure(figsize=(6,6))\n",
|
|
"plt.contour(X, Y, Z, levels=[1])\n",
|
|
"plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)\n",
|
|
"\n",
|
|
"plt.axhline(0)\n",
|
|
"plt.axvline(0)\n",
|
|
"plt.gca().set_aspect(\"equal\", adjustable=\"box\")\n",
|
|
"plt.title(r\"$L^{\\infty}$ unit ball: $\\max\\{|x|,|y|\\} \\leq 1$\")\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.savefig('../images/Linf_unit_ball.png')\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "55c4ce17",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"Problems involving the $L^{\\infty}$ norm are often formulated as linear programs, and are useful when worst-case guarantees are more important than optimizing average performance. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d393c069",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"## Matrix norms\n",
|
|
"\n",
|
|
"There are also various norms on matrices, each highlighting a different aspect of the associated linear transformation.\n",
|
|
"- **Frobenius norm**. This is an important norm, essentially the analogue of the $L^2$ norm for matrices. It is the Euclidean norm if you think of your matrix as a vector, forgetting its rectangular shape. For $A = (a_{ij})$ a matrix, the Frobenius norm \n",
|
|
" $$ \\|A\\|_F = \\sqrt{\\sum a_{ij}^2} $$\n",
|
|
" is the square root of the sum of squares of all the entries. This treats a matrix as a long vector and is invariant under orthogonal transformations. As we'll see, it plays a central role in:\n",
|
|
" - least-squares problems,\n",
|
|
" - low-rank approximation,\n",
|
|
" - principal component analysis.\n",
|
|
"\n",
|
|
" In particular, the truncated SVD yields a best low-rank approximation of a matrix with respect to the Frobenius norm.\n",
|
|
"\n",
|
|
" We also that that the Frobenius norm can be written in terms of tracial data. We have that\n",
|
|
" $$ \\|A\\|_F^2 = \\text{Tr}(A^TA) = \\text{Tr}(AA^T). $$\n",
|
|
"- **Operator norm** (spectral norm). This is just the norm as an operator $A: \\mathbb{R}^p \\to \\mathbb{R}^n$, where $\\mathbb{R}^p$ and $\\mathbb{R}^n$ are thought of as Hilbert spaces:\n",
|
|
" $$ \\|A\\| = \\max_{\\|x\\|_2 = 1}\\|Ax\\|_2. $$\n",
|
|
" This norm measures how big of an amplification $A$ can apply, and is equal to the largest singular value of $A$. This norm is related to stability properties, and is the analogue of the $L^{\\infty}$ norm.\n",
|
|
"- **Nuclear norm**. The nuclear norm, defined as\n",
|
|
" $$ \\|A\\|_* = \\sum \\sigma_i, $$\n",
|
|
" is the sum of the singular values. When $A$ is square, this is precisely the trace-class norm, and is the analogue of the $L^1$ norm. This norm acts as a generalization of the concept of rank. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4ee62ee0",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"# A note on regularization\n",
|
|
"\n",
|
|
"Regularization introduces additional constraints or penalties to stabilize ill-posed problems. From the linear algebra point of view, regularization modifies the singular value structure of a problem. \n",
|
|
"- **Ridge regression**: add a positive multiple $\\lambda\\cdot I$ of the identity to $X^TX$ which will artificially inflate small singular values.\n",
|
|
"- This dampens unstable directions while leaving well-conditioned directions largely unaffected.\n",
|
|
" \n",
|
|
"Geometrically, regularization reshapes the solution space to suppress directions that are poorly supported by the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "be3b8c1e",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"# A note on solving multiple targets concurrently\n",
|
|
"\n",
|
|
"Suppose now that we were interested in solving several problems concurrently; that is, given some data points, we would like to make $k$ predictions. Say we have our $n \\times p$ data matrix $X$, and we want to make $k$ predictions $y_1,\\dots,y_k$. We can then set the problem up as finding a best solution to the matrix equation\n",
|
|
"$$ XB = Y $$\n",
|
|
"where $B$ will be a $p \\times k$ matrix of parameters and $Y$ will be the $p \\times k$ matrix whose columns are $y_1,\\dots,y_k$. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "908cd528",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"# What can go wrong?\n",
|
|
"\n",
|
|
"We are often dealing with imperfect data, so there is plenty that could go wrong. Here are some basic cases of where things can break down.\n",
|
|
"\n",
|
|
"- **Perfect multicolinearity**: non-invertible $\\tilde{X}^T\\tilde{X}$. This happens when one feature is a perfect combination of the others. This means that the columns of the matrix $\\tilde{X}$ are linearly dependent, and so infinitely many solutions will exist to the least-squares problem. \n",
|
|
" - For example, if you are looking at characteristics of people and you have height in both inches and centimeters.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a1d9e0ed",
|
|
"metadata": {},
|
|
"source": [
|
|
"- **Almost multicolinearity**: this happens when one features is **almost** a perfect combination of the others. From the linear algebra perspective, the columns of $\\tilde{X}$ might not be dependent, but they will be be **almost** linearly dependent. This will cause problems in calculation, as the condition number will become large and amplify numerical errors. The inverse will blow up small spectral components. \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fa57c3a4",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **More features than observations**: this means that our matrix $\\tilde{X}$ will be wider than it is high. Necessarily, this means that the columns are linearly dependent. Regularization or dimensionality reduction becomes essential.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f5fff0b5",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **Redundant or constant features**: this is when there is a characteristic that is satisfied by each observation. In terms of the linear algebraic data, this means that one of the columns of $X$ is constant.\n",
|
|
" - e.g., if you are looking at characteristics of penguins, and you have \"# of legs\". This will always be two, and doesn't add anything to the analysis.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ed7d745d",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **Underfitting**: the model lacks sufficient expressivity to capture the underlying structure. For example, see the section on polynomial regression -- sometimes one might want a curve vs. a straight line."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2de2ed0c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"## Generate data\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# 1) Generate quadratic data\n",
|
|
"np.random.seed(3)\n",
|
|
"\n",
|
|
"n = 50\n",
|
|
"x = np.random.uniform(-5, 5, n) # symmetric, wider range\n",
|
|
"\n",
|
|
"# True relationship: y = ax^2 + c + noise\n",
|
|
"a_true = 2.0\n",
|
|
"c_true = 5.0\n",
|
|
"noise = np.random.normal(0, 3, n)\n",
|
|
"\n",
|
|
"y = a_true * x**2 + c_true + noise\n",
|
|
"\n",
|
|
"# find a line of best fit\n",
|
|
"a,b = np.polyfit(x, y, 1)\n",
|
|
"\n",
|
|
"# add scatter points to plot\n",
|
|
"plt.scatter(x,y)\n",
|
|
"\n",
|
|
"# add line of best fit to plot\n",
|
|
"plt.plot(x, a*x + b, 'r', linewidth=1)\n",
|
|
"\n",
|
|
"# plot it\n",
|
|
"plt.show()\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dbb01960",
|
|
"metadata": {},
|
|
"source": [
|
|
"- **Overfitting**: the model captures noise rather than structure. Often due to model complexity relative to data size. Polynomial regression can give a nice visualization of overfitting. For example, if we worked with the same generated quadratic data from the polynomial regression section, and we tried to approximation it by a degree 11 polynomial, we get the following.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "43ab6a3f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# 1) Generate quadratic data\n",
|
|
"np.random.seed(3)\n",
|
|
"\n",
|
|
"n = 50\n",
|
|
"x = np.random.uniform(-5, 5, n)\n",
|
|
"\n",
|
|
"a_true = 2.0\n",
|
|
"c_true = 5.0\n",
|
|
"noise = np.random.normal(0, 3, n)\n",
|
|
"\n",
|
|
"y = a_true * x**2 + c_true + noise\n",
|
|
"\n",
|
|
"# 2) Fit degree 11 polynomial\n",
|
|
"coeffs = np.polyfit(x, y, 11)\n",
|
|
"\n",
|
|
"# Create polynomial function\n",
|
|
"p = np.poly1d(coeffs)\n",
|
|
"\n",
|
|
"# 3) Sort x for smooth plotting\n",
|
|
"x_sorted = np.linspace(min(x), max(x), 500)\n",
|
|
"\n",
|
|
"# 4) Plot\n",
|
|
"plt.scatter(x, y, label=\"Data\")\n",
|
|
"plt.plot(x_sorted, p(x_sorted), 'r', linewidth=2, label=\"Degree 11 fit\")\n",
|
|
"\n",
|
|
"plt.legend()\n",
|
|
"plt.title(\"Degree 11 Polynomial Fit\")\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3aa62d78",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **Outliers**: large deviations can dominate the $L^2$ norm. This is where normalization might be key.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "86606942",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **Heteroscedasticity**: this is when the variance of noise changes across observations. Certain least-squares assumptions will break down."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "04e122fb",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **Condition number**: a large condition number indicates numerical instability and sensitivity to perturbation, even when formal solutions exist."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b5e424fd",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"- **Insufficient tolerance**: in numerical algorithms, thresholds used to determine rank or invertibility must be chosen carefully. Poor choices can lead to misleading results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8d1dd798",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"The point is that many failures in data science are not conceptual, but they happen geometrically and numerically. Poor choices lead to poor results. \n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.14.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|