{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Least Squares Regression: A Linear Algebra Perspective\n", "\n", "## Introduction\n", "\n", "This is meant to be a not entirely comprehensive introduction to Data Science for the Linear Algebraist. There are of course many other complicated topics, but this is just to get the essence of data science (and the tools involved) from the perspective of someone with a strong linear algebra background.\n", "\n", "One of the most fundamental questions of data science is the following. \n", "\n", "> **Question**: Given observed data, how can we predict certain targets?\n", "\n", "The answer of course boils down to linear algebra, and we will begin by translating data science terms and concepts into linear algebraic ones. But first, as should be common practice for the linear algebraist, an example.\n", "\n", "> **Example**. Suppose that we observe $n=3$ houses, and for each house we record\n", "> - the square footage,\n", "> - the number of bedrooms,\n", "> - and additionally the sale price.\n", "> \n", "> So we have a table as follows.\n", ">\n", "> |House | Square ft | Bedrooms | Price (in $1000s) |\n", "> | --- | --- | --- | --- |\n", "> | 0 | 1600 | 3 | 500 |\n", "> | 1 | 2100 | 4 | 650 |\n", "> | 2 | 1550 | 2 | 475 |\n", ">\n", "> So, for example, the first house is 1600 square feet, has 3 bedrooms, and costs $500,000, and so on. Our goal will be to understand the cost of a house in terms of the number of bedrooms as well as the square footage.\n", "> Concretely this gives us a matrix and a vector:\n", "> $$ X = \\begin{bmatrix} 1600 & 3 \\\\ 2100 & 4 \\\\ 1550 & 2 \\end{bmatrix} \\text{ and } y =\\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\end{bmatrix} $$\n", "> So translating to linear algebra, the goal is to understand how $y$ depends on the columns of $X$.\n", "\n", "\n", "## Translation from Data Science to Linear Algebra\n", "\n", "| Data Science (DS) Term | Linear Algebra (LA) Equivalent | Explanation |\n", "| --- | --- | --- |\n", "| Dataset (with n observations and p features) | A matrix $X \\in \\mathbb{R}^{n \\times p}$ | The dataset is just a matrix. Each row is an observation (a vector of features). Each column is a feature (a vector of its values across all observations). |\n", "| Features | Columns of $X$ | Each feature is a column in your data matrix. |\n", "| Observation | Rows of $X$ | Each data point corresponds to a row. |\n", "| Targets | A vector $y \\in \\mathbb{R}^{n \\times 1}$ | The list of all target values is a column vector. |\n", "| Model parameters | A vector $\\beta \\in \\mathbb{R}^{p \\times 1}$ | These are the unknown coefficients. |\n", "| Model | Matrix–vector equation | The relationship becomes an equation involving matrices and vectors. |\n", "| Prediction Error / Residuals | A residual vector $e \\in \\mathbb{R}^{n \\times 1}$ | Difference between actual targets and predictions. |\n", "| Training / \"best fit\" | Optimization: minimizing the norm of the residual vector | To find the \"best\" model by finding a model which makes the norm of the residual vector as small as possible. |\n", "\n", "So our matrix $X$ will represent our data set, our vector $y$ is the target, and $\\beta$ is our vector of parameters. We will often be interested in understanding data with \"intercepts\", i.e., when there is a base value given in our data. So we will augment a column of 1's (denoted by $\\mathbb{1}$) to $X$ and append a parameter $\\beta_0$ to the top of $\\beta$, yielding\n", "\n", "$$ \\tilde{X} = \\begin{bmatrix} \\mathbb{1} & X \\end{bmatrix} \\text{ and } \\tilde{\\beta} = \\begin{bmatrix} \\beta_0 \\\\ \\beta_1 \\\\ \\beta_2 \\\\ \\vdots \\\\ \\beta_p \\end{bmatrix}. $$\n", "\n", "So the answer to the Data Science problem becomes\n", "\n", "> **Answer**: Solve, or best approximate a solution to, the matrix equation $\\tilde{X}\\tilde{\\beta} = y$.\n", "\n", "To be explicit, given $\\tilde{X}$ and $y$, we want to find a $\\tilde{\\beta}$ that does a good job of roughly giving $\\tilde{X}\\tilde{\\beta} = y$. There of course ways to solve (or approximate) such small systems by hand. However, one will often be dealing with enormous data sets with plenty to be desired. One view to take is that modern data science is applying numerical linear algebra techniques to imperfect information, all to get as good a solution as possible.\n", "\n", "# Solving the problem: Least Squares Regression and Matrix Decompositions\n", "\n", "If the system $\\tilde{X}\\tilde{\\beta} = y$ is consistent, then we can find a solution. However, we are often dealing with overdetermined systems, in the sense that there are often more observations than features (i.e., more rows than columns in $\\tilde{X}$, or more equations than unknowns), and therefore inconsistent systems. However, it is possible to find a **best fit** solution, in the sense that the difference\n", "\n", "$$ e = y - \\tilde{X}\\tilde{\\beta} $$\n", "\n", "is small. By small, we often mean that $e$ is small in $L^2$ norm; i.e., we are minimizing the the sums of the squares of the differences between the components of $y$ and the components of $\\tilde{X}\\tilde{\\beta}$. This is known as a **least squares solution**. Assuming that our data points live in the Euclidean plane, this precisely describes finding a line of best fit.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bdee8009", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# 1. Generate some synthetic data\n", "# We set a random seed for reproducibility\n", "np.random.seed(3)\n", "\n", "# Create 50 random x values between 0 and 10\n", "x = np.random.uniform(0, 10, 50)\n", "\n", "# Create y values with a linear relationship plus some random noise\n", "# True relationship: y = 2.5x + 5 + noise\n", "noise = np.random.normal(0, 2, 50)\n", "y = 2.5 * x + 5 + noise\n", "\n", "# 2. Calculate the line of best fit\n", "# np.polyfit(x, y, deg) returns the coefficients for the polynomial\n", "# deg=1 specifies a linear fit (first degree polynomial)\n", "slope, intercept = np.polyfit(x, y, 1)\n", "\n", "# Create a polynomial function from the coefficients\n", "# This allows us to pass x values directly to get predicted y values\n", "fit_function = np.poly1d((slope, intercept))\n", "\n", "# Generate x values for plotting the line (smoothly across the range)\n", "x_line = np.linspace(x.min(), x.max(), 100)\n", "y_line = fit_function(x_line)\n", "\n", "# 3. Plot the data and the line of best fit\n", "plt.figure(figsize=(10, 6))\n", "\n", "# Plot the scatter points\n", "plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)\n", "\n", "# Plot the line of best fit\n", "plt.plot(x_line, y_line, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')\n", "\n", "# Add labels and title\n", "plt.xlabel('X Axis')\n", "plt.ylabel('Y Axis')\n", "plt.title('Scatter Plot with Line of Best Fit')\n", "\n", "# Add the equation to the plot\n", "# The f-string formats the slope and intercept to 2 decimal places\n", "plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))\n", "\n", "# Display legend and grid\n", "plt.legend()\n", "plt.grid(True, linestyle=':', alpha=0.6)\n", "\n", "# Show the plot\n", "plt.savefig('../images/line_of_best_fit_generated_1.png')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "1c25ccb0", "metadata": {}, "source": [ "The structure of this sections is as follows.\n", "- [Least Squares Solution](#least-squares-solution)\n", "- [QR Decompositions](#qr-decompositions)\n", "- [Singular Value Decomposition](#singular-value-decomposition)\n", "- [A note on other norms](#a-note-on-other-norms)\n", "- [A note on regularization](#a-note-on-regularization)\n", "- [A note on solving multiple targets concurrently](#a-note-on-solving-multiple-targets-concurrently)\n", "- [Polynomial regression](#polynomial-regression)\n", "- [What can go wrong?](#what-can-go-wrong)\n", "\n", "## Least Squares Solution\n", "\n", "Recall that the Euclidean distance between two vectors $x = (x_1,\\dots,x_n) ,y = (y_1,\\dots,y_n) \\in \\mathbb{R}^n$ is given by\n", "\n", "$$ ||x - y||_2 = \\sqrt{\\sum_{i=1}^n |x_i - y_i|^2}. $$\n", "\n", "We will often work with the square of the $L^2$ norm to simplify things (the square function is increasing, so minimizing the square of a non-negative function will also minimize the function itself).\n", "\n", "> **Definition**: Let $A$ be an $m \\times n$ matrix and $b \\in \\mathbb{R}^n$. A **least-squares solution** of $Ax = b$ is a vector $x_0 \\in \\mathbb{R}^n$ such that\n", "> \n", "> $$ \\|b - Ax_0\\|_2 \\leq \\|b - Ax\\|_2 \\text{ for all } x \\in \\mathbb{R}^n. $$\n", "\n", "So a least-squares solution to the equation $Ax = b$ is trying to find a vector $x_0 \\in \\mathbb{R}^n$ which realizes the smallest distance between the vector $b$ and the column space\n", "$$ \\text{Col}(A) = \\{Ax \\mid x \\in \\mathbb{R}^n\\} $$\n", "of $A$. We know this to be the projection of the vector $b$ onto the column space. " ] }, { "cell_type": "code", "execution_count": null, "id": "f44a6feb", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Linear algebra helper functions\n", "def proj_onto_subspace(A, v):\n", " \"\"\"\n", " Project vector v onto Col(A) where A is (3 x k) with columns spanning the subspace.\n", " Uses the formula: P = A (A^T A)^(-1) A^T (for full column rank A).\n", " \"\"\"\n", " AtA = A.T @ A\n", " return A @ np.linalg.solve(AtA, A.T @ v)\n", "\n", "def make_plane_grid(a, b, u_range=(-1.5, 1.5), v_range=(-1.5, 1.5), n=15):\n", " \"\"\"\n", " Plane through origin spanned by vectors a and b.\n", " Returns meshgrid points X,Y,Z for surface plotting.\n", " \"\"\"\n", " uu = np.linspace(*u_range, n)\n", " vv = np.linspace(*v_range, n)\n", " U, V = np.meshgrid(uu, vv)\n", " P = U[..., None] * a + V[..., None] * b # shape (n,n,3)\n", " return P[..., 0], P[..., 1], P[..., 2]\n", "\n", "# Choose a plan and a vector\n", "# Plane basis vectors (span a 2D subspace in R^3)\n", "a = np.array([1.0, 0.2, 0.0])\n", "b = np.array([0.2, 1.0, 0.3])\n", "# Create the associated matrix\n", "# 3x2 matrix of full column rank\n", "# the column space will be a plane\n", "A = np.column_stack([a, b]) \n", "\n", "# Vector to project\n", "v = np.array([0.8, 0.6, 1.2])\n", "\n", "# Projection and residual\n", "p = proj_onto_subspace(A, v)\n", "r = v - p\n", "\n", "# Plot\n", "fig = plt.figure(figsize=(9, 7))\n", "# 1 row, 1 column, 1 subplot\n", "# axis lives in R^3\n", "ax = fig.add_subplot(111, projection=\"3d\")\n", "\n", "# Plane surface\n", "X, Y, Z = make_plane_grid(a, b)\n", "# Here is a rectangular grid of points in 3D; draw a surface through them.\n", "ax.plot_surface(X, Y, Z, alpha=0.25)\n", "\n", "origin = np.zeros(3)\n", "\n", "# v, p, and residual r\n", "ax.quiver(*origin, *v, arrow_length_ratio=0.08, linewidth=2)\n", "ax.quiver(*origin, *p, arrow_length_ratio=0.08, linewidth=2)\n", "ax.quiver(*p, *r, arrow_length_ratio=0.08, linewidth=2)\n", "\n", "# Drop line from v to its projection on the plane\n", "ax.plot([v[0], p[0]],\n", "\t\t[v[1], p[1]],\n", "\t\t[v[2], p[2]],\n", "\t\tlinestyle=\"--\", linewidth=2)\n", "\n", "# Points for emphasis\n", "ax.scatter(*v, s=60)\n", "ax.scatter(*p, s=60)\n", "\n", "# Labels (simple text)\n", "ax.text(*v, \" v\")\n", "ax.text(*p, \" Proj(v)\")\n", "\n", "# Make axes look nice\n", "ax.set_xlabel(\"x\")\n", "ax.set_ylabel(\"y\")\n", "ax.set_zlabel(\"z\")\n", "ax.set_title(\"Projection of a vector onto a plane\")\n", "\n", "# Set symmetric limits so the picture isn't squished\n", "all_pts = np.vstack([origin, v, p])\n", "m = np.max(np.abs(all_pts)) * 1.3 + 0.2\n", "ax.set_xlim(-m, m)\n", "ax.set_ylim(-m, m)\n", "ax.set_zlim(-m, m)\n", "\n", "# Adjust spacing so labels, titles, and axes don’t overlap or get cut off.\n", "plt.tight_layout()\n", "plt.savefig('../images/projection_of_vector_onto_plane.png')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "c8a1fe20", "metadata": {}, "source": [ "> **Theorem**: The set of least-squares solutions of $Ax = b$ coincides with solutions of the **normal equations** $A^TAx = A^Tb$. Moreover, the normal equations always have a solution.\n", "\n", "Let us first see why we get a line of best fit. \n", "\n", "> **Example**. Let us show why this describes a line of best fit when we are working with one feature and one target. Suppose that we observe four data points\n", "> $$ X = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\end{bmatrix} \\text{ and } y = \\begin{bmatrix} 1 \\\\ 2\\\\ 2 \\\\ 4 \\end{bmatrix}. $$\n", "> We want to fit a line $y = \\beta_0 + \\beta_1x$ to these data points. We will have our augmented matrix be\n", "> $$ \\tilde{X} = \\begin{bmatrix} 1 & 1 \\\\ 1 & 2 \\\\ 1 & 3 \\\\ 1 & 4 \\end{bmatrix}, $$\n", "> and our parameter be\n", "> $$ \\tilde{\\beta} = \\begin{bmatrix} \\beta_0 \\\\ \\beta_1 \\end{bmatrix}. $$\n", "> We have that\n", "> $$ \\tilde{X}^T\\tilde{X} = \\begin{bmatrix} 4 & 10 \\\\ 10 & 30 \\end{bmatrix} \\text{ and } \\tilde{X}^Ty = \\begin{bmatrix} 9 \\\\ 27 \\end{bmatrix}. $$\n", "> The 2x2 matrix $\\tilde{X}^T\\tilde{X}$ is easy to invert, and so we get that\n", "> $$ \\tilde{\\beta} = (\\tilde{X}^T\\tilde{X})^{-1}\\tilde{X}^Ty = \\frac{1}{10}\\begin{bmatrix} 15 & -5 \\\\ -5 & 2 \\end{bmatrix}\\begin{bmatrix} 9 \\\\ 27 \\end{bmatrix} = \\begin{bmatrix} 0 \\\\ \\frac{9}{10} \\end{bmatrix}. $$\n", "> So our line of best fit is of them form $y = \\frac{9}{10}x$.\n", "\n", "Although the above system was small and we could solve the system of equations explicitly, this isn't always feasible. We will generally use python in order to solve large systems. \n", "- One can find a least-squares solution using `numpy.linalg.lstsq`.\n", "- We can set up the normal equations and solve the system by using `numpy.linalg.solve`\n", "Although the first approach simplifies things greatly, and is more or less what we are doing anyway, we will generally set up our problems as we would by hand, and then use `numpy.linalg.solve` to help us find a solution. However, computing $X^TX$ can cause lots of errors, so later we'll see how to get linear systems from QR decompositions and the SVD, and then apply `numpy.lingalg.solve`. \n", "\n", "Let's see how to use these for the above example, and see the code to generate the scatter plot and line of best fit. \n", "Again, our system is the following.\n", "$$ X = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\end{bmatrix} \\text{ and } y = \\begin{bmatrix} 1 \\\\ 2\\\\ 2 \\\\ 4 \\end{bmatrix}. $$\n", "We will do what we did above, but use python instead.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Define the matrix X and vector y\n", "X = np.array([[1], [2], [3], [4]])\n", "y = np.array([[1], [2], [2], [4]])\n", "\n", "# Augment X with a column of 1's (intercept)\n", "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n", "\n", "# Solve the normal equations\n", "beta = np.linalg.solve(X_aug.T @ X_aug, X_aug.T @ y)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "And what is the result?\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1c42a900", "metadata": {}, "outputs": [], "source": [ "beta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This agrees with our by-hand computation: the intercept is tiny, so it is virtually zero, and we get 9/10 as our slope. Let's plot it. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "b, m = beta #beta[0] will be the intercept and beta[1] will be the slope\n", "_ = plt.plot(X, y, 'o', label='Original data', markersize=10)\n", "_ = plt.plot(X, m*X + b, 'r', label='Line of best fit')\n", "_ = plt.legend()\n", "plt.savefig('../images/line_of_best_fit_easy_example.png')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "What about `numpy.linalg.lstsq`? Is it any different?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Define the matrix X and vector y\n", "X = np.array([[1], [2], [3], [4]])\n", "y = np.array([[1], [2], [2], [4]])\n", "\n", "# Augment X with a column of 1's (intercept)\n", "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n", "\n", "# Solve the least squares equation with matrix X_aug and target y\n", "beta = np.linalg.lstsq(X_aug,y)[0]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "We then get\n" ] }, { "cell_type": "code", "execution_count": null, "id": "35e8c88d", "metadata": {}, "outputs": [], "source": [ "beta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "So it is a little different -- and, in fact, closer to our exact answer (the intercept is zero). This makes sense -- `numpy.linalg.lstsq` won't directly compute $X^TX$, which, again, can cause quite a few issues. \n", "\n", "---\n", "\n", "Now going to our initial example. \n", "\n", "> **Example**: Let us work with the example from above. We augment the matrix with a column of 1's to include an intercept term:\n", "> $$ \\tilde{X} = \\begin{bmatrix} 1 & 1600 & 3 \\\\ 1 & 2100 & 4 \\\\ 1 & 1550 & 2 \\end{bmatrix}. $$\n", "> Let us solve the normal equations\n", "> $$ \\tilde{X}^T\\tilde{X}\\tilde{\\beta} = \\tilde{X}^Ty. $$\n", "> We have\n", "> $$ \\tilde{X}^T\\tilde{X} = \\begin{bmatrix} 3 & 5250 & 9 \\\\ 5250 & 9372500 & 16300 \\\\ 9 & 16300 & 29\\end{bmatrix} \\text{ and } \\tilde{X}^Ty = \\begin{bmatrix} 1625 \\\\ 2901500 \\\\ 5050 \\end{bmatrix} $$\n", "> Solving this system of equations yields the parameter vector $\\tilde{\\beta}$. In this case, we have\n", "> $$ \\tilde{\\beta} = \\begin{bmatrix} \\frac{200}{9} \\\\ \\frac{5}{18} \\\\ \\frac{100}{9} \\end{bmatrix}. $$\n", "> When we apply $\\tilde{X}$ to $\\tilde{\\beta}$, we get\n", "> $$ \\tilde{X}\\tilde{\\beta} = \\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\end{bmatrix}, $$\n", "> which is our target on the nose. This means that we can expect, based on our data, that the cost of a house will be\n", "> $$ \\frac{200}{9} + \\frac{5}{18}(\\text{square footage}) + \\frac{100}{9}(\\text{\\# of bedrooms})$$\n", "\n", "In the above, we actually had a consistent system to begin with, so our least-squares solution gave our prediction honestly. What happens if we have an inconsistent system?\n", "\n", "> **Example**: Let us add two more observations, say our data is now the following. \n", "> |House | Square ft | Bedrooms | Price (in $1000s) |\n", "> | --- | --- | --- | --- |\n", "> | 0 | 1600 | 3 | 500 |\n", "> | 1 | 2100 | 4 | 650 |\n", "> | 2 | 1550 | 2 | 475 |\n", "> | 3 | 1600 | 3 | 490 |\n", "> | 4 | 2000 | 4 | 620 |\n", "> \n", "> So setting up our system, we want a least-square solution to the matrix equation\n", "> $$ \\begin{bmatrix} 1 & 1600 & 3 \\\\ 1 & 2100 & 4 \\\\ 1 & 1550 & 2 \\\\ 1 & 1600 & 3 \\\\ 1 & 2000 & 4 \\end{bmatrix}\\tilde{\\beta} = \\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\\\ 490 \\\\ 620 \\end{bmatrix}. $$\n", "> Note that the system is inconsistent (the 1st and 4th rows agree in $\\tilde{X}$, but they have different costs). Writing the normal equations we have\n", "> $$ \\tilde{X}^T\\tilde{X} = \\begin{bmatrix} 5 & 8850 & 16 \\\\ 8850 & 15932500 & 29100 \\\\ 16 & 29100 & 54 \\end{bmatrix} \\text{ and } \\tilde{X}y = \\begin{bmatrix} 2735 \\\\ 4 925 250 \\\\ 9000 \\end{bmatrix}. $$\n", "> Solving this linear system yields\n", "> $$ \\tilde{\\beta} = \\begin{bmatrix} 0 \\\\ \\frac{3}{10} \\\\ 5 \\end{bmatrix}. $$\n", "> This is a vastly different answer! Applying $\\tilde{X}$ to it yields\n", "> $$ \\tilde{X}\\tilde{\\beta} = \\begin{bmatrix} 495 \\\\ 650 \\\\ 475 \\\\ 495 \\\\ 620 \\end{bmatrix}. $$\n", "> Note that the error here is\n", "> $$ y - \\tilde{X}\\tilde{\\beta} = \\begin{bmatrix} 5 \\\\ 0 \\\\ 0 \\\\ -5 \\\\ 0 \\end{bmatrix}, $$\n", "> which has squared $L^2$ norm\n", "> $$ \\|y - \\tilde{X}\\tilde{\\beta}\\|_2^2 = 25 + 25 = 50. $$\n", "> So this says that, given our data, we can roughly estimate the cost of a house, within 50k or so, to be\n", "> $$ \\approx \\frac{3}{10}(\\text{square footage}) + 5(\\text{\\# of bedrooms}). $$\n", "In practice, our data sets can be gigantic, and so there is absolutely no hope of doing computations by hand. It is nice to know that theoretically we can do things like this though. \n", "\n", "> **Theorem**: Let $A$ be an $m \\times n$ matrix and $b \\in \\mathbb{R}^n$. The following are equivalent.\n", "> \n", "> 1. The equation $Ax = b$ has a unique least-squares solution for each $b \\in \\mathbb{R}^n$.\n", "> 2. The columns of $A$ are linearly independent. \n", "> 3. The matrix $A^TA$ is invertible.\n", "\n", "In this case, the unique solution to the normal equations $A^TAx = A^Tb$ is\n", "\n", "$$ x_0 = (A^TA)^{-1}A^Tb. $$\n", "\n", "Computing $\\tilde{X}^T\\tilde{X}$ or taking inverses are very computationally intensive tasks, and it is best to avoid doing these. Moreover, as we'll see in an example later, if we do a numerical calculation we can get close to zero and then divide where we shouldn't be, blowing up our final result. One way to get around this is to use QR decompositions of matrices. \n", "\n", "Now let's use python to visualize the above data and then solve for the least-squares solution. We'll use `pandas` in order to think about this data. We note that `pandas` incorporates `matplotlib` under the hood already, so there are some simplifications that can be made.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# First let us make a dictionary incorporating our data.\n", "# Each entry corresponds to a column (feature of our data)\n", "data = {\n", "\t'Square ft': [1600, 2100, 1550, 1600, 2000],\n", "\t'Bedrooms': [3, 4, 2, 3, 4],\n", "\t'Price': [500, 650, 475, 490, 620]\n", "}\n", "\n", "# Create a pandas DataFrame\n", "df = pd.DataFrame(data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's see how python formats this `DataFrame`. It will turn it into essentially the table we had at the beginning. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "9dd3046d", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "So what can we do with DataFrames? First let's use `pandas.DataFrame.describe` to see some basic statistics about our data.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a28d2558", "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This gives use the mean, the standard deviation, the min, the max, as well as some other things. We get an immediate sense of scale from our data. We can also examine the pairwise correlation of all the columns by using `pandas.DataFrame.corr`.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7850890b", "metadata": {}, "outputs": [], "source": [ "df[[\"Square ft\", \"Bedrooms\", \"Price\"]].corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "It is clear that each of the three are correlated. This makes sense, as the number of bedrooms should be increasing with the square feet. Same with the price. We'll discuss in the next section when we look at Principal Component Analysis. \n", "\n", "We can also graph our data; for example, we could create some scatter plots, one for `Square ft` vs `Price` and on for `Bedrooms` vs `Price`. We can also do a grouped bar chart. Let's start with the scatter plots. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scatter plot for Price vs Square ft\n", "df.plot(\n", "\tkind=\"scatter\",\n", "\tx=\"Square ft\",\n", "\ty=\"Price\",\n", "\ttitle=\"House Price vs Square footage\"\n", ")\n", "plt.savefig('../images/house_price_vs_square_ft.png')\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scatter plot for Price vs Bedrooms\n", "df.plot(\n", "\tkind=\"scatter\",\n", "\tx=\"Bedrooms\",\n", "\ty=\"Price\",\n", "\ttitle=\"House Price vs Bedrooms\"\n", ")\n", "plt.savefig('../images/house_price_vs_bedrooms.png')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can even do square footage vs bedrooms. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scatter plot for Bedrooms vs Square ft\n", "df.plot(\n", "\tkind=\"scatter\",\n", "\tx=\"Square ft\",\n", "\ty=\"Bedrooms\",\n", "\ttitle=\"Bedrooms vs Square footage\"\n", ")\n", "plt.savefig('../images/bedrooms_vs_square_ft.png')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Of course, these figures are somewhat meaningless due to how unpopulated our data is.\n", "\n", "Now let's get our matrices and linear systems set up with `pandas.DataFrame.to_numpy`.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create our matrix X and our target y\n", "X = df[[\"Square ft\", \"Bedrooms\"]].to_numpy()\n", "y = df[[\"Price\"]].to_numpy()\n", "\n", "# Augment X with a column of 1's (intercept)\n", "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n", "\n", "# Solve the least-squares problem\n", "beta = np.linalg.lstsq(X_aug,y)[0]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This yields\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0d08c091", "metadata": {}, "outputs": [], "source": [ "beta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As the first parameter is basically 0, we are left with the second being 3/10 and the third being 5, just like our exact solution. Next, we will look at matrix decompositions and how they can help us find least-squares solutions. " ] }, { "cell_type": "markdown", "id": "a1f82fae", "metadata": {}, "source": [ "# Polynomial Regression\n", "\n", "Sometimes fitting a line to a set of $n$ data points clearly isn't the right thing to do. To emphasize the limitations of linear models, we generate data from a purely quadratic relationship. In this setting, the space of linear functions is not rich enough to capture the underlying structure, and the linear least-squares solution exhibits systematic error. Expanding the feature space to include quadratic terms resolves this issue.\n", "\n", "For example, suppose our data looked like the following. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "52a5c824", "metadata": {}, "outputs": [], "source": [ "## Generate data\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# 1) Generate quadratic data\n", "np.random.seed(3)\n", "\n", "n = 50\n", "x = np.random.uniform(-5, 5, n) # symmetric, wider range\n", "\n", "# True relationship: y = ax^2 + c + noise\n", "a_true = 2.0\n", "c_true = 5.0\n", "noise = np.random.normal(0, 3, n)\n", "\n", "y = a_true * x**2 + c_true + noise\n", "\n", "## Generate scatter plot\n", "plt.scatter(x,y)\n", "\n", "# plot it\n", "plt.savefig('../images/quadratic_data_generated_1.png')\n", "plt.show()\n", "\n" ] }, { "cell_type": "markdown", "id": "1221e35e", "metadata": {}, "source": [ "\n", "If we try to find a line of best fit, we get something that doesn't really describe or approximate our data at all...\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bb6cd90d", "metadata": {}, "outputs": [], "source": [ "# find a line of best fit\n", "a,b = np.polyfit(x, y, 1)\n", "\n", "# add scatter points to plot\n", "plt.scatter(x,y)\n", "\n", "# add line of best fit to plot\n", "plt.plot(x, a*x + b, 'r', linewidth=1)\n", "\n", "# plot it\n", "plt.savefig('../images/quadratic_data_line_of_best_fit.png')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "a2d5bb71", "metadata": {}, "source": [ "\n", "This is an example of **underfitting** data, and we can do better. The same linear regression ideas work for fitting a degree $d$ polynomial model to a set of $n$ data points. Before, when trying to fit a line to points $(x_1,y_1),\\dots,(x_n,y_n)$, we had the following matrices\n", "$$ \\tilde{X} = \\begin{bmatrix} 1 & x_1 \\\\ \\vdots & \\vdots \\\\ 1 & x_n \\end{bmatrix}, y = \\begin{bmatrix} y_1 \\\\ \\vdots \\\\ y_n \\end{bmatrix}, \\tilde{\\beta} = \\begin{bmatrix} \\beta_0 \\\\ \\beta_1 \\end{bmatrix} $$\n", "in the matrix equation\n", "$$ \\tilde{X}\\tilde{\\beta} = y, $$\n", "and we were trying to find a vector $\\tilde{\\beta}$ which gave a best possible solution. This would give us a line $y = \\beta_0 + \\beta_1x$ which best approximates the data. To fit a polynomial $y = \\beta_0 + \\beta_1x + \\beta_2x^2 + \\cdots + \\beta_d^dx^d$ to the data, we have a similar set up.\n", "\n", "> **Definition**. The **Vandermonde matrix** is the $n \\times (d+1)$ matrix\n", "> $$ V = \\begin{bmatrix} 1 & x_1 & x_1^2 & \\cdots & x_1^d \\\\ 1 & x_2 & x_2^2 & \\cdots & x_2^d \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ 1 & x_n & x_n^2 & \\cdots & x_n^d \\end{bmatrix}. $$\n", "\n", "With the Vandermonde matrix, to find a polynomial function of best fit, one just needs to find a least-squares solution to the matrix equation\n", "$$ V\\tilde{\\beta} = y. $$\n", "\n", "With the generated data above, we get the following curve. " ] }, { "cell_type": "code", "execution_count": null, "id": "8c569b38", "metadata": {}, "outputs": [], "source": [ "# polynomial fit with degree = 2\n", "poly = np.polyfit(x,y,2)\n", "model = np.poly1d(poly)\n", "\n", "# add scatter points to plot\n", "plt.scatter(x,y)\n", "\n", "# add the quadratic to the plot\n", "polyline=np.linspace(x.min(), x.max())\n", "plt.plot(polyline, model(polyline), 'r', linewidth=1)\n", "\n", "\n", "# plot it\n", "plt.savefig('../images/quadratic_data_quadratic_of_best_fit')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "b567fe95", "metadata": {}, "source": [ "\n", "Solving these problems can be done with python. One can use `numpy.polyfit` and `numpy.poly1d`. \n", "\n", "> **Example**. Consider the following data.\n", "> |House | Square ft | Bedrooms | Price (in $1000s) |\n", "> | --- | --- | --- | --- |\n", "> | 0 | 1600 | 3 | 500 |\n", "> | 1 | 2100 | 4 | 650 |\n", "> | 2 | 1550 | 2 | 475 |\n", "> | 3 | 1600 | 3 | 490 |\n", "> | 4 | 2000 | 4 | 620 |\n", ">\n", "> Suppose we wanted to predict the price of a house based on the square footage and we thought the relationship was cubic (it clearly isn't, but hey, for the sake of argument). So really we are looking at the subset of data\n", "> |House | Square ft | Price (in $1000s) |\n", "> | --- | --- | --- |\n", "> | 0 | 1600 | 500 |\n", "> | 1 | 2100 | 650 |\n", "> | 2 | 1550 | 475 |\n", "> | 3 | 1600 | 490 |\n", "> | 4 | 2000 | 620 |\n", ">\n", "> Our Vandermonde matrix will be\n", "> $$ V = \\begin{bmatrix} 1 & 1600 & 1600^2 & 1600^3 \\\\ 1 & 2100 & 2100^2 & 2100^3 \\\\ 1 & 1550 & 1550^2 & 1550^3 \\\\ 1 & 1600 & 1600^2 & 1600^3 \\\\ 1 & 2000 & 2000^2 & 2000^3 \\end{bmatrix} $$\n", "> and our target vector will be\n", "> $$ y = \\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\\\ 490 \\\\ 620 \\end{bmatrix}. $$\n", "> As we can see, the entries of the Vandermonde matrix get very very large very fast. One can, if they are so inclined, compute a least-squares solution to $V\\tilde{\\beta} = y$ by hand. Let's not, but let us find, using python, a \"best\" cubic approximation of the relationship between the square footage and price.\n", "\n", "We will use `numpy.polyfit`, `numpy.pold1d` and `numpy.linspace`.\n", "\n", "Let's get a cubic of best fit.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9bdf7560", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# First let us make a dictionary incorporating our data.\n", "# Each entry corresponds to a column (feature of our data)\n", "data = {\n", " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n", " 'Bedrooms': [3, 4, 2, 3, 4],\n", " 'Price': [500, 650, 475, 490, 620]\n", "}\n", "\n", "# Create a pandas DataFrame\n", "df = pd.DataFrame(data)\n", "\n", "# Extract x (square footage) and y (price)\n", "x = df[\"Square ft\"].to_numpy(dtype=float)\n", "y = df[\"Price\"].to_numpy(dtype=float)\n", "\n", "# Degree of polynomial\n", "degree = 3 # cubic\n", "\n", "# Polyfit directly on x\n", "cubic = np.poly1d(np.polyfit(x,y, degree))\n", "\n", "# Add fitted polynomial line and scatter plot\n", "polyline = np.linspace(x.min(),x.max())\n", "plt.scatter(x,y, label=\"Observed data\")\n", "plt.plot(polyline, cubic(polyline), 'r', label=\"Cubic best fit\")\n", "plt.xlabel(\"Square ft\")\n", "plt.ylabel(\"Price (in $1000s)\")\n", "plt.title(\"Cubic polynomial regression: Price vs Square Footage\")\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "67efb4de", "metadata": {}, "source": [ "Here `numpy.polyfit` computes the least-squares solution in the polynomial basis $1, x, x^2, x^3$, i.e., it solves the Vandermonde least-squares problem. So what is our cubic polynomial?" ] }, { "cell_type": "code", "execution_count": null, "id": "eaba0d42", "metadata": {}, "outputs": [], "source": [ "cubic" ] }, { "cell_type": "markdown", "id": "db109cf4", "metadata": {}, "source": [ "The first term is the degree 3 term, the second the degree 2 term, the third the degree 1 term, and the fourth is the constant term. " ] }, { "cell_type": "markdown", "id": "3f40d45e", "metadata": {}, "source": [ "\n", "\n", "# Additional visualization: line of best fit\n", "## Generated scatter plot example\n", "The first figure is a line of best fit for scattered points. Here is some alternate code that will produce an image. We can do the following using `matplotlib.pyplot.axline`." ] }, { "cell_type": "code", "execution_count": null, "id": "a4bb276b", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Generate data (same as above)\n", "np.random.seed(3)\n", "x = np.random.uniform(0, 10, 50)\n", "y = 2.5 * x + 5 + np.random.normal(0, 2, 50)\n", "\n", "# Calculate slope and intercept\n", "slope, intercept = np.polyfit(x, y, 1)\n", "\n", "plt.figure(figsize=(10, 6))\n", "plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)\n", "# Plot the line using axline\n", "# xy1=(0, intercept) is the y-intercept point\n", "\n", "# slope=slope defines the steepness\n", "plt.axline(xy1=(0, intercept), slope=slope, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')\n", "\n", "# Add the equation to the plot\n", "# The f-string formats the slope and intercept to 2 decimal places\n", "plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))\n", "\n", "\n", "plt.xlabel('X Axis')\n", "plt.ylabel('Y Axis')\n", "plt.title('Scatter Plot with Line of Best Fit')\n", "plt.legend()\n", "plt.grid(True, linestyle=':', alpha=0.6)\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "c17fbdaf", "metadata": {}, "source": [ "\n", "\n", "See\n", "- https://stackoverflow.com/questions/37234163/how-to-add-a-line-of-best-fit-to-scatter-plot\n", "- https://www.statology.org/line-of-best-fit-python/\n", "- https://stackoverflow.com/questions/6148207/linear-regression-with-matplotlib-numpy\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 5 }