Least Squares Regression: A Linear Algebra Perspective¶
Introduction¶
This is meant to be a not entirely comprehensive introduction to Data Science for the Linear Algebraist. There are of course many other complicated topics, but this is just to get the essence of data science (and the tools involved) from the perspective of someone with a strong linear algebra background.
One of the most fundamental questions of data science is the following.
Question: Given observed data, how can we predict certain targets?
The answer of course boils down to linear algebra, and we will begin by translating data science terms and concepts into linear algebraic ones. But first, as should be common practice for the linear algebraist, an example.
Example. Suppose that we observe $n=3$ houses, and for each house we record
- the square footage,
- the number of bedrooms,
- and additionally the sale price.
So we have a table as follows.
|House | Square ft | Bedrooms | Price (in $1000s) | | --- | --- | --- | --- | | 0 | 1600 | 3 | 500 | | 1 | 2100 | 4 | 650 | | 2 | 1550 | 2 | 475 |
So, for example, the first house is 1600 square feet, has 3 bedrooms, and costs $500,000, and so on. Our goal will be to understand the cost of a house in terms of the number of bedrooms as well as the square footage. Concretely this gives us a matrix and a vector: $$ X = \begin{bmatrix} 1600 & 3 \\ 2100 & 4 \\ 1550 & 2 \end{bmatrix} \text{ and } y =\begin{bmatrix} 500 \\ 650 \\ 475 \end{bmatrix} $$ So translating to linear algebra, the goal is to understand how $y$ depends on the columns of $X$.
Translation from Data Science to Linear Algebra¶
| Data Science (DS) Term | Linear Algebra (LA) Equivalent | Explanation |
|---|---|---|
| Dataset (with n observations and p features) | A matrix $X \in \mathbb{R}^{n \times p}$ | The dataset is just a matrix. Each row is an observation (a vector of features). Each column is a feature (a vector of its values across all observations). |
| Features | Columns of $X$ | Each feature is a column in your data matrix. |
| Observation | Rows of $X$ | Each data point corresponds to a row. |
| Targets | A vector $y \in \mathbb{R}^{n \times 1}$ | The list of all target values is a column vector. |
| Model parameters | A vector $\beta \in \mathbb{R}^{p \times 1}$ | These are the unknown coefficients. |
| Model | Matrix–vector equation | The relationship becomes an equation involving matrices and vectors. |
| Prediction Error / Residuals | A residual vector $e \in \mathbb{R}^{n \times 1}$ | Difference between actual targets and predictions. |
| Training / "best fit" | Optimization: minimizing the norm of the residual vector | To find the "best" model by finding a model which makes the norm of the residual vector as small as possible. |
So our matrix $X$ will represent our data set, our vector $y$ is the target, and $\beta$ is our vector of parameters. We will often be interested in understanding data with "intercepts", i.e., when there is a base value given in our data. So we will augment a column of 1's (denoted by $\mathbb{1}$) to $X$ and append a parameter $\beta_0$ to the top of $\beta$, yielding
$$ \tilde{X} = \begin{bmatrix} \mathbb{1} & X \end{bmatrix} \text{ and } \tilde{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}. $$
So the answer to the Data Science problem becomes
Answer: Solve, or best approximate a solution to, the matrix equation $\tilde{X}\tilde{\beta} = y$.
To be explicit, given $\tilde{X}$ and $y$, we want to find a $\tilde{\beta}$ that does a good job of roughly giving $\tilde{X}\tilde{\beta} = y$. There of course ways to solve (or approximate) such small systems by hand. However, one will often be dealing with enormous data sets with plenty to be desired. One view to take is that modern data science is applying numerical linear algebra techniques to imperfect information, all to get as good a solution as possible.
Solving the problem: Least Squares Regression and Matrix Decompositions¶
If the system $\tilde{X}\tilde{\beta} = y$ is consistent, then we can find a solution. However, we are often dealing with overdetermined systems, in the sense that there are often more observations than features (i.e., more rows than columns in $\tilde{X}$, or more equations than unknowns), and therefore inconsistent systems. However, it is possible to find a best fit solution, in the sense that the difference
$$ e = y - \tilde{X}\tilde{\beta} $$
is small. By small, we often mean that $e$ is small in $L^2$ norm; i.e., we are minimizing the the sums of the squares of the differences between the components of $y$ and the components of $\tilde{X}\tilde{\beta}$. This is known as a least squares solution. Assuming that our data points live in the Euclidean plane, this precisely describes finding a line of best fit.
import numpy as np
import matplotlib.pyplot as plt
# 1. Generate some synthetic data
# We set a random seed for reproducibility
np.random.seed(3)
# Create 50 random x values between 0 and 10
x = np.random.uniform(0, 10, 50)
# Create y values with a linear relationship plus some random noise
# True relationship: y = 2.5x + 5 + noise
noise = np.random.normal(0, 2, 50)
y = 2.5 * x + 5 + noise
# 2. Calculate the line of best fit
# np.polyfit(x, y, deg) returns the coefficients for the polynomial
# deg=1 specifies a linear fit (first degree polynomial)
slope, intercept = np.polyfit(x, y, 1)
# Create a polynomial function from the coefficients
# This allows us to pass x values directly to get predicted y values
fit_function = np.poly1d((slope, intercept))
# Generate x values for plotting the line (smoothly across the range)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = fit_function(x_line)
# 3. Plot the data and the line of best fit
plt.figure(figsize=(10, 6))
# Plot the scatter points
plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)
# Plot the line of best fit
plt.plot(x_line, y_line, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')
# Add labels and title
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Scatter Plot with Line of Best Fit')
# Add the equation to the plot
# The f-string formats the slope and intercept to 2 decimal places
plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))
# Display legend and grid
plt.legend()
plt.grid(True, linestyle=':', alpha=0.6)
# Show the plot
plt.savefig('../images/line_of_best_fit_generated_1.png')
plt.show()
The structure of this sections is as follows.
- Least Squares Solution
- QR Decompositions
- Singular Value Decomposition
- A note on other norms
- A note on regularization
- A note on solving multiple targets concurrently
- Polynomial regression
- What can go wrong?
Least Squares Solution¶
Recall that the Euclidean distance between two vectors $x = (x_1,\dots,x_n) ,y = (y_1,\dots,y_n) \in \mathbb{R}^n$ is given by
$$ ||x - y||_2 = \sqrt{\sum_{i=1}^n |x_i - y_i|^2}. $$
We will often work with the square of the $L^2$ norm to simplify things (the square function is increasing, so minimizing the square of a non-negative function will also minimize the function itself).
Definition: Let $A$ be an $m \times n$ matrix and $b \in \mathbb{R}^n$. A least-squares solution of $Ax = b$ is a vector $x_0 \in \mathbb{R}^n$ such that
$$ \|b - Ax_0\|_2 \leq \|b - Ax\|_2 \text{ for all } x \in \mathbb{R}^n. $$
So a least-squares solution to the equation $Ax = b$ is trying to find a vector $x_0 \in \mathbb{R}^n$ which realizes the smallest distance between the vector $b$ and the column space $$ \text{Col}(A) = \{Ax \mid x \in \mathbb{R}^n\} $$ of $A$. We know this to be the projection of the vector $b$ onto the column space.
import numpy as np
import matplotlib.pyplot as plt
# Linear algebra helper functions
def proj_onto_subspace(A, v):
"""
Project vector v onto Col(A) where A is (3 x k) with columns spanning the subspace.
Uses the formula: P = A (A^T A)^(-1) A^T (for full column rank A).
"""
AtA = A.T @ A
return A @ np.linalg.solve(AtA, A.T @ v)
def make_plane_grid(a, b, u_range=(-1.5, 1.5), v_range=(-1.5, 1.5), n=15):
"""
Plane through origin spanned by vectors a and b.
Returns meshgrid points X,Y,Z for surface plotting.
"""
uu = np.linspace(*u_range, n)
vv = np.linspace(*v_range, n)
U, V = np.meshgrid(uu, vv)
P = U[..., None] * a + V[..., None] * b # shape (n,n,3)
return P[..., 0], P[..., 1], P[..., 2]
# Choose a plan and a vector
# Plane basis vectors (span a 2D subspace in R^3)
a = np.array([1.0, 0.2, 0.0])
b = np.array([0.2, 1.0, 0.3])
# Create the associated matrix
# 3x2 matrix of full column rank
# the column space will be a plane
A = np.column_stack([a, b])
# Vector to project
v = np.array([0.8, 0.6, 1.2])
# Projection and residual
p = proj_onto_subspace(A, v)
r = v - p
# Plot
fig = plt.figure(figsize=(9, 7))
# 1 row, 1 column, 1 subplot
# axis lives in R^3
ax = fig.add_subplot(111, projection="3d")
# Plane surface
X, Y, Z = make_plane_grid(a, b)
# Here is a rectangular grid of points in 3D; draw a surface through them.
ax.plot_surface(X, Y, Z, alpha=0.25)
origin = np.zeros(3)
# v, p, and residual r
ax.quiver(*origin, *v, arrow_length_ratio=0.08, linewidth=2)
ax.quiver(*origin, *p, arrow_length_ratio=0.08, linewidth=2)
ax.quiver(*p, *r, arrow_length_ratio=0.08, linewidth=2)
# Drop line from v to its projection on the plane
ax.plot([v[0], p[0]],
[v[1], p[1]],
[v[2], p[2]],
linestyle="--", linewidth=2)
# Points for emphasis
ax.scatter(*v, s=60)
ax.scatter(*p, s=60)
# Labels (simple text)
ax.text(*v, " v")
ax.text(*p, " Proj(v)")
# Make axes look nice
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
ax.set_title("Projection of a vector onto a plane")
# Set symmetric limits so the picture isn't squished
all_pts = np.vstack([origin, v, p])
m = np.max(np.abs(all_pts)) * 1.3 + 0.2
ax.set_xlim(-m, m)
ax.set_ylim(-m, m)
ax.set_zlim(-m, m)
# Adjust spacing so labels, titles, and axes don’t overlap or get cut off.
plt.tight_layout()
plt.savefig('../images/projection_of_vector_onto_plane.png')
plt.show()
Theorem: The set of least-squares solutions of $Ax = b$ coincides with solutions of the normal equations $A^TAx = A^Tb$. Moreover, the normal equations always have a solution.
Let us first see why we get a line of best fit.
Example. Let us show why this describes a line of best fit when we are working with one feature and one target. Suppose that we observe four data points $$ X = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \text{ and } y = \begin{bmatrix} 1 \\ 2\\ 2 \\ 4 \end{bmatrix}. $$ We want to fit a line $y = \beta_0 + \beta_1x$ to these data points. We will have our augmented matrix be $$ \tilde{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}, $$ and our parameter be $$ \tilde{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}. $$ We have that $$ \tilde{X}^T\tilde{X} = \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix} \text{ and } \tilde{X}^Ty = \begin{bmatrix} 9 \\ 27 \end{bmatrix}. $$ The 2x2 matrix $\tilde{X}^T\tilde{X}$ is easy to invert, and so we get that $$ \tilde{\beta} = (\tilde{X}^T\tilde{X})^{-1}\tilde{X}^Ty = \frac{1}{10}\begin{bmatrix} 15 & -5 \\ -5 & 2 \end{bmatrix}\begin{bmatrix} 9 \\ 27 \end{bmatrix} = \begin{bmatrix} 0 \\ \frac{9}{10} \end{bmatrix}. $$ So our line of best fit is of them form $y = \frac{9}{10}x$.
Although the above system was small and we could solve the system of equations explicitly, this isn't always feasible. We will generally use python in order to solve large systems.
- One can find a least-squares solution using
numpy.linalg.lstsq. - We can set up the normal equations and solve the system by using
numpy.linalg.solveAlthough the first approach simplifies things greatly, and is more or less what we are doing anyway, we will generally set up our problems as we would by hand, and then usenumpy.linalg.solveto help us find a solution. However, computing $X^TX$ can cause lots of errors, so later we'll see how to get linear systems from QR decompositions and the SVD, and then applynumpy.lingalg.solve.
Let's see how to use these for the above example, and see the code to generate the scatter plot and line of best fit. Again, our system is the following. $$ X = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \text{ and } y = \begin{bmatrix} 1 \\ 2\\ 2 \\ 4 \end{bmatrix}. $$ We will do what we did above, but use python instead.
import numpy as np
# Define the matrix X and vector y
X = np.array([[1], [2], [3], [4]])
y = np.array([[1], [2], [2], [4]])
# Augment X with a column of 1's (intercept)
X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
# Solve the normal equations
beta = np.linalg.solve(X_aug.T @ X_aug, X_aug.T @ y)
And what is the result?
beta
array([[-1.0658141e-15],
[ 9.0000000e-01]])
This agrees with our by-hand computation: the intercept is tiny, so it is virtually zero, and we get 9/10 as our slope. Let's plot it.
import matplotlib.pyplot as plt
b, m = beta #beta[0] will be the intercept and beta[1] will be the slope
_ = plt.plot(X, y, 'o', label='Original data', markersize=10)
_ = plt.plot(X, m*X + b, 'r', label='Line of best fit')
_ = plt.legend()
plt.savefig('../images/line_of_best_fit_easy_example.png')
plt.show()
What about numpy.linalg.lstsq? Is it any different?
import numpy as np
# Define the matrix X and vector y
X = np.array([[1], [2], [3], [4]])
y = np.array([[1], [2], [2], [4]])
# Augment X with a column of 1's (intercept)
X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
# Solve the least squares equation with matrix X_aug and target y
beta = np.linalg.lstsq(X_aug,y)[0]
We then get
beta
array([[6.16291085e-16],
[9.00000000e-01]])
So it is a little different -- and, in fact, closer to our exact answer (the intercept is zero). This makes sense -- numpy.linalg.lstsq won't directly compute $X^TX$, which, again, can cause quite a few issues.
Now going to our initial example.
Example: Let us work with the example from above. We augment the matrix with a column of 1's to include an intercept term: $$ \tilde{X} = \begin{bmatrix} 1 & 1600 & 3 \\ 1 & 2100 & 4 \\ 1 & 1550 & 2 \end{bmatrix}. $$ Let us solve the normal equations $$ \tilde{X}^T\tilde{X}\tilde{\beta} = \tilde{X}^Ty. $$ We have $$ \tilde{X}^T\tilde{X} = \begin{bmatrix} 3 & 5250 & 9 \\ 5250 & 9372500 & 16300 \\ 9 & 16300 & 29\end{bmatrix} \text{ and } \tilde{X}^Ty = \begin{bmatrix} 1625 \\ 2901500 \\ 5050 \end{bmatrix} $$ Solving this system of equations yields the parameter vector $\tilde{\beta}$. In this case, we have $$ \tilde{\beta} = \begin{bmatrix} \frac{200}{9} \\ \frac{5}{18} \\ \frac{100}{9} \end{bmatrix}. $$ When we apply $\tilde{X}$ to $\tilde{\beta}$, we get $$ \tilde{X}\tilde{\beta} = \begin{bmatrix} 500 \\ 650 \\ 475 \end{bmatrix}, $$ which is our target on the nose. This means that we can expect, based on our data, that the cost of a house will be $$ \frac{200}{9} + \frac{5}{18}(\text{square footage}) + \frac{100}{9}(\text{\# of bedrooms})$$
In the above, we actually had a consistent system to begin with, so our least-squares solution gave our prediction honestly. What happens if we have an inconsistent system?
Example: Let us add two more observations, say our data is now the following. |House | Square ft | Bedrooms | Price (in $1000s) | | --- | --- | --- | --- | | 0 | 1600 | 3 | 500 | | 1 | 2100 | 4 | 650 | | 2 | 1550 | 2 | 475 | | 3 | 1600 | 3 | 490 | | 4 | 2000 | 4 | 620 |
So setting up our system, we want a least-square solution to the matrix equation $$ \begin{bmatrix} 1 & 1600 & 3 \\ 1 & 2100 & 4 \\ 1 & 1550 & 2 \\ 1 & 1600 & 3 \\ 1 & 2000 & 4 \end{bmatrix}\tilde{\beta} = \begin{bmatrix} 500 \\ 650 \\ 475 \\ 490 \\ 620 \end{bmatrix}. $$ Note that the system is inconsistent (the 1st and 4th rows agree in $\tilde{X}$, but they have different costs). Writing the normal equations we have $$ \tilde{X}^T\tilde{X} = \begin{bmatrix} 5 & 8850 & 16 \\ 8850 & 15932500 & 29100 \\ 16 & 29100 & 54 \end{bmatrix} \text{ and } \tilde{X}y = \begin{bmatrix} 2735 \\ 4 925 250 \\ 9000 \end{bmatrix}. $$ Solving this linear system yields $$ \tilde{\beta} = \begin{bmatrix} 0 \\ \frac{3}{10} \\ 5 \end{bmatrix}. $$ This is a vastly different answer! Applying $\tilde{X}$ to it yields $$ \tilde{X}\tilde{\beta} = \begin{bmatrix} 495 \\ 650 \\ 475 \\ 495 \\ 620 \end{bmatrix}. $$ Note that the error here is $$ y - \tilde{X}\tilde{\beta} = \begin{bmatrix} 5 \\ 0 \\ 0 \\ -5 \\ 0 \end{bmatrix}, $$ which has squared $L^2$ norm $$ \|y - \tilde{X}\tilde{\beta}\|_2^2 = 25 + 25 = 50. $$ So this says that, given our data, we can roughly estimate the cost of a house, within 50k or so, to be $$ \approx \frac{3}{10}(\text{square footage}) + 5(\text{\# of bedrooms}). $$ In practice, our data sets can be gigantic, and so there is absolutely no hope of doing computations by hand. It is nice to know that theoretically we can do things like this though.
Theorem: Let $A$ be an $m \times n$ matrix and $b \in \mathbb{R}^n$. The following are equivalent.
- The equation $Ax = b$ has a unique least-squares solution for each $b \in \mathbb{R}^n$.
- The columns of $A$ are linearly independent.
- The matrix $A^TA$ is invertible.
In this case, the unique solution to the normal equations $A^TAx = A^Tb$ is
$$ x_0 = (A^TA)^{-1}A^Tb. $$
Computing $\tilde{X}^T\tilde{X}$ or taking inverses are very computationally intensive tasks, and it is best to avoid doing these. Moreover, as we'll see in an example later, if we do a numerical calculation we can get close to zero and then divide where we shouldn't be, blowing up our final result. One way to get around this is to use QR decompositions of matrices.
Now let's use python to visualize the above data and then solve for the least-squares solution. We'll use pandas in order to think about this data. We note that pandas incorporates matplotlib under the hood already, so there are some simplifications that can be made.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# First let us make a dictionary incorporating our data.
# Each entry corresponds to a column (feature of our data)
data = {
'Square ft': [1600, 2100, 1550, 1600, 2000],
'Bedrooms': [3, 4, 2, 3, 4],
'Price': [500, 650, 475, 490, 620]
}
# Create a pandas DataFrame
df = pd.DataFrame(data)
Let's see how python formats this DataFrame. It will turn it into essentially the table we had at the beginning.
df
| Square ft | Bedrooms | Price | |
|---|---|---|---|
| 0 | 1600 | 3 | 500 |
| 1 | 2100 | 4 | 650 |
| 2 | 1550 | 2 | 475 |
| 3 | 1600 | 3 | 490 |
| 4 | 2000 | 4 | 620 |
So what can we do with DataFrames? First let's use pandas.DataFrame.describe to see some basic statistics about our data.
df.describe()
| Square ft | Bedrooms | Price | |
|---|---|---|---|
| count | 5.000000 | 5.00000 | 5.000000 |
| mean | 1770.000000 | 3.20000 | 547.000000 |
| std | 258.843582 | 0.83666 | 81.516869 |
| min | 1550.000000 | 2.00000 | 475.000000 |
| 25% | 1600.000000 | 3.00000 | 490.000000 |
| 50% | 1600.000000 | 3.00000 | 500.000000 |
| 75% | 2000.000000 | 4.00000 | 620.000000 |
| max | 2100.000000 | 4.00000 | 650.000000 |
This gives use the mean, the standard deviation, the min, the max, as well as some other things. We get an immediate sense of scale from our data. We can also examine the pairwise correlation of all the columns by using pandas.DataFrame.corr.
df[["Square ft", "Bedrooms", "Price"]].corr()
| Square ft | Bedrooms | Price | |
|---|---|---|---|
| Square ft | 1.000000 | 0.900426 | 0.998810 |
| Bedrooms | 0.900426 | 1.000000 | 0.909066 |
| Price | 0.998810 | 0.909066 | 1.000000 |
It is clear that each of the three are correlated. This makes sense, as the number of bedrooms should be increasing with the square feet. Same with the price. We'll discuss in the next section when we look at Principal Component Analysis.
We can also graph our data; for example, we could create some scatter plots, one for Square ft vs Price and on for Bedrooms vs Price. We can also do a grouped bar chart. Let's start with the scatter plots.
# Scatter plot for Price vs Square ft
df.plot(
kind="scatter",
x="Square ft",
y="Price",
title="House Price vs Square footage"
)
plt.savefig('../images/house_price_vs_square_ft.png')
plt.show()
# Scatter plot for Price vs Bedrooms
df.plot(
kind="scatter",
x="Bedrooms",
y="Price",
title="House Price vs Bedrooms"
)
plt.savefig('../images/house_price_vs_bedrooms.png')
plt.show()
We can even do square footage vs bedrooms.
# Scatter plot for Bedrooms vs Square ft
df.plot(
kind="scatter",
x="Square ft",
y="Bedrooms",
title="Bedrooms vs Square footage"
)
plt.savefig('../images/bedrooms_vs_square_ft.png')
plt.show()
Of course, these figures are somewhat meaningless due to how unpopulated our data is.
Now let's get our matrices and linear systems set up with pandas.DataFrame.to_numpy.
# Create our matrix X and our target y
X = df[["Square ft", "Bedrooms"]].to_numpy()
y = df[["Price"]].to_numpy()
# Augment X with a column of 1's (intercept)
X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
# Solve the least-squares problem
beta = np.linalg.lstsq(X_aug,y)[0]
This yields
beta
array([[4.0098513e-13],
[3.0000000e-01],
[5.0000000e+00]])
As the first parameter is basically 0, we are left with the second being 3/10 and the third being 5, just like our exact solution. Next, we will look at matrix decompositions and how they can help us find least-squares solutions.
Polynomial Regression¶
Sometimes fitting a line to a set of $n$ data points clearly isn't the right thing to do. To emphasize the limitations of linear models, we generate data from a purely quadratic relationship. In this setting, the space of linear functions is not rich enough to capture the underlying structure, and the linear least-squares solution exhibits systematic error. Expanding the feature space to include quadratic terms resolves this issue.
For example, suppose our data looked like the following.
## Generate data
import numpy as np
import matplotlib.pyplot as plt
# 1) Generate quadratic data
np.random.seed(3)
n = 50
x = np.random.uniform(-5, 5, n) # symmetric, wider range
# True relationship: y = ax^2 + c + noise
a_true = 2.0
c_true = 5.0
noise = np.random.normal(0, 3, n)
y = a_true * x**2 + c_true + noise
## Generate scatter plot
plt.scatter(x,y)
# plot it
plt.savefig('../images/quadratic_data_generated_1.png')
plt.show()
If we try to find a line of best fit, we get something that doesn't really describe or approximate our data at all...
# find a line of best fit
a,b = np.polyfit(x, y, 1)
# add scatter points to plot
plt.scatter(x,y)
# add line of best fit to plot
plt.plot(x, a*x + b, 'r', linewidth=1)
# plot it
plt.savefig('../images/quadratic_data_line_of_best_fit.png')
plt.show()
This is an example of underfitting data, and we can do better. The same linear regression ideas work for fitting a degree $d$ polynomial model to a set of $n$ data points. Before, when trying to fit a line to points $(x_1,y_1),\dots,(x_n,y_n)$, we had the following matrices $$ \tilde{X} = \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}, \tilde{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} $$ in the matrix equation $$ \tilde{X}\tilde{\beta} = y, $$ and we were trying to find a vector $\tilde{\beta}$ which gave a best possible solution. This would give us a line $y = \beta_0 + \beta_1x$ which best approximates the data. To fit a polynomial $y = \beta_0 + \beta_1x + \beta_2x^2 + \cdots + \beta_d^dx^d$ to the data, we have a similar set up.
Definition. The Vandermonde matrix is the $n \times (d+1)$ matrix $$ V = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}. $$
With the Vandermonde matrix, to find a polynomial function of best fit, one just needs to find a least-squares solution to the matrix equation $$ V\tilde{\beta} = y. $$
With the generated data above, we get the following curve.
# polynomial fit with degree = 2
poly = np.polyfit(x,y,2)
model = np.poly1d(poly)
# add scatter points to plot
plt.scatter(x,y)
# add the quadratic to the plot
polyline=np.linspace(x.min(), x.max())
plt.plot(polyline, model(polyline), 'r', linewidth=1)
# plot it
plt.savefig('../images/quadratic_data_quadratic_of_best_fit')
plt.show()
Solving these problems can be done with python. One can use numpy.polyfit and numpy.poly1d.
Example. Consider the following data. |House | Square ft | Bedrooms | Price (in $1000s) | | --- | --- | --- | --- | | 0 | 1600 | 3 | 500 | | 1 | 2100 | 4 | 650 | | 2 | 1550 | 2 | 475 | | 3 | 1600 | 3 | 490 | | 4 | 2000 | 4 | 620 |
Suppose we wanted to predict the price of a house based on the square footage and we thought the relationship was cubic (it clearly isn't, but hey, for the sake of argument). So really we are looking at the subset of data |House | Square ft | Price (in $1000s) | | --- | --- | --- | | 0 | 1600 | 500 | | 1 | 2100 | 650 | | 2 | 1550 | 475 | | 3 | 1600 | 490 | | 4 | 2000 | 620 |
Our Vandermonde matrix will be $$ V = \begin{bmatrix} 1 & 1600 & 1600^2 & 1600^3 \\ 1 & 2100 & 2100^2 & 2100^3 \\ 1 & 1550 & 1550^2 & 1550^3 \\ 1 & 1600 & 1600^2 & 1600^3 \\ 1 & 2000 & 2000^2 & 2000^3 \end{bmatrix} $$ and our target vector will be $$ y = \begin{bmatrix} 500 \\ 650 \\ 475 \\ 490 \\ 620 \end{bmatrix}. $$ As we can see, the entries of the Vandermonde matrix get very very large very fast. One can, if they are so inclined, compute a least-squares solution to $V\tilde{\beta} = y$ by hand. Let's not, but let us find, using python, a "best" cubic approximation of the relationship between the square footage and price.
We will use numpy.polyfit, numpy.pold1d and numpy.linspace.
Let's get a cubic of best fit.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# First let us make a dictionary incorporating our data.
# Each entry corresponds to a column (feature of our data)
data = {
'Square ft': [1600, 2100, 1550, 1600, 2000],
'Bedrooms': [3, 4, 2, 3, 4],
'Price': [500, 650, 475, 490, 620]
}
# Create a pandas DataFrame
df = pd.DataFrame(data)
# Extract x (square footage) and y (price)
x = df["Square ft"].to_numpy(dtype=float)
y = df["Price"].to_numpy(dtype=float)
# Degree of polynomial
degree = 3 # cubic
# Polyfit directly on x
cubic = np.poly1d(np.polyfit(x,y, degree))
# Add fitted polynomial line and scatter plot
polyline = np.linspace(x.min(),x.max())
plt.scatter(x,y, label="Observed data")
plt.plot(polyline, cubic(polyline), 'r', label="Cubic best fit")
plt.xlabel("Square ft")
plt.ylabel("Price (in $1000s)")
plt.title("Cubic polynomial regression: Price vs Square Footage")
plt.show()
Here numpy.polyfit computes the least-squares solution in the polynomial basis $1, x, x^2, x^3$, i.e., it solves the Vandermonde least-squares problem. So what is our cubic polynomial?
cubic
poly1d([ 3.08080808e-07, -1.78106061e-03, 3.71744949e+00, -2.15530303e+03])
The first term is the degree 3 term, the second the degree 2 term, the third the degree 1 term, and the fourth is the constant term.
import numpy as np
import matplotlib.pyplot as plt
# Generate data (same as above)
np.random.seed(3)
x = np.random.uniform(0, 10, 50)
y = 2.5 * x + 5 + np.random.normal(0, 2, 50)
# Calculate slope and intercept
slope, intercept = np.polyfit(x, y, 1)
plt.figure(figsize=(10, 6))
plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)
# Plot the line using axline
# xy1=(0, intercept) is the y-intercept point
# slope=slope defines the steepness
plt.axline(xy1=(0, intercept), slope=slope, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')
# Add the equation to the plot
# The f-string formats the slope and intercept to 2 decimal places
plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Scatter Plot with Line of Best Fit')
plt.legend()
plt.grid(True, linestyle=':', alpha=0.6)
plt.show()