diff --git a/README.md b/README.md
index ee26b5f..fe8e7df 100644
--- a/README.md
+++ b/README.md
@@ -1,2037 +1,167 @@
-The LaTeX does not seem to render properly on GitLab/Github. This repo is mirrored on my gitea server here, where the LaTeX seems to render properly.
+# Data Science for the Linear Algebraist
-# Data Science for the Linear Algebraist.
-
-## Project Overview
-A not-so comprehensive guide bridging linear algebra theory with practical data science implementation. Meant for someone to learn data science by using their strong linear algebra background.
-This project demonstrates how fundamental linear algebra concepts power modern machine learning algorithms, with hands-on Python implementations.
+A practical, linear-algebra-first introduction to data science.
-## Main Dependencies
-- **Python 3**
-- **NumPy** - Matrix operations and linear algebra
-- **Pandas** - Data manipulation
-- **Matplotlib** - Visualization
+This repository demonstrates how core linear algebra concepts -- least squares, matrix decompositions, and spectral methods -- directly power modern data science and machine learning workflows. We finish off with a mini-project involving image denoising using the truncated SVD.
-## Key Demonstrations
-1. **Least Squares Regression** - From theory to implementation
-2. **QR Decomposition and SVD** - Numerical stability in solving systems
-3. **PCA** - Dimensionality reduction
-4. **Project** - Applying low-rank approximation (via truncated SVD) to an image of my beautiful dog
+Rather than treating data science as a collection of tools, this project builds everything from first principles and connects theory to implementation through jupyter notebooks.
-## To do
-- [ ] Upload Jupyter notebook
+The compiled notebooks in this project can be viewed as a single webpage on my [website](https://pawelsarkowicz.xyz/posts/ds_for_la). Note that if you view in the notebooks in Gitlab/Github, they have a tendency to not render the latex properly.
-## Contact
-Pawel Sarkowicz
-š¼ LinkedIn
+## Structure
-š website
+This project is organized as a collection of focused notebooks:
-š» git
+```text
+images/ # saved images/visualizations
+notebooks/ # jupyter notebooks containing theory, code, visuals
+bibliography.md # references for essentially everything
+requirements.txt # python requirements
+LICENSE # project license
+```
-š§ email
+Each notebook is self-contained and moves from theory to implementation to visualization.
-January 2026
+## Dependencies
+
+* **Python 3**
+* **NumPy** -- linear algebra
+* **Pandas** -- data handling
+* **Matplotlib** -- visualization
+* **Pillow** -- imaging library
+* **scikit-learn** -- machine learning utilities
+
+## How to Run
+
+```bash
+git clone https://gitlab.com/psark/ds-for-la.git
+cd ds-for-la
+
+pip install requirements.txt
+
+jupyter notebook
+```
+
+Open any notebook inside the `notebooks/` folder.
---
-## Table of Contents
+## Topics
-1. [Introduction: The Basic Data Science Problem](#introduction)
-2. [Solving the Problem: Least Squares and Matrix Decompositions](#solving-the-problem-least-squares-regression-and-matrix-decompositions)
-3. [Principal Component Analysis](#principal-component-analysis)
-4. [Project: Spectral Image Denoising via Truncated SVD](#project-spectral-image-denoising-via-truncated-svd)
-5. [Appendix](#appendix)
-6. [Bibliography](#bibliography)
+### 1. Least Squares Regression
-## Introduction
+* Overdetermined systems
+* Normal equations
+* Geometric interpretation (projection onto column space)
+* Implementation using NumPy
-This is meant to be a not entirely comprehensive introduction to Data Science for the Linear Algebraist. There are of course many other complicated topics, but this is just to get the essence of data science (and the tools involved) from the perspective of someone with a strong linear algebra background.
+### 2. QR Decomposition & SVD
-One of the most fundamental questions of data science is the following.
+* Numerical stability vs. normal equations
+* Orthogonal bases and conditioning
+* Solving linear systems without forming $X^T X$
-> **Question**: Given observed data, how can we predict certain targets?
+### 3. Some Notes & What Can Go Wrong
-The answer of course boils down to linear algebra, and we will begin by translating data science terms and concepts into linear algebraic ones. But first, as should be common practice for the linear algebraist, an example.
+* Other vector norms ($L^1, L^\infty$), as well as matrix norms (Frobenius, Operator)
+* What can go wrong?
-> **Example**. Suppose that we observe $n=3$ houses, and for each house we record
-> - the square footage,
-> - the number of bedrooms,
-> - and additionally the sale price.
->
-> So we have a table as follows.
->
-> |House | Square ft | Bedrooms | Price (in $1000s) |
-> | --- | --- | --- | --- |
-> | 0 | 1600 | 3 | 500 |
-> | 1 | 2100 | 4 | 650 |
-> | 2 | 1550 | 2 | 475 |
->
-> So, for example, the first house is 1600 square feet, has 3 bedrooms, and costs $500,000, and so on. Our goal will be to understand the cost of a house in terms of the number of bedrooms as well as the square footage.
-> Concretely this gives us a matrix and a vector:
-> $$ X = \begin{bmatrix} 1600 & 3 \\ 2100 & 4 \\ 1550 & 2 \end{bmatrix} \text{ and } y =\begin{bmatrix} 500 \\ 650 \\ 475 \end{bmatrix} $$
-> So translating to linear algebra, the goal is to understand how $y$ depends on the columns of $X$.
+### 4. Principal Component Analysis (PCA)
+* Dimensionality reduction via spectral methods
+* Relationship between covariance matrices and eigenvectors
+* Handling correlated features
-## Translation from Data Science to Linear Algebra
+### 5. Project: Spectral Image Denoising via Truncated SVD
-| Data Science (DS) Term | Linear Algebra (LA) Equivalent | Explanation |
-| --- | --- | --- |
-| Dataset (with (n) observations and (p) features) | A matrix $X \in \mathbb{R}^{n \times p}$ | The dataset is just a matrix. Each row is an observation (a vector of features). Each column is a feature (a vector of its values across all observations). |
-| Features | Columns of $X$ | Each feature is a column in your data matrix. |
-| Observation | Rows of $X$ | Each data point corresponds to a row. |
-| Targets | A vector $y \in \mathbb{R}^{n \times 1}$ | The list of all target values is a column vector. |
-| Model parameters | A vector $\beta \in \mathbb{R}^{p \times 1}$ | These are the unknown coefficients. |
-| Model | Matrixāvector equation | The relationship becomes an equation involving matrices and vectors. |
-| Prediction Error / Residuals | A residual vector $e \in \mathbb{R}^{n \times 1}$ | Difference between actual targets and predictions. |
-| Training / "best fit" | Optimization: minimizing the norm of the residual vector | To find the "best" model by finding a model which makes the norm of the residual vector as small as possible. |
+* Low-rank approximation of images
+* Noise removal using singular value truncation
+* RGB images (channel-wise SVD)
+* Quantitative evaluation (MSE, PSNR)
-So our matrix $X$ will represent our data set, our vector $y$ is the target, and $\beta$ is our vector of parameters. We will often be interested in understanding data with "intercepts", i.e., when there is a base value given in our data. So we will augment a column of 1's (denoted by $\mathbb{1}$) to $X$ and append a parameter $\beta_0$ to the top of $\beta$, yielding
+### 6. Modelling 101
-$$ \tilde{X} = \begin{bmatrix} \mathbb{1} & X \end{bmatrix} \text{ and } \tilde{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}. $$
-
-So the answer to the Data Science problem becomes
-
-> **Answer**: Solve, or best approximate a solution to, the matrix equation $\tilde{X}\tilde{\beta} = y$.
-
-To be explicit, given $\tilde{X}$ and $y$, we want to find a $\tilde{\beta}$ that does a good job of roughly giving $\tilde{X}\tilde{\beta} = y$. There of course ways to solve (or approximate) such small systems by hand. However, one will often be dealing with enormous data sets with plenty to be desired. One view to take is that modern data science is applying numerical linear algebra techniques to imperfect information, all to get as good a solution as possible.
-
-# Solving the problem: Least Squares Regression and Matrix Decompositions
-
-If the system $\tilde{X}\tilde{\beta} = y$ is consistent, then we can find a solution. However, we are often dealing with overdetermined systems, in the sense that there are often more observations than features (i.e., more rows than columns in $\tilde{X}$, or more equations than unknowns), and therefore inconsistent systems. However, it is possible to find a **best fit** solution, in the sense that the difference
-
-$$ e = y - \tilde{X}\tilde{\beta} $$
-
-is small. By small, we often mean that $e$ is small in $L^2$ norm; i.e., we are minimizing the the sums of the squares of the differences between the components of $y$ and the components of $\tilde{X}\tilde{\beta}$. This is known as a **least squares solution**. Assuming that our data points live in the Euclidean plane, this precisely describes finding a line of best fit.
-
-
-
-The structure of this sections is as follows.
-- [Least Squares Solution](#least-squares-solution)
-- [QR Decompositions](#qr-decompositions)
-- [Singular Value Decomposition](#singular-value-decomposition)
-- [A note on other norms](#a-note-on-other-norms)
-- [A note on regularization](#a-note-on-regularization)
-- [A note on solving multiple targets concurrently](#a-note-on-solving-multiple-targets-concurrently)
-- [Polynomial regression](#polynomial-regression)
-- [What can go wrong?](#what-can-go-wrong)
-
-## Least Squares Solution
-
-Recall that the Euclidean distance between two vectors $x = (x_1,\dots,x_n) ,y = (y_1,\dots,y_n) \in \mathbb{R}^n$ is given by
-
-$$ ||x - y||_2 = \sqrt{\sum_{i=1}^n |x_i - y_i|^2}. $$
-
-We will often work with the square of the $L^2$ norm to simplify things (the square function is increasing, so minimizing the square of a non-negative function will also minimize the function itself).
-
-> **Definition**: Let $A$ be an $m \times n$ matrix and $b \in \mathbb{R}^n$. A **least-squares solution** of $Ax = b$ is a vector $x_0 \in \mathbb{R}^n$ such that
->
-> $$ \|b - Ax_0\|_2 \leq \|b - Ax\|_2 \text{ for all } x \in \mathbb{R}^n. $$
-
-So a least-squares solution to the equation $Ax = b$ is trying to find a vector $x_0 \in \mathbb{R}^n$ which realizes the smallest distance between the vector $b$ and the column space
-$$ \text{Col}(A) = \{Ax \mid x \in \mathbb{R}^n\} $$
-of $A$. We know this to be the projection of the vector $b$ onto the column space.
-
-
-
-> **Theorem**: The set of least-squares solutions of $Ax = b$ coincides with solutions of the **normal equations** $A^TAx = A^Tb$. Moreover, the normal equations always have a solution.
-
-Let us first see why we get a line of best fit.
-
-> **Example**. Let us show why this describes a line of best fit when we are working with one feature and one target. Suppose that we observe four data points
-> $$ X = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \text{ and } y = \begin{bmatrix} 1 \\ 2\\ 2 \\ 4 \end{bmatrix}. $$
-> We want to fit a line $y = \beta_0 + \beta_1x$ to these data points. We will have our augmented matrix be
-> $$ \tilde{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}, $$
-> and our parameter be
-> $$ \tilde{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}. $$
-> We have that
-> $$ \tilde{X}^T\tilde{X} = \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix} \text{ and } \tilde{X}^Ty = \begin{bmatrix} 9 \\ 27 \end{bmatrix}. $$
-> The 2x2 matrix $\tilde{X}^T\tilde{X}$ is easy to invert, and so we get that
-> $$ \tilde{\beta} = (\tilde{X}^T\tilde{X})^{-1}\tilde{X}^Ty = \frac{1}{10}\begin{bmatrix} 15 & -5 \\ -5 & 2 \end{bmatrix}\begin{bmatrix} 9 \\ 27 \end{bmatrix} = \begin{bmatrix} 0 \\ \frac{9}{10} \end{bmatrix}. $$
-> So our line of best fit is of them form $y = \frac{9}{10}x$.
-
-Although the above system was small and we could solve the system of equations explicitly, this isn't always feasible. We will generally use python in order to solve large systems.
-- One can find a least-squares solution using `numpy.linalg.lstsq`.
-- We can set up the normal equations and solve the system by using `numpy.linalg.solve`
-Although the first approach simplifies things greatly, and is more or less what we are doing anyway, we will generally set up our problems as we would by hand, and then use `numpy.linalg.solve` to help us find a solution. However, computing $X^TX$ can cause lots of errors, so later we'll see how to get linear systems from QR decompositions and the SVD, and then apply `numpy.lingalg.solve`.
-
-Let's see how to use these for the above example, and see the code to generate the scatter plot and line of best fit.
-Again, our system is the following.
-$$ X = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \text{ and } y = \begin{bmatrix} 1 \\ 2\\ 2 \\ 4 \end{bmatrix}. $$
-We will do what we did above, but use python instead.
-```python
-import numpy as np
-
-# Define the matrix X and vector y
-X = np.array([[1], [2], [3], [4]])
-y = np.array([[1], [2], [2], [4]])
-
-# Augment X with a column of 1's (intercept)
-X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
-
-# Solve the normal equations
-beta = np.linalg.solve(X_aug.T @ X_aug, X_aug.T @ y)
-```
-And what is the result?
-```python
->>> beta
-array([[-1.0658141e-15],
- [ 9.0000000e-01]])
-```
-This agrees with our by-hand computation: the intercept is tiny, so it is virtually zero, and we get 9/10 as our slope. Let's plot it.
-```python
-import matplotlib.pyplot as plt
-b, m = beta #beta[0] will be the intercept and beta[1] will be the slope
-_ = plt.plot(X, y, 'o', label='Original data', markersize=10)
-_ = plt.plot(X, m*X + b, 'r', label='Line of best fit')
-_ = plt.legend()
-plt.show()
-```
-
-
-
-What about `numpy.linalg.lstsq`? Is it any different?
-```python
-import numpy as np
-
-# Define the matrix X and vector y
-X = np.array([[1], [2], [3], [4]])
-y = np.array([[1], [2], [2], [4]])
-
-# Augment X with a column of 1's (intercept)
-X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
-
-# Solve the least squares equation with matrix X_aug and target y
-beta = np.linalg.lstsq(X_aug,y)[0]
-```
-We then get
-```python
->>> beta
-array([[6.16291085e-16],
- [9.00000000e-01]])
-
-```
-So it is a little different -- and, in fact, closer to our exact answer (the intercept is zero). This makes sense -- `numpy.linalg.lstsq` won't directly compute $X^TX$, which, again, can cause quite a few issues.
+* Train/test splits
+* Regression (Linear, Ridge, LASSO, PCR)
+* Gradient descent
+* Decision trees & random forests
+* Logistic regression
+* Cross-validation
+* Feature scaling
+* Hyperparameter tuning
---
-Now going to our initial example.
+## Example: Image Denoising via SVD
-> **Example**: Let us work with the example from above. We augment the matrix with a column of 1's to include an intercept term:
-> $$ \tilde{X} = \begin{bmatrix} 1 & 1600 & 3 \\ 1 & 2100 & 4 \\ 1 & 1550 & 2 \end{bmatrix}. $$
-> Let us solve the normal equations
-> $$ \tilde{X}^T\tilde{X}\tilde{\beta} = \tilde{X}^Ty. $$
-> We have
-> $$ \tilde{X}^T\tilde{X} = \begin{bmatrix} 3 & 5250 & 9 \\ 5250 & 9372500 & 16300 \\ 9 & 16300 & 29\end{bmatrix} \text{ and } \tilde{X}^Ty = \begin{bmatrix} 1625 \\ 2901500 \\ 5050 \end{bmatrix} $$
-> Solving this system of equations yields the parameter vector $\tilde{\beta}$. In this case, we have
-> $$ \tilde{\beta} = \begin{bmatrix} \frac{200}{9} \\ \frac{5}{18} \\ \frac{100}{9} \end{bmatrix}. $$
-> When we apply $\tilde{X}$ to $\tilde{\beta}$, we get
-> $$ \tilde{X}\tilde{\beta} = \begin{bmatrix} 500 \\ 650 \\ 475 \end{bmatrix}, $$
-> which is our target on the nose. This means that we can expect, based on our data, that the cost of a house will be
-> $$ \frac{200}{9} + \frac{5}{18}(\text{square footage}) + \frac{100}{9}(\text{\# of bedrooms})$$
+Given an image matrix $A$ (for simplicity, let's go with greyscale), we compute its singular value decomposition:
-In the above, we actually had a consistent system to begin with, so our least-squares solution gave our prediction honestly. What happens if we have an inconsistent system?
+$$
+A = U \Sigma V^T
+$$
-> **Example**: Let us add two more observations, say our data is now the following.
-> |House | Square ft | Bedrooms | Price (in $1000s) |
-> | --- | --- | --- | --- |
-> | 0 | 1600 | 3 | 500 |
-> | 1 | 2100 | 4 | 650 |
-> | 2 | 1550 | 2 | 475 |
-> | 3 | 1600 | 3 | 490 |
-> | 4 | 2000 | 4 | 620 |
->
-> So setting up our system, we want a least-square solution to the matrix equation
-> $$ \begin{bmatrix} 1 & 1600 & 3 \\ 1 & 2100 & 4 \\ 1 & 1550 & 2 \\ 1 & 1600 & 3 \\ 1 & 2000 & 4 \end{bmatrix}\tilde{\beta} = \begin{bmatrix} 500 \\ 650 \\ 475 \\ 490 \\ 620 \end{bmatrix}. $$
-> Note that the system is inconsistent (the 1st and 4th rows agree in $\tilde{X}$, but they have different costs). Writing the normal equations we have
-> $$ \tilde{X}^T\tilde{X} = \begin{bmatrix} 5 & 8850 & 16 \\ 8850 & 15932500 & 29100 \\ 16 & 29100 & 54 \end{bmatrix} \text{ and } \tilde{X}y = \begin{bmatrix} 2735 \\ 4 925 250 \\ 9000 \end{bmatrix}. $$
-> Solving this linear system yields
-> $$ \tilde{\beta} = \begin{bmatrix} 0 \\ \frac{3}{10} \\ 5 \end{bmatrix}. $$
-> This is a vastly different answer! Applying $\tilde{X}$ to it yields
-> $$ \tilde{X}\tilde{\beta} = \begin{bmatrix} 495 \\ 650 \\ 475 \\ 495 \\ 620 \end{bmatrix}. $$
-> Note that the error here is
-> $$ y - \tilde{X}\tilde{\beta} = \begin{bmatrix} 5 \\ 0 \\ 0 \\ -5 \\ 0 \end{bmatrix}, $$
-> which has squared $L^2$ norm
-> $$ \|y - \tilde{X}\tilde{\beta}\|_2^2 = 25 + 25 = 50. $$
-> So this says that, given our data, we can roughly estimate the cost of a house, within 50k or so, to be
-> $$ \approx \frac{3}{10}(\text{square footage}) + 5(\text{\# of bedrooms}). $$
-In practice, our data sets can be gigantic, and so there is absolutely no hope of doing computations by hand. It is nice to know that theoretically we can do things like this though.
+We approximate the image using only the top $k$ singular values:
-> **Theorem**: Let $A$ be an $m \times n$ matrix and $b \in \mathbb{R}^n$. The following are equivalent.
->
-> 1. The equation $Ax = b$ has a unique least-squares solution for each $b \in \mathbb{R}^n$.
-> 2. The columns of $A$ are linearly independent.
-> 3. The matrix $A^TA$ is invertible.
+$$
+A_k = U_k \Sigma_k V_k^T
+$$
-In this case, the unique solution to the normal equations $A^TAx = A^Tb$ is
+This produces:
-$$ x_0 = (A^TA)^{-1}A^Tb. $$
+* **Noise reduction**
+* **Compression**
+* A direct application of the **EckartāYoungāMirsky theorem**
-Computing $\tilde{X}^T\tilde{X}$ or taking inverses are very computationally intensive tasks, and it is best to avoid doing these. Moreover, as we'll see in an example later, if we do a numerical calculation we can get close to zero and then divide where we shouldn't be, blowing up our final result. One way to get around this is to use QR decompositions of matrices.
-
-Now let's use python to visualize the above data and then solve for the least-squares solution. We'll use `pandas` in order to think about this data. We note that `pandas` incorporates `matplotlib` under the hood already, so there are some simplifications that can be made.
-```python
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-
-# First let us make a dictionary incorporating our data.
-# Each entry corresponds to a column (feature of our data)
-data = {
- 'Square ft': [1600, 2100, 1550, 1600, 2000],
- 'Bedrooms': [3, 4, 2, 3, 4],
- 'Price': [500, 650, 475, 490, 620]
-}
-
-# Create a pandas DataFrame
-df = pd.DataFrame(data)
-```
-Let's see how python formats this `DataFrame`. It will turn it into essentially the table we had at the beginning.
-```python
->>> df
- Square ft Bedrooms Price
-0 1600 3 500
-1 2100 4 650
-2 1550 2 475
-3 1600 3 490
-4 2000 4 620
-```
-So what can we do with DataFrames? First let's use `pandas.DataFrame.describe` to see some basic statistics about our data.
-```python
->>> df.describe()
- Square ft Bedrooms Price
-count 5.000000 5.00000 5.000000
-mean 1770.000000 3.20000 547.000000
-std 258.843582 0.83666 81.516869
-min 1550.000000 2.00000 475.000000
-25% 1600.000000 3.00000 490.000000
-50% 1600.000000 3.00000 500.000000
-75% 2000.000000 4.00000 620.000000
-max 2100.000000 4.00000 650.000000
-```
-This gives use the mean, the standard deviation, the min, the max, as well as some other things. We get an immediate sense of scale from our data. We can also examine the pairwise correlation of all the columns by using `pandas.DataFrame.corr`.
-```python
->>> df[["Square ft", "Bedrooms", "Price"]].corr()
- Square ft Bedrooms Price
-Square ft 1.000000 0.900426 0.998810
-Bedrooms 0.900426 1.000000 0.909066
-Price 0.998810 0.909066 1.000000
-```
-It is clear that each of the three are correlated. This makes sense, as the number of bedrooms should be increasing with the square feet. Same with the price. We'll discuss in the next section when we look at Principal Component Analysis.
-
-We can also graph our data; for example, we could create some scatter plots, one for `Square ft` vs `Price` and on for `Bedrooms` vs `Price`. We can also do a grouped bar chart. Let's start with the scatter plots.
-
-```python
-# Scatter plot for Price vs Square ft
-df.plot(
- kind="scatter",
- x="Square ft",
- y="Price",
- title="House Price vs Square footage"
-)
-plt.show()
-```
-```python
-# Scatter plot for Price vs Bedrooms
-df.plot(
- kind="scatter",
- x="Bedrooms",
- y="Price",
- title="House Price vs Bedrooms"
-)
-plt.show()
-```
-
-
-
-
-
-We can even do square footage vs bedrooms.
-```python
-# Scatter plot for Bedrooms vs Square ft
-df.plot(
- kind="scatter",
- x="Square ft",
- y="Bedrooms",
- title="Bedrooms vs Square footage"
-)
-plt.show()
-```
-
-
-
-Of course, these figures are somewhat meaningless due to how unpopulated our data is.
-
-Now let's get our matrices and linear systems set up with `pandas.DataFrame.to_numpy`.
-
-```python
-# Create our matrix X and our target y
-X = df[["Square ft", "Bedrooms"]].to_numpy()
-y = df[["Price"]].to_numpy()
-
-# Augment X with a column of 1's (intercept)
-X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
-
-# Solve the least-squares problem
-beta = np.linalg.lstsq(X_aug,y)[0]
-```
-This yields
-```python
->>> beta
-array([[4.0098513e-13],
- [3.0000000e-01],
- [5.0000000e+00]])
-
-```
-As the first parameter is basically 0, we are left with the second being 3/10 and the third being 5, just like our exact solution. Next, we will look at matrix decompositions and how they can help us find least-squares solutions.
-
-## QR Decompositions
-
-QR decompositions are a powerful tool in linear algebra and data science for several reasons. They provide a way to decompose a matrix into an orthogonal matrix $Q$ aand an upper triangular matrix $R$, which can simplify many computations and analyses.
-
-> **Theorem**: Let $A$ is an $m \times n$ matrix with linearly independent columns ($m \geq n$ in this case), then $A$ can be decomposed as $A = QR$ where $Q$ is an $m \times n$ matrix whose columns form an orthonormal basis for Col($A$) and $R$ is an $n \times n$ upper-triangular invertible matrix with positive entries on the diagonal.
-
-In the literature, sometimes the QR decomposition is phrased as follows: any $m \times n$ matrix $A$ can also be written as $A = QR$ where $Q$ is an $m \times m$ orthogonal matrix ($Q^T = Q^{-1}$), and $R$ is an $m \times n$ upper-triangular matrix. One follows from the other by playing around with some matrix equations. Indeed, suppose that $A = Q_1R_1$ is a decomposition as above (that is, $Q_1$ is $m \times n$ and $R_1$ is $n \times n$). Use can use the Gram-Schmidt procedure to extend the columns of $Q_1$ to an orthonormal basis for all of $\mathbb{R}^m$, and put the remaining vectors in a $(m - n) \times n$ matrix $Q_2$. Then
-
-$$ A = Q_1R_1 = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\begin{bmatrix} R_1 \\ 0 \end{bmatrix}. $$
-
-The left matrix is an $m \times m$ orthogonal matrix and the right matrix is $m \times n$ upper triangular. Moreover, the decomposition provides orthonormal bases for both the column space of $A$ and the perp of the column space of $A$; $Q_1$ will consist of an orthonormal basis for the column space of $A$ and $Q_2$ will consist of an orthonormal basis for the perp of the column space of $A$.
-
-However, we will often want to use the decomposition when $Q$ is $m \times n$, $R$ is $n \times n$, and the columns of $Q$ form an orthonormal basis for the column space of $A$. For example, the python function `numpy.linalg.qr` give QR decompositions this way (again, assuming that the columns of $A$ are linearly independent, so $m \geq n$).
-
-> **Key take-away**. The QR decomposition provides an orthonormal basis for the column space of $A$. If $A$ has rank $k$, then the first $k$ columns of $Q$ will form a basis for the column space of $A$.
-
-For small matrices, one can find $Q$ and $R$ by hand, assuming that $A = [ a_1\ \cdots\ a_n ]$ has full column rank. Let $e_1,\dots,e_n$ be the unnormalized vectors we get when we apply Gram-Schmidt to $c_1,\dots,c_n$, and let $u_1,\dots,u_n$ be their normalizations. Let
-$$ r_j = \begin{bmatrix} \langle e_1,c_j \rangle \\ \vdots \\ \langle e_n, c_j \rangle \end{bmatrix}, $$
-and note that $\langle e_i,c_j \rangle = 0$ whenever $i > j$. Thus
-$$ Q = \begin{bmatrix} u_1 & \cdots & u_n \end{bmatrix} \text{ and } R = \begin{bmatrix} r_1 & \cdots & r_n \end{bmatrix} $$
-give rise to a $A = QR$, where the columns of $Q$ form an orthonormal basis for $\text{Col}(A)$ and $R$ is upper-triangular. We can also compute $R$ directly from $Q$ and $Q$. Indeed, note that $Q^TQ = I$, so
-$$ Q^TA = Q^T(QR) = IR = R. $$
-
-> **Example**. Find a QR decomposition for the matrix
-> $$ A = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}. $$
-> Note that one trivially see (or by applying the Gram-Schmidt procedure) that
-> $$ \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix} $$
-> forms an orthonormal basis for the column space of $A$. So with
-> $$ Q = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix} \text{ and }R = \begin{bmatrix} 1 & 1 & 1\\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix}, $$
-> we have $A = QR$.
-
-Let's do a more involved example.
-> **Example**. Consider the matrix
-> $$ A = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}. $$
-> One can apply the Gram-Schmidt procedure to the columns of $A$ to find that
-> $$ \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} -3 \\ 1 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 0 \\ -\frac{2}{3} \\ \frac{1}{3} \\ \frac{1}{3}\end{bmatrix} $$
-> forms an orthogonal basis for the column space of $A$. Normalizing, we get that
-> $$ Q = \begin{bmatrix} \frac{1}{2} & -\frac{3}{\sqrt{12}} & 0 \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & -\frac{2}{\sqrt{6}} \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{6}} \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{6}} \end{bmatrix} $$
-> is an appropriate $Q$. Thus
-> $$ \begin{split} R = Q^TA &= \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & \frac{1}{2} & \frac{1}{2} \\ -\frac{3}{\sqrt{12}} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{12}} \\ 0 & -\frac{2}{\sqrt{6}} & \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{6}} \end{bmatrix}\begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} \\ &= \begin{bmatrix} 2 & \frac{3}{2} & 1 \\ 0 & \frac{3}{\sqrt{12}} & \frac{2}{\sqrt{12}} \\ 0 & 0 & \frac{2}{\sqrt{6}} \end{bmatrix}. \end{split} $$
-> So all together,
-> $$A = \begin{bmatrix} \frac{1}{2} & -\frac{3}{\sqrt{12}} & 0 \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & -\frac{2}{\sqrt{6}} \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{6}} \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{6}} \end{bmatrix}\begin{bmatrix} 2 & \frac{3}{2} & 1 \\ 0 & \frac{3}{\sqrt{12}} & \frac{2}{\sqrt{12}} \\ 0 & 0 & \frac{2}{\sqrt{6}} \end{bmatrix}. $$
-
-To do this numerically, we can use `numpy.linalg.qr`.
-
-```python
-# Define our matrices
-A = np.array([[1,1,1],[0,1,1],[0,0,1],[0,0,0]])
-B = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])
-
-# Take QR decompositions
-QA, RA = np.linalg.qr(A)
-QB, RB = np.linalg.qr(B)
-```
-Our resulting matrices are:
-```python
->>> QA
-array([[ 1., 0., 0.],
- [-0., 1., 0.],
- [-0., -0., 1.],
- [-0., -0., -0.]])
->>> RA
-array([[1., 1., 1.],
- [0., 1., 1.],
- [0., 0., 1.]])
->>> QB
-array([[-0.5 , 0.8660254 , 0. ],
- [-0.5 , -0.28867513, 0.81649658],
- [-0.5 , -0.28867513, -0.40824829],
- [-0.5 , -0.28867513, -0.40824829]])
->>> RB
-array([[-2. , -1.5 , -1. ],
- [ 0. , -0.8660254 , -0.57735027],
- [ 0. , 0. , -0.81649658]])
-
-```
-
-### How to use QR decompositions
-
-One of the primary uses of QR decompositions is to solve least squares problems, as introduced above. Assuming that $A$ has full column rank, we can write $A = QR$ as a QR decomposition, and then we can find a least-squares solution to $Ax = b$ by solving the upper-triangular system.
-
-> **Theorem**. Let $A$ be an $m \times n$ matrix with full column rank, and let $A = QR$ be a QR factorization of $A$. Then, for each $b \in \mathbb{R}^m$, the equation $Ax = b$ has a unique least-squares solution, arising from the system
-> $$ Rx = Q^Tb. $$
-
-Normal equations can be *ill-conditioned*, i.e., small errors in calculating $A^TA$ give large errors when trying to solve the least-squares problem. When $A$ has full column rank, a QR factorization will allow one to compute a solution to the least-squares problem more reliably.
-
-> **Example**. Let
-> $$ A = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} \text{ and } b = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 0 \end{bmatrix}. $$
-> We can find the least-squares solution $Ax = b$ by using the QR decomposition. Let us use the QR decomposition from above, and solve the system
-> $$ Rx = Q^Tb. $$
-> As
-> $$ \begin{bmatrix} \frac{1}{2} & -\frac{3}{\sqrt{12}} & 0 \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & -\frac{2}{\sqrt{6}} \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{6}} \\ \frac{1}{2} & \frac{1}{\sqrt{12}} & \frac{1}{\sqrt{6}} \end{bmatrix}^T\begin{bmatrix} 1 \\ 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} \frac{3}{2} \\ -\frac{1}{2\sqrt{3}} \\ -\frac{1}{\sqrt{6}}, \end{bmatrix} $$
-> we are looking at the system
-> $$ \begin{bmatrix} 2 & \frac{3}{2} & 1 \\ 0 & \frac{3}{\sqrt{12}} & \frac{2}{\sqrt{12}} \\ 0 & 0 & \frac{2}{\sqrt{6}} \end{bmatrix}x =\begin{bmatrix} \frac{3}{2} \\ -\frac{1}{2\sqrt{3}} \\ -\frac{1}{\sqrt{6}} \end{bmatrix}. $$
-> Solving this system yields that
-> $$ x_0 = \begin{bmatrix} 1 \\ 0 \\ -\frac{1}{2} \end{bmatrix} $$
-> is a least-squares solution to $Ax = b$.
-
-Let us set this system up in python and use `numpy.linalg.solve`.
-
-```python
-import numpy as np
-
-# Define matrix and vector
-A = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])
-b = np.array([[1],[1],[1],[0]])
-
-# Take the QR decomposition of A
-Q, R = np.linalg.qr(A)
-
-# Solve the linear system Rx = Q.T b
-beta = np.linalg.solve(R,Q.T @ b)
-```
-This yields
-```python
->>> beta
-array([[ 1.00000000e+00],
- [ 6.40987562e-17],
- [-5.00000000e-01]])
-
-```
-which agrees with our exact least-squares solution.
-Note that `numpy.linalg.lstsq` still gives a **ever so slightly** different result.
-```python
->>> np.linalg.lstsq(A,b)[0]
-array([[ 1.00000000e+00],
- [ 2.22044605e-16],
- [-5.00000000e-01]])
-```
+For color images, this is applied independently to each channel (R, G, B).
---
-Let's go back to the house example. While we're at it, let's get used to using pandas to make a dataframe.
-```python
-import numpy as np
-import pandas as pd
+## Key Takeaways
-# First let us make a dictionary incorporating our data.
-# Each entry corresponds to a column (feature of our data)
-data = {
- 'Square ft': [1600, 2100, 1550, 1600, 2000],
- 'Bedrooms': [3, 4, 2, 3, 4],
- 'Price': [500, 650, 475, 490, 620]
-}
+* Data science problems can be framed as:
-# Create a pandas DataFrame
-df = pd.DataFrame(data)
+ > *approximate solutions to linear systems*
-# Create our matrix X and our target y
-X = df[["Square ft", "Bedrooms"]].to_numpy()
-y = df[["Price"]].to_numpy()
+* Numerical linear algebra is necessary; it determines:
-# Augment X with a column of 1's (intercept)
-X_aug = np.hstack((np.ones((X.shape[0], 1)), X))
+ * stability
+ * performance
+ * model reliability
-# Perform QR decomposition
-Q, R = np.linalg.qr(X_aug)
+* Spectral methods (SVD, PCA) provide:
+
+ * structure
+ * compression
+ * interpretability
+
+* Regularization connects directly to linear algebra:
+ * Ridge shifts singular values, improving condition number
+ * Lasso exploits $L^1$ geometry to product sparse solutions
+
+* Gradient descent convergence is governed by singular value structure
+ * Condition number determines learning rate stability
+ * Feature scaling reshapes the optimization landscape
-# Solve the upper triangular system Rx = Q^Ty
-beta = np.linalg.solve(R, Q.T @ y)
-```
-Let's look at the output.
-```python
->>> Q
-array([[-0.4472136 , 0.32838365, 0.40496317],
- [-0.4472136 , -0.63745061, -0.22042299],
- [-0.4472136 , 0.42496708, -0.7689174 ],
- [-0.4472136 , 0.32838365, 0.40496317],
- [-0.4472136 , -0.44428376, 0.17941406]])
->>> R
-array([[-2.23606798e+00, -3.95784032e+03, -7.15541753e+00],
- [ 0.00000000e+00, -5.17687164e+02, -1.50670145e+00],
- [ 0.00000000e+00, 0.00000000e+00, 7.27908474e-01]])
->>> beta
-array([-3.05053797e-13, 3.00000000e-01, 5.00000000e+00])
-```
-As we can see, the least-squares solution agrees with what we got by hand and by other python methods (if we agree that the tiny first component is essentially zero).
---
-The QR decomposition of a matrix is also useful for computing orthogonal projections.
-> **Theorem**. Let $A$ be an $m \times n$ matrix with full column rank. If $A = QR$ is a QR decomposition, then $QQ^T$ is the projection onto the column space of $A$, i.e., $QQ^Tb = \text{Proj}_{\text{Col}(A)}b$ for all $b \in \mathbb{R}^m$.
+## Purpose
-Let's see what our range projections are for the matrices above. Note that the first example above will have the orthogonal projection just being
-$$ \begin{bmatrix} 1 \\ & 1 \\ & & 1\\ & & & 0 \end{bmatrix}. $$
-Let's look at the other matrix.
+This project is part of a broader effort to translate a background in pure mathematics into practical data science and machine learning skills.
-> **Example**. Working with the matrix
-> $$ A = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}, $$
-> the projection onto the column space if given by
-> $$ QQ^T = \begin{bmatrix} 1 \\ & 1 \\ & & \frac{1}{2} & \frac{1}{2} \\ & & \frac{1}{2} & \frac{1}{2} \end{bmatrix}. $$
-> This is a well-understood projection: it is the direct sum of the identity on $\mathbb{R}^2$ and the projection onto the line $y = x$ in $\mathbb{R}^2$.
-Now let's use python to implement the projection.
+## Future Work
-```python
-import numpy as np
-
-# Create our matrix A
-A = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])
-
-# Take the QR decomposition
-Q, R = np.linalg.qr(A)
-
-# Create the range projection
-P = Q @ Q.T
-```
-The output gives
-```python
-array([[1.00000000e+00, 2.89687929e-17, 2.89687929e-17, 2.89687929e-17],
- [2.89687929e-17, 1.00000000e+00, 7.07349921e-17, 7.07349921e-17],
- [2.89687929e-17, 7.07349921e-17, 5.00000000e-01, 5.00000000e-01],
- [2.89687929e-17, 7.07349921e-17, 5.00000000e-01, 5.00000000e-01]])
-
-```
-As we can see, the two off-diagonal blocks are all tiny, hence we treat them as zero. Note that if they were not actually zero, then this wouldn't actually be a projection. This can cause some problems. So let's fix this by introducing some tolerances.
-
-Let's write a function to implement this, assuming that columns of A are linearly independent.
-
-```python
-import numpy as np
-
-def proj_onto_col_space(A):
- # Take the QR decomposition
- Q,R = np.linalg.qr(A)
- # The projection is just Q @ Q.T
- P = Q @ Q.T
-
- return P
-```
-We'll come back to this later. We should really be incorporating some sort of error tolerance so that things are **super super** tiny can actually just be sent to zero.
-
-> **Remark**. Another way to get the projection onto the column space of an $n \times p$ matrix $A$ of full column rank is to take
-> $$ P = A(A^TA)^{-1}A^T. $$
-> Indeed, let $b \in \mathbb{R}^n$ and let $x_0 \in \mathbb{R}^p$ be a solution to the normal equations
-> $$ A^TAx_0 = A^Tb. $$
-> Then $x_0 = (A^TA)^{-1}A^Tb$ and so $Ax_0 = A(A^TA^{-1})A^Tb$ is the (unique!) vector in the column space of $A$ which is closest to $b$, i.e., the projection of $b$ onto the column space of $A$.
-> However, taking transposes, multiplying, and inverting is not what we would like to do numerically.
-
-## Singular Value Decomposition
-
-The SVD is a very important matrix decomposition in both data science and linear algebra.
-
-> **Theorem**. For any matrix $n \times p$ matrix $X$, there exist an orthogonal $n \times n$ matrix $U$, an orthogonal $p \times p$ matrix $V$, and a diagonal $n \times p$ matrix $\Sigma$ with non-negative entries such that
-> $$ X = U\Sigma V^T. $$
-> - The columns of $U$ are left **left singular vectors**.
-> - The columns of $V$ are the **right singular vectors**.
-> - $\Sigma$ has **singular values** $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$ on its diagonal, where $r$ is the rank of $X$.
-
-> **Remark**. The SVD is clearly a generalization of matrix diagonalization, but it also generalizes the **polar decomposition** of a matrix. Recall that every $n \times n$ matrix $A$ can be written as $A = UP$ where $U$ is orthogonal (or unitary) and $P$ is a positive matrix. This is because if
-> $$ A = U_0\Sigma V^T $$
-> is the SVD for $A$, then $\Sigma$ is an $n \times n$ diagonal matrix with non-negative entries, hence any orthogonal conjugate of it is positive, and so
-> $$ A = (U_0V^T)(V\Sigma V^T). $$
-> Take $U = U_0V^T$ and $P = V\Sigma V^T$.
-
-By hand, the algorithm for computing an SVD is as follows.
-1. Both $AA^T$ and $A^TA$ are symmetric (they are positive in fact), and so they can be orthogonally diagonalized; one can form an orthogonal basis of eigenvectors. Let $v_1,\dots,v_p$ be an orthonormal basis of eigenvectors for $\mathbb{R}^p$ which correspond to eigenvectors of $A^TA$ in decreasing order. Suppose that $A^TA$ has $r$ non-zero eigenvalues. Let $V$ be the matrix whose columns contain the $v_i$'s. This gives our right singular vectors and our singular values.
-2. Let $u_i = \frac{1}{\sigma_i}Av_i$ for $i = 1,\dots,r$, and extend this collection of vectors to an orthonormal basis for $\mathbb{R}^n$ if necessary. Let $U$ be the corresponding matrix.
-3. Let $\Sigma$ be the $n \times p$ matrix whose diagonal entries are $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r$, and then zeroes if necessary.
-
-> **Example**. Let us compute the SVD of
-> $$ A = \begin{bmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{bmatrix}. $$
-> First we note that
-> $$ A^TA = \begin{bmatrix} 13 & 12 & 2 \\ 12 & 13 & -2 \\ 2 & -2 & 8 \end{bmatrix}, $$
-> which has eigenvalues $25,9,0$ with corresponding eigenvectors
-> $$ \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 \\ -1 \\ 4 \end{bmatrix}, \begin{bmatrix} -2 \\ 2 \\ 1 \end{bmatrix}. $$
-> Normalizing, we get
-> $$ V = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{3\sqrt{2}} & -\frac{2}{3} \\ \frac{1}{\sqrt{2}} & -\frac{1}{3\sqrt{2}} & \frac{2}{3} \\ 0 & \frac{4}{3\sqrt{2}} & \frac{1}{3} \end{bmatrix}. $$
-> Now we set $u_1 = \frac{1}{5}Av_1$ and $u_2 = \frac{1}{3}Av_2$ to get
-> $$ U = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \end{bmatrix}. $$
-> So
-> $$ A = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \end{bmatrix} \begin{bmatrix} 5 & 0 & 0 \\ 0 & 3 & 0 \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{3\sqrt{2}} & -\frac{2}{3} \\ \frac{1}{\sqrt{2}} & -\frac{1}{3\sqrt{2}} & \frac{2}{3} \\ 0 & \frac{4}{3\sqrt{2}} & \frac{1}{3} \end{bmatrix}^T $$
-> is our SVD decomposition.
-
-We note that in practice, we avoid the computation of $X^TX$ because if the entries of $X$ have errors, then these errors will be squared in $X^TX$. There are better computational tools to get singular values and singular vectors which are more accurate. This is what our python tools will use.
-
-Let's use `numpy.linalg.svd` for the above matrix.
-
-```python
-import numpy as np
-
-#Define our matrix
-A = np.array([[3,2,2],[2,3,-2]])
-
-# Take the SVD
-U, S, Vh = np.linalg.svd(A)
-```
-Our SVD matrices are
-```python
->>> U
-array([[-0.70710678, -0.70710678],
- [-0.70710678, 0.70710678]])
->>> S
-array([5., 3.])
-# Note that Vh already gives the transpose of the matrix V we get
-# in our SVD. So we'll take the transpose again to get
-# the appropriate rows
->>> Vh.T
-array([[-7.07106781e-01, -2.35702260e-01, -6.66666667e-01],
- [-7.07106781e-01, 2.35702260e-01, 6.66666667e-01],
- [-6.47932334e-17, -9.42809042e-01, 3.33333333e-01]])
-```
-
-
-Because the eigenvalues of the hermitian squares of
-$$ \begin{bmatrix} 1 & 1 & 1\\ 0 & 1 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix} \text{ and } \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$
-are quite atrocious, an exact SVD decomposition is difficult to compute by hand. However, we can of course use python.
-
-```python
-import numpy as np
-
-# Define our matrices
-A = np.array([[1,1,1],[0,1,1],[0,0,1],[0,0,0]])
-B = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])
-
-# SVD decomposition
-U_A, S_A, Vh_A = np.linalg.svd(A)
-U_B, S_B, Vh_B = np.linalg.svd(B)
-```
-The resulting matrices are
-```python
->>> U_A
-array([[ 0.73697623, 0.59100905, 0.32798528, 0. ],
- [ 0.59100905, -0.32798528, -0.73697623, 0. ],
- [ 0.32798528, -0.73697623, 0.59100905, 0. ],
- [ 0. , 0. , 0. , 1. ]])
->>> S_A
-array([2.2469796 , 0.80193774, 0.55495813])
->>> Vh_A.T
-array([[ 0.32798528, 0.73697623, 0.59100905],
- [ 0.59100905, 0.32798528, -0.73697623],
- [ 0.73697623, -0.59100905, 0.32798528]])
->>> U_B
-array([[-2.41816250e-01, 7.12015746e-01, -6.59210496e-01,
- 0.00000000e+00],
- [-4.52990541e-01, 5.17957311e-01, 7.25616837e-01,
- 6.71536163e-17],
- [-6.06763739e-01, -3.35226641e-01, -1.39502200e-01,
- -7.07106781e-01],
- [-6.06763739e-01, -3.35226641e-01, -1.39502200e-01,
- 7.07106781e-01]])
->>> S_B
-array([2.8092118 , 0.88646771, 0.56789441])
->>> Vh_B.T
-array([[-0.67931306, 0.63117897, -0.37436195],
- [-0.59323331, -0.17202654, 0.7864357 ],
- [-0.43198148, -0.75632002, -0.49129626]])
-```
-
-Another final note is that the **operator norm** of a matrix $A$ agrees with its largest singular value.
-
-### Pseudoinverses and using the SVD
-The SVD can be used to determine a least-squares solution for a given system. Recall that if $v_1,\dots,v_p$ is an orthonormal basis for $\mathbb{R}^p$ consisting of eigenvectors of $A^TA$, arranged so that they correspond to eigenvalues $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r$, then $\{Av_1,\dots,Av_r\}$ is an orthogonal basis for the column space of $A$. In essence, this means that when we have our left singular vectors $u_1,\dots,u_n$ (constructed based on our algorithm as above), we have that the first $r$ vectors form an orthonormal basis for the column space of $A$, and that the remaining $n - r$ vectors form an orthonormal basis for the perp of the column space of $A$ (which is also equal to the nullspace of $A^T$).
-
-> **Definition**. Let $A$ be an $n \times p$ matrix and suppose that the rank of $A$ is $r \leq \min\{n,p\}$. Suppose that $A = U\Sigma V^T$ is the SVD, where the singular values are decreasing. Partition
-> $$ U = \begin{bmatrix} U_r & U_{n-r} \end{bmatrix} \text{ and } V = \begin{bmatrix} V_r & V_{p-r} \end{bmatrix} $$
-> into submatrices, where $U_r$ and $V_r$ are the matrices whose columns are the first $r$ columns of $U$ and $V$ respectively. So $U_r$ is $n \times r$ and $V_r$ is $p \times r$. Let $D$ be the diagonal $r \times r$ matrices whose diagonal entries are $\sigma_1,\dots, \sigma_r$, so that
->$$ \Sigma = \begin{bmatrix} D & 0 \\ 0 & 0 \end{bmatrix} $$
-> and note that
-> $$ A = U_rDV_r^T. $$
-> We call this the reduced singular value decomposition of $A$. Note that $D$ is invertible, and its inverse is simply
-> $$ D = \begin{bmatrix} \sigma_1^{-1} \\ & \sigma_2^{-1} \\ & & \ddots \\ & & & \sigma_r^{-1} \end{bmatrix}. $$
-> The **pseudoinverse** (or **Moore-Penrose inverse**) of $A$ is the matrix
-> $$ A^+ = V_rD^{-1}U_r^T. $$
-
-We note that the pseudoinverse $A^+$ is a $p \times n$ matrix.
-
-With the pseudoinverse, we can actually find least-squares solutions quite easily. Indeed, if we are looking for the least-squares solution to the system $Ax = b$, define
-$$ x_0 = A^+b. $$
-Then
-$$ \begin{split} Ax_0 &= (U_rDV_r^T)(V_rD^{-1}U_r^Tb) \\ &= U_rDD^{-1}U_r^Tb \\ &= U_rU_r^Tb \end{split} $$
-As mentioned before, the columns of $U_r$ form an orthonormal basis for the column space of $A$ and so $U_rU_r^T$ is the orthogonal projection onto the range of $A$. That is, $Ax_0$ is precisely the projection of $b$ onto the column space of $A$, meaning that this yields a least-squares solution. This gives the following.
-
-> **Theorem**. Let $A$ be an $n \times p$ matrix and $b \in \mathbb{R}^n$. Then
-> $$ x_0 = A^+b$$
-> is a least-squares solution to $Ax = b$.
-
-Taking pseudoinverses is quite involved. We'll do one example by hand, and then use python -- and we'll see something go wrong! There is a function `numpy.linalg.pinv` in numpy that will take a pseudoinverse. We can also just use `numpy.linalg.svd` and do the process above.
-
-> **Example**. We have the following SVD $A = U\Sigma V^T$.
-> $$ \begin{bmatrix} 1 & 1 & 2\\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix} = \begin{bmatrix} \sqrt{\frac{2}{3}} & 0 & 0 & -\frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{6}} & -\frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{3}} \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 3 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\begin{bmatrix} \frac{1}{\sqrt{6}} & -\frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{3}} \\ \sqrt{\frac{2}{3}} & 0 & \frac{1}{\sqrt{3}} \end{bmatrix}^T. $$
-> Can we find a least-squares solution to $Ax = b$, where
-> $$ b = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}? $$
-> The pseudoinverse of $A$ is
-> $$ \begin{split} A^+ &= \begin{bmatrix} \frac{1}{\sqrt{6}} & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{2}} \\ \sqrt{\frac{2}{3}} & 0 \end{bmatrix} \begin{bmatrix} 3 \\ & 1 \end{bmatrix} \begin{bmatrix} \sqrt{\frac{2}{3}} & 0 \\ \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{6}} & -\frac{1}{\sqrt{2}} \\ 0 & 0 \end{bmatrix}^T \\ &= \begin{bmatrix} \frac{1}{9} & -\frac{4}{9} & \frac{5}{9} & 0 \\ \frac{1}{9} & \frac{5}{9} & -\frac{4}{9} & 0 \\ \frac{2}{9} & \frac{1}{9} & \frac{1}{9} & 0\end{bmatrix}, \end{split} $$
-> and so a least-squares solution is given by
-> $$ \begin{split} x_0 &= A^+b \\ &= \begin{bmatrix} \frac{1}{9} & -\frac{4}{9} & \frac{5}{9} & 0 \\ \frac{1}{9} & \frac{5}{9} & -\frac{4}{9} & 0 \\ \frac{2}{9} & \frac{1}{9} & \frac{1}{9} & 0\end{bmatrix}\begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} \\ &= \begin{bmatrix} \frac{2}{9} \\ \frac{2}{9} \\ \frac{4}{9} \end{bmatrix}. \end{split} $$
-
-Now let's do this with python, and see an example of how things can go wrong. We'll try to take the pseudoinverse manually first.
-
-```python
-import numpy as np
-
-# Create our matrix A and our target b
-A = np.array([[1,1,2],[0,1,1],[1,0,1],[0,0,0]])
-b = np.array([[1],[1],[1],[1]])
-
-# Take the SVD decomposition
-U, S, Vh = np.linalg.svd(A)
-
-# Prepare the pseudoinverse
-# Recall that we invert the non-zero diagonal entries of the diagonal matrix.
-# So we first build S_inv to be the appropriate size
-S_inv = np.zeros((Vh.shape[0], U.shape[0]))
-# We then fill in the appropriate values on the diagonal
-S_inv[:len(S), :len(S)] = np.diag(1/S)
-
-# Build the pseudoinverse
-A_pinv = Vh.T @ S_inv @ U.T
-
-# Compute the least-squares solution
-beta = A_pinv @ b
-```
-What is the result?
-```python
->>> beta
-array([[ 2.74080345e+15],
- [ 2.74080345e+15],
- [-2.74080345e+15]])
-
-```
-This is **WAY** off the mark. So what happened? Well, when we look at our singular values, we have
-```python
->>> S
-array([3.00000000e+00, 1.00000000e+00, 1.21618839e-16])
-```
-As we got this matrix numerically, the last entry is actually non-zero, but tiny. This isn't exactly what's going on since we know that the rank of A is 2. So when we invert the singular values and throw them on the diagonal, have `1/1.21618839e-16` which is a very large value. This value then messes up the rest of the computation. So how do we fix this? One can set tolerances in numpy, but we'll get to that later. Let's just note that `numpy.linalg.pinv` will already incorporate this. Let's see what we get.
-
-```python
-import numpy as np
-
-# Create our matrix A and our target b
-A = np.array([[1,1,2],[0,1,1],[1,0,1],[0,0,0]])
-b = np.array([[1],[1],[1],[1]])
-
-# Build the pseudoinverse
-A_pinv = np.linalg.pinv(A)
-
-# Compute the least-squares solution
-beta = A_pinv @ b
-```
-```python
->>> A_pinv
-array([[ 0.11111111, -0.44444444, 0.55555556, 0. ],
- [ 0.11111111, 0.55555556, -0.44444444, 0. ],
- [ 0.22222222, 0.11111111, 0.11111111, 0. ]])
->>> beta
-array([[0.22222222],
- [0.22222222],
- [0.44444444]])
-```
-
-### The Condition Number
-Numerical calculations involving matrix equations are quite reliable if we use the SVD. This is because the orthogonal matrices $U$ and $V$ preserve lengths and angles, leaving the stability of the problem to be governed by the singular values of the matrix $X$. Recall that if $X = U\Sigma V^T$, then solving the least-squares problem involves dividing by the non-zero singular values $\sigma_i$ of $X$. If these values are very small, their inverses become very large, and this will amplify any numerical errors.
-
-> **Definition**. Let $X$ be an $n \times p$ matrix and let $\sigma_1 \geq \cdots \geq \sigma_r$ be the non-zero singular values of $X$. The **condition number** of $X$ is the quotient
-> $$ \kappa(X) = \frac{\sigma_1}{\sigma_r} $$
-> of the largest and smallest non-zero singular values.
-
-A condition number close to 1 indicates a well-conditioned problem, while a large condition number indicates that small perturbations in data may lead to large changes in computation. Geometrically, $\kappa(X)$ measures how much $X$ distorts space.
-
-> **Example**. Consider the matrices
-> $$ A = \begin{bmatrix} 1 \\ & 1 \end{bmatrix} \text{ and } B = \begin{bmatrix} 1 \\ & \frac{1}{10^6} \end{bmatrix}. $$
-> The condition numbers are
-> $$ \kappa(A) = 1 \text{ and } \kappa(B) = 10^6. $$
-> Inverting $X_2$ includes dividing by $\frac{1}{10^6}$, which will amplify errors by $10^6$.
-
-Let's look our main example in python by using `numpy.linalg.cond`.
-
-```python
-import numpy as np
-import pandas as pd
-
-# First let us make a dictionary incorporating our data.
-# Each entry corresponds to a column (feature of our data)
-data = {
- 'Square ft': [1600, 2100, 1550, 1600, 2000],
- 'Bedrooms': [3, 4, 2, 3, 4],
- 'Price': [500, 650, 475, 490, 620]
-}
-
-# Create a pandas DataFrame
-df = pd.DataFrame(data)
-
-# Create out matrix X
-X = df[['Square ft', 'Bedrooms']].to_numpy()
-
-# Check the condition number
-cond_X = np.linalg.cond(X)
-```
-Let's see what we got.
-```python
->>> cond_X
-np.float64(4329.082589067693)
-```
-so this is quite a high condition number! This should be unsurprising, as clearly the number of bedrooms is correlated to the size of a house (especially so in our small toy example).
-## A note on other norms
-
-There are other canonical choices of norms for vectors and matrices. While $L^2$ leads naturally to least-squares problems with closed-form solutions, other norms induce different geometries and different optimal solutions. From the linear algebra perspective, changing the norm affects:
-- the shape of the unit ball,
-- the geometry of approximation,
-- the numerical behaviour of optimization problems.
-
-### $L^1$ norm (Manhattan distance)
-The $L^1$ norm of a vector $x = (x_1,\dots,x_p) \in \mathbb{R}^p$ is defined as
-$$ \|x\|_1 = \sum |x_i|. $$
-Minimizing the $L^1$ norm is less sensitive to outliers. Geometrically, the $L^1$ unit ball in $\mathbb{R}^2$ is a diamond (a rotated square), rather than a circle.
-
-)
-
-Consequently, optimization problems involving $L^1$ tend to produce solutions which live on the corners of this polytope.
-Solutions often require linear programming or iterative reweighted least squares.
-
-$L^1$ based methods (such as LASSO) tend to set coefficients to be exactly zero. Unlike with $L^2$, the minimization problem for $L^1$ does not admit a closed form solution. Algorithms include:
-- linear programming formulations,
-- iterative reweighted least squares,
-- coordinate descent methods.
-
-### $L^{\infty}$ norm (max/supremum norm)
-The supremum norm defined as
-$$ \|x\|_{\infty} = \max |x_i| $$
-seeks to control the worst-case error rather than the average error. Minimizing this norm is related to Chebyshev approximation by polynomials.
-
-Geometrically, the unit ball of $\mathbb{R}^2$ with respect to the $L^{\infty}$ norm looks like a square.
-
-
-
-Problems involving the $L^{\infty}$ norm are often formulated as linear programs, and are useful when worst-case guarantees are more important than optimizing average performance.
-
-### Matrix norms
-
-There are also various norms on matrices, each highlighting a different aspect of the associated linear transformation.
-- **Frobenius norm**. This is an important norm, essentially the analogue of the $L^2$ norm for matrices. It is the Euclidean norm if you think of your matrix as a vector, forgetting its rectangular shape. For $A = (a_{ij})$ a matrix, the Frobenius norm
- $$ \|A\|_F = \sqrt{\sum a_{ij}^2} $$
- is the square root of the sum of squares of all the entries. This treats a matrix as a long vector and is invariant under orthogonal transformations. As we'll see, it plays a central role in:
- - least-squares problems,
- - low-rank approximation,
- - principal component analysis.
-
- In particular, the truncated SVD yields a best low-rank approximation of a matrix with respect to the Frobenius norm.
-
- We also that that the Frobenius norm can be written in terms of tracial data. We have that
- $$ \|A\|_F^2 = \text{Tr}(A^TA) = \text{Tr}(AA^T). $$
-- **Operator norm** (spectral norm). This is just the norm as an operator $A: \mathbb{R}^p \to \mathbb{R}^n$, where $\mathbb{R}^p$ and $\mathbb{R}^n$ are thought of as Hilbert spaces:
- $$ \|A\| = \max_{\|x\|_2 = 1}\|Ax\|_2. $$
- This norm measures how big of an amplification $A$ can apply, and is equal to the largest singular value of $A$. This norm is related to stability properties, and is the analogue of the $L^{\infty}$ norm.
-- **Nuclear norm**. The nuclear norm, defined as
- $$ \|A\|_* = \sum \sigma_i, $$
- is the sum of the singular values. When $A$ is square, this is precisely the trace-class norm, and is the analogue of the $L^1$ norm. This norm acts as a generalization of the concept of rank.
-
-## A note on regularization
-
-Regularization introduces additional constraints or penalties to stabilize ill-posed problems. From the linear algebra point of view, regularization modifies the singular value structure of a problem.
-- **Ridge regression**: add a positive multiple $\lambda\cdot I$ of the identity to $X^TX$ which will artificially inflate small singular values.
-- This dampens unstable directions while leaving well-conditioned directions largely unaffected.
-
-Geometrically, regularization reshapes the solution space to suppress directions that are poorly supported by the data.
-## A note on solving multiple targets concurrently
-
-Suppose now that we were interested in solving several problems concurrently; that is, given some data points, we would like to make $k$ predictions. Say we have our $n \times p$ data matrix $X$, and we want to make $k$ predictions $y_1,\dots,y_k$. We can then set the problem up as finding a best solution to the matrix equation
-$$ XB = Y $$
-where $B$ will be a $p \times k$ matrix of parameters and $Y$ will be the $p \times k$ matrix whose columns are $y_1,\dots,y_k$.
-
-## Polynomial Regression
-
-Sometimes fitting a line to a set of $n$ data points clearly isn't the right thing to do. To emphasize the limitations of linear models, we generate data from a purely quadratic relationship. In this setting, the space of linear functions is not rich enough to capture the underlying structure, and the linear least-squares solution exhibits systematic error. Expanding the feature space to include quadratic terms resolves this issue.
-
-For example, suppose our data looked like the following.
-
-
-
-If we try to find a line of best fit, we get something that doesn't really describe or approximate our data at all...
-
-
-
-This is an example of **underfitting** data, and we can do better. The same linear regression ideas work for fitting a degree $d$ polynomial model to a set of $n$ data points. Before, when trying to fit a line to points $(x_1,y_1),\dots,(x_n,y_n)$, we had the following matrices
-$$ \tilde{X} = \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}, \tilde{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} $$
-in the matrix equation
-$$ \tilde{X}\tilde{\beta} = y, $$
-and we were trying to find a vector $\tilde{\beta}$ which gave a best possible solution. This would give us a line $y = \beta_0 + \beta_1x$ which best approximates the data. To fit a polynomial $y = \beta_0 + \beta_1x + \beta_2x^2 + \cdots + \beta_d^dx^d$ to the data, we have a similar set up.
-
-> **Definition**. The **Vandermonde matrix** is the $n \times (d+1)$ matrix
-> $$ V = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}. $$
-
-With the Vandermonde matrix, to find a polynomial function of best fit, one just needs to find a least-squares solution to the matrix equation
-$$ V\tilde{\beta} = y. $$
-
-With the generated data above, we get the following curve.
-
-
-
-Solving these problems can be done with python. One can use `numpy.polyfit` and `numpy.poly1d`.
-
-> **Example**. Consider the following data.
-> |House | Square ft | Bedrooms | Price (in $1000s) |
-> | --- | --- | --- | --- |
-> | 0 | 1600 | 3 | 500 |
-> | 1 | 2100 | 4 | 650 |
-> | 2 | 1550 | 2 | 475 |
-> | 3 | 1600 | 3 | 490 |
-> | 4 | 2000 | 4 | 620 |
->
-> Suppose we wanted to predict the price of a house based on the square footage and we thought the relationship was cubic (it clearly isn't, but hey, for the sake of argument). So really we are looking at the subset of data
-> |House | Square ft | Price (in $1000s) |
-> | --- | --- | --- |
-> | 0 | 1600 | 500 |
-> | 1 | 2100 | 650 |
-> | 2 | 1550 | 475 |
-> | 3 | 1600 | 490 |
-> | 4 | 2000 | 620 |
->
-> Our Vandermonde matrix will be
-> $$ V = \begin{bmatrix} 1 & 1600 & 1600^2 & 1600^3 \\ 1 & 2100 & 2100^2 & 2100^3 \\ 1 & 1550 & 1550^2 & 1550^3 \\ 1 & 1600 & 1600^2 & 1600^3 \\ 1 & 2000 & 2000^2 & 2000^3 \end{bmatrix} $$
-> and our target vector will be
-> $$ y = \begin{bmatrix} 500 \\ 650 \\ 475 \\ 490 \\ 620 \end{bmatrix}. $$
-> As we can see, the entries of the Vandermonde matrix get very very large very fast. One can, if they are so inclined, compute a least-squares solution to $V\tilde{\beta} = y$ by hand. Let's not, but let us find, using python, a "best" cubic approximation of the relationship between the square footage and price.
-
-We will use `numpy.polyfit`, `numpy.pold1d` and `numpy.linspace`.
-```python
-import numpy as np
-import pandas as pd
-import matplotlib as plt
-
-# First let us make a dictionary incorporating our data.
-# Each entry corresponds to a column (feature of our data)
-data = {
- 'Square ft': [1600, 2100, 1550, 1600, 2000],
- 'Bedrooms': [3, 4, 2, 3, 4],
- 'Price': [500, 650, 475, 490, 620]
-}
-
-# Create a pandas DataFrame
-df = pd.DataFrame(data)
-
-# Extract x (square footage) and y (price)
-x = df["Square ft"].to_numpy(dtype=float)
-y = df["Price"].to_numpy(dtype=float)
-
-# Degree of polynomial
-degree = 3 # cubic
-
-# Polyfit directly on x
-cubic = np.poly1d(np.polyfit(x,y, degree))
-
-# Add fitted polynomial line and scatter plot
-polyline = np.linspace(x.min(),x.max())
-plt.scatter(x,y, label="Observed data")
-plt.plot(polyline, cubic(polyline), 'r', label="Cubic best fit")
-plt.xlabel("Square ft")
-plt.ylabel("Price (in $1000s)")
-plt.title("Cubic polynomial regression: Price vs Square Footage")
-plt.show()
-```
-
-So we get a cubic of best fit.
-
-
-
-Here `numpy.polyfit` computes the least-squares solution in the polynomial basis $1, x, x^2, x^3$, i.e., it solves the Vandermonde least-squares problem. So what is our cubic polynomial?
-
-```python
->>> cubic
-poly1d([ 3.08080808e-07, -1.78106061e-03, 3.71744949e+00, -2.15530303e+03])
-```
-The first term is the degree 3 term, the second the degree 2 term, the third the degree 1 term, and the fourth is the constant term.
-
-## What can go wrong?
-
-We are often dealing with imperfect data, so there is plenty that could go wrong. Here are some basic cases of where things can break down.
-
-- **Perfect multicolinearity**: non-invertible $\tilde{X}^T\tilde{X}$. This happens when one feature is a perfect combination of the others. This means that the columns of the matrix $\tilde{X}$ are linearly dependent, and so infinitely many solutions will exist to the least-squares problem.
- - For example, if you are looking at characteristics of people and you have height in both inches and centimeters.
-- **Almost multicolinearity**: this happens when one features is **almost** a perfect combination of the others. From the linear algebra perspective, the columns of $\tilde{X}$ might not be dependent, but they will be be **almost** linearly dependent. This will cause problems in calculation, as the condition number will become large and amplify numerical errors. The inverse will blow up small spectral components.
-- **More features than observations**: this means that our matrix $\tilde{X}$ will be wider than it is high. Necessarily, this means that the columns are linearly dependent. Regularization or dimensionality reduction becomes essential.
-- **Redundant or constant features**: this is when there is a characteristic that is satisfied by each observation. In terms of the linear algebraic data, this means that one of the columns of $X$ is constant.
- - e.g., if you are looking at characteristics of penguins, and you have "# of legs". This will always be two, and doesn't add anything to the analysis.
-- **Underfitting**: the model lacks sufficient expressivity to capture the underlying structure. For example, see the section on polynomial regression -- sometimes one might want a curve vs. a straight line.
-
-
-- **Overfitting**: the model captures noise rather than structure. Often due to model complexity relative to data size. Polynomial regression can give a nice visualization of overfitting. For example, if we worked with the same generated quadratic data from the polynomial regression section, and we tried to approximation it by a degree 11 polynomial, we get the following.
-
-
-- **Outliers**: large deviations can dominate the $L^2$ norm. This is where normalization might be key.
-- **Heteroscedasticity**: this is when the variance of noise changes across observations. Certain least-squares assumptions will break down.
-- **Condition number**: a large condition number indicates numerical instability and sensitivity to perturbation, even when formal solutions exist.
-- **Insufficient tolerance**: in numerical algorithms, thresholds used to determine rank or invertibility must be chosen carefully. Poor choices can lead to misleading results.
-
-The point is that many failures in data science are not conceptual, but they happen geometrically and numerically. Poor choices lead to poor results.
-
-# Principal Component Analysis
-
-Principal Component Analysis (PCA) addresses the issues of multicollinearity and dimensionality mentioned at the end of the previous section by transforming the data into a new coordinate system. The new axes -- called principal components -- are chosen to capture the maximum variance in the data. In linear algebra terms, we are finding a subspace of potentially smaller dimension that best approximates our data.
-
-> **Example**: Let us return to our house example. Suppose we decide to list the square footage in both square feet and square meters. Let's add this feature to our dataset.
-> |House | Square ft | Square m | Bedrooms | Price (in $1000s) |
-> | --- | --- | --- | --- | --- |
-> | 0 | 1600 | 148 | 3 | 500 |
-> | 1 | 2100 | 195 | 4 | 650 |
-> | 2 | 1550 | 144 | 2 | 475 |
-> | 3 | 1600 | 148 | 3 | 490 |
-> | 4 | 2000 | 185 | 4 | 620 |
->
-> In this case, our associated matrix is:
-> $$ X = \begin{bmatrix} 1600 & 148 & 3 & 500 \\ 2100 & 195 & 4 & 650 \\ 1550 & 144 & 2 & 475 \\ 1600 & 148 & 3 & 490 \\ 2000 & 185 & 4 & 620 \end{bmatrix} $$
-
-There are a few problems with the above data and the associated matrix $X$ (this time, we're not looking to make predictions, so we don't omit the last column).
-- **Redundancy**: Square feet and square meters give the same information. It's just a matter of if you're from a civilized country or from an uncivilized country.
-- **Numerical instability**: The columns of $X$ are nearly linearly dependent. Indeed, the second column is almost a multiple of the first. Moreover, one can make a safe bet that the number of bedrooms increases as the square footage does, so that the first and third columns are correlated.
-- **Interpretation difficulty**: We used the square footage and bedrooms *together* in the previous section to predict the price of a house. However, because of their correlation, this obfuscates the true relationship, say, between the square footage and the price of a house, or the number of bedrooms and the price of a house.
-
-So the question becomes: what do we do about this? We will try to get a smaller matrix (less columns) that contains the same, or a close enough, amount of information. The point is that the data is *effectively* lower-dimensional.
-
-Let's do a little analysis on our dataset before progressing. Let's use `pandas.DataFrame.describe`, `pandas.DataFrame.corr` and `numpy.linalg.cond`. First, let's set up our data.
-
-```python
-import numpy as np
-import pandas as pd
-
-# First let us make a dictionary incorporating our data.
-# Each entry corresponds to a column (feature of our data)
-data = {
- 'Square ft': [1600, 2100, 1550, 1600, 2000],
- 'Square m': [148, 195, 144, 148, 185],
- 'Bedrooms': [3, 4, 2, 3, 4],
- 'Price': [500, 650, 475, 490, 620]
-}
-
-# Create a pandas DataFrame
-df = pd.DataFrame(data)
-
-# Create out matrix X
-X = df.to_numpy()
-```
-
-Now let's see what it has to offer.
-
-```python
-# Describe the data
->>> df.describe()
- Square ft Square m Bedrooms Price
-count 5.000000 5.000000 5.00000 5.000000
-mean 1770.000000 164.000000 3.20000 547.000000
-std 258.843582 24.052027 0.83666 81.516869
-min 1550.000000 144.000000 2.00000 475.000000
-25% 1600.000000 148.000000 3.00000 490.000000
-50% 1600.000000 148.000000 3.00000 500.000000
-75% 2000.000000 185.000000 4.00000 620.000000
-max 2100.000000 195.000000 4.00000 650.000000
-# View correlations
->>> df.corr()
- Square ft Square m Bedrooms Price
-Square ft 1.000000 0.999886 0.900426 0.998810
-Square m 0.999886 1.000000 0.894482 0.998395
-Bedrooms 0.900426 0.894482 1.000000 0.909066
-Price 0.998810 0.998395 0.909066 1.000000
-# Check the condition number
->>> np.linalg.cond(X)
-np.float64(8222.19067218415)
-
-```
-
-As we can see, everything is basically correlated, and we clearly have some redundancies.
-
-This section is structured as follows.
-- [Low-rank approximation via SVD](#low-rank-approximation-via-svd)
-- [Centering data](#centering-data)
-
-
-## Low-rank approximation via SVD
-
-Let $A$ be an $n \times p$ matrix and let $A = U\Sigma V^T$ be a SVD. Let $u_1,\dots,u_n$ be the columns of $U$, $v_1,\dots,v_p$ be the column of $V$, and $\sigma_1 \geq \cdots \sigma_r > 0$ be the singular values, where $r \leq \min\{n,p\}$ is the rank of $A$. Then we have the **reduced singular value decomposition** (see [Pseudoinverses and using the svd](#pseudoinverses-and-using-the-svd))
-$$ A = \sum_{i=1}^r \sigma_i u_iv_i^T $$
-(note that $u_i$ is a $n \times 1$ matrix and $v_i$ is a $p \times 1$ matrix, so $u_iv_i^T$ is some $n \times p$ matrix).
-The key idea is that if the rank of $A$ is higher, say $s$, but the latter singular values are small, then we should still have an approximation like this. Say $\sigma_{r+1},\dots,\sigma_{s}$ are tiny. Then
-$$ \begin{split} A &= \sum_{i=1}^s \sigma_i u_i v_i^T \\ &= \sum_{i=1}^r \sigma_i u_iv_i^T + \sum_{i=r+1}^{s} \sigma_i u_iv_i^T \\ &\approx \sum_{i=1}^r \sigma_iu_i v_i^T \end{split}. $$
-So defining $A_r := \sum_{i=1}^r \sigma_i u_iv_i^T$, we are approximating $A$ by $A_r$.
-
-In what sense is this a good approximation though? Recall that the Frobenius norm of a matrix $A$ is defined as the sqrt root of the sum of the squares of all the entries:
-$$ \|A\|_F = \sqrt{\sum_{i,j} a_{ij}^2}. $$
-The Frobenius norm acts as a very nice generalization of the $L^2$ norm for vectors, and is an indispensable tool in both linear algebra and data science. The point is that this "approximation" above actually works in the Frobenius norm, and this reduced singular value decomposition in fact minimizes the error.
-
-> **Theorem** (EckartāYoungāMirsky). Let $A$ be an $n \times p$ matrix of rank $r$. For $k \leq r$,
-> $$ \min_{B \text{ such that rank}(B) \leq k} \|A - B\|_F = \|A - A_k\|_F. $$
-> The (at most) rank $k$ matrix $A_k$ also realizes the minimum when optimizing for the operator norm.
-
-> **Example**. Recall that we have the following SVD:
-> $$ \begin{bmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \end{bmatrix} \begin{bmatrix} 5 & 0 & 0 \\ 0 & 3 & 0 \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{3\sqrt{2}} & -\frac{2}{3} \\ \frac{1}{\sqrt{2}} & -\frac{1}{3\sqrt{2}} & \frac{2}{3} \\ 0 & \frac{4}{3\sqrt{2}} & \frac{1}{3} \end{bmatrix}^T. $$
-> So if we want a rank-one approximation for the matrix, we'll do the reduced SVD. We have
-> $$ \begin{split} A_1 &= \sigma_1u_1v_1^T \\ &= 5\begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{bmatrix}\begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \end{bmatrix} \\ &= \begin{bmatrix} \frac{5}{2} & \frac{5}{2} & 0 \\ \frac{5}{2} & \frac{5}{2} & 0 \end{bmatrix} \end{split}$$
-> Now let's compute the (square of the) Frobenius norm of the difference $A - A_1$. We have
-> $$ \begin{split} \|A - A_1\|_F^2 &= \left\| \begin{bmatrix} \frac{1}{2} & -\frac{1}{2} & 2 \\ -\frac{1}{2} & \frac{1}{2} & -2 \end{bmatrix}\right\|_F^2 \\ &= 4(\frac{1}{2})^2 + 2(2^2) = 9. \end{split} $$
-> So the Frobenius distance between $A$ and $A_1$ is 3, and we know by Eckart-Young-Mirsky that this is the smallest we can get when looking at the difference between $A$ and a (at most) rank one $2 \times 3$ matrix. As mentioned, the operator norm $\|A - A_1\|$ also minimizes the distance (in operator norm). We know this to be the largest singular value. As $A - A_1$ has SVD
-> $$ \begin{bmatrix} \frac{1}{2} & -\frac{1}{2} & 2 \\ -\frac{1}{2} & \frac{1}{2} & -2 \end{bmatrix} = \begin{bmatrix} -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{bmatrix}\begin{bmatrix} 3 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} -\frac{1}{3\sqrt{2}} & -\frac{4}{\sqrt{17}} & \frac{1}{3\sqrt{34}} \\ \frac{1}{3\sqrt{2}} & 0 & \frac{1}{3}\sqrt{\frac{17}{2}} \\ -\frac{2\sqrt{2}}{3} & \frac{1}{\sqrt{17}} & \frac{2}{3}\sqrt{\frac{2}{17}} \end{bmatrix}, $$
-> the operator norm is also 3.
-
-Now let's do this in python. We'll set up our matrix as usual, take the SVD, do the truncated construction of $A_1$, and use `numpy.linalg.norm` to look at the norms.
-```python
-import numpy as np
-
-# Create our matrix A
-A = np.array([[3,2,2],[2,3,-2]])
-
-# Take the SVD
-U, S, Vh = np.linalg.svd(A)
-
-# Create our rank-1 approximation
-sigma1 = S[0]
-u1 = U[:, [0]] #shape (2,2)
-v1T = Vh[[0], :] #shape (3,3)
-A1 = sigma1 * (u1 @ v1T)
-
-# Take norms and view errors
-frobenius_error = np.linalg.norm(A - A1, ord="fro") #Frobenius norm
-operator_error = np.linalg.norm(A - A1, ord=2) #operator norm
-```
-Let's see if we get what we expect.
-```python
->>> sigma1
-np.float64(4.999999999999999)
->>> u1
-array([[-0.70710678],
- [-0.70710678]])
->>> v1T
-array([[-7.07106781e-01, -7.07106781e-01, -6.47932334e-17]])
->>> A1
-array([[2.50000000e+00, 2.50000000e+00, 2.29078674e-16],
- [2.50000000e+00, 2.50000000e+00, 2.29078674e-16]])
->>> frobenius_error
-np.float64(3.0)
->>> operator_error
-np.float64(3.0)
-```
-So this numerically confirms the EYM theorem.
-
-## Centering data
-In data science, we rarely apply low-rank approximation to raw values directly, because translation and units can dominate the geometry. Instead, we apply these methods to centered (and often standardized) data so that low-rank structure reflects relationships among features rather than the absolute location or measurement scale. Centering converts the problem from approximating an affine cloud to approximating a linear one, in direct analogy with including an intercept term in linear regression. Therefore, before we can analyze the variance structure, we must ensure our data is centered, i.e., that each feature has a mean of 0. We achieve this by subtracting the mean of each column from every entry in that column.
-Suppose $X$ is our $n \times p$ data matrix, and let
-$$ \mu = \frac{1}{n}\mathbb{1}^T X. $$
-Then
-$$ \hat{X} = X - \mu \mathbb{1} $$
-will be centered data matrix.
-
-> **Example**. Going back to our housing example, the means of the columns are 1770, 164, 3.2, and 547, respectively. So our centered matrix is
-> $$ \hat{X} = \begin{bmatrix} -170 & -16 & -0.2 & -47 \\ 330 & 31 & 0.8 & 103 \\ -220 & -20 & -1.2 & -72 \\ -170 & -16 & -0.2 & -57 \\ 230 & 21 & 0.8 & 73 \end{bmatrix}. $$
-
-Let's do this in python.
-
-```python
-import numpy as np
-import pandas as pd
-
-# First let us make a dictionary incorporating our data.
-# Each entry corresponds to a column (feature of our data)
-data = {
- 'Square ft': [1600, 2100, 1550, 1600, 2000],
- 'Square m': [148, 195, 144, 148, 185],
- 'Bedrooms': [3, 4, 2, 3, 4],
- 'Price': [500, 650, 475, 490, 620]
-}
-
-# Create a pandas DataFrame
-df = pd.DataFrame(data)
-
-# Create out matrix X
-X = df.to_numpy()
-
-# Get our vector of means
-X_means = np.mean(X, axis=0)
-
-# Create our centered matrix
-X_centered = X - X_means
-
-# Get the SVD for X_centered
-U, S, Vh = np.linalg.svd(X_centered)
-```
-This returns the following.
-```python
->>> X_means
-array([1770. , 164. , 3.2, 547. ])
->>> X_centered
-array([[-1.70e+02, -1.60e+01, -2.00e-01, -4.70e+01],
- [ 3.30e+02, 3.10e+01, 8.00e-01, 1.03e+02],
- [-2.20e+02, -2.00e+01, -1.20e+00, -7.20e+01],
- [-1.70e+02, -1.60e+01, -2.00e-01, -5.70e+01],
- [ 2.30e+02, 2.10e+01, 8.00e-01, 7.30e+01]])
-
-```
-
-We will apply the low-rank approximations from the previous sections. First let's see what our SVD looks like, and what the condition number is.
-```python
->>> U
-array([[-0.32486018, -0.81524197, -0.01735449, -0.17188722, 0.4472136 ],
- [ 0.63705869, 0.10707263, -0.3450375 , -0.51345964, 0.4472136 ],
- [-0.42643013, 0.35553416, -0.61058318, 0.34487822, 0.4472136 ],
- [-0.33034709, 0.436448 , 0.61781883, -0.3445052 , 0.4472136 ],
- [ 0.44457871, -0.08381281, 0.35515633, 0.68497384, 0.4472136 ]])
->>> S
-array([5.44828440e+02, 7.61035608e+00, 8.91429037e-01, 2.41987799e-01])
->>> Vh.T
-array([[ 0.95017495, 0.29361033, 0.08182661, 0.06530651],
- [ 0.08827897, 0.06690917, -0.71081981, -0.69459714],
- [ 0.00276797, -0.04366082, 0.69629997, -0.71641638],
- [ 0.29894268, -0.95258064, -0.05662119, 0.00417714]])
->>> np.linalg.cond(X_centered)
-np.float64(2251.4707027583063)
-```
-Now let's approximate our centered matrix $\hat{X}$ by some lower-rank matrices. First, we'll define a function which will give us a rank $k$ truncated SVD.
-```python
-# Defining the truncated svd
-def reduced_svd_matrix_k(U, S, Vh, k):
- Uk = U[:, :k]
- Sk = np.diag(S[:k])
- Vhk = Vh[:k, :]
- return Uk @ Sk @ Vhk
-```
-Now, as $\hat{X}$ has rank 4, we can do a reduced matrix of rank 1,2,3. We will do this in a loop.
-
-> **Remark**. We'll divide the error by the (Frobenius) norm so that we have a relative error. E.g., if two houses are within 10k of each other, they are similarly priced. The magnitude of error being large doesn't say much if our quantities are large.
->
-```python
-for k in [1, 2, 3]:
- # Define our reduced matrix
- Xck = reduced_svd_matrix_k(U, S, Vh, k)
- # Compute the relative error
- rel_err = np.linalg.norm(X_centered - Xck, ord="fro") / np.linalg.norm(X_centered, ord="fro")
- # Print the information
- print(Xck, "\n", f"k={k}: relative Frobenius reconstruction error on centered data = {rel_err:.4f}", "\n")
-```
-And let's see what we get.
-```python
-[[-168.1743765 -15.62476472 -0.48991109 -52.91078079]
- [ 329.79403078 30.64054254 0.96072753 103.7593243 ]
- [-220.7553464 -20.50996365 -0.64308544 -69.45373002]
- [-171.01485494 -15.88866823 -0.49818573 -53.80444804]
- [ 230.15054706 21.38285405 0.67045472 72.40963456]]
- k=1: relative Frobenius reconstruction error on centered data = 0.0141
-
-[[-1.69996018e+02 -1.60398881e+01 -2.19027093e-01 -4.70007022e+01]
- [ 3.30033282e+02 3.06950642e+01 9.25150039e-01 1.02983104e+02]
- [-2.19960913e+02 -2.03289247e+01 -7.61220318e-01 -7.20311670e+01]
- [-1.70039621e+02 -1.56664278e+01 -6.43206200e-01 -5.69684681e+01]
- [ 2.29963269e+02 2.13401763e+01 6.98303572e-01 7.30172337e+01]]
- k=2: relative Frobenius reconstruction error on centered data = 0.0017
-
-[[-1.69997284e+02 -1.60288915e+01 -2.29799059e-01 -4.69998263e+01]
- [ 3.30008114e+02 3.09136956e+01 7.10984571e-01 1.03000519e+02]
- [-2.20005450e+02 -1.99420315e+01 -1.14021052e+00 -7.20003486e+01]
- [-1.69994556e+02 -1.60579058e+01 -2.59724807e-01 -5.69996518e+01]
- [ 2.29989175e+02 2.11151332e+01 9.18749820e-01 7.29993076e+01]]
- k=3: relative Frobenius reconstruction error on centered data = 0.0004
-```
-
-This seems to check out -- it says that one rank (or one feature) should be roughly enough to describe this data. This should make sense because clearly the square meterage, # of bedrooms, and price depend on the square footage.
-
-# Project: Spectral Image Denoising via Truncated SVD
-
-In this project, we will use Truncated Singular Value Decomposition (SVD) to denoise a grayscale image.
-The idea is based on the Eckart-Young-Mirsky theorem: the best low-rank approximation of a matrix (in Frobenius norm) is given by truncating its SVD.
-
-**Outline**.
-1. Convert an image of my sweet, sweet dog, Bella to a grayscale image.
-2. Load the grayscale image.
-3. Add synthetic Gaussian noise to the image.
-4. Treat the image as a matrix and compute its SVD.
-5. Truncate the SVD to keep only the top $k$ singular values.
-6. Reconstruct the image from the truncated SVD.
-7. Compare the original, noisy, and denoised images visually and quantitatively.
-
-### The Setup: Images as Matrices
-
-Suppose we have a digital image of my dog, Bella. For simplicity, let's assume it is a grayscale image. From the perspective of a computer, this image is simply a large $n \times p$ matrix $A$, where $n$ is the height in pixels and $p$ is the width. The entry $A_{ij}$ represents the brightness (intensity) of the pixel at row $i$ and column $j$, typically taking values between 0 (black) and 255 (white) (or 0 and 1 if we normalize). This matrix representation allows us to leverage linear algebra techniques for image manipulation and analysis.
-
-> **Remark**. Color images, by contrast, consist of multiple channels (e.g., RGB), and are therefore naturally represented as collections of matrices. To avoid introducing additional structure unrelated to the core linear algebraic ideas, we will restrict ourselves to grayscale images. That is, we will convert a chosen image into grayscale and apply the SVD directly.
-
-### Experimental Setup
-
-We will perform the following steps.
-1. **Load an preprocess the image**. Convert the image to grayscale to simplify the analysis.
-2. **Add artificial Gaussian noise**. Introduce synthetic Gaussian noise to simulate real-world noise.
-3. **Compute the SVD**. Decompose the noisy matrix into its singular values and vectors.
-4. **Truncating the SVD**. Retain only the top $k$ singular values to create a low-rank approximation.
-5. **Reconstructing the Image**. Use the truncated SVD to reconstruct the denoised image.
-6. **Comparing results**. Visually and quantitatively compare the original, noisy, and denoised images.
-
-### Loading and Preprocessing the Image
-Let's start with this picture of my beautiful dog Bella. Here it is!
-
-
-
-Let's first convert it to grayscale.
-
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-from PIL import Image
-
-# Load image and convert to grayscale
-img = Image.open("bella.jpg").convert("L")
-A = np.array(img, dtype=float)
-
-plt.imshow(A, cmap="gray")
-plt.title("Original Grayscale Image")
-plt.axis("off")
-plt.show()
-```
-
-Here is the result.
-
-
-
-### Adding Noise
-
-Noise is added to the image to simulate real-world conditions. The noise level can be adjusted to see how the denoising algorithm performs under the difference noise conditions. The noisy image matrix $A_{\text{noisy}}$ is created by adding Gaussian noise to the original matrix $A$.
-
-```python
-rng = np.random.default_rng(0)
-
-noise_level = 25
-A_noisy = A + noise_level * rng.standard_normal(A.shape)
-
-plt.imshow(A_noisy, cmap="gray")
-plt.title("Noisy Image")
-plt.axis("off")
-```
-
-This gives the following image.
-
-
-
-
-
-### SVD
-
-Recall that the SVD of $A$ is given by $A = U\Sigma V^T$ where $U$ is an $n \times n$ orthogonal matrix, $V$ is a $p \times p$ orthogonal matrix, and $\Sigma$ is an $n \times p$ diagonal matrix with the singular values on the diagonal, in decreasing order. The left singular vectors correspond to the principal components of the image columns, while the right singular vectors correspond to the principal components of the image rows.
-
-The truncated SVD is given by $A_k = U_k\Sigma_kV_k^T$, where $U_k,V_k, \Sigma_k$ are the truncated versions of $U,V, \Sigma$, respectively. This truncated SVD gives a *best approximation* of our matrix by a lower rank matrix, in terms of the Frobenius norm. Truncating the SVD is equivalent to projecting the image onto the top $k$ principal components.
-
-The larger singular values correspond to the most important features of the image, while the smaller singular values often contain noise. By truncating the smaller singular values, we can remove the noise while preserving the essential information.
-
-```python
-# Take the SVD
-U, S, Vh = np.linalg.svd(A_noisy)
-
-# Define the truncated SVD
-def truncated_svd(U, S, Vh, k):
- return U[:, :k] @ np.diag(S[:k]) @ Vh[:k, :]
-```
-
-If you run the code, you'll see that it takes a bit. This is because computing the SVD of a large image is computationally expensive. There are other methods (e.g., randomized SVD) that exist for scalability.
-
-#### Singular Value Decay
-
-As mentioned, the singular values of an image typically decay rapidly, with the largest singular values capturing most of the important information. The smaller singular values often contain components with noise. We plot the singular values on a log scale, we can determine an appropriate truncation point $k$.
-
-```python
-plt.semilogy(S)
-plt.title("Singular Value Decay")
-plt.xlabel("Index")
-plt.ylabel("Singular value (log scale)")
-plt.show()
-```
-
-
-
-
-### Reconstructing the image
-
-We reconstruct our image precisely from the truncated SVD $A_k = U_k\Sigma_k V_k^T$.
-```python
-import math
-
-# Choose values of $k$
-ks = [5, 20, 50, 100]
-
-n_images = len(ks) # total number of reconstructions
-n_cols = 2 # number of columns in the grid
-n_rows = math.ceil(n_images / n_cols)
-
-# Create the grid of subplots
-fig, axes = plt.subplots(
- n_rows,
- n_cols,
- figsize=(4 * n_cols, 4 * n_rows)
-)
-
-# axes is a 2D grid; flatten it so we can iterate over it with a for loop
-axes = axes.ravel()
-
-# Generate and display reconstructed image for each k
-for ax, k in zip(axes, ks):
- # Reconstruct the rank-k approximation
- A_k = truncated_svd(U, S, Vh, k)
-
- # Display the image
- ax.imshow(A_k, cmap="gray")
-
- # Label each subplot with the truncation rank
- ax.set_title(f"k = {k}")
-
- # Remove axis ticks for a cleaner visualization
- ax.axis("off")
-
-# Hide the extras
-for ax in axes[n_images:]:
- ax.axis("off")
-
-# adjust space to not overlap
-plt.tight_layout()
-
-# show the plot
-plt.show()
-
-
-
-for ax, k in zip(axes, ks):
- # Reconstruct the rank-k approximation
- A_k = truncated_svd(U, S, Vh, k)
-
- # Display the image
- ax.imshow(A_k, cmap="gray")
-
- # Label each subplot with the truncation rank
- ax.set_title(f"k = {k}")
-
- # Remove axis ticks for a cleaner visualization
- ax.axis("off")
-
-
-
-fig, axes = plt.subplots(1, len(ks), figsize=(15,4))
-
-# Generate an image for each value of $k$
-for ax, k in zip(axes, ks):
- A_k = truncated_svd(U, S, Vh, k)
- ax.imshow(A_k, cmap="gray")
- ax.set_title(f"k = {k}")
- ax.axis("off")
-
-plt.show()
-```
-
-
-
-We can see that as $k$ increases, more image detail is recovered, but noise also begins to reappear.
-
-### Quantitative Evaluation
-We can quantify the quality of the denoised image using the **mean squared error (MSE)** and **peak signal-to-noise ratio (PSNR)**:
-$$ \text{MSE} = \frac{1}{np} \sum_{i,j} (A_{ij} - A_k^{ij})^2, \quad \text{PSNR} = 10 \log_{10} \left( \frac{255^2}{\text{MSE}} \right) $$
-MSE quantifies the difference between two images, while higher PSNR values indicate better image quality and less distortion.
-
-Let's compute the MSE and PSNR for $k=5,20,50,100$.
-
-```python
-def mse(A, B):
- return np.mean((A - B) ** 2)
-
-def psnr(A, B, max_val=255.0):
- error = mse(A, B)
- if error == 0:
- return np.inf
- return 10 * np.log10((max_val ** 2) / error)
-
-results = []
-
-for k in ks:
- A_k = truncated_svd(U, S, Vh, k)
- mse_val = mse(A, A_k)
- psnr_val = psnr(A, A_k)
- results.append((k, mse_val, psnr_val))
-
-# Display results
-for k, m, p in results:
- print(f"k = {k:3d} | MSE = {m:10.2f} | PSNR = {p:6.2f} dB")
-
-```
-
-We get
-```python
-k = 5 | MSE = 275.48 | PSNR = 23.73 dB
-k = 20 | MSE = 91.05 | PSNR = 28.54 dB
-k = 50 | MSE = 56.81 | PSNR = 30.59 dB
-k = 100 | MSE = 64.08 | PSNR = 30.06 dB
-
-```
-
-Let's put this into a table.
-| $k$ | MSE | PSNR (dB)| Visual Quality|
-|------|--------|----------|---------------|
-| 5 | 275.48 | 23.73 | Very blurry |
-| 20 | 91.05 | 28.54 | Some detail |
-| 50 | 56.81 | 30.59 | Good balance |
-| 100 | 64.08 | 30.06 | Noise returns |
-
-We can even see further values of MSE and PSNR. Although truncated SVD minimizes the reconstruction error relative to the noisy image, our quantitative evaluation measures error relative to the original clean image. As $k$ increases, the approximation increasingly fits noise-dominated singular components. Consequently, the mean squared error initially decreases as signal structure is recovered, but eventually increases once noise begins to dominate. This behavior reflects the biasāvariance trade off inherent in spectral denoising and explains why the MSE is not monotone in $k$.
-
-| $k$ | MSE | PSNR (dB)| Visual Quality |
-|-------|--------|----------|----------------------------|
-| 200 | 107.84 | 27.80 | more noise |
-| 500 | 229.03 | 24.54 | even more noise |
-| 1000 | 380.20 | 22.33 | even more noise |
-| 3000 | 616.74 | 20.23 | recovering our noisy image |
-
-
-
-Note that the MSE between the original matrix A and the noisy matrix is 624.67.
-```python
->>> mse(A,A_noisy)
-np.float64(624.6700890361011)
-```
-So as the $k$ goes to the maximum it can be (in this case, 3456, as the image is 5184 x 3456), we should expect the MSE to go towards this value -- i.e., as $k$ gets higher, we are just approximating our noisy image better and better.
-# Appendix
-
-## Figures
-
-### Line of best fit
-#### Line of best fit for generated scatter plot
-The first figure is a line of best fit for scattered points. Here is the python code that will produce the image.
-
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-
-# 1. Generate some synthetic data
-# We set a random seed for reproducibility
-np.random.seed(3)
-
-# Create 50 random x values between 0 and 10
-x = np.random.uniform(0, 10, 50)
-
-# Create y values with a linear relationship plus some random noise
-# True relationship: y = 2.5x + 5 + noise
-noise = np.random.normal(0, 2, 50)
-y = 2.5 * x + 5 + noise
-
-# 2. Calculate the line of best fit
-# np.polyfit(x, y, deg) returns the coefficients for the polynomial
-# deg=1 specifies a linear fit (first degree polynomial)
-slope, intercept = np.polyfit(x, y, 1)
-
-# Create a polynomial function from the coefficients
-# This allows us to pass x values directly to get predicted y values
-fit_function = np.poly1d((slope, intercept))
-
-# Generate x values for plotting the line (smoothly across the range)
-x_line = np.linspace(x.min(), x.max(), 100)
-y_line = fit_function(x_line)
-
-# 3. Plot the data and the line of best fit
-plt.figure(figsize=(10, 6))
-
-# Plot the scatter points
-plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)
-
-# Plot the line of best fit
-plt.plot(x_line, y_line, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')
-
-# Add labels and title
-plt.xlabel('X Axis')
-plt.ylabel('Y Axis')
-plt.title('Scatter Plot with Line of Best Fit')
-
-# Add the equation to the plot
-# The f-string formats the slope and intercept to 2 decimal places
-plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))
-
-# Display legend and grid
-plt.legend()
-plt.grid(True, linestyle=':', alpha=0.6)
-
-# Show the plot
-plt.show()
-```
-
-Alternatively, we can do the following using `matplotlib.pyplot.axline`.
-
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-
-# Generate data (same as above)
-np.random.seed(3)
-x = np.random.uniform(0, 10, 50)
-y = 2.5 * x + 5 + np.random.normal(0, 2, 50)
-
-# Calculate slope and intercept
-slope, intercept = np.polyfit(x, y, 1)
-
-plt.figure(figsize=(10, 6))
-plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)
-
-# Plot the line using axline
-# xy1=(0, intercept) is the y-intercept point
-# slope=slope defines the steepness
-plt.axline(xy1=(0, intercept), slope=slope, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')
-
-# Add the equation to the plot
-# The f-string formats the slope and intercept to 2 decimal places
-plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))
-
-
-plt.xlabel('X Axis')
-plt.ylabel('Y Axis')
-plt.title('Scatter Plot with Line of Best Fit')
-plt.legend()
-plt.grid(True, linestyle=':', alpha=0.6)
-plt.show()
-```
-
-See
-- https://stackoverflow.com/questions/37234163/how-to-add-a-line-of-best-fit-to-scatter-plot
-- https://www.statology.org/line-of-best-fit-python/
-- https://stackoverflow.com/questions/6148207/linear-regression-with-matplotlib-numpy
-
-### Projection of vector onto subspace
-
-Here is the code to generate the image of the projection of $\text{Proj}(v)$ of a vector $v$ onto a plane in $\mathbb{R}^3$.
-
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-
-# Linear algebra helper functions
-def proj_onto_subspace(A, v):
- """
- Project vector v onto Col(A) where A is (3 x k) with columns spanning the subspace.
- Uses the formula: P = A (A^T A)^(-1) A^T (for full column rank A).
- """
- AtA = A.T @ A
- return A @ np.linalg.solve(AtA, A.T @ v)
-
-def make_plane_grid(a, b, u_range=(-1.5, 1.5), v_range=(-1.5, 1.5), n=15):
- """
- Plane through origin spanned by vectors a and b.
- Returns meshgrid points X,Y,Z for surface plotting.
- """
- uu = np.linspace(*u_range, n)
- vv = np.linspace(*v_range, n)
- U, V = np.meshgrid(uu, vv)
- P = U[..., None] * a + V[..., None] * b # shape (n,n,3)
- return P[..., 0], P[..., 1], P[..., 2]
-
-# Choose a plan and a vector
-# Plane basis vectors (span a 2D subspace in R^3)
-a = np.array([1.0, 0.2, 0.0])
-b = np.array([0.2, 1.0, 0.3])
-# Create the associated matrix
-# 3x2 matrix of full column rank
-# the column space will be a plane
-A = np.column_stack([a, b])
-
-# Vector to project
-v = np.array([0.8, 0.6, 1.2])
-
-# Projection and residual
-p = proj_onto_subspace(A, v)
-r = v - p
-
-# Plot
-fig = plt.figure(figsize=(9, 7))
-# 1 row, 1 column, 1 subplot
-# axis lives in R^3
-ax = fig.add_subplot(111, projection="3d")
-
-# Plane surface
-X, Y, Z = make_plane_grid(a, b)
-# Here is a rectangular grid of points in 3D; draw a surface through them.
-ax.plot_surface(X, Y, Z, alpha=0.25)
-
-origin = np.zeros(3)
-
-# v, p, and residual r
-ax.quiver(*origin, *v, arrow_length_ratio=0.08, linewidth=2)
-ax.quiver(*origin, *p, arrow_length_ratio=0.08, linewidth=2)
-ax.quiver(*p, *r, arrow_length_ratio=0.08, linewidth=2)
-
-# Drop line from v to its projection on the plane
-ax.plot([v[0], p[0]],
- [v[1], p[1]],
- [v[2], p[2]],
- linestyle="--", linewidth=2)
-
-# Points for emphasis
-ax.scatter(*v, s=60)
-ax.scatter(*p, s=60)
-
-# Labels (simple text)
-ax.text(*v, " v")
-ax.text(*p, " Proj(v)")
-
-# Make axes look nice
-ax.set_xlabel("x")
-ax.set_ylabel("y")
-ax.set_zlabel("z")
-ax.set_title("Projection of a vector onto a plane")
-
-# Set symmetric limits so the picture isn't squished
-all_pts = np.vstack([origin, v, p])
-m = np.max(np.abs(all_pts)) * 1.3 + 0.2
-ax.set_xlim(-m, m)
-ax.set_ylim(-m, m)
-ax.set_zlim(-m, m)
-
-# Adjust spacing so labels, titles, and axes donāt overlap or get cut off.
-plt.tight_layout()
-
-plt.show()
-```
-
-### $L^1$ and $L^{\infty}$ unit balls
-
-To generate the matplotlib image of the $L^1$ unit ball, let's use the following code.
-
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-
-# Grid
-xx = np.linspace(-1.2, 1.2, 400)
-yy = np.linspace(-1.2, 1.2, 400)
-X, Y = np.meshgrid(xx, yy)
-
-# Take the $L^1$ norm
-Z = np.abs(X) + np.abs(Y)
-
-plt.figure(figsize=(6,6))
-plt.contour(X, Y, Z, levels=[1])
-plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)
-
-plt.axhline(0)
-plt.axvline(0)
-plt.gca().set_aspect("equal", adjustable="box")
-plt.title(r"$L^1$ unit ball: $|x|+|y|\leq 1$")
-plt.tight_layout()
-plt.show()
-```
-
-For the $L^{\infty}$ unit ball:
-
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-
-# Grid
-xx = np.linspace(-1.2, 1.2, 400)
-yy = np.linspace(-1.2, 1.2, 400)
-X, Y = np.meshgrid(xx, yy)
-
-# Take the $L^{\infty}$ norm
-Z = np.maximum(np.abs(X), np.abs(Y))
-
-plt.figure(figsize=(6,6))
-plt.contour(X, Y, Z, levels=[1])
-plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)
-
-plt.axhline(0)
-plt.axvline(0)
-plt.gca().set_aspect("equal", adjustable="box")
-plt.title(r"$L^{\infty}$ unit ball: $\max\{|x|,|y|\} \leq 1$")
-plt.tight_layout()
-plt.show()
-```
-
-### Polynomial of best fit
-
-First let us generate the data and show it in a simple scatter plot.
-```python
-import numpy as np
-import matplotlib.pyplot as plt
-
-# 1) Generate quadratic data
-np.random.seed(3)
-
-n = 50
-x = np.random.uniform(-5, 5, n) # symmetric, wider range
-
-# True relationship: y = ax^2 + c + noise
-a_true = 2.0
-c_true = 5.0
-noise = np.random.normal(0, 3, n)
-
-y = a_true * x**2 + c_true + noise
-```
-
-Now to generate the scatter plot.
-
-```python
-# add the scatter points to the plot
-plt.scatter(x,y)
-
-# plot it
-plt.show()
-```
-
-As for a *line of best fit*, the following will generate the scatter plot vs. the line.
-
-```python
-# find a line of best fit
-a,b = np.polyfit(x, y, 1)
-
-# add scatter points to plot
-plt.scatter(x,y)
-
-# add line of best fit to plot
-plt.plot(x, a*x + b, 'r', linewidth=1)
-
-# plot it
-plt.show()
-```
-
-Now let us do the quadratic of best fit on top of the scatter plot.
-
-```python
-# polynomial fit with degree = 2
-poly = np.polyfit(x,y,2)
-model = np.poly1d(poly)
-
-# add scatter points to plot
-plt.scatter(x,y)
-
-# add the quadratic to the plot
-polyline=np.linspace(x.min(), x.max())
-plt.plot(polyline, model(polyline), 'r', linewidth=1)
-
-# plot it
-plt.show()
-```
-
-# Bibliography
-
-## Mathematics
-
-- Gene H. Golub and Charles F. Van Loan, *Matrix Computations*. John Hopkins University Press, 2013.
-- Mark H. Holmes, *Introduction to scientific computing and data analysis*. Vol. 13. Springer Nature, 2023.
-- David C. Lay, Steven R. Lay, and Judith J. McDonald,Ā *Linear Algebra and Its Applications*, Pearson, 2021. ISBN 013588280X.
-- https://ubcmath.github.io/MATH307/index.html
-- https://eecs16b.org/notes/fa23/note16.pdf
-- https://en.wikipedia.org/wiki/Low-rank_approximation
-- https://www-labs.iro.umontreal.ca/~grabus/courses/ift6760_W20_files/lecture-5.pdf
-- https://www.statology.org/polynomial-regression-python/
-- https://en.wikipedia.org/wiki/Mean_squared_error
-- https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
-
-## Python
-### Numpy (https://numpy.org/doc/stable/index.html)
-- numpy basics: https://numpy.org/doc/stable/user/absolute_beginners.html
-- numpy.array: https://numpy.org/doc/stable/reference/generated/numpy.array.html
-- numpy.hstack: https://numpy.org/doc/stable/reference/generated/numpy.hstack.html (Stack arrays in sequence horizontally (column wise).)
-- numpy.column_stack: https://numpy.org/doc/stable/reference/generated/numpy.column_stack.html (Stack 1-D arrays as columns into a 2-D array.)
-- numpy.shape: https://numpy.org/doc/stable/reference/generated/numpy.shape.html (Return the shape of an array.)
-- numpy.polyfit: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html (Least squares polynomial fit.)
-- numpy.mean: https://numpy.org/doc/stable/reference/generated/numpy.mean.html (Compute the arithmetic mean along the specified axis.)
-- numyp.poly1d: https://numpy.org/doc/stable/reference/generated/numpy.poly1d.html (A one-dimensional polynomial class.)
-- numpy.set_printoptions: https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html (These options determine the way floating point numbers, arrays and other NumPy objects are displayed.)
-- numpy.finfo: https://numpy.org/doc/stable/reference/generated/numpy.finfo.html (Machine limits for floating point types.)
-- numpy.logspace: https://numpy.org/doc/stable/reference/generated/numpy.logspace.html (Return numbers spaced evenly on a log scale.)
-- numpy.sum: https://numpy.org/doc/stable/reference/generated/numpy.sum.html (Sum of array elements over a given axis.)
-- numpy.abs: https://numpy.org/doc/stable/reference/generated/numpy.absolute.html (Calculate the absolute value element-wise.)
-- numpy.ndarray.T: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.T.html (View of the transposed array.)
-- numpy.ones: https://numpy.org/doc/stable/reference/generated/numpy.ones.html (Return a new array of given shape and type, filled with ones.)
-- numpy.zeros: https://numpy.org/doc/stable/reference/generated/numpy.zeros.html (Return a new array of given shape and type, filled with zeros.)
-- numpy.diag: https://numpy.org/doc/stable/reference/generated/numpy.diag.html (Extract a diagonal or construct a diagonal array.)
-- numpy.cumsum: https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html (Return the cumulative sum of the elements along a given axis.)
-- numpy.meshgrid: https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html (Return a tuple of coordinate matrices from coordinate vectors.)
-- numpy.linspace: https://numpy.org/doc/stable/reference/generated/numpy.linspace.html (Return evenly spaced numbers over a specified interval.)
-- numpy.ravel: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html (Return a contiguous flattened array.)
-- numpy.vstack: https://numpy.org/doc/stable/reference/generated/numpy.vstack.html (Stack arrays in sequence vertically (row wise).)
-
-#### numpy.random (https://numpy.org/doc/stable/reference/random/index.html)
-- numpy.random.seed: https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html (Reseed the singleton RandomState instance.)
-- numpy.random.normal: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html (Draw random samples from a normal (Gaussian) distribution.)
-- numpy.random.default_rng: https://numpy.org/doc/stable/reference/random/generator.html (Construct a new Generator with the default BitGenerator (PCG64).)
-- numpy.random.uniform: https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html (Draw samples from a uniform distribution.)
-
-#### numpy.linalg (https://numpy.org/doc/stable/reference/routines.linalg.html)
-- numpy.linalg.qr: https://numpy.org/doc/stable/reference/generated/numpy.linalg.qr.html (Compute the qr factorization of a matrix.
-- numpy.linalg.svd: https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html (Singular Value Decomposition.)
-- numpy.linalg.solve: https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html (Solve a linear matrix equation, or system of linear scalar equations.)
-- numpy.linalg.lstsq: https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html (Return the least-squares solution to a linear matrix equation.)
-- numpy.linalg.norm: https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html (Matrix or vector norm.)
-- numpy.linalg.pinv: https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html (Compute the (Moore-Penrose) pseudo-inverse of a matrix.)
-- numpy.linalg.cond: https://numpy.org/doc/stable/reference/generated/numpy.linalg.cond.html (Compute the condition number of a matrix.)
-
-### Matplotlib (https://matplotlib.org/stable/users/getting_started/)
-- matplotlib.pyplot: https://matplotlib.org/stable/api/pyplot_summary.html
-- matplotlib.figure: https://matplotlib.org/stable/api/figure_api.html (Implements the following classes: `Figure` and `SubFigure`)
-- mpl_toolkits.mplot3d.axes3d.Axes3D.plot_surface: https://matplotlib.org/stable/api/_as_gen/mpl_toolkits.mplot3d.axes3d.Axes3D.plot_surface.html#mpl_toolkits.mplot3d.axes3d.Axes3D.plot_surface (Create a surface plot.)
-
-#### matplotlib.pyplot
-- matplotlib.pyplot.plot: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html (Plot y versus x as lines and/or markers.)
-- matplotlib.pyplot.quiver: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.quiver.html (Plot a 2D field of arrows.)
-- matplotlib.pyplot.tight_layout: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.tight_layout.html (Adjust the padding between and around subplots.)
-- matplotlib.pyplot.legend: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html (Place a legend on the Axes.)
-- matplotlib.pyplot.show: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html (Display all open figures.)
-- matplotlib.pyplot.xlabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html (Set the label for the x-axis.)
-- matplotlib.pyplot.ylabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html (Set the label for the y-axis.)
-- matplotlib.pyplot.title: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html (Set a title for the Axes.)
-- matplotlib.pyplot.scatter: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html (A scatter plot of y vs. x with varying marker size and/or color.)
-- matplotlib.pyplot.imshow: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html (Display data as an image, i.e., on a 2D regular raster.)
-- matplotlib.pyplot.axis: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axis.html (Convenience method to get or set some axis properties.)
-- matplotlib.pyplot.semilogy: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.semilogy.html (Make a plot with log scaling on the y-axis.)
-- matplotlib.pyplot.subplots: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html (Create a figure and a set of subplots.)
-- matplotlib.pyplot.contour: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contour.html (Plot contour lines.)
-- matplotlib.pyplot.contourf: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html (Plot filled contours.)
-- matplotlib.pyplot.axhline: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axhline.html (Add a horizontal line spanning the whole or fraction of the Axes.)
-- matplotlib.pyplot.axvline: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html (Add a vertical line spanning the whole or fraction of the Axes.)
-- matplotlib.pyplot.gca: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.gca.html (Get the current Axes.)
-
-#### matplotlib.figure
-- matplotlib.figure.Figure.add_subplot : https://matplotlib.org/stable/api/_as_gen/matplotlib.figure.Figure.add_subplot.html (Add an `Axes` to the figure as part of a subplot arrangement.)]
-
-#### matplotlib.axes
-- matplotlib.axes.Axes.set_title: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_title.html (Set a title for the Axes.)
-- matplotlib.axes.Axes.imshow: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.imshow.html (Display data as an image, i.e., on a 2D regular raster.)
-- matplotlib.axes.Axes.axis: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.axis.html (Convenience method to get or set some axis properties.)
-- matplotlib.axes.Axes.text: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html (Add text to the Axes.)
-- matplotlib.axes.Axes.set_xlabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html (Set the label for the x-axis.)
-- matplotlib.axes.Axes.set_ylabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_ylabel.html (Set the label for the y-axis.)
-- matplotlib.axes.Axes.set_xlim: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xlim.html (Set the x-axis view limits.)
-- matplotlib.axes.Axes.set_aspect: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_aspect.html (Set the aspect ratio of the Axes scaling, i.e. y/x-scale.)
-
-#### Scatter plots with line of best fit
-- https://stackoverflow.com/questions/37234163/how-to-add-a-line-of-best-fit-to-scatter-plot
-- https://www.statology.org/line-of-best-fit-python/
-- https://stackoverflow.com/questions/6148207/linear-regression-with-matplotlib-numpy
-
-### Pandas
-- pandas basics: https://pandas.pydata.org/docs/user_guide/index.html
-- pandas.DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html (Two-dimensional, size-mutable, potentially heterogeneous tabular data.)
-#### pandas.DataFrame
-- pandas.DataFrame.describe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html (Generate descriptive statistics.)
-- pandas.DataFrame.corr: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html (Compute pairwise correlation of columns, excluding NA/null values.)
-- pandas.DataFrame.to_numpy: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html (Convert the DataFrame to a NumPy array.)
-- pandas.DataFrame.plot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html (Make plots of Series or DataFrame.)
-
-### Pillow
-- PIL basics: https://pillow.readthedocs.io/en/stable/
-- PIL.Image: https://pillow.readthedocs.io/en/stable/reference/Image.html
-
-### Math
-- Math basics: https://docs.python.org/3/library/math.html
-- math.ceil: https://docs.python.org/3/library/math.html#math.ceil (Return the ceiling of x, the smallest integer greater than or equal to x)
+- ~~Add regularization (Ridge, LASSO)~~
+- Extend PCA to real datasets
+- Compare SVD vs. autoencoders for compression
+- Add performance benchmarks (QR vs SVD vs normal equations)
+- Add neural networks
+---
# License
This project is licensed under the MIT License.
-See the [`LICENSE`](./LICENSE) file for details.
-
-
+See the [`LICENSE`](./LICENSE) file for details.
\ No newline at end of file
diff --git a/bibliography.md b/bibliography.md
new file mode 100644
index 0000000..23c2291
--- /dev/null
+++ b/bibliography.md
@@ -0,0 +1,152 @@
+# Bibliography
+
+## Mathematics
+
+- Gene H. Golub and Charles F. Van Loan, *Matrix Computations*. John Hopkins University Press, 2013.
+- Mark H. Holmes, *Introduction to scientific computing and data analysis*. Vol. 13. Springer Nature, 2023.
+- David C. Lay, Steven R. Lay, and Judith J. McDonald,Ā *Linear Algebra and Its Applications*, Pearson, 2021. ISBN 013588280X.
+- https://ubcmath.github.io/MATH307/index.html
+- https://eecs16b.org/notes/fa23/note16.pdf
+- https://en.wikipedia.org/wiki/Low-rank_approximation
+- https://www-labs.iro.umontreal.ca/~grabus/courses/ift6760_W20_files/lecture-5.pdf
+- https://www.statology.org/polynomial-regression-python/
+- https://en.wikipedia.org/wiki/Mean_squared_error
+- https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
+
+## Modelling
+- https://www.geeksforgeeks.org/machine-learning/what-is-ridge-regression/
+- https://www.geeksforgeeks.org/machine-learning/what-is-lasso-regression/
+- https://www.geeksforgeeks.org/machine-learning/principal-component-regression-pcr/
+- https://www.geeksforgeeks.org/machine-learning/gradient-descent-algorithm-and-its-variants/
+- https://www.geeksforgeeks.org/machine-learning/decision-tree/
+- https://www.geeksforgeeks.org/machine-learning/random-forest-algorithm-in-machine-learning/
+
+## Python
+### Numpy (https://numpy.org/doc/stable/index.html)
+- numpy basics: https://numpy.org/doc/stable/user/absolute_beginners.html
+- numpy.array: https://numpy.org/doc/stable/reference/generated/numpy.array.html
+- numpy.hstack: https://numpy.org/doc/stable/reference/generated/numpy.hstack.html (Stack arrays in sequence horizontally (column wise).)
+- numpy.column_stack: https://numpy.org/doc/stable/reference/generated/numpy.column_stack.html (Stack 1-D arrays as columns into a 2-D array.)
+- numpy.shape: https://numpy.org/doc/stable/reference/generated/numpy.shape.html (Return the shape of an array.)
+- numpy.polyfit: https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html (Least squares polynomial fit.)
+- numpy.mean: https://numpy.org/doc/stable/reference/generated/numpy.mean.html (Compute the arithmetic mean along the specified axis.)
+- numyp.poly1d: https://numpy.org/doc/stable/reference/generated/numpy.poly1d.html (A one-dimensional polynomial class.)
+- numpy.set_printoptions: https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html (These options determine the way floating point numbers, arrays and other NumPy objects are displayed.)
+- numpy.finfo: https://numpy.org/doc/stable/reference/generated/numpy.finfo.html (Machine limits for floating point types.)
+- numpy.logspace: https://numpy.org/doc/stable/reference/generated/numpy.logspace.html (Return numbers spaced evenly on a log scale.)
+- numpy.sum: https://numpy.org/doc/stable/reference/generated/numpy.sum.html (Sum of array elements over a given axis.)
+- numpy.abs: https://numpy.org/doc/stable/reference/generated/numpy.absolute.html (Calculate the absolute value element-wise.)
+- numpy.ndarray.T: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.T.html (View of the transposed array.)
+- numpy.ones: https://numpy.org/doc/stable/reference/generated/numpy.ones.html (Return a new array of given shape and type, filled with ones.)
+- numpy.zeros: https://numpy.org/doc/stable/reference/generated/numpy.zeros.html (Return a new array of given shape and type, filled with zeros.)
+- numpy.diag: https://numpy.org/doc/stable/reference/generated/numpy.diag.html (Extract a diagonal or construct a diagonal array.)
+- numpy.cumsum: https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html (Return the cumulative sum of the elements along a given axis.)
+- numpy.meshgrid: https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html (Return a tuple of coordinate matrices from coordinate vectors.)
+- numpy.linspace: https://numpy.org/doc/stable/reference/generated/numpy.linspace.html (Return evenly spaced numbers over a specified interval.)
+- numpy.ravel: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html (Return a contiguous flattened array.)
+- numpy.vstack: https://numpy.org/doc/stable/reference/generated/numpy.vstack.html (Stack arrays in sequence vertically (row wise).)
+
+#### numpy.random (https://numpy.org/doc/stable/reference/random/index.html)
+- numpy.random.seed: https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html (Reseed the singleton RandomState instance.)
+- numpy.random.normal: https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html (Draw random samples from a normal (Gaussian) distribution.)
+- numpy.random.default_rng: https://numpy.org/doc/stable/reference/random/generator.html (Construct a new Generator with the default BitGenerator (PCG64).)
+- numpy.random.uniform: https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html (Draw samples from a uniform distribution.)
+
+#### numpy.linalg (https://numpy.org/doc/stable/reference/routines.linalg.html)
+- numpy.linalg.qr: https://numpy.org/doc/stable/reference/generated/numpy.linalg.qr.html (Compute the qr factorization of a matrix.
+- numpy.linalg.svd: https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html (Singular Value Decomposition.)
+- numpy.linalg.solve: https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html (Solve a linear matrix equation, or system of linear scalar equations.)
+- numpy.linalg.lstsq: https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html (Return the least-squares solution to a linear matrix equation.)
+- numpy.linalg.norm: https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html (Matrix or vector norm.)
+- numpy.linalg.pinv: https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html (Compute the (Moore-Penrose) pseudo-inverse of a matrix.)
+- numpy.linalg.cond: https://numpy.org/doc/stable/reference/generated/numpy.linalg.cond.html (Compute the condition number of a matrix.)
+
+### Matplotlib (https://matplotlib.org/stable/users/getting_started/)
+- matplotlib.pyplot: https://matplotlib.org/stable/api/pyplot_summary.html
+- matplotlib.figure: https://matplotlib.org/stable/api/figure_api.html (Implements the following classes: `Figure` and `SubFigure`)
+- mpl_toolkits.mplot3d.axes3d.Axes3D.plot_surface: https://matplotlib.org/stable/api/_as_gen/mpl_toolkits.mplot3d.axes3d.Axes3D.plot_surface.html#mpl_toolkits.mplot3d.axes3d.Axes3D.plot_surface (Create a surface plot.)
+- colormaps: https://matplotlib.org/stable/users/explain/colors/colormaps.html
+
+#### matplotlib.pyplot
+- matplotlib.pyplot.plot: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html (Plot y versus x as lines and/or markers.)
+- matplotlib.pyplot.quiver: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.quiver.html (Plot a 2D field of arrows.)
+- matplotlib.pyplot.tight_layout: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.tight_layout.html (Adjust the padding between and around subplots.)
+- matplotlib.pyplot.legend: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html (Place a legend on the Axes.)
+- matplotlib.pyplot.show: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html (Display all open figures.)
+- matplotlib.pyplot.xlabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html (Set the label for the x-axis.)
+- matplotlib.pyplot.ylabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html (Set the label for the y-axis.)
+- matplotlib.pyplot.title: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html (Set a title for the Axes.)
+- matplotlib.pyplot.scatter: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html (A scatter plot of y vs. x with varying marker size and/or color.)
+- matplotlib.pyplot.imshow: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html (Display data as an image, i.e., on a 2D regular raster.)
+- matplotlib.pyplot.axis: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axis.html (Convenience method to get or set some axis properties.)
+- matplotlib.pyplot.semilogy: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.semilogy.html (Make a plot with log scaling on the y-axis.)
+- matplotlib.pyplot.subplots: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html (Create a figure and a set of subplots.)
+- matplotlib.pyplot.contour: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contour.html (Plot contour lines.)
+- matplotlib.pyplot.contourf: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html (Plot filled contours.)
+- matplotlib.pyplot.axhline: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axhline.html (Add a horizontal line spanning the whole or fraction of the Axes.)
+- matplotlib.pyplot.axvline: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html (Add a vertical line spanning the whole or fraction of the Axes.)
+- matplotlib.pyplot.gca: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.gca.html (Get the current Axes.)
+
+#### matplotlib.figure
+- matplotlib.figure.Figure.add_subplot : https://matplotlib.org/stable/api/_as_gen/matplotlib.figure.Figure.add_subplot.html (Add an `Axes` to the figure as part of a subplot arrangement.)]
+
+#### matplotlib.axes
+- matplotlib.axes.Axes.set_title: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_title.html (Set a title for the Axes.)
+- matplotlib.axes.Axes.imshow: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.imshow.html (Display data as an image, i.e., on a 2D regular raster.)
+- matplotlib.axes.Axes.axis: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.axis.html (Convenience method to get or set some axis properties.)
+- matplotlib.axes.Axes.text: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html (Add text to the Axes.)
+- matplotlib.axes.Axes.set_xlabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html (Set the label for the x-axis.)
+- matplotlib.axes.Axes.set_ylabel: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_ylabel.html (Set the label for the y-axis.)
+- matplotlib.axes.Axes.set_xlim: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xlim.html (Set the x-axis view limits.)
+- matplotlib.axes.Axes.set_aspect: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_aspect.html (Set the aspect ratio of the Axes scaling, i.e. y/x-scale.)
+
+#### Scatter plots with line of best fit
+- https://stackoverflow.com/questions/37234163/how-to-add-a-line-of-best-fit-to-scatter-plot
+- https://www.statology.org/line-of-best-fit-python/
+- https://stackoverflow.com/questions/6148207/linear-regression-with-matplotlib-numpy
+
+### Pandas
+- pandas basics: https://pandas.pydata.org/docs/user_guide/index.html
+- pandas.DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html (Two-dimensional, size-mutable, potentially heterogeneous tabular data.)
+#### pandas.DataFrame
+- pandas.DataFrame.describe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html (Generate descriptive statistics.)
+- pandas.DataFrame.corr: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html (Compute pairwise correlation of columns, excluding NA/null values.)
+- pandas.DataFrame.to_numpy: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html (Convert the DataFrame to a NumPy array.)
+- pandas.DataFrame.plot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html (Make plots of Series or DataFrame.)
+
+### scikit-learn
+- sklearn basis: https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics
+- sklearn.datasets.fetch_california_housing: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html (Load the California housing dataset (regression).)
+- sklearn.tree.DecisionTreeRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html (A decision tree regressor.)
+- sklearn.ensemble.RandomForestRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html (A random forest regressor.)
+- sklearn.pipeline.make_pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html (Construct a Pipeline from the given estimators.)
+- sklearn.decomposition.PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html (Principal component analysis (PCA).)
+
+#### sklearn.preprocessing
+- sklearn.preprocessing.PolynomialFeatures: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html (Generate polynomial and interaction features.)
+- sklearn.preprocessing: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (Standardize features by removing the mean and scaling to unit variance.)
+
+#### sklearn.linear_model
+- sklearn.linear_model.LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html (Ordinary least squares Linear Regression.)
+- sklearn.linear_model.Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html (Linear least squares with l2 regularization.)
+- sklearn.linear_model.Lasso: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html (Linear Model trained with L1 prior as regularizer (aka the Lasso).)
+- sklearn.linear_model.LassoCV: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html (Lasso linear model with iterative fitting along a regularization path.)
+- sklearn.linear_model.SGDRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html (Linear model fitted by minimizing a regularized empirical loss with SGD.)
+
+
+#### sklearn.model_selection
+- sklearn.model_selection.train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html (Split arrays or matrices into random train and test subsets.)
+- sklearn.model_selection.cross_val_score: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html (Evaluate a score by cross-validation.)
+- sklearn.model_selection.KFold: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html (K-Fold cross-validator.)
+
+#### sklearn.metrices
+- sklearn.metrics.mean_squared_error: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html (Mean squared error regression loss.)
+- sklearn.metrics.r2_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html (R^2 (coefficient of determination) regression score function.)
+
+### Pillow
+- PIL basics: https://pillow.readthedocs.io/en/stable/
+- PIL.Image: https://pillow.readthedocs.io/en/stable/reference/Image.html
+
+### Math
+- Math basics: https://docs.python.org/3/library/math.html
+- math.ceil: https://docs.python.org/3/library/math.html#math.ceil (Return the ceiling of x, the smallest integer greater than or equal to x)
diff --git a/figures/bella_reconstruction_error_vs_rank.jpeg b/figures/bella_reconstruction_error_vs_rank.jpeg
deleted file mode 100644
index a0a1987..0000000
Binary files a/figures/bella_reconstruction_error_vs_rank.jpeg and /dev/null differ
diff --git a/figures/bella_singular_value_decay.jpeg b/figures/bella_singular_value_decay.jpeg
deleted file mode 100644
index c7ed7fc..0000000
Binary files a/figures/bella_singular_value_decay.jpeg and /dev/null differ
diff --git a/figures/house_price_vs_sqft_cubic.png b/figures/house_price_vs_sqft_cubic.png
deleted file mode 100644
index e725a04..0000000
Binary files a/figures/house_price_vs_sqft_cubic.png and /dev/null differ
diff --git a/figures/projection_of_vector_onto_plane.png b/figures/projection_of_vector_onto_plane.png
deleted file mode 100644
index 567866f..0000000
Binary files a/figures/projection_of_vector_onto_plane.png and /dev/null differ
diff --git a/figures/quadratic_degree_11_best_fit.png b/figures/quadratic_degree_11_best_fit.png
deleted file mode 100644
index ccb1bc0..0000000
Binary files a/figures/quadratic_degree_11_best_fit.png and /dev/null differ
diff --git a/figures/quadratic_line_of_best_fit.png b/figures/quadratic_line_of_best_fit.png
deleted file mode 100644
index b4b7291..0000000
Binary files a/figures/quadratic_line_of_best_fit.png and /dev/null differ
diff --git a/figures/quadratic_poly_of_best_fit.png b/figures/quadratic_poly_of_best_fit.png
deleted file mode 100644
index e73ca08..0000000
Binary files a/figures/quadratic_poly_of_best_fit.png and /dev/null differ
diff --git a/figures/L1_unit_ball.png b/images/L1_unit_ball.png
similarity index 100%
rename from figures/L1_unit_ball.png
rename to images/L1_unit_ball.png
diff --git a/figures/Linf_unit_ball.png b/images/Linf_unit_ball.png
similarity index 100%
rename from figures/Linf_unit_ball.png
rename to images/Linf_unit_ball.png
diff --git a/images/RF_feature_importance.png b/images/RF_feature_importance.png
new file mode 100644
index 0000000..ca29b78
Binary files /dev/null and b/images/RF_feature_importance.png differ
diff --git a/figures/bedrooms_vs_square_footage.png b/images/bedrooms_vs_square_ft.png
similarity index 99%
rename from figures/bedrooms_vs_square_footage.png
rename to images/bedrooms_vs_square_ft.png
index 667d00e..0678c3d 100644
Binary files a/figures/bedrooms_vs_square_footage.png and b/images/bedrooms_vs_square_ft.png differ
diff --git a/pictures/bella.jpg b/images/bella.jpg
similarity index 100%
rename from pictures/bella.jpg
rename to images/bella.jpg
diff --git a/images/bella_best_truncated.png b/images/bella_best_truncated.png
new file mode 100644
index 0000000..3564f5c
Binary files /dev/null and b/images/bella_best_truncated.png differ
diff --git a/images/bella_noisy.png b/images/bella_noisy.png
new file mode 100644
index 0000000..70d5ba1
Binary files /dev/null and b/images/bella_noisy.png differ
diff --git a/images/bella_truncated_svd_multiple_ks.png b/images/bella_truncated_svd_multiple_ks.png
new file mode 100644
index 0000000..1c70be3
Binary files /dev/null and b/images/bella_truncated_svd_multiple_ks.png differ
diff --git a/images/bella_truncated_svd_multiplie_ks.png b/images/bella_truncated_svd_multiplie_ks.png
new file mode 100644
index 0000000..c45cdef
Binary files /dev/null and b/images/bella_truncated_svd_multiplie_ks.png differ
diff --git a/images/bella_truncated_svd_plain_vs_centered.png b/images/bella_truncated_svd_plain_vs_centered.png
new file mode 100644
index 0000000..0b1e046
Binary files /dev/null and b/images/bella_truncated_svd_plain_vs_centered.png differ
diff --git a/images/california_housing_scatter.png b/images/california_housing_scatter.png
new file mode 100644
index 0000000..05b5325
Binary files /dev/null and b/images/california_housing_scatter.png differ
diff --git a/images/cross_validation_illustration.png b/images/cross_validation_illustration.png
new file mode 100644
index 0000000..a542190
Binary files /dev/null and b/images/cross_validation_illustration.png differ
diff --git a/images/gd_condition_number_effect.png b/images/gd_condition_number_effect.png
new file mode 100644
index 0000000..7e58dcf
Binary files /dev/null and b/images/gd_condition_number_effect.png differ
diff --git a/images/gradient_descent_convergence_scaled.png b/images/gradient_descent_convergence_scaled.png
new file mode 100644
index 0000000..7d3af74
Binary files /dev/null and b/images/gradient_descent_convergence_scaled.png differ
diff --git a/images/gradient_descent_convergence_unscaled.png b/images/gradient_descent_convergence_unscaled.png
new file mode 100644
index 0000000..434d250
Binary files /dev/null and b/images/gradient_descent_convergence_unscaled.png differ
diff --git a/figures/house_price_vs_bedrooms.png b/images/house_price_vs_bedrooms.png
similarity index 99%
rename from figures/house_price_vs_bedrooms.png
rename to images/house_price_vs_bedrooms.png
index 8891d0a..b6b1cea 100644
Binary files a/figures/house_price_vs_bedrooms.png and b/images/house_price_vs_bedrooms.png differ
diff --git a/figures/house_price_vs_square_ft.png b/images/house_price_vs_square_ft.png
similarity index 99%
rename from figures/house_price_vs_square_ft.png
rename to images/house_price_vs_square_ft.png
index 6d60c78..235fb16 100644
Binary files a/figures/house_price_vs_square_ft.png and b/images/house_price_vs_square_ft.png differ
diff --git a/images/lasso_vs_ridge_geometry.png b/images/lasso_vs_ridge_geometry.png
new file mode 100644
index 0000000..ea0d9dd
Binary files /dev/null and b/images/lasso_vs_ridge_geometry.png differ
diff --git a/figures/line_of_best_fit_easy_example.png b/images/line_of_best_fit_easy_example.png
similarity index 99%
rename from figures/line_of_best_fit_easy_example.png
rename to images/line_of_best_fit_easy_example.png
index 267a269..0b53b85 100644
Binary files a/figures/line_of_best_fit_easy_example.png and b/images/line_of_best_fit_easy_example.png differ
diff --git a/figures/line_of_best_fit_generated1.png b/images/line_of_best_fit_generated_1.png
similarity index 100%
rename from figures/line_of_best_fit_generated1.png
rename to images/line_of_best_fit_generated_1.png
diff --git a/images/pcr_components_selection.png b/images/pcr_components_selection.png
new file mode 100644
index 0000000..20eed5b
Binary files /dev/null and b/images/pcr_components_selection.png differ
diff --git a/images/projection_of_vector_onto_plane.png b/images/projection_of_vector_onto_plane.png
new file mode 100644
index 0000000..8b20b84
Binary files /dev/null and b/images/projection_of_vector_onto_plane.png differ
diff --git a/figures/quadratic_data.png b/images/quadratic_data_generated_1.png
similarity index 100%
rename from figures/quadratic_data.png
rename to images/quadratic_data_generated_1.png
diff --git a/images/quadratic_data_line_of_best_fit.png b/images/quadratic_data_line_of_best_fit.png
new file mode 100644
index 0000000..321525e
Binary files /dev/null and b/images/quadratic_data_line_of_best_fit.png differ
diff --git a/images/quadratic_data_quadratic_of_best_fit.png b/images/quadratic_data_quadratic_of_best_fit.png
new file mode 100644
index 0000000..efa7add
Binary files /dev/null and b/images/quadratic_data_quadratic_of_best_fit.png differ
diff --git a/images/ridge_regularization_polynomial_features_scaled.png b/images/ridge_regularization_polynomial_features_scaled.png
new file mode 100644
index 0000000..4f19259
Binary files /dev/null and b/images/ridge_regularization_polynomial_features_scaled.png differ
diff --git a/images/ridge_regularization_polynomial_features_unscaled.png b/images/ridge_regularization_polynomial_features_unscaled.png
new file mode 100644
index 0000000..52cfd1b
Binary files /dev/null and b/images/ridge_regularization_polynomial_features_unscaled.png differ
diff --git a/images/ridge_svd_shrinkage.png b/images/ridge_svd_shrinkage.png
new file mode 100644
index 0000000..f2a6269
Binary files /dev/null and b/images/ridge_svd_shrinkage.png differ
diff --git a/images/train_test_split_illustration.png b/images/train_test_split_illustration.png
new file mode 100644
index 0000000..04c4028
Binary files /dev/null and b/images/train_test_split_illustration.png differ
diff --git a/images/train_validation_test_split.png b/images/train_validation_test_split.png
new file mode 100644
index 0000000..f5711a3
Binary files /dev/null and b/images/train_validation_test_split.png differ
diff --git a/images/underfitting_vs_overfitting.png b/images/underfitting_vs_overfitting.png
new file mode 100644
index 0000000..f214e38
Binary files /dev/null and b/images/underfitting_vs_overfitting.png differ
diff --git a/notebooks/01_least_squares.ipynb b/notebooks/01_least_squares.ipynb
new file mode 100644
index 0000000..7f9126b
--- /dev/null
+++ b/notebooks/01_least_squares.ipynb
@@ -0,0 +1,983 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Least Squares Regression: A Linear Algebra Perspective\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "This is meant to be a not entirely comprehensive introduction to Data Science for the Linear Algebraist. There are of course many other complicated topics, but this is just to get the essence of data science (and the tools involved) from the perspective of someone with a strong linear algebra background.\n",
+ "\n",
+ "One of the most fundamental questions of data science is the following. \n",
+ "\n",
+ "> **Question**: Given observed data, how can we predict certain targets?\n",
+ "\n",
+ "The answer of course boils down to linear algebra, and we will begin by translating data science terms and concepts into linear algebraic ones. But first, as should be common practice for the linear algebraist, an example.\n",
+ "\n",
+ "> **Example**. Suppose that we observe $n=3$ houses, and for each house we record\n",
+ "> - the square footage,\n",
+ "> - the number of bedrooms,\n",
+ "> - and additionally the sale price.\n",
+ "> \n",
+ "> So we have a table as follows.\n",
+ ">\n",
+ "> |House | Square ft | Bedrooms | Price (in $1000s) |\n",
+ "> | --- | --- | --- | --- |\n",
+ "> | 0 | 1600 | 3 | 500 |\n",
+ "> | 1 | 2100 | 4 | 650 |\n",
+ "> | 2 | 1550 | 2 | 475 |\n",
+ ">\n",
+ "> So, for example, the first house is 1600 square feet, has 3 bedrooms, and costs 500,000, and so on. Our goal will be to understand the cost of a house in terms of the number of bedrooms as well as the square footage.\n",
+ "> Concretely this gives us a matrix and a vector:\n",
+ "> $$ X = \\begin{bmatrix} 1600 & 3 \\\\ 2100 & 4 \\\\ 1550 & 2 \\end{bmatrix} \\text{ and } y =\\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\end{bmatrix} $$\n",
+ "> So translating to linear algebra, the goal is to understand how $y$ depends on the columns of $X$.\n",
+ "\n",
+ "\n",
+ "## Translation from Data Science to Linear Algebra\n",
+ "\n",
+ "| Data Science (DS) Term | Linear Algebra (LA) Equivalent | Explanation |\n",
+ "| --- | --- | --- |\n",
+ "| Dataset (with n observations and p features) | A matrix $X \\in \\mathbb{R}^{n \\times p}$ | The dataset is just a matrix. Each row is an observation (a vector of features). Each column is a feature (a vector of its values across all observations). |\n",
+ "| Features | Columns of $X$ | Each feature is a column in your data matrix. |\n",
+ "| Observation | Rows of $X$ | Each data point corresponds to a row. |\n",
+ "| Targets | A vector $y \\in \\mathbb{R}^{n \\times 1}$ | The list of all target values is a column vector. |\n",
+ "| Model parameters | A vector $\\beta \\in \\mathbb{R}^{p \\times 1}$ | These are the unknown coefficients. |\n",
+ "| Model | Matrixāvector equation | The relationship becomes an equation involving matrices and vectors. |\n",
+ "| Prediction Error / Residuals | A residual vector $e \\in \\mathbb{R}^{n \\times 1}$ | Difference between actual targets and predictions. |\n",
+ "| Training / \"best fit\" | Optimization: minimizing the norm of the residual vector | To find the \"best\" model by finding a model which makes the norm of the residual vector as small as possible. |\n",
+ "\n",
+ "So our matrix $X$ will represent our data set, our vector $y$ is the target, and $\\beta$ is our vector of parameters. We will often be interested in understanding data with \"intercepts\", i.e., when there is a base value given in our data. So we will augment a column of 1's (denoted by $\\mathbb{1}$) to $X$ and append a parameter $\\beta_0$ to the top of $\\beta$, yielding\n",
+ "\n",
+ "$$ \\tilde{X} = \\begin{bmatrix} \\mathbb{1} & X \\end{bmatrix} \\text{ and } \\tilde{\\beta} = \\begin{bmatrix} \\beta_0 \\\\ \\beta_1 \\\\ \\beta_2 \\\\ \\vdots \\\\ \\beta_p \\end{bmatrix}. $$\n",
+ "\n",
+ "So the answer to the Data Science problem becomes\n",
+ "\n",
+ "> **Answer**: Solve, or best approximate a solution to, the matrix equation $\\tilde{X}\\tilde{\\beta} = y$.\n",
+ "\n",
+ "To be explicit, given $\\tilde{X}$ and $y$, we want to find a $\\tilde{\\beta}$ that does a good job of roughly giving $\\tilde{X}\\tilde{\\beta} = y$. There of course ways to solve (or approximate) such small systems by hand. However, one will often be dealing with enormous data sets with plenty to be desired. One view to take is that modern data science is applying numerical linear algebra techniques to imperfect information, all to get as good a solution as possible."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "42453515",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Solving the problem: Least Squares Regression\n",
+ "\n",
+ "If the system $\\tilde{X}\\tilde{\\beta} = y$ is consistent, then we can find a solution. However, we are often dealing with overdetermined systems, in the sense that there are often more observations than features (i.e., more rows than columns in $\\tilde{X}$, or more equations than unknowns), and therefore inconsistent systems. However, it is possible to find a **best fit** solution, in the sense that the difference\n",
+ "\n",
+ "$$ e = y - \\tilde{X}\\tilde{\\beta} $$\n",
+ "\n",
+ "is small. By small, we often mean that $e$ is small in $L^2$ norm; i.e., we are minimizing the the sums of the squares of the differences between the components of $y$ and the components of $\\tilde{X}\\tilde{\\beta}$. This is known as a **least squares solution**. Assuming that our data points live in the Euclidean plane, this precisely describes finding a line of best fit.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdee8009",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# 1. Generate some synthetic data\n",
+ "# We set a random seed for reproducibility\n",
+ "np.random.seed(3)\n",
+ "\n",
+ "# Create 50 random x values between 0 and 10\n",
+ "x = np.random.uniform(0, 10, 50)\n",
+ "\n",
+ "# Create y values with a linear relationship plus some random noise\n",
+ "# True relationship: y = 2.5x + 5 + noise\n",
+ "noise = np.random.normal(0, 2, 50)\n",
+ "y = 2.5 * x + 5 + noise\n",
+ "\n",
+ "# 2. Calculate the line of best fit\n",
+ "# np.polyfit(x, y, deg) returns the coefficients for the polynomial\n",
+ "# deg=1 specifies a linear fit (first degree polynomial)\n",
+ "slope, intercept = np.polyfit(x, y, 1)\n",
+ "\n",
+ "# Create a polynomial function from the coefficients\n",
+ "# This allows us to pass x values directly to get predicted y values\n",
+ "fit_function = np.poly1d((slope, intercept))\n",
+ "\n",
+ "# Generate x values for plotting the line (smoothly across the range)\n",
+ "x_line = np.linspace(x.min(), x.max(), 100)\n",
+ "y_line = fit_function(x_line)\n",
+ "\n",
+ "# 3. Plot the data and the line of best fit\n",
+ "plt.figure(figsize=(10, 6))\n",
+ "\n",
+ "# Plot the scatter points\n",
+ "plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)\n",
+ "\n",
+ "# Plot the line of best fit\n",
+ "plt.plot(x_line, y_line, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')\n",
+ "\n",
+ "# Add labels and title\n",
+ "plt.xlabel('X Axis')\n",
+ "plt.ylabel('Y Axis')\n",
+ "plt.title('Scatter Plot with Line of Best Fit')\n",
+ "\n",
+ "# Add the equation to the plot\n",
+ "# The f-string formats the slope and intercept to 2 decimal places\n",
+ "plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))\n",
+ "\n",
+ "# Display legend and grid\n",
+ "plt.legend()\n",
+ "plt.grid(True, linestyle=':', alpha=0.6)\n",
+ "\n",
+ "# Show the plot\n",
+ "plt.savefig('../images/line_of_best_fit_generated_1.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1c25ccb0",
+ "metadata": {},
+ "source": [
+ "The structure of this sections is as follows.\n",
+ "- [Least Squares Solution](#least-squares-solution)\n",
+ "- [QR Decompositions](#qr-decompositions)\n",
+ "- [Singular Value Decomposition](#singular-value-decomposition)\n",
+ "- [A note on other norms](#a-note-on-other-norms)\n",
+ "- [A note on regularization](#a-note-on-regularization)\n",
+ "- [A note on solving multiple targets concurrently](#a-note-on-solving-multiple-targets-concurrently)\n",
+ "- [Polynomial regression](#polynomial-regression)\n",
+ "- [What can go wrong?](#what-can-go-wrong)\n",
+ "\n",
+ "## Least Squares Solution\n",
+ "\n",
+ "Recall that the Euclidean distance between two vectors $x = (x_1,\\dots,x_n) ,y = (y_1,\\dots,y_n) \\in \\mathbb{R}^n$ is given by\n",
+ "\n",
+ "$$ ||x - y||_2 = \\sqrt{\\sum_{i=1}^n |x_i - y_i|^2}. $$\n",
+ "\n",
+ "We will often work with the square of the $L^2$ norm to simplify things (the square function is increasing, so minimizing the square of a non-negative function will also minimize the function itself).\n",
+ "\n",
+ "> **Definition**: Let $A$ be an $m \\times n$ matrix and $b \\in \\mathbb{R}^n$. A **least-squares solution** of $Ax = b$ is a vector $x_0 \\in \\mathbb{R}^n$ such that\n",
+ "> \n",
+ "> $$ \\|b - Ax_0\\|_2 \\leq \\|b - Ax\\|_2 \\text{ for all } x \\in \\mathbb{R}^n. $$\n",
+ "\n",
+ "So a least-squares solution to the equation $Ax = b$ is trying to find a vector $x_0 \\in \\mathbb{R}^n$ which realizes the smallest distance between the vector $b$ and the column space\n",
+ "$$ \\text{Col}(A) = \\{Ax \\mid x \\in \\mathbb{R}^n\\} $$\n",
+ "of $A$. We know this to be the projection of the vector $b$ onto the column space. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f44a6feb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Linear algebra helper functions\n",
+ "def proj_onto_subspace(A, v):\n",
+ " \"\"\"\n",
+ " Project vector v onto Col(A) where A is (3 x k) with columns spanning the subspace.\n",
+ " Uses the formula: P = A (A^T A)^(-1) A^T (for full column rank A).\n",
+ " \"\"\"\n",
+ " AtA = A.T @ A\n",
+ " return A @ np.linalg.solve(AtA, A.T @ v)\n",
+ "\n",
+ "def make_plane_grid(a, b, u_range=(-1.5, 1.5), v_range=(-1.5, 1.5), n=15):\n",
+ " \"\"\"\n",
+ " Plane through origin spanned by vectors a and b.\n",
+ " Returns meshgrid points X,Y,Z for surface plotting.\n",
+ " \"\"\"\n",
+ " uu = np.linspace(*u_range, n)\n",
+ " vv = np.linspace(*v_range, n)\n",
+ " U, V = np.meshgrid(uu, vv)\n",
+ " P = U[..., None] * a + V[..., None] * b # shape (n,n,3)\n",
+ " return P[..., 0], P[..., 1], P[..., 2]\n",
+ "\n",
+ "# Choose a plan and a vector\n",
+ "# Plane basis vectors (span a 2D subspace in R^3)\n",
+ "a = np.array([1.0, 0.2, 0.0])\n",
+ "b = np.array([0.2, 1.0, 0.3])\n",
+ "# Create the associated matrix\n",
+ "# 3x2 matrix of full column rank\n",
+ "# the column space will be a plane\n",
+ "A = np.column_stack([a, b]) \n",
+ "\n",
+ "# Vector to project\n",
+ "v = np.array([0.8, 0.6, 1.2])\n",
+ "\n",
+ "# Projection and residual\n",
+ "p = proj_onto_subspace(A, v)\n",
+ "r = v - p\n",
+ "\n",
+ "# Plot\n",
+ "fig = plt.figure(figsize=(9, 7))\n",
+ "# 1 row, 1 column, 1 subplot\n",
+ "# axis lives in R^3\n",
+ "ax = fig.add_subplot(111, projection=\"3d\")\n",
+ "\n",
+ "# Plane surface\n",
+ "X, Y, Z = make_plane_grid(a, b)\n",
+ "# Here is a rectangular grid of points in 3D; draw a surface through them.\n",
+ "ax.plot_surface(X, Y, Z, alpha=0.25)\n",
+ "\n",
+ "origin = np.zeros(3)\n",
+ "\n",
+ "# v, p, and residual r\n",
+ "ax.quiver(*origin, *v, arrow_length_ratio=0.08, linewidth=2)\n",
+ "ax.quiver(*origin, *p, arrow_length_ratio=0.08, linewidth=2)\n",
+ "ax.quiver(*p, *r, arrow_length_ratio=0.08, linewidth=2)\n",
+ "\n",
+ "# Drop line from v to its projection on the plane\n",
+ "ax.plot([v[0], p[0]],\n",
+ "\t\t[v[1], p[1]],\n",
+ "\t\t[v[2], p[2]],\n",
+ "\t\tlinestyle=\"--\", linewidth=2)\n",
+ "\n",
+ "# Points for emphasis\n",
+ "ax.scatter(*v, s=60)\n",
+ "ax.scatter(*p, s=60)\n",
+ "\n",
+ "# Labels (simple text)\n",
+ "ax.text(*v, \" v\")\n",
+ "ax.text(*p, \" Proj(v)\")\n",
+ "\n",
+ "# Make axes look nice\n",
+ "ax.set_xlabel(\"x\")\n",
+ "ax.set_ylabel(\"y\")\n",
+ "ax.set_zlabel(\"z\")\n",
+ "ax.set_title(\"Projection of a vector onto a plane\")\n",
+ "\n",
+ "# Set symmetric limits so the picture isn't squished\n",
+ "all_pts = np.vstack([origin, v, p])\n",
+ "m = np.max(np.abs(all_pts)) * 1.3 + 0.2\n",
+ "ax.set_xlim(-m, m)\n",
+ "ax.set_ylim(-m, m)\n",
+ "ax.set_zlim(-m, m)\n",
+ "\n",
+ "# Adjust spacing so labels, titles, and axes donāt overlap or get cut off.\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/projection_of_vector_onto_plane.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8a1fe20",
+ "metadata": {},
+ "source": [
+ "> **Theorem**: The set of least-squares solutions of $Ax = b$ coincides with solutions of the **normal equations** $A^TAx = A^Tb$. Moreover, the normal equations always have a solution.\n",
+ "\n",
+ "Let us first see why we get a line of best fit. \n",
+ "\n",
+ "> **Example**. Let us show why this describes a line of best fit when we are working with one feature and one target. Suppose that we observe four data points\n",
+ "> $$ X = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\end{bmatrix} \\text{ and } y = \\begin{bmatrix} 1 \\\\ 2\\\\ 2 \\\\ 4 \\end{bmatrix}. $$\n",
+ "> We want to fit a line $y = \\beta_0 + \\beta_1x$ to these data points. We will have our augmented matrix be\n",
+ "> $$ \\tilde{X} = \\begin{bmatrix} 1 & 1 \\\\ 1 & 2 \\\\ 1 & 3 \\\\ 1 & 4 \\end{bmatrix}, $$\n",
+ "> and our parameter be\n",
+ "> $$ \\tilde{\\beta} = \\begin{bmatrix} \\beta_0 \\\\ \\beta_1 \\end{bmatrix}. $$\n",
+ "> We have that\n",
+ "> $$ \\tilde{X}^T\\tilde{X} = \\begin{bmatrix} 4 & 10 \\\\ 10 & 30 \\end{bmatrix} \\text{ and } \\tilde{X}^Ty = \\begin{bmatrix} 9 \\\\ 27 \\end{bmatrix}. $$\n",
+ "> The 2x2 matrix $\\tilde{X}^T\\tilde{X}$ is easy to invert, and so we get that\n",
+ "> $$ \\tilde{\\beta} = (\\tilde{X}^T\\tilde{X})^{-1}\\tilde{X}^Ty = \\frac{1}{10}\\begin{bmatrix} 15 & -5 \\\\ -5 & 2 \\end{bmatrix}\\begin{bmatrix} 9 \\\\ 27 \\end{bmatrix} = \\begin{bmatrix} 0 \\\\ \\frac{9}{10} \\end{bmatrix}. $$\n",
+ "> So our line of best fit is of them form $y = \\frac{9}{10}x$.\n",
+ "\n",
+ "Although the above system was small and we could solve the system of equations explicitly, this isn't always feasible. We will generally use python in order to solve large systems. \n",
+ "- One can find a least-squares solution using `numpy.linalg.lstsq`.\n",
+ "- We can set up the normal equations and solve the system by using `numpy.linalg.solve`\n",
+ "Although the first approach simplifies things greatly, and is more or less what we are doing anyway, we will generally set up our problems as we would by hand, and then use `numpy.linalg.solve` to help us find a solution. However, computing $X^TX$ can cause lots of errors, so later we'll see how to get linear systems from QR decompositions and the SVD, and then apply `numpy.lingalg.solve`. \n",
+ "\n",
+ "Let's see how to use these for the above example, and see the code to generate the scatter plot and line of best fit. \n",
+ "Again, our system is the following.\n",
+ "$$ X = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\end{bmatrix} \\text{ and } y = \\begin{bmatrix} 1 \\\\ 2\\\\ 2 \\\\ 4 \\end{bmatrix}. $$\n",
+ "We will do what we did above, but use python instead.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Define the matrix X and vector y\n",
+ "X = np.array([[1], [2], [3], [4]])\n",
+ "y = np.array([[1], [2], [2], [4]])\n",
+ "\n",
+ "# Augment X with a column of 1's (intercept)\n",
+ "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n",
+ "\n",
+ "# Solve the normal equations\n",
+ "beta = np.linalg.solve(X_aug.T @ X_aug, X_aug.T @ y)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "And what is the result?\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1c42a900",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "beta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This agrees with our by-hand computation: the intercept is tiny, so it is virtually zero, and we get 9/10 as our slope. Let's plot it. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "b, m = beta #beta[0] will be the intercept and beta[1] will be the slope\n",
+ "_ = plt.plot(X, y, 'o', label='Original data', markersize=10)\n",
+ "_ = plt.plot(X, m*X + b, 'r', label='Line of best fit')\n",
+ "_ = plt.legend()\n",
+ "plt.savefig('../images/line_of_best_fit_easy_example.png')\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "What about `numpy.linalg.lstsq`? Is it any different?\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Define the matrix X and vector y\n",
+ "X = np.array([[1], [2], [3], [4]])\n",
+ "y = np.array([[1], [2], [2], [4]])\n",
+ "\n",
+ "# Augment X with a column of 1's (intercept)\n",
+ "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n",
+ "\n",
+ "# Solve the least squares equation with matrix X_aug and target y\n",
+ "beta = np.linalg.lstsq(X_aug,y)[0]\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "We then get\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "35e8c88d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "beta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "So it is a little different -- and, in fact, closer to our exact answer (the intercept is zero). This makes sense -- `numpy.linalg.lstsq` won't directly compute $X^TX$, which, again, can cause quite a few issues. \n",
+ "\n",
+ "---\n",
+ "\n",
+ "Now going to our initial example. \n",
+ "\n",
+ "> **Example**: Let us work with the example from above. We augment the matrix with a column of 1's to include an intercept term:\n",
+ "> $$ \\tilde{X} = \\begin{bmatrix} 1 & 1600 & 3 \\\\ 1 & 2100 & 4 \\\\ 1 & 1550 & 2 \\end{bmatrix}. $$\n",
+ "> Let us solve the normal equations\n",
+ "> $$ \\tilde{X}^T\\tilde{X}\\tilde{\\beta} = \\tilde{X}^Ty. $$\n",
+ "> We have\n",
+ "> $$ \\tilde{X}^T\\tilde{X} = \\begin{bmatrix} 3 & 5250 & 9 \\\\ 5250 & 9372500 & 16300 \\\\ 9 & 16300 & 29\\end{bmatrix} \\text{ and } \\tilde{X}^Ty = \\begin{bmatrix} 1625 \\\\ 2901500 \\\\ 5050 \\end{bmatrix} $$\n",
+ "> Solving this system of equations yields the parameter vector $\\tilde{\\beta}$. In this case, we have\n",
+ "> $$ \\tilde{\\beta} = \\begin{bmatrix} \\frac{200}{9} \\\\ \\frac{5}{18} \\\\ \\frac{100}{9} \\end{bmatrix}. $$\n",
+ "> When we apply $\\tilde{X}$ to $\\tilde{\\beta}$, we get\n",
+ "> $$ \\tilde{X}\\tilde{\\beta} = \\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\end{bmatrix}, $$\n",
+ "> which is our target on the nose. This means that we can expect, based on our data, that the cost of a house will be\n",
+ "> $$ \\frac{200}{9} + \\frac{5}{18}(\\text{square footage}) + \\frac{100}{9}(\\text{\\# of bedrooms})$$\n",
+ "\n",
+ "In the above, we actually had a consistent system to begin with, so our least-squares solution gave our prediction honestly. What happens if we have an inconsistent system?\n",
+ "\n",
+ "> **Example**: Let us add two more observations, say our data is now the following. \n",
+ "> |House | Square ft | Bedrooms | Price (in $1000s) |\n",
+ "> | --- | --- | --- | --- |\n",
+ "> | 0 | 1600 | 3 | 500 |\n",
+ "> | 1 | 2100 | 4 | 650 |\n",
+ "> | 2 | 1550 | 2 | 475 |\n",
+ "> | 3 | 1600 | 3 | 490 |\n",
+ "> | 4 | 2000 | 4 | 620 |\n",
+ "> \n",
+ "> So setting up our system, we want a least-square solution to the matrix equation\n",
+ "> $$ \\begin{bmatrix} 1 & 1600 & 3 \\\\ 1 & 2100 & 4 \\\\ 1 & 1550 & 2 \\\\ 1 & 1600 & 3 \\\\ 1 & 2000 & 4 \\end{bmatrix}\\tilde{\\beta} = \\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\\\ 490 \\\\ 620 \\end{bmatrix}. $$\n",
+ "> Note that the system is inconsistent (the 1st and 4th rows agree in $\\tilde{X}$, but they have different costs). Writing the normal equations we have\n",
+ "> $$ \\tilde{X}^T\\tilde{X} = \\begin{bmatrix} 5 & 8850 & 16 \\\\ 8850 & 15932500 & 29100 \\\\ 16 & 29100 & 54 \\end{bmatrix} \\text{ and } \\tilde{X}y = \\begin{bmatrix} 2735 \\\\ 4 925 250 \\\\ 9000 \\end{bmatrix}. $$\n",
+ "> Solving this linear system yields\n",
+ "> $$ \\tilde{\\beta} = \\begin{bmatrix} 0 \\\\ \\frac{3}{10} \\\\ 5 \\end{bmatrix}. $$\n",
+ "> This is a vastly different answer! Applying $\\tilde{X}$ to it yields\n",
+ "> $$ \\tilde{X}\\tilde{\\beta} = \\begin{bmatrix} 495 \\\\ 650 \\\\ 475 \\\\ 495 \\\\ 620 \\end{bmatrix}. $$\n",
+ "> Note that the error here is\n",
+ "> $$ y - \\tilde{X}\\tilde{\\beta} = \\begin{bmatrix} 5 \\\\ 0 \\\\ 0 \\\\ -5 \\\\ 0 \\end{bmatrix}, $$\n",
+ "> which has squared $L^2$ norm\n",
+ "> $$ \\|y - \\tilde{X}\\tilde{\\beta}\\|_2^2 = 25 + 25 = 50. $$\n",
+ "> So this says that, given our data, we can roughly estimate the cost of a house, within 50k or so, to be\n",
+ "> $$ \\approx \\frac{3}{10}(\\text{square footage}) + 5(\\text{\\# of bedrooms}). $$\n",
+ "In practice, our data sets can be gigantic, and so there is absolutely no hope of doing computations by hand. It is nice to know that theoretically we can do things like this though. \n",
+ "\n",
+ "> **Theorem**: Let $A$ be an $m \\times n$ matrix and $b \\in \\mathbb{R}^n$. The following are equivalent.\n",
+ "> \n",
+ "> 1. The equation $Ax = b$ has a unique least-squares solution for each $b \\in \\mathbb{R}^n$.\n",
+ "> 2. The columns of $A$ are linearly independent. \n",
+ "> 3. The matrix $A^TA$ is invertible.\n",
+ "\n",
+ "In this case, the unique solution to the normal equations $A^TAx = A^Tb$ is\n",
+ "\n",
+ "$$ x_0 = (A^TA)^{-1}A^Tb. $$\n",
+ "\n",
+ "Computing $\\tilde{X}^T\\tilde{X}$ or taking inverses are very computationally intensive tasks, and it is best to avoid doing these. Moreover, as we'll see in an example later, if we do a numerical calculation we can get close to zero and then divide where we shouldn't be, blowing up our final result. One way to get around this is to use QR decompositions of matrices. \n",
+ "\n",
+ "Now let's use python to visualize the above data and then solve for the least-squares solution. We'll use `pandas` in order to think about this data. We note that `pandas` incorporates `matplotlib` under the hood already, so there are some simplifications that can be made.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# First let us make a dictionary incorporating our data.\n",
+ "# Each entry corresponds to a column (feature of our data)\n",
+ "data = {\n",
+ "\t'Square ft': [1600, 2100, 1550, 1600, 2000],\n",
+ "\t'Bedrooms': [3, 4, 2, 3, 4],\n",
+ "\t'Price': [500, 650, 475, 490, 620]\n",
+ "}\n",
+ "\n",
+ "# Create a pandas DataFrame\n",
+ "df = pd.DataFrame(data)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Let's see how python formats this `DataFrame`. It will turn it into essentially the table we had at the beginning. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9dd3046d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "So what can we do with DataFrames? First let's use `pandas.DataFrame.describe` to see some basic statistics about our data.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a28d2558",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This gives use the mean, the standard deviation, the min, the max, as well as some other things. We get an immediate sense of scale from our data. We can also examine the pairwise correlation of all the columns by using `pandas.DataFrame.corr`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7850890b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df[[\"Square ft\", \"Bedrooms\", \"Price\"]].corr()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "It is clear that each of the three are correlated. This makes sense, as the number of bedrooms should be increasing with the square feet. Same with the price. We'll discuss in the next section when we look at Principal Component Analysis. \n",
+ "\n",
+ "We can also graph our data; for example, we could create some scatter plots, one for `Square ft` vs `Price` and on for `Bedrooms` vs `Price`. We can also do a grouped bar chart. Let's start with the scatter plots. \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Scatter plot for Price vs Square ft\n",
+ "df.plot(\n",
+ "\tkind=\"scatter\",\n",
+ "\tx=\"Square ft\",\n",
+ "\ty=\"Price\",\n",
+ "\ttitle=\"House Price vs Square footage\"\n",
+ ")\n",
+ "plt.savefig('../images/house_price_vs_square_ft.png')\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Scatter plot for Price vs Bedrooms\n",
+ "df.plot(\n",
+ "\tkind=\"scatter\",\n",
+ "\tx=\"Bedrooms\",\n",
+ "\ty=\"Price\",\n",
+ "\ttitle=\"House Price vs Bedrooms\"\n",
+ ")\n",
+ "plt.savefig('../images/house_price_vs_bedrooms.png')\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can even do square footage vs bedrooms. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Scatter plot for Bedrooms vs Square ft\n",
+ "df.plot(\n",
+ "\tkind=\"scatter\",\n",
+ "\tx=\"Square ft\",\n",
+ "\ty=\"Bedrooms\",\n",
+ "\ttitle=\"Bedrooms vs Square footage\"\n",
+ ")\n",
+ "plt.savefig('../images/bedrooms_vs_square_ft.png')\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Of course, these figures are somewhat meaningless due to how unpopulated our data is.\n",
+ "\n",
+ "Now let's get our matrices and linear systems set up with `pandas.DataFrame.to_numpy`.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create our matrix X and our target y\n",
+ "X = df[[\"Square ft\", \"Bedrooms\"]].to_numpy()\n",
+ "y = df[[\"Price\"]].to_numpy()\n",
+ "\n",
+ "# Augment X with a column of 1's (intercept)\n",
+ "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n",
+ "\n",
+ "# Solve the least-squares problem\n",
+ "beta = np.linalg.lstsq(X_aug,y)[0]\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This yields\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0d08c091",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "beta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "As the first parameter is basically 0, we are left with the second being 3/10 and the third being 5, just like our exact solution. Next, we will look at matrix decompositions and how they can help us find least-squares solutions. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1f82fae",
+ "metadata": {},
+ "source": [
+ "# Polynomial Regression\n",
+ "\n",
+ "Sometimes fitting a line to a set of $n$ data points clearly isn't the right thing to do. To emphasize the limitations of linear models, we generate data from a purely quadratic relationship. In this setting, the space of linear functions is not rich enough to capture the underlying structure, and the linear least-squares solution exhibits systematic error. Expanding the feature space to include quadratic terms resolves this issue.\n",
+ "\n",
+ "For example, suppose our data looked like the following. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "52a5c824",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## Generate data\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# 1) Generate quadratic data\n",
+ "np.random.seed(3)\n",
+ "\n",
+ "n = 50\n",
+ "x = np.random.uniform(-5, 5, n) # symmetric, wider range\n",
+ "\n",
+ "# True relationship: y = ax^2 + c + noise\n",
+ "a_true = 2.0\n",
+ "c_true = 5.0\n",
+ "noise = np.random.normal(0, 3, n)\n",
+ "\n",
+ "y = a_true * x**2 + c_true + noise\n",
+ "\n",
+ "## Generate scatter plot\n",
+ "plt.scatter(x,y)\n",
+ "\n",
+ "# plot it\n",
+ "plt.savefig('../images/quadratic_data_generated_1.png')\n",
+ "plt.show()\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1221e35e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "If we try to find a line of best fit, we get something that doesn't really describe or approximate our data at all...\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb6cd90d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# find a line of best fit\n",
+ "a,b = np.polyfit(x, y, 1)\n",
+ "\n",
+ "# add scatter points to plot\n",
+ "plt.scatter(x,y)\n",
+ "\n",
+ "# add line of best fit to plot\n",
+ "plt.plot(x, a*x + b, 'r', linewidth=1)\n",
+ "\n",
+ "# plot it\n",
+ "plt.savefig('../images/quadratic_data_line_of_best_fit.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a2d5bb71",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This is an example of **underfitting** data, and we can do better. The same linear regression ideas work for fitting a degree $d$ polynomial model to a set of $n$ data points. Before, when trying to fit a line to points $(x_1,y_1),\\dots,(x_n,y_n)$, we had the following matrices\n",
+ "$$ \\tilde{X} = \\begin{bmatrix} 1 & x_1 \\\\ \\vdots & \\vdots \\\\ 1 & x_n \\end{bmatrix}, y = \\begin{bmatrix} y_1 \\\\ \\vdots \\\\ y_n \\end{bmatrix}, \\tilde{\\beta} = \\begin{bmatrix} \\beta_0 \\\\ \\beta_1 \\end{bmatrix} $$\n",
+ "in the matrix equation\n",
+ "$$ \\tilde{X}\\tilde{\\beta} = y, $$\n",
+ "and we were trying to find a vector $\\tilde{\\beta}$ which gave a best possible solution. This would give us a line $y = \\beta_0 + \\beta_1x$ which best approximates the data. To fit a polynomial $y = \\beta_0 + \\beta_1x + \\beta_2x^2 + \\cdots + \\beta_d^dx^d$ to the data, we have a similar set up.\n",
+ "\n",
+ "> **Definition**. The **Vandermonde matrix** is the $n \\times (d+1)$ matrix\n",
+ "> $$ V = \\begin{bmatrix} 1 & x_1 & x_1^2 & \\cdots & x_1^d \\\\ 1 & x_2 & x_2^2 & \\cdots & x_2^d \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ 1 & x_n & x_n^2 & \\cdots & x_n^d \\end{bmatrix}. $$\n",
+ "\n",
+ "With the Vandermonde matrix, to find a polynomial function of best fit, one just needs to find a least-squares solution to the matrix equation\n",
+ "$$ V\\tilde{\\beta} = y. $$\n",
+ "\n",
+ "With the generated data above, we get the following curve. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8c569b38",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# polynomial fit with degree = 2\n",
+ "poly = np.polyfit(x,y,2)\n",
+ "model = np.poly1d(poly)\n",
+ "\n",
+ "# add scatter points to plot\n",
+ "plt.scatter(x,y)\n",
+ "\n",
+ "# add the quadratic to the plot\n",
+ "polyline=np.linspace(x.min(), x.max())\n",
+ "plt.plot(polyline, model(polyline), 'r', linewidth=1)\n",
+ "\n",
+ "\n",
+ "# plot it\n",
+ "plt.savefig('../images/quadratic_data_quadratic_of_best_fit')\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b567fe95",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Solving these problems can be done with python. One can use `numpy.polyfit` and `numpy.poly1d`. \n",
+ "\n",
+ "> **Example**. Consider the following data.\n",
+ "> |House | Square ft | Bedrooms | Price (in $1000s) |\n",
+ "> | --- | --- | --- | --- |\n",
+ "> | 0 | 1600 | 3 | 500 |\n",
+ "> | 1 | 2100 | 4 | 650 |\n",
+ "> | 2 | 1550 | 2 | 475 |\n",
+ "> | 3 | 1600 | 3 | 490 |\n",
+ "> | 4 | 2000 | 4 | 620 |\n",
+ ">\n",
+ "> Suppose we wanted to predict the price of a house based on the square footage and we thought the relationship was cubic (it clearly isn't, but hey, for the sake of argument). So really we are looking at the subset of data\n",
+ "> |House | Square ft | Price (in $1000s) |\n",
+ "> | --- | --- | --- |\n",
+ "> | 0 | 1600 | 500 |\n",
+ "> | 1 | 2100 | 650 |\n",
+ "> | 2 | 1550 | 475 |\n",
+ "> | 3 | 1600 | 490 |\n",
+ "> | 4 | 2000 | 620 |\n",
+ ">\n",
+ "> Our Vandermonde matrix will be\n",
+ "> $$ V = \\begin{bmatrix} 1 & 1600 & 1600^2 & 1600^3 \\\\ 1 & 2100 & 2100^2 & 2100^3 \\\\ 1 & 1550 & 1550^2 & 1550^3 \\\\ 1 & 1600 & 1600^2 & 1600^3 \\\\ 1 & 2000 & 2000^2 & 2000^3 \\end{bmatrix} $$\n",
+ "> and our target vector will be\n",
+ "> $$ y = \\begin{bmatrix} 500 \\\\ 650 \\\\ 475 \\\\ 490 \\\\ 620 \\end{bmatrix}. $$\n",
+ "> As we can see, the entries of the Vandermonde matrix get very very large very fast. One can, if they are so inclined, compute a least-squares solution to $V\\tilde{\\beta} = y$ by hand. Let's not, but let us find, using python, a \"best\" cubic approximation of the relationship between the square footage and price.\n",
+ "\n",
+ "We will use `numpy.polyfit`, `numpy.pold1d` and `numpy.linspace`.\n",
+ "\n",
+ "Let's get a cubic of best fit.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9bdf7560",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# First let us make a dictionary incorporating our data.\n",
+ "# Each entry corresponds to a column (feature of our data)\n",
+ "data = {\n",
+ " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n",
+ " 'Bedrooms': [3, 4, 2, 3, 4],\n",
+ " 'Price': [500, 650, 475, 490, 620]\n",
+ "}\n",
+ "\n",
+ "# Create a pandas DataFrame\n",
+ "df = pd.DataFrame(data)\n",
+ "\n",
+ "# Extract x (square footage) and y (price)\n",
+ "x = df[\"Square ft\"].to_numpy(dtype=float)\n",
+ "y = df[\"Price\"].to_numpy(dtype=float)\n",
+ "\n",
+ "# Degree of polynomial\n",
+ "degree = 3 # cubic\n",
+ "\n",
+ "# Polyfit directly on x\n",
+ "cubic = np.poly1d(np.polyfit(x,y, degree))\n",
+ "\n",
+ "# Add fitted polynomial line and scatter plot\n",
+ "polyline = np.linspace(x.min(),x.max())\n",
+ "plt.scatter(x,y, label=\"Observed data\")\n",
+ "plt.plot(polyline, cubic(polyline), 'r', label=\"Cubic best fit\")\n",
+ "plt.xlabel(\"Square ft\")\n",
+ "plt.ylabel(\"Price (in $1000s)\")\n",
+ "plt.title(\"Cubic polynomial regression: Price vs Square Footage\")\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "67efb4de",
+ "metadata": {},
+ "source": [
+ "Here `numpy.polyfit` computes the least-squares solution in the polynomial basis $1, x, x^2, x^3$, i.e., it solves the Vandermonde least-squares problem. So what is our cubic polynomial?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eaba0d42",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "cubic"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "db109cf4",
+ "metadata": {},
+ "source": [
+ "The first term is the degree 3 term, the second the degree 2 term, the third the degree 1 term, and the fourth is the constant term. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3f40d45e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "# Additional visualization: line of best fit\n",
+ "## Generated scatter plot example\n",
+ "The first figure is a line of best fit for scattered points. Here is some alternate code that will produce an image. We can do the following using `matplotlib.pyplot.axline`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a4bb276b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Generate data (same as above)\n",
+ "np.random.seed(3)\n",
+ "x = np.random.uniform(0, 10, 50)\n",
+ "y = 2.5 * x + 5 + np.random.normal(0, 2, 50)\n",
+ "\n",
+ "# Calculate slope and intercept\n",
+ "slope, intercept = np.polyfit(x, y, 1)\n",
+ "\n",
+ "plt.figure(figsize=(10, 6))\n",
+ "plt.scatter(x, y, color='purple', label='Data Points', alpha=0.7)\n",
+ "# Plot the line using axline\n",
+ "# xy1=(0, intercept) is the y-intercept point\n",
+ "\n",
+ "# slope=slope defines the steepness\n",
+ "plt.axline(xy1=(0, intercept), slope=slope, color='steelblue', linestyle='--', linewidth=2, label='Line of Best Fit')\n",
+ "\n",
+ "# Add the equation to the plot\n",
+ "# The f-string formats the slope and intercept to 2 decimal places\n",
+ "plt.text(1, 25, f'y = {slope:.2f}x + {intercept:.2f}', fontsize=12, bbox=dict(facecolor='white', alpha=0.8))\n",
+ "\n",
+ "\n",
+ "plt.xlabel('X Axis')\n",
+ "plt.ylabel('Y Axis')\n",
+ "plt.title('Scatter Plot with Line of Best Fit')\n",
+ "plt.legend()\n",
+ "plt.grid(True, linestyle=':', alpha=0.6)\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c17fbdaf",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "See\n",
+ "- https://stackoverflow.com/questions/37234163/how-to-add-a-line-of-best-fit-to-scatter-plot\n",
+ "- https://www.statology.org/line-of-best-fit-python/\n",
+ "- https://stackoverflow.com/questions/6148207/linear-regression-with-matplotlib-numpy\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/02_qr_svd.ipynb b/notebooks/02_qr_svd.ipynb
new file mode 100644
index 0000000..369b1bf
--- /dev/null
+++ b/notebooks/02_qr_svd.ipynb
@@ -0,0 +1,738 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# QR Decompositions\n",
+ "\n",
+ "QR decompositions are a powerful tool in linear algebra and data science for several reasons. They provide a way to decompose a matrix into an orthogonal matrix $Q$ aand an upper triangular matrix $R$, which can simplify many computations and analyses.\n",
+ "\n",
+ "> **Theorem**: Let $A$ is an $m \\times n$ matrix with linearly independent columns ($m \\geq n$ in this case), then $A$ can be decomposed as $A = QR$ where $Q$ is an $m \\times n$ matrix whose columns form an orthonormal basis for Col($A$) and $R$ is an $n \\times n$ upper-triangular invertible matrix with positive entries on the diagonal.\n",
+ "\n",
+ "In the literature, sometimes the QR decomposition is phrased as follows: any $m \\times n$ matrix $A$ can also be written as $A = QR$ where $Q$ is an $m \\times m$ orthogonal matrix ($Q^T = Q^{-1}$), and $R$ is an $m \\times n$ upper-triangular matrix. One follows from the other by playing around with some matrix equations. Indeed, suppose that $A = Q_1R_1$ is a decomposition as above (that is, $Q_1$ is $m \\times n$ and $R_1$ is $n \\times n$). Use can use the Gram-Schmidt procedure to extend the columns of $Q_1$ to an orthonormal basis for all of $\\mathbb{R}^m$, and put the remaining vectors in a $(m - n) \\times n$ matrix $Q_2$. Then\n",
+ "\n",
+ "$$ A = Q_1R_1 = \\begin{bmatrix} Q_1 & Q_2 \\end{bmatrix}\\begin{bmatrix} R_1 \\\\ 0 \\end{bmatrix}. $$\n",
+ "\n",
+ "The left matrix is an $m \\times m$ orthogonal matrix and the right matrix is $m \\times n$ upper triangular. Moreover, the decomposition provides orthonormal bases for both the column space of $A$ and the perp of the column space of $A$; $Q_1$ will consist of an orthonormal basis for the column space of $A$ and $Q_2$ will consist of an orthonormal basis for the perp of the column space of $A$. \n",
+ "\n",
+ "However, we will often want to use the decomposition when $Q$ is $m \\times n$, $R$ is $n \\times n$, and the columns of $Q$ form an orthonormal basis for the column space of $A$. For example, the python function `numpy.linalg.qr` give QR decompositions this way (again, assuming that the columns of $A$ are linearly independent, so $m \\geq n$).\n",
+ "\n",
+ "> **Key take-away**. The QR decomposition provides an orthonormal basis for the column space of $A$. If $A$ has rank $k$, then the first $k$ columns of $Q$ will form a basis for the column space of $A$. \n",
+ "\n",
+ "For small matrices, one can find $Q$ and $R$ by hand, assuming that $A = [ a_1\\ \\cdots\\ a_n ]$ has full column rank. Let $e_1,\\dots,e_n$ be the unnormalized vectors we get when we apply Gram-Schmidt to $c_1,\\dots,c_n$, and let $u_1,\\dots,u_n$ be their normalizations. Let\n",
+ "$$ r_j = \\begin{bmatrix} \\langle e_1,c_j \\rangle \\\\ \\vdots \\\\ \\langle e_n, c_j \\rangle \\end{bmatrix}, $$\n",
+ "and note that $\\langle e_i,c_j \\rangle = 0$ whenever $i > j$. Thus\n",
+ "$$ Q = \\begin{bmatrix} u_1 & \\cdots & u_n \\end{bmatrix} \\text{ and } R = \\begin{bmatrix} r_1 & \\cdots & r_n \\end{bmatrix} $$\n",
+ "give rise to a $A = QR$, where the columns of $Q$ form an orthonormal basis for $\\text{Col}(A)$ and $R$ is upper-triangular. We can also compute $R$ directly from $Q$ and $Q$. Indeed, note that $Q^TQ = I$, so\n",
+ "$$ Q^TA = Q^T(QR) = IR = R. $$\n",
+ "\n",
+ "> **Example**. Find a QR decomposition for the matrix\n",
+ "> $$ A = \\begin{bmatrix} 1 & 1 & 1 \\\\ 0 & 1 & 1 \\\\ 0 & 0 & 1 \\\\ 0 & 0 & 0 \\end{bmatrix}. $$\n",
+ "> Note that one trivially see (or by applying the Gram-Schmidt procedure) that\n",
+ "> $$ \\begin{bmatrix} 1 \\\\ 0 \\\\ 0 \\\\ 0 \\end{bmatrix}, \\begin{bmatrix} 0 \\\\ 1 \\\\ 0 \\\\ 0 \\end{bmatrix}, \\begin{bmatrix} 0 \\\\ 0 \\\\ 1 \\\\ 0 \\end{bmatrix} $$\n",
+ "> forms an orthonormal basis for the column space of $A$. So with\n",
+ "> $$ Q = \\begin{bmatrix} 1 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \\\\ 0 & 0 & 0 \\end{bmatrix} \\text{ and }R = \\begin{bmatrix} 1 & 1 & 1\\\\ 0 & 1 & 1 \\\\ 0 & 0 & 1 \\end{bmatrix}, $$\n",
+ "> we have $A = QR$.\n",
+ "\n",
+ "Let's do a more involved example.\n",
+ "> **Example**. Consider the matrix\n",
+ "> $$ A = \\begin{bmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 1 & 1 \\\\ 1 & 1 & 1 \\end{bmatrix}. $$\n",
+ "> One can apply the Gram-Schmidt procedure to the columns of $A$ to find that\n",
+ "> $$ \\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 1 \\end{bmatrix}, \\begin{bmatrix} -3 \\\\ 1 \\\\ 1 \\\\ 1 \\end{bmatrix}, \\begin{bmatrix} 0 \\\\ -\\frac{2}{3} \\\\ \\frac{1}{3} \\\\ \\frac{1}{3}\\end{bmatrix} $$\n",
+ "> forms an orthogonal basis for the column space of $A$. Normalizing, we get that\n",
+ "> $$ Q = \\begin{bmatrix} \\frac{1}{2} & -\\frac{3}{\\sqrt{12}} & 0 \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & -\\frac{2}{\\sqrt{6}} \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{6}} \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{6}} \\end{bmatrix} $$\n",
+ "> is an appropriate $Q$. Thus\n",
+ "> $$ \\begin{split} R = Q^TA &= \\begin{bmatrix} \\frac{1}{2} & \\frac{1}{2} & \\frac{1}{2} & \\frac{1}{2} \\\\ -\\frac{3}{\\sqrt{12}} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{12}} \\\\ 0 & -\\frac{2}{\\sqrt{6}} & \\frac{1}{\\sqrt{6}} & \\frac{1}{\\sqrt{6}} \\end{bmatrix}\\begin{bmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 1 & 1 \\\\ 1 & 1 & 1 \\end{bmatrix} \\\\ &= \\begin{bmatrix} 2 & \\frac{3}{2} & 1 \\\\ 0 & \\frac{3}{\\sqrt{12}} & \\frac{2}{\\sqrt{12}} \\\\ 0 & 0 & \\frac{2}{\\sqrt{6}} \\end{bmatrix}. \\end{split} $$\n",
+ "> So all together,\n",
+ "> $$A = \\begin{bmatrix} \\frac{1}{2} & -\\frac{3}{\\sqrt{12}} & 0 \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & -\\frac{2}{\\sqrt{6}} \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{6}} \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{6}} \\end{bmatrix}\\begin{bmatrix} 2 & \\frac{3}{2} & 1 \\\\ 0 & \\frac{3}{\\sqrt{12}} & \\frac{2}{\\sqrt{12}} \\\\ 0 & 0 & \\frac{2}{\\sqrt{6}} \\end{bmatrix}. $$\n",
+ "\n",
+ "To do this numerically, we can use `numpy.linalg.qr`.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Define our matrices\n",
+ "A = np.array([[1,1,1],[0,1,1],[0,0,1],[0,0,0]])\n",
+ "B = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])\n",
+ "\n",
+ "# Take QR decompositions\n",
+ "QA, RA = np.linalg.qr(A)\n",
+ "QB, RB = np.linalg.qr(B)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Our resulting matrices are:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b48e6d97",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"QA = {QA}\\n\")\n",
+ "print(f\"RA = {RA}\\n\")\n",
+ "print(f\"QB = {QB}\\n\")\n",
+ "print(f\"RB = {RB}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## How to use QR decompositions\n",
+ "\n",
+ "One of the primary uses of QR decompositions is to solve least squares problems, as introduced above. Assuming that $A$ has full column rank, we can write $A = QR$ as a QR decomposition, and then we can find a least-squares solution to $Ax = b$ by solving the upper-triangular system.\n",
+ "\n",
+ "> **Theorem**. Let $A$ be an $m \\times n$ matrix with full column rank, and let $A = QR$ be a QR factorization of $A$. Then, for each $b \\in \\mathbb{R}^m$, the equation $Ax = b$ has a unique least-squares solution, arising from the system\n",
+ "> $$ Rx = Q^Tb. $$\n",
+ "\n",
+ "Normal equations can be *ill-conditioned*, i.e., small errors in calculating $A^TA$ give large errors when trying to solve the least-squares problem. When $A$ has full column rank, a QR factorization will allow one to compute a solution to the least-squares problem more reliably. \n",
+ "\n",
+ "> **Example**. Let\n",
+ "> $$ A = \\begin{bmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 1 & 1 \\\\ 1 & 1 & 1 \\end{bmatrix} \\text{ and } b = \\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 0 \\end{bmatrix}. $$\n",
+ "> We can find the least-squares solution $Ax = b$ by using the QR decomposition. Let us use the QR decomposition from above, and solve the system\n",
+ "> $$ Rx = Q^Tb. $$\n",
+ "> As\n",
+ "> $$ \\begin{bmatrix} \\frac{1}{2} & -\\frac{3}{\\sqrt{12}} & 0 \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & -\\frac{2}{\\sqrt{6}} \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{6}} \\\\ \\frac{1}{2} & \\frac{1}{\\sqrt{12}} & \\frac{1}{\\sqrt{6}} \\end{bmatrix}^T\\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 0 \\end{bmatrix} = \\begin{bmatrix} \\frac{3}{2} \\\\ -\\frac{1}{2\\sqrt{3}} \\\\ -\\frac{1}{\\sqrt{6}}, \\end{bmatrix} $$\n",
+ "> we are looking at the system\n",
+ "> $$ \\begin{bmatrix} 2 & \\frac{3}{2} & 1 \\\\ 0 & \\frac{3}{\\sqrt{12}} & \\frac{2}{\\sqrt{12}} \\\\ 0 & 0 & \\frac{2}{\\sqrt{6}} \\end{bmatrix}x =\\begin{bmatrix} \\frac{3}{2} \\\\ -\\frac{1}{2\\sqrt{3}} \\\\ -\\frac{1}{\\sqrt{6}} \\end{bmatrix}. $$\n",
+ "> Solving this system yields that\n",
+ "> $$ x_0 = \\begin{bmatrix} 1 \\\\ 0 \\\\ -\\frac{1}{2} \\end{bmatrix} $$\n",
+ "> is a least-squares solution to $Ax = b$.\n",
+ "\n",
+ "Let us set this system up in python and use `numpy.linalg.solve`. \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Define matrix and vector\n",
+ "A = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])\n",
+ "b = np.array([[1],[1],[1],[0]])\n",
+ "\n",
+ "# Take the QR decomposition of A\n",
+ "Q, R = np.linalg.qr(A)\n",
+ "\n",
+ "# Solve the linear system Rx = Q.T b\n",
+ "beta = np.linalg.solve(R,Q.T @ b)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This yields\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3f71de8a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "beta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "which (basically) agrees with our exact least-squares solution.\n",
+ "Note that `numpy.linalg.lstsq` still gives a **ever so slightly** different result. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dcda7f8d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "np.linalg.lstsq(A,b)[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Let's go back to the house example. While we're at it, let's get used to using pandas to make a dataframe. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "# First let us make a dictionary incorporating our data.\n",
+ "# Each entry corresponds to a column (feature of our data)\n",
+ "data = {\n",
+ " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n",
+ " 'Bedrooms': [3, 4, 2, 3, 4],\n",
+ " 'Price': [500, 650, 475, 490, 620]\n",
+ "}\n",
+ "\n",
+ "# Create a pandas DataFrame\n",
+ "df = pd.DataFrame(data)\n",
+ "\n",
+ "# Create our matrix X and our target y\n",
+ "X = df[[\"Square ft\", \"Bedrooms\"]].to_numpy()\n",
+ "y = df[[\"Price\"]].to_numpy()\n",
+ "\n",
+ "# Augment X with a column of 1's (intercept)\n",
+ "X_aug = np.hstack((np.ones((X.shape[0], 1)), X))\n",
+ "\n",
+ "# Perform QR decomposition\n",
+ "Q, R = np.linalg.qr(X_aug)\n",
+ "\n",
+ "# Solve the upper triangular system Rx = Q^Ty\n",
+ "beta = np.linalg.solve(R, Q.T @ y)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Let's look at the output.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3d1e5bab",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"Q = {Q} \\n\\nR = {R} \\n\\nbeta = {beta}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "As we can see, the least-squares solution agrees with what we got by hand and by other python methods (if we agree that the tiny first component is essentially zero).\n",
+ "\n",
+ "---\n",
+ "\n",
+ "The QR decomposition of a matrix is also useful for computing orthogonal projections.\n",
+ "> **Theorem**. Let $A$ be an $m \\times n$ matrix with full column rank. If $A = QR$ is a QR decomposition, then $QQ^T$ is the projection onto the column space of $A$, i.e., $QQ^Tb = \\text{Proj}_{\\text{Col}(A)}b$ for all $b \\in \\mathbb{R}^m$.\n",
+ "\n",
+ "Let's see what our range projections are for the matrices above. Note that the first example above will have the orthogonal projection just being\n",
+ "$$ \\begin{bmatrix} 1 \\\\ & 1 \\\\ & & 1\\\\ & & & 0 \\end{bmatrix}. $$\n",
+ "Let's look at the other matrix. \n",
+ "\n",
+ "> **Example**. Working with the matrix\n",
+ "> $$ A = \\begin{bmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 1 & 1 \\\\ 1 & 1 & 1 \\end{bmatrix}, $$\n",
+ "> the projection onto the column space if given by\n",
+ "> $$ QQ^T = \\begin{bmatrix} 1 \\\\ & 1 \\\\ & & \\frac{1}{2} & \\frac{1}{2} \\\\ & & \\frac{1}{2} & \\frac{1}{2} \\end{bmatrix}. $$\n",
+ "> This is a well-understood projection: it is the direct sum of the identity on $\\mathbb{R}^2$ and the projection onto the line $y = x$ in $\\mathbb{R}^2$.\n",
+ "\n",
+ "Now let's use python to implement the projection.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Create our matrix A\n",
+ "A = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])\n",
+ "\n",
+ "# Take the QR decomposition\n",
+ "Q, R = np.linalg.qr(A)\n",
+ "\n",
+ "# Create the range projection\n",
+ "P = Q @ Q.T\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5bfd7362",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "P"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "The output gives\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d26b49a6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "array([[1.00000000e+00, 2.89687929e-17, 2.89687929e-17, 2.89687929e-17],\n",
+ " [2.89687929e-17, 1.00000000e+00, 7.07349921e-17, 7.07349921e-17],\n",
+ " [2.89687929e-17, 7.07349921e-17, 5.00000000e-01, 5.00000000e-01],\n",
+ " [2.89687929e-17, 7.07349921e-17, 5.00000000e-01, 5.00000000e-01]])\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "As we can see, the two off-diagonal blocks are all tiny, hence we treat them as zero. Note that if they were not actually zero, then this wouldn't actually be a projection. This can cause some problems.\n",
+ "\n",
+ "Let's write a function to implement this, assuming that columns of A are linearly independent. \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "def proj_onto_col_space(A):\n",
+ "\t# Take the QR decomposition\n",
+ "\tQ,R = np.linalg.qr(A)\n",
+ "\t# The projection is just Q @ Q.T\n",
+ "\tP = Q @ Q.T\n",
+ "\n",
+ "\treturn P\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "We'll come back to this later. We should really be incorporating some sort of error tolerance so that things are **super super** tiny can actually just be sent to zero. \n",
+ "\n",
+ "> **Remark**. Another way to get the projection onto the column space of an $n \\times p$ matrix $A$ of full column rank is to take\n",
+ "> $$ P = A(A^TA)^{-1}A^T. $$\n",
+ "> Indeed, let $b \\in \\mathbb{R}^n$ and let $x_0 \\in \\mathbb{R}^p$ be a solution to the normal equations\n",
+ "> $$ A^TAx_0 = A^Tb. $$\n",
+ "> Then $x_0 = (A^TA)^{-1}A^Tb$ and so $Ax_0 = A(A^TA^{-1})A^Tb$ is the (unique!) vector in the column space of $A$ which is closest to $b$, i.e., the projection of $b$ onto the column space of $A$.\n",
+ "> However, taking transposes, multiplying, and inverting is not what we would like to do numerically. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4ae21758",
+ "metadata": {},
+ "source": [
+ "# Singular Value Decomposition\n",
+ "\n",
+ "The SVD is a very important matrix decomposition in both data science and linear algebra.\n",
+ "\n",
+ "> **Theorem**. For any matrix $n \\times p$ matrix $X$, there exist an orthogonal $n \\times n$ matrix $U$, an orthogonal $p \\times p$ matrix $V$, and a diagonal $n \\times p$ matrix $\\Sigma$ with non-negative entries such that\n",
+ "> $$ X = U\\Sigma V^T. $$\n",
+ "> - The columns of $U$ are left **left singular vectors**.\n",
+ "> - The columns of $V$ are the **right singular vectors**.\n",
+ "> - $\\Sigma$ has **singular values** $\\sigma_1 \\geq \\sigma_2 \\geq \\cdots \\geq \\sigma_r > 0$ on its diagonal, where $r$ is the rank of $X$.\n",
+ "\n",
+ "> **Remark**. The SVD is clearly a generalization of matrix diagonalization, but it also generalizes the **polar decomposition** of a matrix. Recall that every $n \\times n$ matrix $A$ can be written as $A = UP$ where $U$ is orthogonal (or unitary) and $P$ is a positive matrix. This is because if\n",
+ "> $$ A = U_0\\Sigma V^T $$\n",
+ "> is the SVD for $A$, then $\\Sigma$ is an $n \\times n$ diagonal matrix with non-negative entries, hence any orthogonal conjugate of it is positive, and so\n",
+ "> $$ A = (U_0V^T)(V\\Sigma V^T). $$\n",
+ "> Take $U = U_0V^T$ and $P = V\\Sigma V^T$. \n",
+ "\n",
+ "By hand, the algorithm for computing an SVD is as follows.\n",
+ "1. Both $AA^T$ and $A^TA$ are symmetric (they are positive in fact), and so they can be orthogonally diagonalized; one can form an orthogonal basis of eigenvectors. Let $v_1,\\dots,v_p$ be an orthonormal basis of eigenvectors for $\\mathbb{R}^p$ which correspond to eigenvectors of $A^TA$ in decreasing order. Suppose that $A^TA$ has $r$ non-zero eigenvalues. Let $V$ be the matrix whose columns contain the $v_i$'s. This gives our right singular vectors and our singular values. \n",
+ "2. Let $u_i = \\frac{1}{\\sigma_i}Av_i$ for $i = 1,\\dots,r$, and extend this collection of vectors to an orthonormal basis for $\\mathbb{R}^n$ if necessary. Let $U$ be the corresponding matrix.\n",
+ "3. Let $\\Sigma$ be the $n \\times p$ matrix whose diagonal entries are $\\sigma_1 \\geq \\sigma_2 \\geq \\cdots \\geq \\sigma_r$, and then zeroes if necessary. \n",
+ "\n",
+ "> **Example**. Let us compute the SVD of\n",
+ "> $$ A = \\begin{bmatrix} 3 & 2 & 2 \\\\ 2 & 3 & -2 \\end{bmatrix}. $$\n",
+ "> First we note that\n",
+ "> $$ A^TA = \\begin{bmatrix} 13 & 12 & 2 \\\\ 12 & 13 & -2 \\\\ 2 & -2 & 8 \\end{bmatrix}, $$\n",
+ "> which has eigenvalues $25,9,0$ with corresponding eigenvectors\n",
+ "> $$ \\begin{bmatrix} 1 \\\\ 1 \\\\ 0 \\end{bmatrix}, \\begin{bmatrix} 1 \\\\ -1 \\\\ 4 \\end{bmatrix}, \\begin{bmatrix} -2 \\\\ 2 \\\\ 1 \\end{bmatrix}. $$\n",
+ "> Normalizing, we get\n",
+ "> $$ V = \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{3\\sqrt{2}} & -\\frac{2}{3} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{3\\sqrt{2}} & \\frac{2}{3} \\\\ 0 & \\frac{4}{3\\sqrt{2}} & \\frac{1}{3} \\end{bmatrix}. $$\n",
+ "> Now we set $u_1 = \\frac{1}{5}Av_1$ and $u_2 = \\frac{1}{3}Av_2$ to get\n",
+ "> $$ U = \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{\\sqrt{2}} \\end{bmatrix}. $$\n",
+ "> So\n",
+ "> $$ A = \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{\\sqrt{2}} \\end{bmatrix} \\begin{bmatrix} 5 & 0 & 0 \\\\ 0 & 3 & 0 \\end{bmatrix} \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{3\\sqrt{2}} & -\\frac{2}{3} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{3\\sqrt{2}} & \\frac{2}{3} \\\\ 0 & \\frac{4}{3\\sqrt{2}} & \\frac{1}{3} \\end{bmatrix}^T $$\n",
+ "> is our SVD decomposition.\n",
+ "\n",
+ "We note that in practice, we avoid the computation of $X^TX$ because if the entries of $X$ have errors, then these errors will be squared in $X^TX$. There are better computational tools to get singular values and singular vectors which are more accurate. This is what our python tools will use. \n",
+ "\n",
+ "Let's use `numpy.linalg.svd` for the above matrix."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "#Define our matrix\n",
+ "A = np.array([[3,2,2],[2,3,-2]])\n",
+ "\n",
+ "# Take the SVD\n",
+ "U, S, Vh = np.linalg.svd(A)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Our SVD matrices are\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5336313f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"U = {U}\\n\\nS = {S}\\n\\nVh.T = {Vh.T}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "\n",
+ "Because the eigenvalues of the hermitian squares of\n",
+ "$$ \\begin{bmatrix} 1 & 1 & 1\\\\ 0 & 1 & 1 \\\\ 0 & 0 & 1 \\\\ 0 & 0 & 0 \\end{bmatrix} \\text{ and } \\begin{bmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 1 & 1 \\\\ 1 & 1 & 1 \\end{bmatrix} $$\n",
+ "are quite atrocious, an exact SVD decomposition is difficult to compute by hand. However, we can of course use python.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Define our matrices\n",
+ "A = np.array([[1,1,1],[0,1,1],[0,0,1],[0,0,0]])\n",
+ "B = np.array([[1,0,0],[1,1,0],[1,1,1],[1,1,1]])\n",
+ "\n",
+ "# SVD decomposition\n",
+ "U_A, S_A, Vh_A = np.linalg.svd(A)\n",
+ "U_B, S_B, Vh_B = np.linalg.svd(B)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "The resulting matrices are\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a13a3391",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"U_A = {U_A}\\n\\nS_A = {S_A}\\n\\nVh_A.T = {Vh_A.T}\\n\\nU_B = {U_B}\\n\\nS_B = {S_B}\\n\\nVh_B.T = {Vh_B.T}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Another final note is that the **operator norm** of a matrix $A$ agrees with its largest singular value. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c8396ff",
+ "metadata": {},
+ "source": [
+ "## Pseudoinverses and using the SVD\n",
+ "The SVD can be used to determine a least-squares solution for a given system. Recall that if $v_1,\\dots,v_p$ is an orthonormal basis for $\\mathbb{R}^p$ consisting of eigenvectors of $A^TA$, arranged so that they correspond to eigenvalues $\\sigma_1 \\geq \\sigma_2 \\geq \\cdots \\geq \\sigma_r$, then $\\{Av_1,\\dots,Av_r\\}$ is an orthogonal basis for the column space of $A$. In essence, this means that when we have our left singular vectors $u_1,\\dots,u_n$ (constructed based on our algorithm as above), we have that the first $r$ vectors form an orthonormal basis for the column space of $A$, and that the remaining $n - r$ vectors form an orthonormal basis for the perp of the column space of $A$ (which is also equal to the nullspace of $A^T$). \n",
+ "\n",
+ "> **Definition**. Let $A$ be an $n \\times p$ matrix and suppose that the rank of $A$ is $r \\leq \\min\\{n,p\\}$. Suppose that $A = U\\Sigma V^T$ is the SVD, where the singular values are decreasing. Partition\n",
+ "> $$ U = \\begin{bmatrix} U_r & U_{n-r} \\end{bmatrix} \\text{ and } V = \\begin{bmatrix} V_r & V_{p-r} \\end{bmatrix} $$\n",
+ "> into submatrices, where $U_r$ and $V_r$ are the matrices whose columns are the first $r$ columns of $U$ and $V$ respectively. So $U_r$ is $n \\times r$ and $V_r$ is $p \\times r$. Let $D$ be the diagonal $r \\times r$ matrices whose diagonal entries are $\\sigma_1,\\dots, \\sigma_r$, so that\n",
+ ">$$ \\Sigma = \\begin{bmatrix} D & 0 \\\\ 0 & 0 \\end{bmatrix} $$\n",
+ "> and note that\n",
+ "> $$ A = U_rDV_r^T. $$\n",
+ "> We call this the reduced singular value decomposition of $A$. Note that $D$ is invertible, and its inverse is simply\n",
+ "> $$ D = \\begin{bmatrix} \\sigma_1^{-1} \\\\ & \\sigma_2^{-1} \\\\ & & \\ddots \\\\ & & & \\sigma_r^{-1} \\end{bmatrix}. $$\n",
+ "> The **pseudoinverse** (or **Moore-Penrose inverse**) of $A$ is the matrix\n",
+ "> $$ A^+ = V_rD^{-1}U_r^T. $$\n",
+ "\n",
+ "We note that the pseudoinverse $A^+$ is a $p \\times n$ matrix. \n",
+ "\n",
+ "With the pseudoinverse, we can actually find least-squares solutions quite easily. Indeed, if we are looking for the least-squares solution to the system $Ax = b$, define\n",
+ "$$ x_0 = A^+b. $$\n",
+ "Then \n",
+ "$$ \\begin{split} Ax_0 &= (U_rDV_r^T)(V_rD^{-1}U_r^Tb) \\\\ &= U_rDD^{-1}U_r^Tb \\\\ &= U_rU_r^Tb \\end{split} $$\n",
+ "As mentioned before, the columns of $U_r$ form an orthonormal basis for the column space of $A$ and so $U_rU_r^T$ is the orthogonal projection onto the range of $A$. That is, $Ax_0$ is precisely the projection of $b$ onto the column space of $A$, meaning that this yields a least-squares solution. This gives the following.\n",
+ "\n",
+ "> **Theorem**. Let $A$ be an $n \\times p$ matrix and $b \\in \\mathbb{R}^n$. Then\n",
+ "> $$ x_0 = A^+b$$\n",
+ "> is a least-squares solution to $Ax = b$. \n",
+ "\n",
+ "Taking pseudoinverses is quite involved. We'll do one example by hand, and then use python -- and we'll see something go wrong! There is a function `numpy.linalg.pinv` in numpy that will take a pseudoinverse. We can also just use `numpy.linalg.svd` and do the process above.\n",
+ "\n",
+ "> **Example**. We have the following SVD $A = U\\Sigma V^T$. \n",
+ "> $$ \\begin{bmatrix} 1 & 1 & 2\\\\ 0 & 1 & 1 \\\\ 1 & 0 & 1 \\\\ 0 & 0 & 0 \\end{bmatrix} = \\begin{bmatrix} \\sqrt{\\frac{2}{3}} & 0 & 0 & -\\frac{1}{\\sqrt{3}} \\\\ \\frac{1}{\\sqrt{6}} & \\frac{1}{\\sqrt{2}} & 0 & \\frac{1}{\\sqrt{3}} \\\\ \\frac{1}{\\sqrt{6}} & -\\frac{1}{\\sqrt{2}} & 0 & \\frac{1}{\\sqrt{3}} \\\\ 0 & 0 & 1 & 0 \\end{bmatrix} \\begin{bmatrix} 3 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 0 \\\\ 0 & 0 & 0 \\end{bmatrix}\\begin{bmatrix} \\frac{1}{\\sqrt{6}} & -\\frac{1}{\\sqrt{2}} & -\\frac{1}{\\sqrt{3}} \\\\ \\frac{1}{\\sqrt{6}} & \\frac{1}{\\sqrt{2}} & -\\frac{1}{\\sqrt{3}} \\\\ \\sqrt{\\frac{2}{3}} & 0 & \\frac{1}{\\sqrt{3}} \\end{bmatrix}^T. $$\n",
+ "> Can we find a least-squares solution to $Ax = b$, where\n",
+ "> $$ b = \\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 1 \\end{bmatrix}? $$\n",
+ "> The pseudoinverse of $A$ is\n",
+ "> $$ \\begin{split} A^+ &= \\begin{bmatrix} \\frac{1}{\\sqrt{6}} & -\\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{6}} & \\frac{1}{\\sqrt{2}} \\\\ \\sqrt{\\frac{2}{3}} & 0 \\end{bmatrix} \\begin{bmatrix} 3 \\\\ & 1 \\end{bmatrix} \\begin{bmatrix} \\sqrt{\\frac{2}{3}} & 0 \\\\ \\frac{1}{\\sqrt{6}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{6}} & -\\frac{1}{\\sqrt{2}} \\\\ 0 & 0 \\end{bmatrix}^T \\\\ &= \\begin{bmatrix} \\frac{1}{9} & -\\frac{4}{9} & \\frac{5}{9} & 0 \\\\ \\frac{1}{9} & \\frac{5}{9} & -\\frac{4}{9} & 0 \\\\ \\frac{2}{9} & \\frac{1}{9} & \\frac{1}{9} & 0\\end{bmatrix}, \\end{split} $$\n",
+ "> and so a least-squares solution is given by\n",
+ "> $$ \\begin{split} x_0 &= A^+b \\\\ &= \\begin{bmatrix} \\frac{1}{9} & -\\frac{4}{9} & \\frac{5}{9} & 0 \\\\ \\frac{1}{9} & \\frac{5}{9} & -\\frac{4}{9} & 0 \\\\ \\frac{2}{9} & \\frac{1}{9} & \\frac{1}{9} & 0\\end{bmatrix}\\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 1 \\end{bmatrix} \\\\ &= \\begin{bmatrix} \\frac{2}{9} \\\\ \\frac{2}{9} \\\\ \\frac{4}{9} \\end{bmatrix}. \\end{split} $$\n",
+ "\n",
+ "Now let's do this with python, and see an example of how things can go wrong. We'll try to take the pseudoinverse manually first.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Create our matrix A and our target b\n",
+ "A = np.array([[1,1,2],[0,1,1],[1,0,1],[0,0,0]])\n",
+ "b = np.array([[1],[1],[1],[1]])\n",
+ "\n",
+ "# Take the SVD decomposition\n",
+ "U, S, Vh = np.linalg.svd(A)\n",
+ "\n",
+ "# Prepare the pseudoinverse\n",
+ "# Recall that we invert the non-zero diagonal entries of the diagonal matrix.\n",
+ "# So we first build S_inv to be the appropriate size\n",
+ "S_inv = np.zeros((Vh.shape[0], U.shape[0])) \n",
+ "# We then fill in the appropriate values on the diagonal\n",
+ "S_inv[:len(S), :len(S)] = np.diag(1/S)\n",
+ "\n",
+ "# Build the pseudoinverse\n",
+ "A_pinv = Vh.T @ S_inv @ U.T\n",
+ "\n",
+ "# Compute the least-squares solution\n",
+ "beta = A_pinv @ b\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "What is the result?\t\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "862ed810",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "beta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This is **WAY** off the mark. So what happened? Well, when we look at our singular values, we have\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2d3df55d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "S"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "As we got this matrix numerically, the last entry is actually non-zero, but tiny. This isn't exactly what's going on since we know that the rank of A is 2. So when we invert the singular values and throw them on the diagonal, have `1/1.21618839e-16` which is a very large value. This value then messes up the rest of the computation. So how do we fix this? One can set tolerances in numpy, but we'll get to that later. Let's just note that `numpy.linalg.pinv` will already incorporate this. Let's see what we get.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Create our matrix A and our target b\n",
+ "A = np.array([[1,1,2],[0,1,1],[1,0,1],[0,0,0]])\n",
+ "b = np.array([[1],[1],[1],[1]])\n",
+ "\n",
+ "# Build the pseudoinverse\n",
+ "A_pinv = np.linalg.pinv(A)\n",
+ "\n",
+ "# Compute the least-squares solution\n",
+ "beta = A_pinv @ b\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2657ea4b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"A_pinv={A_pinv}\\n\\nbeta={beta}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## The Condition Number\n",
+ "Numerical calculations involving matrix equations are quite reliable if we use the SVD. This is because the orthogonal matrices $U$ and $V$ preserve lengths and angles, leaving the stability of the problem to be governed by the singular values of the matrix $X$. Recall that if $X = U\\Sigma V^T$, then solving the least-squares problem involves dividing by the non-zero singular values $\\sigma_i$ of $X$. If these values are very small, their inverses become very large, and this will amplify any numerical errors.\n",
+ "\n",
+ "> **Definition**. Let $X$ be an $n \\times p$ matrix and let $\\sigma_1 \\geq \\cdots \\geq \\sigma_r$ be the non-zero singular values of $X$. The **condition number** of $X$ is the quotient\n",
+ "> $$ \\kappa(X) = \\frac{\\sigma_1}{\\sigma_r} $$\n",
+ "> of the largest and smallest non-zero singular values.\n",
+ "\n",
+ "A condition number close to 1 indicates a well-conditioned problem, while a large condition number indicates that small perturbations in data may lead to large changes in computation. Geometrically, $\\kappa(X)$ measures how much $X$ distorts space. \n",
+ "\n",
+ "> **Example**. Consider the matrices\n",
+ "> $$ A = \\begin{bmatrix} 1 \\\\ & 1 \\end{bmatrix} \\text{ and } B = \\begin{bmatrix} 1 \\\\ & \\frac{1}{10^6} \\end{bmatrix}. $$\n",
+ "> The condition numbers are\n",
+ "> $$ \\kappa(A) = 1 \\text{ and } \\kappa(B) = 10^6. $$\n",
+ "> Inverting $X_2$ includes dividing by $\\frac{1}{10^6}$, which will amplify errors by $10^6$.\n",
+ "\n",
+ "Let's look our main example in python by using `numpy.linalg.cond`. \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "# First let us make a dictionary incorporating our data.\n",
+ "# Each entry corresponds to a column (feature of our data)\n",
+ "data = {\n",
+ " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n",
+ " 'Bedrooms': [3, 4, 2, 3, 4],\n",
+ " 'Price': [500, 650, 475, 490, 620]\n",
+ "}\n",
+ "\n",
+ "# Create a pandas DataFrame\n",
+ "df = pd.DataFrame(data)\n",
+ "\n",
+ "# Create out matrix X\n",
+ "X = df[['Square ft', 'Bedrooms']].to_numpy()\n",
+ "\n",
+ "# Check the condition number\n",
+ "cond_X = np.linalg.cond(X)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Let's see what we got.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8aa6bca9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "cond_X"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "so this is quite a high condition number! This should be unsurprising, as clearly the number of bedrooms is correlated to the size of a house (especially so in our small toy example). "
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/03_what_goes_wrong.ipynb b/notebooks/03_what_goes_wrong.ipynb
new file mode 100644
index 0000000..b461407
--- /dev/null
+++ b/notebooks/03_what_goes_wrong.ipynb
@@ -0,0 +1,389 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "f385cc40",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# A note on other norms\n",
+ "\n",
+ "There are other canonical choices of norms for vectors and matrices. While $L^2$ leads naturally to least-squares problems with closed-form solutions, other norms induce different geometries and different optimal solutions. From the linear algebra perspective, changing the norm affects:\n",
+ "- the shape of the unit ball,\n",
+ "- the geometry of approximation,\n",
+ "- the numerical behaviour of optimization problems. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca50a202",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## $L^1$ norm (Manhattan distance)\n",
+ "The $L^1$ norm of a vector $x = (x_1,\\dots,x_p) \\in \\mathbb{R}^p$ is defined as\n",
+ "$$ \\|x\\|_1 = \\sum |x_i|. $$\n",
+ "Minimizing the $L^1$ norm is less sensitive to outliers. Geometrically, the $L^1$ unit ball in $\\mathbb{R}^2$ is a diamond (a rotated square), rather than a circle.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "77e7c0b3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Grid\n",
+ "xx = np.linspace(-1.2, 1.2, 400)\n",
+ "yy = np.linspace(-1.2, 1.2, 400)\n",
+ "X, Y = np.meshgrid(xx, yy)\n",
+ "\n",
+ "# Take the $L^1$ norm\n",
+ "Z = np.abs(X) + np.abs(Y)\n",
+ "\n",
+ "plt.figure(figsize=(6,6))\n",
+ "plt.contour(X, Y, Z, levels=[1])\n",
+ "plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)\n",
+ "\n",
+ "plt.axhline(0)\n",
+ "plt.axvline(0)\n",
+ "plt.gca().set_aspect(\"equal\", adjustable=\"box\")\n",
+ "plt.title(r\"$L^1$ unit ball: $|x|+|y|\\leq 1$\")\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/L1_unit_ball.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ce59565a",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Consequently, optimization problems involving $L^1$ tend to produce solutions which live on the corners of this polytope.\n",
+ "Solutions often require linear programming or iterative reweighted least squares.\n",
+ "\n",
+ "$L^1$ based methods (such as LASSO) tend to set coefficients to be exactly zero. Unlike with $L^2$, the minimization problem for $L^1$ does not admit a closed form solution. Algorithms include:\n",
+ "- linear programming formulations,\n",
+ "- iterative reweighted least squares,\n",
+ "- coordinate descent methods.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d9c50d8e",
+ "metadata": {},
+ "source": [
+ "## $L^{\\infty}$ norm (max/supremum norm)\n",
+ "The supremum norm defined as\n",
+ "$$ \\|x\\|_{\\infty} = \\max |x_i| $$\n",
+ "seeks to control the worst-case error rather than the average error. Minimizing this norm is related to Chebyshev approximation by polynomials. \n",
+ "\n",
+ "Geometrically, the unit ball of $\\mathbb{R}^2$ with respect to the $L^{\\infty}$ norm looks like a square.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2724a3bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Grid\n",
+ "xx = np.linspace(-1.2, 1.2, 400)\n",
+ "yy = np.linspace(-1.2, 1.2, 400)\n",
+ "X, Y = np.meshgrid(xx, yy)\n",
+ "\n",
+ "# Take the $L^{\\infty}$ norm\n",
+ "Z = np.maximum(np.abs(X), np.abs(Y))\n",
+ "\n",
+ "plt.figure(figsize=(6,6))\n",
+ "plt.contour(X, Y, Z, levels=[1])\n",
+ "plt.contourf(X, Y, Z, levels=[0,1], alpha=0.3)\n",
+ "\n",
+ "plt.axhline(0)\n",
+ "plt.axvline(0)\n",
+ "plt.gca().set_aspect(\"equal\", adjustable=\"box\")\n",
+ "plt.title(r\"$L^{\\infty}$ unit ball: $\\max\\{|x|,|y|\\} \\leq 1$\")\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/Linf_unit_ball.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "55c4ce17",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Problems involving the $L^{\\infty}$ norm are often formulated as linear programs, and are useful when worst-case guarantees are more important than optimizing average performance. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d393c069",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Matrix norms\n",
+ "\n",
+ "There are also various norms on matrices, each highlighting a different aspect of the associated linear transformation.\n",
+ "- **Frobenius norm**. This is an important norm, essentially the analogue of the $L^2$ norm for matrices. It is the Euclidean norm if you think of your matrix as a vector, forgetting its rectangular shape. For $A = (a_{ij})$ a matrix, the Frobenius norm \n",
+ " $$ \\|A\\|_F = \\sqrt{\\sum a_{ij}^2} $$\n",
+ " is the square root of the sum of squares of all the entries. This treats a matrix as a long vector and is invariant under orthogonal transformations. As we'll see, it plays a central role in:\n",
+ " - least-squares problems,\n",
+ " - low-rank approximation,\n",
+ " - principal component analysis.\n",
+ "\n",
+ " In particular, the truncated SVD yields a best low-rank approximation of a matrix with respect to the Frobenius norm.\n",
+ "\n",
+ " We also that that the Frobenius norm can be written in terms of tracial data. We have that\n",
+ " $$ \\|A\\|_F^2 = \\text{Tr}(A^TA) = \\text{Tr}(AA^T). $$\n",
+ "- **Operator norm** (spectral norm). This is just the norm as an operator $A: \\mathbb{R}^p \\to \\mathbb{R}^n$, where $\\mathbb{R}^p$ and $\\mathbb{R}^n$ are thought of as Hilbert spaces:\n",
+ " $$ \\|A\\| = \\max_{\\|x\\|_2 = 1}\\|Ax\\|_2. $$\n",
+ " This norm measures how big of an amplification $A$ can apply, and is equal to the largest singular value of $A$. This norm is related to stability properties, and is the analogue of the $L^{\\infty}$ norm.\n",
+ "- **Nuclear norm**. The nuclear norm, defined as\n",
+ " $$ \\|A\\|_* = \\sum \\sigma_i, $$\n",
+ " is the sum of the singular values. When $A$ is square, this is precisely the trace-class norm, and is the analogue of the $L^1$ norm. This norm acts as a generalization of the concept of rank. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4ee62ee0",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# A note on regularization\n",
+ "\n",
+ "Regularization introduces additional constraints or penalties to stabilize ill-posed problems. From the linear algebra point of view, regularization modifies the singular value structure of a problem. \n",
+ "- **Ridge regression**: add a positive multiple $\\lambda\\cdot I$ of the identity to $X^TX$ which will artificially inflate small singular values.\n",
+ "- This dampens unstable directions while leaving well-conditioned directions largely unaffected.\n",
+ " \n",
+ "Geometrically, regularization reshapes the solution space to suppress directions that are poorly supported by the data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "be3b8c1e",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# A note on solving multiple targets concurrently\n",
+ "\n",
+ "Suppose now that we were interested in solving several problems concurrently; that is, given some data points, we would like to make $k$ predictions. Say we have our $n \\times p$ data matrix $X$, and we want to make $k$ predictions $y_1,\\dots,y_k$. We can then set the problem up as finding a best solution to the matrix equation\n",
+ "$$ XB = Y $$\n",
+ "where $B$ will be a $p \\times k$ matrix of parameters and $Y$ will be the $p \\times k$ matrix whose columns are $y_1,\\dots,y_k$. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "908cd528",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# What can go wrong?\n",
+ "\n",
+ "We are often dealing with imperfect data, so there is plenty that could go wrong. Here are some basic cases of where things can break down.\n",
+ "\n",
+ "- **Perfect multicolinearity**: non-invertible $\\tilde{X}^T\\tilde{X}$. This happens when one feature is a perfect combination of the others. This means that the columns of the matrix $\\tilde{X}$ are linearly dependent, and so infinitely many solutions will exist to the least-squares problem. \n",
+ " - For example, if you are looking at characteristics of people and you have height in both inches and centimeters.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1d9e0ed",
+ "metadata": {},
+ "source": [
+ "- **Almost multicolinearity**: this happens when one features is **almost** a perfect combination of the others. From the linear algebra perspective, the columns of $\\tilde{X}$ might not be dependent, but they will be be **almost** linearly dependent. This will cause problems in calculation, as the condition number will become large and amplify numerical errors. The inverse will blow up small spectral components. \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fa57c3a4",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **More features than observations**: this means that our matrix $\\tilde{X}$ will be wider than it is high. Necessarily, this means that the columns are linearly dependent. Regularization or dimensionality reduction becomes essential.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5fff0b5",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **Redundant or constant features**: this is when there is a characteristic that is satisfied by each observation. In terms of the linear algebraic data, this means that one of the columns of $X$ is constant.\n",
+ " - e.g., if you are looking at characteristics of penguins, and you have \"# of legs\". This will always be two, and doesn't add anything to the analysis.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ed7d745d",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **Underfitting**: the model lacks sufficient expressivity to capture the underlying structure. For example, see the section on polynomial regression -- sometimes one might want a curve vs. a straight line."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2de2ed0c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## Generate data\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# 1) Generate quadratic data\n",
+ "np.random.seed(3)\n",
+ "\n",
+ "n = 50\n",
+ "x = np.random.uniform(-5, 5, n) # symmetric, wider range\n",
+ "\n",
+ "# True relationship: y = ax^2 + c + noise\n",
+ "a_true = 2.0\n",
+ "c_true = 5.0\n",
+ "noise = np.random.normal(0, 3, n)\n",
+ "\n",
+ "y = a_true * x**2 + c_true + noise\n",
+ "\n",
+ "# find a line of best fit\n",
+ "a,b = np.polyfit(x, y, 1)\n",
+ "\n",
+ "# add scatter points to plot\n",
+ "plt.scatter(x,y)\n",
+ "\n",
+ "# add line of best fit to plot\n",
+ "plt.plot(x, a*x + b, 'r', linewidth=1)\n",
+ "\n",
+ "# plot it\n",
+ "plt.show()\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dbb01960",
+ "metadata": {},
+ "source": [
+ "- **Overfitting**: the model captures noise rather than structure. Often due to model complexity relative to data size. Polynomial regression can give a nice visualization of overfitting. For example, if we worked with the same generated quadratic data from the polynomial regression section, and we tried to approximation it by a degree 11 polynomial, we get the following.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "43ab6a3f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# 1) Generate quadratic data\n",
+ "np.random.seed(3)\n",
+ "\n",
+ "n = 50\n",
+ "x = np.random.uniform(-5, 5, n)\n",
+ "\n",
+ "a_true = 2.0\n",
+ "c_true = 5.0\n",
+ "noise = np.random.normal(0, 3, n)\n",
+ "\n",
+ "y = a_true * x**2 + c_true + noise\n",
+ "\n",
+ "# 2) Fit degree 11 polynomial\n",
+ "coeffs = np.polyfit(x, y, 11)\n",
+ "\n",
+ "# Create polynomial function\n",
+ "p = np.poly1d(coeffs)\n",
+ "\n",
+ "# 3) Sort x for smooth plotting\n",
+ "x_sorted = np.linspace(min(x), max(x), 500)\n",
+ "\n",
+ "# 4) Plot\n",
+ "plt.scatter(x, y, label=\"Data\")\n",
+ "plt.plot(x_sorted, p(x_sorted), 'r', linewidth=2, label=\"Degree 11 fit\")\n",
+ "\n",
+ "plt.legend()\n",
+ "plt.title(\"Degree 11 Polynomial Fit\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3aa62d78",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **Outliers**: large deviations can dominate the $L^2$ norm. This is where normalization might be key.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "86606942",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **Heteroscedasticity**: this is when the variance of noise changes across observations. Certain least-squares assumptions will break down."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04e122fb",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **Condition number**: a large condition number indicates numerical instability and sensitivity to perturbation, even when formal solutions exist."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5e424fd",
+ "metadata": {},
+ "source": [
+ "\n",
+ "- **Insufficient tolerance**: in numerical algorithms, thresholds used to determine rank or invertibility must be chosen carefully. Poor choices can lead to misleading results."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d1dd798",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "The point is that many failures in data science are not conceptual, but they happen geometrically and numerically. Poor choices lead to poor results. \n",
+ "\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/04_pca.ipynb b/notebooks/04_pca.ipynb
new file mode 100644
index 0000000..e6fe488
--- /dev/null
+++ b/notebooks/04_pca.ipynb
@@ -0,0 +1,419 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Principal Component Analysis\n",
+ "\n",
+ "Principal Component Analysis (PCA) addresses the issues of multicollinearity and dimensionality mentioned at the end of the previous section by transforming the data into a new coordinate system. The new axes -- called principal components -- are chosen to capture the maximum variance in the data. In linear algebra terms, we are finding a subspace of potentially smaller dimension that best approximates our data.\n",
+ "\n",
+ "> **Example**: Let us return to our house example. Suppose we decide to list the square footage in both square feet and square meters. Let's add this feature to our dataset.\n",
+ "> |House | Square ft | Square m | Bedrooms | Price (in $1000s) |\n",
+ "> | --- | --- | --- | --- | --- |\n",
+ "> | 0 | 1600 | 148 | 3 | 500 |\n",
+ "> | 1 | 2100 | 195 | 4 | 650 |\n",
+ "> | 2 | 1550 | 144 | 2 | 475 |\n",
+ "> | 3 | 1600 | 148 | 3 | 490 |\n",
+ "> | 4 | 2000 | 185 | 4 | 620 |\n",
+ "> \n",
+ "> In this case, our associated matrix is:\n",
+ "> $$ X = \\begin{bmatrix} 1600 & 148 & 3 & 500 \\\\ 2100 & 195 & 4 & 650 \\\\ 1550 & 144 & 2 & 475 \\\\ 1600 & 148 & 3 & 490 \\\\ 2000 & 185 & 4 & 620 \\end{bmatrix} $$\n",
+ "\n",
+ "There are a few problems with the above data and the associated matrix $X$ (this time, we're not looking to make predictions, so we don't omit the last column).\n",
+ "- **Redundancy**: Square feet and square meters give the same information. It's just a matter of if you're from a civilized country or from an uncivilized country.\n",
+ "- **Numerical instability**: The columns of $X$ are nearly linearly dependent. Indeed, the second column is almost a multiple of the first. Moreover, one can make a safe bet that the number of bedrooms increases as the square footage does, so that the first and third columns are correlated.\n",
+ "- **Interpretation difficulty**: We used the square footage and bedrooms *together* in the previous section to predict the price of a house. However, because of their correlation, this obfuscates the true relationship, say, between the square footage and the price of a house, or the number of bedrooms and the price of a house. \n",
+ "\n",
+ "So the question becomes: what do we do about this? We will try to get a smaller matrix (less columns) that contains the same, or a close enough, amount of information. The point is that the data is *effectively* lower-dimensional. \n",
+ "\n",
+ "Let's do a little analysis on our dataset before progressing. Let's use `pandas.DataFrame.describe`, `pandas.DataFrame.corr` and `numpy.linalg.cond`. First, let's set up our data.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "# First let us make a dictionary incorporating our data.\n",
+ "# Each entry corresponds to a column (feature of our data)\n",
+ "data = {\n",
+ " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n",
+ "\t'Square m': [148, 195, 144, 148, 185],\n",
+ " 'Bedrooms': [3, 4, 2, 3, 4],\n",
+ " 'Price': [500, 650, 475, 490, 620]\n",
+ "}\n",
+ "\n",
+ "# Create a pandas DataFrame\n",
+ "df = pd.DataFrame(data)\n",
+ "\n",
+ "# Create out matrix X\n",
+ "X = df.to_numpy()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "Now let's see what it has to offer. \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8514ed8b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df.describe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0eb032aa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df.corr()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6a166792",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "np.linalg.cond(X)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "As we can see, everything is basically correlated, and we clearly have some redundancies. \n",
+ "\n",
+ "This section is structured as follows. \n",
+ "- [Low-rank approximation via SVD](#low-rank-approximation-via-svd)\n",
+ "- [Centering data](#centering-data)\n",
+ "\n",
+ "\n",
+ "## Low-rank approximation via SVD\n",
+ "\n",
+ "Let $A$ be an $n \\times p$ matrix and let $A = U\\Sigma V^T$ be a SVD. Let $u_1,\\dots,u_n$ be the columns of $U$, $v_1,\\dots,v_p$ be the column of $V$, and $\\sigma_1 \\geq \\cdots \\sigma_r > 0$ be the singular values, where $r \\leq \\min\\{n,p\\}$ is the rank of $A$. Then we have the **reduced singular value decomposition** (see [Pseudoinverses and using the svd](#pseudoinverses-and-using-the-svd))\n",
+ "$$ A = \\sum_{i=1}^r \\sigma_i u_iv_i^T $$\n",
+ "(note that $u_i$ is a $n \\times 1$ matrix and $v_i$ is a $p \\times 1$ matrix, so $u_iv_i^T$ is some $n \\times p$ matrix).\n",
+ "The key idea is that if the rank of $A$ is higher, say $s$, but the latter singular values are small, then we should still have an approximation like this. Say $\\sigma_{r+1},\\dots,\\sigma_{s}$ are tiny. Then\n",
+ "$$ \\begin{split} A &= \\sum_{i=1}^s \\sigma_i u_i v_i^T \\\\ &= \\sum_{i=1}^r \\sigma_i u_iv_i^T + \\sum_{i=r+1}^{s} \\sigma_i u_iv_i^T \\\\ &\\approx \\sum_{i=1}^r \\sigma_iu_i v_i^T \\end{split}. $$\n",
+ "So defining $A_r := \\sum_{i=1}^r \\sigma_i u_iv_i^T$, we are approximating $A$ by $A_r$.\n",
+ "\n",
+ "In what sense is this a good approximation though? Recall that the Frobenius norm of a matrix $A$ is defined as the sqrt root of the sum of the squares of all the entries:\n",
+ "$$ \\|A\\|_F = \\sqrt{\\sum_{i,j} a_{ij}^2}. $$\n",
+ "The Frobenius norm acts as a very nice generalization of the $L^2$ norm for vectors, and is an indispensable tool in both linear algebra and data science. The point is that this \"approximation\" above actually works in the Frobenius norm, and this reduced singular value decomposition in fact minimizes the error.\n",
+ "\n",
+ "> **Theorem** (EckartāYoungāMirsky). Let $A$ be an $n \\times p$ matrix of rank $r$. For $k \\leq r$,\n",
+ "> $$ \\min_{B \\text{ such that rank}(B) \\leq k} \\|A - B\\|_F = \\|A - A_k\\|_F. $$\n",
+ "> The (at most) rank $k$ matrix $A_k$ also realizes the minimum when optimizing for the operator norm.\n",
+ "\n",
+ "> **Example**. Recall that we have the following SVD:\n",
+ "> $$ \\begin{bmatrix} 3 & 2 & 2 \\\\ 2 & 3 & -2 \\end{bmatrix} = \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{\\sqrt{2}} \\end{bmatrix} \\begin{bmatrix} 5 & 0 & 0 \\\\ 0 & 3 & 0 \\end{bmatrix} \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{3\\sqrt{2}} & -\\frac{2}{3} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{3\\sqrt{2}} & \\frac{2}{3} \\\\ 0 & \\frac{4}{3\\sqrt{2}} & \\frac{1}{3} \\end{bmatrix}^T. $$\n",
+ "> So if we want a rank-one approximation for the matrix, we'll do the reduced SVD. We have\n",
+ "> $$ \\begin{split} A_1 &= \\sigma_1u_1v_1^T \\\\ &= 5\\begin{bmatrix} \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} \\end{bmatrix}\\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} & 0 \\end{bmatrix} \\\\ &= \\begin{bmatrix} \\frac{5}{2} & \\frac{5}{2} & 0 \\\\ \\frac{5}{2} & \\frac{5}{2} & 0 \\end{bmatrix} \\end{split}$$\n",
+ "> Now let's compute the (square of the) Frobenius norm of the difference $A - A_1$. We have\n",
+ "> $$ \\begin{split} \\|A - A_1\\|_F^2 &= \\left\\| \\begin{bmatrix} \\frac{1}{2} & -\\frac{1}{2} & 2 \\\\ -\\frac{1}{2} & \\frac{1}{2} & -2 \\end{bmatrix}\\right\\|_F^2 \\\\ &= 4(\\frac{1}{2})^2 + 2(2^2) = 9. \\end{split} $$\n",
+ "> So the Frobenius distance between $A$ and $A_1$ is 3, and we know by Eckart-Young-Mirsky that this is the smallest we can get when looking at the difference between $A$ and a (at most) rank one $2 \\times 3$ matrix. As mentioned, the operator norm $\\|A - A_1\\|$ also minimizes the distance (in operator norm). We know this to be the largest singular value. As $A - A_1$ has SVD\n",
+ "> $$ \\begin{bmatrix} \\frac{1}{2} & -\\frac{1}{2} & 2 \\\\ -\\frac{1}{2} & \\frac{1}{2} & -2 \\end{bmatrix} = \\begin{bmatrix} -\\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\end{bmatrix}\\begin{bmatrix} 3 & 0 & 0 \\\\ 0 & 0 & 0 \\end{bmatrix} \\begin{bmatrix} -\\frac{1}{3\\sqrt{2}} & -\\frac{4}{\\sqrt{17}} & \\frac{1}{3\\sqrt{34}} \\\\ \\frac{1}{3\\sqrt{2}} & 0 & \\frac{1}{3}\\sqrt{\\frac{17}{2}} \\\\ -\\frac{2\\sqrt{2}}{3} & \\frac{1}{\\sqrt{17}} & \\frac{2}{3}\\sqrt{\\frac{2}{17}} \\end{bmatrix}, $$\n",
+ "> the operator norm is also 3. \n",
+ "\n",
+ "Now let's do this in python. We'll set up our matrix as usual, take the SVD, do the truncated construction of $A_1$, and use `numpy.linalg.norm` to look at the norms. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "# Create our matrix A\n",
+ "A = np.array([[3,2,2],[2,3,-2]])\n",
+ "\n",
+ "# Take the SVD\n",
+ "U, S, Vh = np.linalg.svd(A)\n",
+ "\n",
+ "# Create our rank-1 approximation\n",
+ "sigma1 = S[0]\n",
+ "u1 = U[:, [0]]\t\t#shape (2,2)\n",
+ "v1T = Vh[[0], :]\t\t#shape (3,3)\n",
+ "A1 = sigma1 * (u1 @ v1T)\n",
+ "\n",
+ "# Take norms and view errors\n",
+ "frobenius_error = np.linalg.norm(A - A1, ord=\"fro\")\t#Frobenius norm\n",
+ "operator_error = np.linalg.norm(A - A1, ord=2)\t\t#operator norm\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Let's see if we get what we expect.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "799ea5da",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sigma1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e17ad031",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "u1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b75d1b41",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "v1T"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cda3bc1a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "A1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5741dc92",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "frobenius_error"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b1171244",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "operator_error"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "So this numerically confirms the EYM theorem. \n",
+ "\n",
+ "## Centering data \n",
+ "In data science, we rarely apply low-rank approximation to raw values directly, because translation and units can dominate the geometry. Instead, we apply these methods to centered (and often standardized) data so that low-rank structure reflects relationships among features rather than the absolute location or measurement scale. Centering converts the problem from approximating an affine cloud to approximating a linear one, in direct analogy with including an intercept term in linear regression. Therefore, before we can analyze the variance structure, we must ensure our data is centered, i.e., that each feature has a mean of 0. We achieve this by subtracting the mean of each column from every entry in that column.\n",
+ "Suppose $X$ is our $n \\times p$ data matrix, and let\n",
+ "$$ \\mu = \\frac{1}{n}\\mathbb{1}^T X. $$\n",
+ "Then\n",
+ "$$ \\hat{X} = X - \\mu \\mathbb{1} $$\n",
+ "will be centered data matrix.\n",
+ "\n",
+ "> **Example**. Going back to our housing example, the means of the columns are 1770, 164, 3.2, and 547, respectively. So our centered matrix is\n",
+ "> $$ \\hat{X} = \\begin{bmatrix} -170 & -16 & -0.2 & -47 \\\\ 330 & 31 & 0.8 & 103 \\\\ -220 & -20 & -1.2 & -72 \\\\ -170 & -16 & -0.2 & -57 \\\\ 230 & 21 & 0.8 & 73 \\end{bmatrix}. $$\n",
+ "\n",
+ "Let's do this in python.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "# First let us make a dictionary incorporating our data.\n",
+ "# Each entry corresponds to a column (feature of our data)\n",
+ "data = {\n",
+ " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n",
+ "\t'Square m': [148, 195, 144, 148, 185],\n",
+ " 'Bedrooms': [3, 4, 2, 3, 4],\n",
+ " 'Price': [500, 650, 475, 490, 620]\n",
+ "}\n",
+ "\n",
+ "# Create a pandas DataFrame\n",
+ "df = pd.DataFrame(data)\n",
+ "\n",
+ "# Create out matrix X\n",
+ "X = df.to_numpy()\n",
+ "\n",
+ "# Get our vector of means\n",
+ "X_means = np.mean(X, axis=0)\n",
+ "\n",
+ "# Create our centered matrix\n",
+ "X_centered = X - X_means\n",
+ "\n",
+ "# Get the SVD for X_centered\n",
+ "U, S, Vh = np.linalg.svd(X_centered)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "This returns the following.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4288abb2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_means"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "31c2ebf2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_centered"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "We will apply the low-rank approximations from the previous sections. First let's see what our SVD looks like, and what the condition number is.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d944d257",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f\"U = {U}\\n\\nS = {S}\\n\\nVh.T = {Vh.T}\\n\")\n",
+ "print(\"Condition number of X_centered = \", np.linalg.cond(X_centered))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Now let's approximate our centered matrix $\\hat{X}$ by some lower-rank matrices. First, we'll define a function which will give us a rank $k$ truncated SVD. \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Defining the truncated svd\n",
+ "def reduced_svd_matrix_k(U, S, Vh, k):\n",
+ "\tUk = U[:, :k]\n",
+ "\tSk = np.diag(S[:k])\n",
+ "\tVhk = Vh[:k, :]\n",
+ "\treturn Uk @ Sk @ Vhk\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Now, as $\\hat{X}$ has rank 4, we can do a reduced matrix of rank 1,2,3. We will do this in a loop.\n",
+ "\n",
+ "> **Remark**. We'll divide the error by the (Frobenius) norm so that we have a relative error. E.g., if two houses are within 10k of each other, they are similarly priced. The magnitude of error being large doesn't say much if our quantities are large.\n",
+ "> \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for k in [1, 2, 3]:\n",
+ "\t# Define our reduced matrix\n",
+ " Xck = reduced_svd_matrix_k(U, S, Vh, k)\n",
+ "\t# Compute the relative error\n",
+ " rel_err = np.linalg.norm(X_centered - Xck, ord=\"fro\") / np.linalg.norm(X_centered, ord=\"fro\")\n",
+ "\t# Print the information\n",
+ " print(Xck, \"\\n\", f\"k={k}: relative Frobenius reconstruction error on centered data = {rel_err:.4f}\", \"\\n\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "This seems to check out -- it says that one rank (or one feature) should be roughly enough to describe this data. This should make sense because clearly the square meterage, # of bedrooms, and price depend on the square footage. \n",
+ "\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/05_svd_image_denoising.ipynb b/notebooks/05_svd_image_denoising.ipynb
new file mode 100644
index 0000000..c9a2c93
--- /dev/null
+++ b/notebooks/05_svd_image_denoising.ipynb
@@ -0,0 +1,811 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "6ae6c7f8",
+ "metadata": {},
+ "source": [
+ "# Spectral Image Denoising via Truncated SVD\n",
+ "\n",
+ "This notebook extracts the image denoising project into a standalone workflow and extends it from **grayscale images** to **actual color images**.\n",
+ "\n",
+ "The core idea is the same as in the original write-up: if an image matrix has singular value decomposition\n",
+ "$$\n",
+ "A = U \\Sigma V^T,\n",
+ "$$\n",
+ "then the best rank-$k$ approximation to $A$ in Frobenius norm is obtained by truncating the SVD. This is the **EckartāYoungāMirsky theorem**.\n",
+ "\n",
+ "For a grayscale image, the image is a single matrix. For an RGB image, we treat the image as **three matrices**, one for each channel, and apply truncated SVD to each channel separately."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "31a665c9",
+ "metadata": {},
+ "source": [
+ "## Outline\n",
+ "\n",
+ "1. Load an image from disk\n",
+ "2. Convert it to grayscale or keep it in RGB\n",
+ "3. Add synthetic Gaussian noise\n",
+ "4. Compute a truncated SVD reconstruction\n",
+ "5. Compare the original, noisy, and denoised images\n",
+ "6. Measure quality using MSE and PSNR\n",
+ "\n",
+ "This notebook is written so that you can use **your own image files** directly."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88584c56",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "from PIL import Image\n",
+ "from pathlib import Path\n",
+ "\n",
+ "try:\n",
+ " from skimage.metrics import structural_similarity as ssim\n",
+ " HAS_SKIMAGE = True\n",
+ "except ImportError:\n",
+ " ssim = None\n",
+ " HAS_SKIMAGE = False\n",
+ "\n",
+ "print(f\"scikit-image available: {HAS_SKIMAGE}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "30e96441",
+ "metadata": {},
+ "source": [
+ "## A note on color images\n",
+ "\n",
+ "For a grayscale image, SVD applies directly to a single matrix. For a color image $A \\in \\mathbb{R}^{n \\times p \\times 3}$, we write\n",
+ "$$\n",
+ "A = (A_R, A_G, A_B),\n",
+ "$$\n",
+ "where each channel is an $n \\times p$ matrix. We then compute a rank-$k$ approximation for each channel:\n",
+ "$$\n",
+ "A_R \\approx (A_R)_k,\\qquad\n",
+ "A_G \\approx (A_G)_k,\\qquad\n",
+ "A_B \\approx (A_B)_k,\n",
+ "$$\n",
+ "and stack them back together.\n",
+ "\n",
+ "This is the most direct extension of the grayscale method, and it works well as a first linear-algebraic treatment of color denoising."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f275cbc9",
+ "metadata": {},
+ "source": [
+ "## Helper functions\n",
+ "\n",
+ "We begin with some utilities for:\n",
+ "- loading images,\n",
+ "- adding Gaussian noise,\n",
+ "- reconstructing rank-$k$ approximations,\n",
+ "- computing image-quality metrics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "21adfcaf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def load_image(path, mode=\"rgb\"):\n",
+ " \"\"\"\n",
+ " Load an image from disk.\n",
+ "\n",
+ " Parameters\n",
+ " ----------\n",
+ " path : str or Path\n",
+ " Path to the image file.\n",
+ " mode : {\"rgb\", \"gray\"}\n",
+ " Whether to load the image as RGB or grayscale.\n",
+ "\n",
+ " Returns\n",
+ " -------\n",
+ " np.ndarray\n",
+ " Float image array scaled to [0, 255].\n",
+ " Shape is (H, W, 3) for RGB and (H, W) for grayscale.\n",
+ " \"\"\"\n",
+ " path = Path(path)\n",
+ " if not path.exists():\n",
+ " raise FileNotFoundError(f\"Could not find image file: {path}\")\n",
+ "\n",
+ " img = Image.open(path)\n",
+ "\n",
+ " if mode.lower() in {\"gray\", \"grayscale\", \"l\"}:\n",
+ " img = img.convert(\"L\")\n",
+ " else:\n",
+ " img = img.convert(\"RGB\")\n",
+ "\n",
+ " return np.asarray(img, dtype=np.float64)\n",
+ "\n",
+ "\n",
+ "def show_image(img, title=None):\n",
+ " \"\"\"Display a grayscale or RGB image.\"\"\"\n",
+ " plt.figure(figsize=(6, 6))\n",
+ " if img.ndim == 2:\n",
+ " plt.imshow(np.clip(img, 0, 255), cmap=\"gray\", vmin=0, vmax=255)\n",
+ " else:\n",
+ " plt.imshow(np.clip(img, 0, 255).astype(np.uint8))\n",
+ " if title is not None:\n",
+ " plt.title(title)\n",
+ " plt.axis(\"off\")\n",
+ " plt.tight_layout()\n",
+ " plt.show()\n",
+ "\n",
+ "\n",
+ "def add_gaussian_noise(img, sigma=25, seed=0):\n",
+ " \"\"\"\n",
+ " Add Gaussian noise to an image.\n",
+ "\n",
+ " Parameters\n",
+ " ----------\n",
+ " img : np.ndarray\n",
+ " Image array in [0, 255].\n",
+ " sigma : float\n",
+ " Standard deviation of the noise.\n",
+ " seed : int\n",
+ " Random seed for reproducibility.\n",
+ "\n",
+ " Returns\n",
+ " -------\n",
+ " np.ndarray\n",
+ " Noisy image clipped to [0, 255].\n",
+ " \"\"\"\n",
+ " rng = np.random.default_rng(seed)\n",
+ " noisy = img + rng.normal(loc=0.0, scale=sigma, size=img.shape)\n",
+ " return np.clip(noisy, 0, 255)\n",
+ "\n",
+ "\n",
+ "def truncated_svd_matrix(A, k):\n",
+ " \"\"\"\n",
+ " Return the rank-k truncated SVD approximation of a 2D matrix A.\n",
+ " \"\"\"\n",
+ " U, s, Vt = np.linalg.svd(A, full_matrices=False)\n",
+ " k = min(k, len(s))\n",
+ " return (U[:, :k] * s[:k]) @ Vt[:k, :]\n",
+ "\n",
+ "\n",
+ "def truncated_svd_image(img, k):\n",
+ " \"\"\"\n",
+ " Apply truncated SVD to a grayscale or RGB image.\n",
+ "\n",
+ " For RGB images, truncated SVD is applied channel-by-channel.\n",
+ " \"\"\"\n",
+ " if img.ndim == 2:\n",
+ " recon = truncated_svd_matrix(img, k)\n",
+ " return np.clip(recon, 0, 255)\n",
+ "\n",
+ " if img.ndim == 3:\n",
+ " channels = []\n",
+ " for c in range(img.shape[2]):\n",
+ " channel_recon = truncated_svd_matrix(img[:, :, c], k)\n",
+ " channels.append(channel_recon)\n",
+ " recon = np.stack(channels, axis=2)\n",
+ " return np.clip(recon, 0, 255)\n",
+ "\n",
+ " raise ValueError(\"Image must be either 2D (grayscale) or 3D (RGB).\")\n",
+ "\n",
+ "\n",
+ "def mse(A, B):\n",
+ " \"\"\"Mean squared error between two images.\"\"\"\n",
+ " return np.mean((A.astype(np.float64) - B.astype(np.float64)) ** 2)\n",
+ "\n",
+ "\n",
+ "def psnr(A, B, max_val=255.0):\n",
+ " \"\"\"Peak signal-to-noise ratio in decibels.\"\"\"\n",
+ " err = mse(A, B)\n",
+ " if err == 0:\n",
+ " return np.inf\n",
+ " return 10 * np.log10((max_val ** 2) / err)\n",
+ "\n",
+ "\n",
+ "def image_ssim(A, B, max_val=255.0):\n",
+ " \"\"\"\n",
+ " Structural similarity index.\n",
+ "\n",
+ " For RGB images, compute SSIM channel-by-channel and average.\n",
+ " Returns None when scikit-image is unavailable.\n",
+ " \"\"\"\n",
+ " if not HAS_SKIMAGE:\n",
+ " return None\n",
+ "\n",
+ " A = A.astype(np.float64)\n",
+ " B = B.astype(np.float64)\n",
+ "\n",
+ " if A.ndim == 2:\n",
+ " return float(ssim(A, B, data_range=max_val))\n",
+ "\n",
+ " if A.ndim == 3:\n",
+ " vals = [ssim(A[:, :, c], B[:, :, c], data_range=max_val) for c in range(A.shape[2])]\n",
+ " return float(np.mean(vals))\n",
+ "\n",
+ " raise ValueError(\"Images must be either 2D (grayscale) or 3D (RGB).\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fe1d4932",
+ "metadata": {},
+ "source": [
+ "## Choose an image"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "42bafca5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pathlib import Path\n",
+ "\n",
+ "MODE = \"rgb\" # use \"gray\" for grayscale, \"rgb\" for color\n",
+ "\n",
+ "candidate_paths = [\n",
+ " Path(\"../images/bella.jpg\"),\n",
+ " Path(\"images/bella.jpg\"),\n",
+ " Path(\"bella.jpg\"),\n",
+ "]\n",
+ "\n",
+ "IMAGE_PATH = None\n",
+ "for p in candidate_paths:\n",
+ " if p.exists():\n",
+ " IMAGE_PATH = p\n",
+ " break\n",
+ "\n",
+ "if IMAGE_PATH is None:\n",
+ " raise FileNotFoundError(\n",
+ " \"Could not find bella.jpg. Put it in ../images/, images/, or the notebook folder.\"\n",
+ " )\n",
+ "\n",
+ "img = load_image(IMAGE_PATH, mode=MODE)\n",
+ "print(\"Using image:\", IMAGE_PATH)\n",
+ "print(\"Image shape:\", img.shape)\n",
+ "show_image(img, title=f\"Original image ({MODE})\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "05e52222",
+ "metadata": {},
+ "source": [
+ "## Add synthetic Gaussian noise\n",
+ "\n",
+ "We add noise so that the denoising effect is visible and measurable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "528e69b3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sigma = 40\n",
+ "seed = 0\n",
+ "\n",
+ "img_noisy = add_gaussian_noise(img, sigma=sigma, seed=seed)\n",
+ "\n",
+ "noisy_output_path = IMAGE_PATH.with_name(f\"{IMAGE_PATH.stem}_noisy.png\")\n",
+ "Image.fromarray(np.clip(img_noisy, 0, 255).astype(np.uint8)).save(noisy_output_path)\n",
+ "print(\"Saved noisy image to:\", noisy_output_path)\n",
+ "\n",
+ "show_image(img_noisy, title=f\"Noisy image (sigma={sigma})\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1bbcc1d8",
+ "metadata": {},
+ "source": [
+ "## Visualizing rank-$k$ reconstructions\n",
+ "\n",
+ "For small $k$, the reconstruction captures only coarse structure.\n",
+ "As $k$ increases, more detail returns. For denoising, there is often a useful middle ground:\n",
+ "enough singular values to preserve structure, but not so many that we reintroduce noise."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "563df53a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "import math\n",
+ "\n",
+ "ks = [5, 20, 50, 100]\n",
+ "\n",
+ "# Collect all images + titles\n",
+ "images = []\n",
+ "titles = []\n",
+ "\n",
+ "# Original\n",
+ "images.append(img)\n",
+ "titles.append(\"Original\")\n",
+ "\n",
+ "# Noisy\n",
+ "images.append(img_noisy)\n",
+ "titles.append(\"Noisy\")\n",
+ "\n",
+ "# Reconstructions\n",
+ "for k in ks:\n",
+ " recon = truncated_svd_image(img_noisy, k)\n",
+ " images.append(recon)\n",
+ " titles.append(f\"k = {k}\")\n",
+ "\n",
+ "# Grid setup\n",
+ "ncols = 2\n",
+ "nrows = math.ceil(len(images) / ncols)\n",
+ "\n",
+ "fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 4 * nrows))\n",
+ "axes = axes.flatten() # easier indexing\n",
+ "\n",
+ "# Plot everything\n",
+ "for i, (ax, im, title) in enumerate(zip(axes, images, titles)):\n",
+ " if im.ndim == 2:\n",
+ " ax.imshow(im, cmap=\"gray\", vmin=0, vmax=255)\n",
+ " else:\n",
+ " ax.imshow(np.clip(im, 0, 255).astype(np.uint8))\n",
+ " \n",
+ " ax.set_title(title)\n",
+ " ax.axis(\"off\")\n",
+ "\n",
+ "# Hide any unused axes\n",
+ "for j in range(len(images), len(axes)):\n",
+ " axes[j].axis(\"off\")\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "comparison_output_path = IMAGE_PATH.with_name(f\"{IMAGE_PATH.stem}_truncated_svd_multiple_ks.png\")\n",
+ "print(\"Saved comparison figure to:\", comparison_output_path)\n",
+ "plt.savefig(comparison_output_path, bbox_inches=\"tight\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "309579fa",
+ "metadata": {},
+ "source": [
+ "## Quantitative evaluation\n",
+ "\n",
+ "We compare each reconstruction against the **clean original image**, not against the noisy one.\n",
+ "A good denoising rank should typically:\n",
+ "- reduce MSE relative to the noisy image,\n",
+ "- increase PSNR relative to the noisy image."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56ce07ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "baseline_mse = mse(img, img_noisy)\n",
+ "baseline_psnr = psnr(img, img_noisy)\n",
+ "\n",
+ "print(f\"Noisy image baseline -> MSE: {baseline_mse:.2f}, PSNR: {baseline_psnr:.2f} dB\")\n",
+ "\n",
+ "results = []\n",
+ "for k in ks:\n",
+ " recon = truncated_svd_image(img_noisy, k)\n",
+ " results.append((k, mse(img, recon), psnr(img, recon)))\n",
+ "\n",
+ "print(\"\\nRank-k reconstructions:\")\n",
+ "for k, m, p in results:\n",
+ " print(f\"k = {k:3d} | MSE = {m:10.2f} | PSNR = {p:6.2f} dB\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f2fe6fe2",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Efficient search over many values of $k$\n",
+ "\n",
+ "A naive implementation would recompute the SVD from scratch for every candidate value of $k$.\n",
+ "That is extremely expensive: every reconstruction would require a fresh factorization of each\n",
+ "channel of the noisy image.\n",
+ "\n",
+ "A much better approach is:\n",
+ "\n",
+ "1. compute the SVD **once** for each channel;\n",
+ "2. reuse those factors for every candidate $k$;\n",
+ "3. compare reconstructions using MSE, PSNR, and optionally SSIM.\n",
+ "\n",
+ "This is also a nice numerical linear algebra point: all rank-$k$ truncated reconstructions come\n",
+ "from the **same** singular value decomposition.\n",
+ "\n",
+ "We compare two variants:\n",
+ "\n",
+ "- **plain truncated SVD**, applied directly to each channel;\n",
+ "- **centered truncated SVD**, where we subtract each channel's column mean before factorizing and\n",
+ " add it back after reconstruction.\n",
+ "\n",
+ "The centered version sometimes improves reconstruction slightly because the low-rank approximation\n",
+ "spends less effort representing the mean structure.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4277a913",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "def precompute_svd_image(img):\n",
+ " \"\"\"Precompute plain SVD factors for each channel.\"\"\"\n",
+ " if img.ndim == 2:\n",
+ " A = img.astype(np.float64)\n",
+ " U, s, Vt = np.linalg.svd(A, full_matrices=False)\n",
+ " return [(U, s, Vt)]\n",
+ "\n",
+ " cache = []\n",
+ " for c in range(img.shape[2]):\n",
+ " A = img[:, :, c].astype(np.float64)\n",
+ " U, s, Vt = np.linalg.svd(A, full_matrices=False)\n",
+ " cache.append((U, s, Vt))\n",
+ " return cache\n",
+ "\n",
+ "\n",
+ "def precompute_centered_svd_image(img):\n",
+ " \"\"\"Precompute centered SVD factors for each channel.\"\"\"\n",
+ " if img.ndim == 2:\n",
+ " A = img.astype(np.float64)\n",
+ " col_mean = A.mean(axis=0, keepdims=True)\n",
+ " A_centered = A - col_mean\n",
+ " U, s, Vt = np.linalg.svd(A_centered, full_matrices=False)\n",
+ " return [(U, s, Vt, col_mean)]\n",
+ "\n",
+ " cache = []\n",
+ " for c in range(img.shape[2]):\n",
+ " A = img[:, :, c].astype(np.float64)\n",
+ " col_mean = A.mean(axis=0, keepdims=True)\n",
+ " A_centered = A - col_mean\n",
+ " U, s, Vt = np.linalg.svd(A_centered, full_matrices=False)\n",
+ " cache.append((U, s, Vt, col_mean))\n",
+ " return cache\n",
+ "\n",
+ "\n",
+ "def reconstruct_from_svd_cache(cache, k):\n",
+ " \"\"\"Reconstruct from precomputed plain SVD factors.\"\"\"\n",
+ " channels = []\n",
+ " for U, s, Vt in cache:\n",
+ " kk = min(k, len(s))\n",
+ " recon = (U[:, :kk] * s[:kk]) @ Vt[:kk, :]\n",
+ " channels.append(np.clip(recon, 0, 255))\n",
+ "\n",
+ " if len(channels) == 1:\n",
+ " return channels[0]\n",
+ " return np.stack(channels, axis=2)\n",
+ "\n",
+ "\n",
+ "def reconstruct_from_centered_svd_cache(cache, k):\n",
+ " \"\"\"Reconstruct from precomputed centered SVD factors.\"\"\"\n",
+ " channels = []\n",
+ " for U, s, Vt, col_mean in cache:\n",
+ " kk = min(k, len(s))\n",
+ " recon = (U[:, :kk] * s[:kk]) @ Vt[:kk, :] + col_mean\n",
+ " channels.append(np.clip(recon, 0, 255))\n",
+ "\n",
+ " if len(channels) == 1:\n",
+ " return channels[0]\n",
+ " return np.stack(channels, axis=2)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d1dacbb",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Scoring reconstructions\n",
+ "\n",
+ "We first compute a **baseline** by comparing the noisy image to the clean one. Then we score\n",
+ "rank-$k$ reconstructions. A smaller MSE and a larger PSNR indicate better fidelity to the clean\n",
+ "image. If `scikit-image` is available, we also compute SSIM.\n",
+ "\n",
+ "A useful conceptual warning is important here:\n",
+ "\n",
+ "> The best low-rank approximation in a matrix norm does **not** necessarily produce the image that\n",
+ "> looks best to a human observer.\n",
+ "\n",
+ "Why? Because human perception cares about things like edges, texture, and local contrast, while\n",
+ "MSE and PSNR are purely pixelwise. A reconstruction can score well numerically and still look too\n",
+ "smooth, too blurry, or otherwise unnatural.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c48c94cc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "baseline_mse = mse(img, img_noisy)\n",
+ "baseline_psnr = psnr(img, img_noisy)\n",
+ "\n",
+ "print(f\"Baseline noisy vs clean:\")\n",
+ "print(f\" MSE : {baseline_mse:.2f}\")\n",
+ "print(f\" PSNR: {baseline_psnr:.2f}\")\n",
+ "\n",
+ "if HAS_SKIMAGE:\n",
+ " baseline_ssim = image_ssim(img, img_noisy)\n",
+ " print(f\" SSIM: {baseline_ssim:.4f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "231330d1",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Automatic search over many values of $k$\n",
+ "\n",
+ "Because all rank-$k$ reconstructions come from the same SVD, we precompute the factorizations\n",
+ "once and then search efficiently over candidate values of $k$.\n",
+ "\n",
+ "For very large images this can still be somewhat expensive, so for exploratory work a coarser grid\n",
+ "such as `range(5, 151, 5)` is often sufficient. Once a promising region is found, one can refine\n",
+ "the search around that region.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8bae249d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "candidate_ks = list(range(1, 151, 5))\n",
+ "\n",
+ "plain_cache = precompute_svd_image(img_noisy)\n",
+ "centered_cache = precompute_centered_svd_image(img_noisy)\n",
+ "\n",
+ "plain_scores = []\n",
+ "centered_scores = []\n",
+ "\n",
+ "for k in candidate_ks:\n",
+ " plain = reconstruct_from_svd_cache(plain_cache, k)\n",
+ " centered = reconstruct_from_centered_svd_cache(centered_cache, k)\n",
+ "\n",
+ " plain_row = (k, mse(img, plain), psnr(img, plain))\n",
+ " centered_row = (k, mse(img, centered), psnr(img, centered))\n",
+ "\n",
+ " if HAS_SKIMAGE:\n",
+ " plain_row = plain_row + (image_ssim(img, plain),)\n",
+ " centered_row = centered_row + (image_ssim(img, centered),)\n",
+ "\n",
+ " plain_scores.append(plain_row)\n",
+ " centered_scores.append(centered_row)\n",
+ "\n",
+ "best_plain_by_mse = min(plain_scores, key=lambda x: x[1])\n",
+ "best_plain_by_psnr = max(plain_scores, key=lambda x: x[2])\n",
+ "best_centered_by_mse = min(centered_scores, key=lambda x: x[1])\n",
+ "best_centered_by_psnr = max(centered_scores, key=lambda x: x[2])\n",
+ "\n",
+ "print(\"Plain SVD:\")\n",
+ "print(\" Best by MSE :\", best_plain_by_mse)\n",
+ "print(\" Best by PSNR:\", best_plain_by_psnr)\n",
+ "\n",
+ "print(\"Centered SVD:\")\n",
+ "print(\" Best by MSE :\", best_centered_by_mse)\n",
+ "print(\" Best by PSNR:\", best_centered_by_psnr)\n",
+ "\n",
+ "if HAS_SKIMAGE:\n",
+ " best_plain_by_ssim = max(plain_scores, key=lambda x: x[3])\n",
+ " best_centered_by_ssim = max(centered_scores, key=lambda x: x[3])\n",
+ "\n",
+ " print(\"Plain SVD:\")\n",
+ " print(\" Best by SSIM:\", best_plain_by_ssim)\n",
+ " print(\"Centered SVD:\")\n",
+ " print(\" Best by SSIM:\", best_centered_by_ssim)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "166d0877",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Metric curves versus $k$\n",
+ "\n",
+ "Plotting the metrics as functions of $k$ is often more informative than looking only at the\n",
+ "single best value. Frequently the metric is nearly flat across a whole range of ranks, in which\n",
+ "case several nearby values of $k$ have very similar numerical performance.\n",
+ "\n",
+ "That is exactly the situation where visual inspection matters most: among a cluster of nearly tied\n",
+ "candidates, the one that looks nicest to the eye may not be the exact numerical winner.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0e1000de",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "plain_ks = [row[0] for row in plain_scores]\n",
+ "plain_mses = [row[1] for row in plain_scores]\n",
+ "plain_psnrs = [row[2] for row in plain_scores]\n",
+ "\n",
+ "centered_ks = [row[0] for row in centered_scores]\n",
+ "centered_mses = [row[1] for row in centered_scores]\n",
+ "centered_psnrs = [row[2] for row in centered_scores]\n",
+ "\n",
+ "plt.figure(figsize=(8, 4))\n",
+ "plt.plot(plain_ks, plain_mses, label=\"Plain SVD\")\n",
+ "plt.plot(centered_ks, centered_mses, label=\"Centered SVD\")\n",
+ "plt.xlabel(\"k\")\n",
+ "plt.ylabel(\"MSE\")\n",
+ "plt.title(\"MSE versus rank k\")\n",
+ "plt.legend()\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "plt.figure(figsize=(8, 4))\n",
+ "plt.plot(plain_ks, plain_psnrs, label=\"Plain SVD\")\n",
+ "plt.plot(centered_ks, centered_psnrs, label=\"Centered SVD\")\n",
+ "plt.xlabel(\"k\")\n",
+ "plt.ylabel(\"PSNR\")\n",
+ "plt.title(\"PSNR versus rank k\")\n",
+ "plt.legend()\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "if HAS_SKIMAGE:\n",
+ " plain_ssims = [row[3] for row in plain_scores]\n",
+ " centered_ssims = [row[3] for row in centered_scores]\n",
+ "\n",
+ " plt.figure(figsize=(8, 4))\n",
+ " plt.plot(plain_ks, plain_ssims, label=\"Plain SVD\")\n",
+ " plt.plot(centered_ks, centered_ssims, label=\"Centered SVD\")\n",
+ " plt.xlabel(\"k\")\n",
+ " plt.ylabel(\"SSIM\")\n",
+ " plt.title(\"SSIM versus rank k\")\n",
+ " plt.legend()\n",
+ " plt.tight_layout()\n",
+ " plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d008c548",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Visual comparison near the best ranks\n",
+ "\n",
+ "Finally, we inspect a few reconstructions around the automatically selected ranks. This is important\n",
+ "because the reconstruction that is optimal in Frobenius norm, MSE, or PSNR is not guaranteed to be\n",
+ "the reconstruction a human would actually prefer.\n",
+ "\n",
+ "Low-rank approximation is mathematically optimal for a precise matrix objective, but photographic\n",
+ "quality is influenced by far more than that. Fine textures, fur, sharp edges, and local contrast\n",
+ "can all matter a great deal perceptually, and some of those are exactly the kinds of features that\n",
+ "get smoothed away by aggressive truncation.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cb0c7d57",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "# Pick a few candidate ranks around the PSNR-optimal values\n",
+ "plain_best_k = best_plain_by_psnr[0]\n",
+ "centered_best_k = best_centered_by_psnr[0]\n",
+ "\n",
+ "plain_inspect_ks = sorted(set(k for k in [plain_best_k - 10, plain_best_k - 5, plain_best_k, plain_best_k + 5, plain_best_k + 10] if k >= 1))\n",
+ "centered_inspect_ks = sorted(set(k for k in [centered_best_k - 10, centered_best_k - 5, centered_best_k, centered_best_k + 5, centered_best_k + 10] if k >= 1))\n",
+ "\n",
+ "print(\"Plain SVD ranks to inspect :\", plain_inspect_ks)\n",
+ "print(\"Centered SVD ranks to inspect:\", centered_inspect_ks)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "aab1aff3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "\n",
+ "import math\n",
+ "\n",
+ "# Build a gallery: original, noisy, then several plain and centered reconstructions\n",
+ "gallery_images = [img, img_noisy]\n",
+ "gallery_titles = [\"Original\", f\"Noisy (sigma={sigma})\"]\n",
+ "\n",
+ "for k in plain_inspect_ks:\n",
+ " gallery_images.append(reconstruct_from_svd_cache(plain_cache, k))\n",
+ " gallery_titles.append(f\"Plain SVD, k={k}\")\n",
+ "\n",
+ "for k in centered_inspect_ks:\n",
+ " gallery_images.append(reconstruct_from_centered_svd_cache(centered_cache, k))\n",
+ " gallery_titles.append(f\"Centered SVD, k={k}\")\n",
+ "\n",
+ "ncols = 2\n",
+ "nrows = math.ceil(len(gallery_images) / ncols)\n",
+ "\n",
+ "fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 4 * nrows))\n",
+ "axes = np.array(axes).reshape(-1)\n",
+ "\n",
+ "for ax, im, title in zip(axes, gallery_images, gallery_titles):\n",
+ " if im.ndim == 2:\n",
+ " ax.imshow(im, cmap=\"gray\", vmin=0, vmax=255)\n",
+ " else:\n",
+ " ax.imshow(np.clip(im, 0, 255).astype(np.uint8))\n",
+ " ax.set_title(title)\n",
+ " ax.axis(\"off\")\n",
+ "\n",
+ "for ax in axes[len(gallery_images):]:\n",
+ " ax.axis(\"off\")\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2433f279",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Remarks and possible extensions\n",
+ "\n",
+ "- Truncated SVD provides the best rank-$k$ approximation in Frobenius norm, but that does **not**\n",
+ " automatically mean it gives the most visually pleasing denoised image.\n",
+ "- For real photographs, low-rank methods often smooth away texture and local detail along with the\n",
+ " noise.\n",
+ "- The visually best image may lie near the metric optimum rather than exactly at it.\n",
+ "- One can compare this method with more perceptual denoisers such as wavelet methods, bilateral\n",
+ " filtering, non-local means, or modern learned denoisers.\n",
+ "- A useful next step would be to compare how the preferred $k$ changes as the noise level\n",
+ " $\\sigma$ increases.\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/06_modelling_101.ipynb b/notebooks/06_modelling_101.ipynb
new file mode 100644
index 0000000..bd6f780
--- /dev/null
+++ b/notebooks/06_modelling_101.ipynb
@@ -0,0 +1,1642 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "4f44b2e6",
+ "metadata": {},
+ "source": [
+ "# Modelling 101: Train/Test Splits & Beyond Linear Regression\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "So far we have seen how linear regression (ordinary least squares) solves $\\tilde{X}\\tilde{\\beta} = y$ by minimizing $\\|y - \\tilde{X}\\tilde{\\beta}\\|_2^2$. This is a powerful tool, but real data often breaks the assumptions that make linear regression the best choice. We address several of the points made in [notebook 03](03_what_goes_wrong.ipynb).\n",
+ "\n",
+ "> **Why linear regression might not cut it:**\n",
+ "> - **Nonlinear relationships** ā The true dependency may be curved, periodic, or otherwise not linear.\n",
+ "> - **High dimensionality** ā When the number of features $p$ is close to or larger than the number of observations $n$, the matrix $\\tilde{X}^T\\tilde{X}$ becomes singular or nearly singular.\n",
+ "> - **Multicollinearity** ā Features are correlated, leading to large condition numbers and unstable coefficients.\n",
+ "> - **Overfitting** ā A complex model fits noise instead of signal, especially when $p$ is large.\n",
+ "> - **Outliers** ā The $L^2$ norm magnifies large errors, pulling the fit away from the bulk of the data.\n",
+ "\n",
+ "In this notebook we will:\n",
+ "- Work with a real, moderately sized dataset.\n",
+ "- Learn how to properly split data into training, validation, and test sets.\n",
+ "- Apply linear and polynomial regression, then diagnose their limitations.\n",
+ "- Introduce **regularisation** methods (Ridge and Lasso) from a linear algebra perspective.\n",
+ "- Explore **gradient descent** as a numerical optimisation alternative to the normal equations.\n",
+ "- Look at **decision trees and random forests** ā nonlinear models that can capture complex interactions without feature engineering.\n",
+ "- Cover **logistic regression** for classification.\n",
+ "- Discuss **feature scaling**, **crossāvalidation**, **model interpretation**, and **hyperparameter tuning**.\n",
+ "\n",
+ "The goal is to equip the linear algebraist with practical modelling tools while maintaining a geometric / algebraic intuition.\n",
+ "\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "40f2a9ea",
+ "metadata": {},
+ "source": [
+ "## A Real Dataset: California Housing\n",
+ "\n",
+ "A natural next step from our toy housing example is the **California housing dataset** from `sklearn.datasets`. It contains 20,640 observations of 8 features (median income, house age, average rooms, etc.) and the target is the median house value for blocks in California. This dataset is large enough to illustrate interesting effects but small enough to run quickly.\n",
+ "\n",
+ "> **Linear algebra view**: Each observation is a row vector $x_i \\in \\mathbb{R}^8$. The features form the columns of the design matrix $X \\in \\mathbb{R}^{20640 \\times 8}$. We will add an intercept column $\\mathbb{1}$ to obtain $\\tilde{X} \\in \\mathbb{R}^{20640 \\times 9}$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ecbbc640",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.datasets import fetch_california_housing\n",
+ "\n",
+ "# Load the data\n",
+ "housing = fetch_california_housing()\n",
+ "X = housing.data # shape (20640, 8)\n",
+ "y = housing.target # shape (20640,)\n",
+ "feature_names = housing.feature_names\n",
+ "\n",
+ "# Convert to DataFrame for convenience\n",
+ "df = pd.DataFrame(X, columns=feature_names)\n",
+ "df['MedHouseVal'] = y\n",
+ "\n",
+ "print(f\"Data shape: {df.shape}\")\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f60af719",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Basic statistics\n",
+ "df.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3f3e7ab3",
+ "metadata": {},
+ "source": [
+ "Let's see the relationships between these features and the price."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3459bcb7",
+ "metadata": {},
+ "source": [
+ "## Train / Test Split (and Validation)\n",
+ "\n",
+ "When it comes to real world modelling, we must split our data into training and tests sets.\n",
+ "\n",
+ "> **Why split?** If we evaluate a model on the same data we used to train it, we get an overly optimistic estimate of performance. The model may have memorised the training set (overfitting). Splitting mimics a realāworld scenario: we test on unseen data.\n",
+ "\n",
+ "A common workflow:\n",
+ "1. **Training set** (e.g., 60ā80%): used to fit the model parameters.\n",
+ "2. **Validation set** (e.g., 10ā20%): used to tune hyperparameters (e.g., degree of polynomial, regularisation strength).\n",
+ "3. **Test set** (e.g., 10ā20%): used only once at the end to report final performance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f998bdb3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Illustrate the three-way split\n",
+ "fig, ax = plt.subplots(figsize=(12, 3))\n",
+ "\n",
+ "# Create rectangles for each split\n",
+ "ax.barh(0, 60, left=0, height=0.5, color='blue', alpha=0.7, label='Training (60%)')\n",
+ "ax.barh(0, 20, left=60, height=0.5, color='orange', alpha=0.7, label='Validation (20%)')\n",
+ "ax.barh(0, 20, left=80, height=0.5, color='red', alpha=0.7, label='Test (20%)')\n",
+ "\n",
+ "# Add labels\n",
+ "ax.text(30, 0, 'Train Model\\nParameters', ha='center', va='center', fontsize=10, fontweight='bold')\n",
+ "ax.text(70, 0, 'Tune\\nHyperparams', ha='center', va='center', fontsize=10, fontweight='bold')\n",
+ "ax.text(90, 0, 'Final\\nEvaluation', ha='center', va='center', fontsize=10, fontweight='bold')\n",
+ "\n",
+ "ax.set_xlim(0, 100)\n",
+ "ax.set_ylim(-0.5, 0.5)\n",
+ "ax.set_xlabel('Percentage of Data')\n",
+ "ax.set_yticks([])\n",
+ "ax.set_title('Train/Validation/Test Split')\n",
+ "ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/train_validation_test_split.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "022d6da5",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Let us first fix a random state."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "79e9ce46",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "RANDOM_STATE=3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3da87dd2",
+ "metadata": {},
+ "source": [
+ "Let's visualize this."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "27eece10",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "# Generate synthetic data to illustrate the concept\n",
+ "np.random.seed(3)\n",
+ "n = 50\n",
+ "X = np.random.uniform(-5,5,n) # synthetic, wider range\n",
+ "\n",
+ "# True relationship\n",
+ "a_true = 2.0\n",
+ "c_true = 5.0\n",
+ "noise = np.random.normal(0,3,n)\n",
+ "\n",
+ "y = a_true * X**2 + c_true + noise\n",
+ "\n",
+ "# Perform train/test split\n",
+ "X_train, X_test, y_train, y_test = train_test_split(\n",
+ " X,y, test_size=0.3, random_state=3\n",
+ ")\n",
+ "\n",
+ "# Sort for plotting\n",
+ "X_curve = np.linspace(X.min(), X. max())\n",
+ "y_true = a_true * X_curve**2 + c_true\n",
+ "\n",
+ "\n",
+ "\n",
+ "# Plot\n",
+ "fig, ax = plt.subplots(figsize=(10,6))\n",
+ "ax.scatter(X_train, y_train, color='blue', s=50, label='Training data', zorder=3)\n",
+ "ax.scatter(X_test, y_test, color='red', s=50, label='Test data', zorder=3)\n",
+ "ax.plot(X_curve, y_true, linewidth=2, label='True relationship', alpha=0.7)\n",
+ "\n",
+ "ax.set_xlabel('X')\n",
+ "ax.set_ylabel('y')\n",
+ "ax.set_title('Train/Test Split')\n",
+ "ax.legend()\n",
+ "ax.grid(True, alpha=0.3)\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/train_test_split_illustration.png')\n",
+ "plt.show()\n",
+ "\n",
+ "print(f\"Total samples: {n}\")\n",
+ "print(f\"Training samples: {len(X_train)} ({len(X_train)/n*100:.0f}%)\")\n",
+ "print(f\"Test samples: {len(X_test)} ({len(X_test)/n*100:.0f}%)\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0913400b",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Back to the housing data. We will use `sklearn.model_selection.train_test_split` to create two splits (train+validation vs. test), then further split the train+validation part if needed. For simplicity we will first do a single train/test split and use crossāvalidation later."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70a2ca47",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "# Separate features and target\n",
+ "X = df[feature_names].values\n",
+ "y = df['MedHouseVal'].values\n",
+ "\n",
+ "# Split: 80% train, 20% test\n",
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)\n",
+ "\n",
+ "print(f\"Training set size: {X_train.shape[0]}\")\n",
+ "print(f\"Test set size: {X_test.shape[0]}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "05795576",
+ "metadata": {},
+ "source": [
+ "With that, let's see the relationship between the features and the price."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "195267d8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Visualize relationships between features and target\n",
+ "fig, axes = plt.subplots(2, 4, figsize=(16, 8))\n",
+ "axes = axes.flatten()\n",
+ "\n",
+ "for i, (name, ax) in enumerate(zip(feature_names, axes)):\n",
+ " ax.scatter(X_train[:, i], y_train, alpha=0.1, s=1)\n",
+ " ax.set_xlabel(name)\n",
+ " ax.set_ylabel('MedHouseVal')\n",
+ " ax.set_title(f'{name} vs Price')\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/california_housing_scatter.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "edfe258e",
+ "metadata": {},
+ "source": [
+ "## Linear Regression in Practice\n",
+ "\n",
+ "We can use `sklearn.linear_model.LinearRegression`, which internally solves the normal equations using either a direct solver or an SVDābased approach (the `lstsq` method we saw earlier).\n",
+ "\n",
+ "> **Linear algebra reminder**: The leastāsquares solution minimises $\\|y - X\\beta\\|_2^2$. The closed form is $\\beta = (X^T X)^{-1} X^T y$ when $X$ has full column rank."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f22373f2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import LinearRegression\n",
+ "from sklearn.metrics import mean_squared_error, r2_score\n",
+ "\n",
+ "# Fit linear regression\n",
+ "lin_reg = LinearRegression()\n",
+ "lin_reg.fit(X_train, y_train)\n",
+ "\n",
+ "# Predict on train and test\n",
+ "y_train_pred = lin_reg.predict(X_train)\n",
+ "y_test_pred = lin_reg.predict(X_test)\n",
+ "\n",
+ "# Evaluate\n",
+ "train_mse = mean_squared_error(y_train, y_train_pred)\n",
+ "test_mse = mean_squared_error(y_test, y_test_pred)\n",
+ "train_r2 = r2_score(y_train, y_train_pred)\n",
+ "test_r2 = r2_score(y_test, y_test_pred)\n",
+ "\n",
+ "print(f\"Train MSE: {train_mse:.4f}, Train R²: {train_r2:.4f}\")\n",
+ "print(f\"Test MSE: {test_mse:.4f}, Test R²: {test_r2:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "62e8cec4",
+ "metadata": {},
+ "source": [
+ "The test $R^2$ is respectable (~0.6), but perhaps we can do better with a more flexible model. However, simply adding polynomial features might lead to overfitting. Let's examine that.\n",
+ "\n",
+ "## Polynomial Regression and the Danger of Overfitting\n",
+ "\n",
+ "Polynomial regression creates new features by taking powers of the original features. For example, with one feature $x$, a degreeā2 model uses $[1, x, x^2]$. For multiple features, we can include interaction terms.\n",
+ "\n",
+ "> **Linear algebra view**: The Vandermonde matrix (for one feature) or its multivariate generalisation becomes the new design matrix. As degree increases, the condition number often explodes, leading to numerical instability and wild coefficients ā a sign of overfitting.\n",
+ "\n",
+ "Let's illustrate underfitting and overfitting on synthetic data before moving to the housing dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2e6f26c6",
+ "metadata": {},
+ "source": [
+ "### Illustration: Underfitting vs Overfitting\n",
+ "\n",
+ "We generate data from a quadratic function with noise, then fit polynomials of different degrees."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "14a7a408",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Generate quadratic data (similar to notebook 02) ā using distinct names\n",
+ "np.random.seed(3)\n",
+ "n_synth = 50\n",
+ "x_synth = np.random.uniform(-5, 5, n_synth)\n",
+ "y_true_synth = 2.0 * x_synth**2 + 5.0\n",
+ "noise_synth = np.random.normal(0, 3, n_synth)\n",
+ "y_synth = y_true_synth + noise_synth\n",
+ "\n",
+ "# Fit polynomials of degree 1 (underfit), 2 (good), 11 (overfit)\n",
+ "degrees = [1, 2, 11]\n",
+ "x_plot = np.linspace(-5, 5, 200)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
+ "for idx, d in enumerate(degrees):\n",
+ " coeff = np.polyfit(x_synth, y_synth, d)\n",
+ " p = np.poly1d(coeff)\n",
+ " axes[idx].scatter(x_synth, y_synth, alpha=0.7, label='Data')\n",
+ " axes[idx].plot(x_plot, p(x_plot), 'r-', linewidth=2, label=f'Degree {d}')\n",
+ " axes[idx].set_title(f'Degree {d} fit')\n",
+ " axes[idx].set_xlabel('x')\n",
+ " axes[idx].set_ylabel('y')\n",
+ " axes[idx].legend()\n",
+ " axes[idx].grid(True)\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/underfitting_vs_overfitting.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f79cb7d",
+ "metadata": {},
+ "source": [
+ "- **Degree 1 (underfitting)**: The linear model cannot capture the curvature, resulting in high bias.\n",
+ "- **Degree 2 (good)**: The quadratic model matches the true underlying structure.\n",
+ "- **Degree 11 (overfitting)**: The polynomial oscillates wildly to fit the noise, leading to poor generalisation.\n",
+ "\n",
+ "Now back to the housing dataset. Let's create polynomial features and see the effect on condition number and test error."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4a1c23b1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import PolynomialFeatures\n",
+ "\n",
+ "# Create polynomial features of degree 2 (includes interactions)\n",
+ "poly = PolynomialFeatures(degree=2, include_bias=False)\n",
+ "X_train_poly = poly.fit_transform(X_train)\n",
+ "X_test_poly = poly.transform(X_test)\n",
+ "\n",
+ "print(f\"Original training features: {X_train.shape[1]}\")\n",
+ "print(f\"Polynomial training features: {X_train_poly.shape[1]}\")\n",
+ "\n",
+ "# Condition number of the augmented polynomial design matrix (with intercept added later)\n",
+ "from numpy.linalg import cond\n",
+ "\n",
+ "X_train_poly_with_intercept = np.hstack([np.ones((X_train_poly.shape[0], 1)), X_train_poly])\n",
+ "print(f\"Condition number of polynomial design matrix: {cond(X_train_poly_with_intercept):.2e}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70ce5172",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Fit linear regression on polynomial features\n",
+ "poly_reg = LinearRegression()\n",
+ "poly_reg.fit(X_train_poly, y_train)\n",
+ "\n",
+ "y_train_pred_poly = poly_reg.predict(X_train_poly)\n",
+ "y_test_pred_poly = poly_reg.predict(X_test_poly)\n",
+ "\n",
+ "train_mse_poly = mean_squared_error(y_train, y_train_pred_poly)\n",
+ "test_mse_poly = mean_squared_error(y_test, y_test_pred_poly)\n",
+ "\n",
+ "print(f\"Polynomial (deg=2) Train MSE: {train_mse_poly:.4f}\")\n",
+ "print(f\"Polynomial (deg=2) Test MSE: {test_mse_poly:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8e00866",
+ "metadata": {},
+ "source": [
+ "The test error is **worse** than the linear model ā this is a clear sign of overfitting. The model is too flexible and fits noise in the training data. We need **regularisation**.\n",
+ "\n",
+ "## Ridge Regression ($L^2$ Regularisation)\n",
+ "\n",
+ "Ridge regression adds a penalty on the squared $L^2$ norm of the coefficient vector:\n",
+ "\n",
+ "$$\n",
+ "\\min_{\\beta} \\|y - X\\beta\\|_2^2 + \\lambda \\|\\beta\\|_2^2\n",
+ "$$\n",
+ "\n",
+ "where $\\lambda \\ge 0$ is the regularisation strength.\n",
+ "\n",
+ "> **Linear algebra interpretation**: The normal equations become $(X^T X + \\lambda I)\\beta = X^T y$. Adding $\\lambda I$ to $X^T X$ increases all eigenvalues by $\\lambda$, thereby improving the condition number and making the problem wellāposed even when $X^T X$ is singular. This is a form of **Tikhonov regularisation**. \n",
+ "> This directly shifts the eigenvalues (and singular values) of $X^TX$. \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b41e9e16",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "Ridge regression shrinks coefficients towards zero but rarely makes them exactly zero. It is especially useful when features are correlated (multicollinearity)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa1c509",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import Ridge\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "\n",
+ "# We'll use the polynomial features because ridge can help with overfitting\n",
+ "# Choose lambda via cross-validation on the training set\n",
+ "alphas = np.logspace(-3, 3, 20)\n",
+ "cv_scores = []\n",
+ "\n",
+ "for alpha in alphas:\n",
+ " ridge = Ridge(alpha=alpha)\n",
+ " # 5-fold cross-validation, negative MSE (scoring expects higher = better)\n",
+ " scores = cross_val_score(ridge, X_train_poly, y_train, cv=5, scoring='neg_mean_squared_error')\n",
+ " cv_scores.append(-scores.mean())\n",
+ "\n",
+ "best_alpha = alphas[np.argmin(cv_scores)]\n",
+ "print(f\"Best alpha from CV: {best_alpha:.4f}\")\n",
+ "\n",
+ "# Plot CV error vs alpha\n",
+ "plt.figure(figsize=(8,4))\n",
+ "plt.semilogx(alphas, cv_scores)\n",
+ "plt.xlabel('alpha (Ī»)')\n",
+ "plt.ylabel('Cross-validated MSE')\n",
+ "plt.title('Ridge Regularisation on Polynomial Features')\n",
+ "plt.grid(True)\n",
+ "plt.savefig('../images/ridge_regularization_polynomial_features_unscaled.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0c512e8",
+ "metadata": {},
+ "source": [
+ "You'll notice we are getting a bunch of errors about ill-conditioned matrices. This happens because the polynomial features are on wildly different scales. Let's standardize our features first. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "97aaaeba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import Ridge\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "# Add scaler\n",
+ "scaler = StandardScaler()\n",
+ "X_train_poly_scaled = scaler.fit_transform(X_train_poly)\n",
+ "X_test_poly_scaled = scaler.transform(X_test_poly)\n",
+ "\n",
+ "# We'll use the polynomial features because ridge can help with overfitting\n",
+ "# Choose lambda via cross-validation on the training set\n",
+ "alphas = np.logspace(-3, 3, 20)\n",
+ "cv_scores = []\n",
+ "\n",
+ "for alpha in alphas:\n",
+ " ridge = Ridge(alpha=alpha)\n",
+ " scores = cross_val_score(ridge, X_train_poly_scaled, y_train, cv=5, scoring='neg_mean_squared_error')\n",
+ " cv_scores.append(-scores.mean())\n",
+ "\n",
+ "best_alpha = alphas[np.argmin(cv_scores)]\n",
+ "print(f\"Best alpha from CV: {best_alpha:.4f}\")\n",
+ "\n",
+ "# Plot CV error vs alpha\n",
+ "plt.figure(figsize=(8,4))\n",
+ "plt.semilogx(alphas, cv_scores)\n",
+ "plt.xlabel('alpha (Ī»)')\n",
+ "plt.ylabel('Cross-validated MSE')\n",
+ "plt.title('Ridge Regularisation on Polynomial Features')\n",
+ "plt.grid(True)\n",
+ "plt.savefig('../images/ridge_regularization_polynomial_features_scaled.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "99c974ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Fit ridge with best alpha on SCALED polynomial features\n",
+ "ridge_best = Ridge(alpha=best_alpha)\n",
+ "ridge_best.fit(X_train_poly_scaled, y_train)\n",
+ "\n",
+ "y_test_pred_ridge = ridge_best.predict(X_test_poly_scaled)\n",
+ "test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)\n",
+ "print(f\"Ridge (poly deg=2) Test MSE: {test_mse_ridge:.4f}\")\n",
+ "print(f\"Ridge improved over plain polynomial (MSE {test_mse_poly:.4f} -> {test_mse_ridge:.4f})\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "100ef094",
+ "metadata": {},
+ "source": [
+ "### Ridge from the SVD Perspective\n",
+ "\n",
+ "The Ridge solution has a beautiful interpretation in terms of singular values. Recall from Notebook 2 that if $X = U\\Sigma V^T$ is the SVD of the (centered) design matrix, then the OLS solution is\n",
+ "\n",
+ "$$\n",
+ "\\hat{\\beta}_{OLS} = V\\Sigma^{-1}U^T y = \\sum_{i=1}^{p} \\frac{1}{\\sigma_i} (u_i^T y) v_i.\n",
+ "$$\n",
+ "\n",
+ "When $\\sigma_i$ is small, the coefficient $\\frac{1}{\\sigma_i}$ explodes ā this is the condition number problem.\n",
+ "\n",
+ "For Ridge regression, one can show that\n",
+ "\n",
+ "$$\n",
+ "\\hat{\\beta}_{Ridge} = \\sum_{i=1}^{p} \\frac{\\sigma_i}{\\sigma_i^2 + \\lambda} (u_i^T y) v_i.\n",
+ "$$\n",
+ "\n",
+ "Notice what happens:\n",
+ "- When $\\sigma_i \\gg \\sqrt{\\lambda}$, the coefficient is approximately $\\frac{1}{\\sigma_i}$ (same as OLS).\n",
+ "- When $\\sigma_i \\ll \\sqrt{\\lambda}$, the coefficient is approximately $\\frac{\\sigma_i}{\\lambda}$ ā **shrunk towards zero**.\n",
+ "- The effective condition number becomes $\\frac{\\sigma_1^2 + \\lambda}{\\sigma_p^2 + \\lambda}$, which is much better than $\\frac{\\sigma_1^2}{\\sigma_p^2}$.\n",
+ "\n",
+ "This is why Ridge helps with multicollinearity: it dampens precisely those directions that were poorly determined."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c438ebc5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Visualize how Ridge shrinks coefficients relative to singular values\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "# Use scaled data for clean SVD interpretation\n",
+ "scaler = StandardScaler()\n",
+ "X_train_scaled = scaler.fit_transform(X_train)\n",
+ "\n",
+ "# Compute SVD of centered design matrix\n",
+ "U, s, Vt = np.linalg.svd(X_train_scaled, full_matrices=False)\n",
+ "\n",
+ "# For different lambda values, compute the \"shrinkage factor\" for each singular direction\n",
+ "lambdas = [0, 0.1, 1, 10, 100]\n",
+ "\n",
+ "plt.figure(figsize=(10, 5))\n",
+ "\n",
+ "for lam in lambdas:\n",
+ " if lam == 0:\n",
+ " # OLS: no shrinkage\n",
+ " shrinkage = np.ones_like(s)\n",
+ " label = 'OLS (Ī»=0)'\n",
+ " else:\n",
+ " # Ridge shrinkage factor: sigma / (sigma^2 + lambda)\n",
+ " shrinkage = s / (s**2 + lam)\n",
+ " # Normalize so we can compare shapes\n",
+ " shrinkage = shrinkage / shrinkage[0] # normalize to first component\n",
+ " label = f'Ridge (Ī»={lam})'\n",
+ " \n",
+ " plt.plot(range(1, len(s)+1), shrinkage, 'o-', label=label, markersize=8)\n",
+ "\n",
+ "plt.xlabel('Singular value index (decreasing)')\n",
+ "plt.ylabel('Shrinkage factor (normalized)')\n",
+ "plt.title('Ridge Shrinkage: How Ī» Dampens Small Singular Directions')\n",
+ "plt.legend()\n",
+ "plt.grid(True, alpha=0.3)\n",
+ "plt.xticks(range(1, len(s)+1))\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/ridge_svd_shrinkage.png')\n",
+ "plt.show()\n",
+ "\n",
+ "# Show condition number improvement\n",
+ "print(\"Singular values:\", s.round(2))\n",
+ "print(f\"\\nCondition number (OLS): {s[0]/s[-1]:.2f}\")\n",
+ "for lam in [0.1, 1, 10]:\n",
+ " effective_cond = (s[0]**2 + lam) / (s[-1]**2 + lam)\n",
+ " print(f\"Effective condition number (Ī»={lam}): {effective_cond:.2f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50490dca",
+ "metadata": {},
+ "source": [
+ "## Lasso Regression ($L^1$ Regularisation)\n",
+ "\n",
+ "Lasso replaces the $L^2$ penalty with an $L^1$ penalty:\n",
+ "\n",
+ "$$\n",
+ "\\min_{\\beta} \\|y - X\\beta\\|_2^2 + \\lambda \\|\\beta\\|_1\n",
+ "$$\n",
+ "\n",
+ "> **Geometric intuition**: The $L^1$ ball is a diamond (in $\\mathbb{R}^2$). The intersection of the quadratic loss contours with this diamond often occurs at a corner, forcing some coefficients to be **exactly zero**. Thus Lasso performs **feature selection**."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8638a17",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "Lasso is useful when we suspect that only a few features are truly relevant, especially in highādimensional settings. However, it does not have a closedāform solution; it is typically solved via coordinate descent or other optimisation algorithms."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4ebe0612",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import Lasso\n",
+ "\n",
+ "# Lasso also requires tuning of alpha\n",
+ "lasso = Lasso(alpha=0.01, max_iter=10000) # start with a small alpha\n",
+ "lasso.fit(X_train_poly, y_train)\n",
+ "\n",
+ "# Count non-zero coefficients\n",
+ "n_nonzero = np.sum(np.abs(lasso.coef_) > 1e-10)\n",
+ "print(f\"Number of non-zero coefficients: {n_nonzero} out of {len(lasso.coef_)}\")\n",
+ "\n",
+ "y_test_pred_lasso = lasso.predict(X_test_poly)\n",
+ "test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)\n",
+ "print(f\"Lasso (poly deg=2) Test MSE: {test_mse_lasso:.4f}\")\n",
+ "\n",
+ "# Cross-validation for Lasso alpha\n",
+ "from sklearn.linear_model import LassoCV\n",
+ "\n",
+ "lasso_cv = LassoCV(alphas=np.logspace(-3, 1, 30), cv=5, max_iter=10000, random_state=RANDOM_STATE)\n",
+ "lasso_cv.fit(X_train_poly, y_train)\n",
+ "print(f\"Best alpha from LassoCV: {lasso_cv.alpha_:.4f}\")\n",
+ "print(f\"Number of non-zero coefficients (CV best): {np.sum(np.abs(lasso_cv.coef_) > 1e-10)}\")\n",
+ "\n",
+ "y_test_pred_lasso_cv = lasso_cv.predict(X_test_poly)\n",
+ "test_mse_lasso_cv = mean_squared_error(y_test, y_test_pred_lasso_cv)\n",
+ "print(f\"LassoCV Test MSE: {test_mse_lasso_cv:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "48be5c6e",
+ "metadata": {},
+ "source": [
+ "Again, there are unscaled polynomial features, so we get convergence warnings. LASSO is sensivitve to scalling becuase the penalty treats all coefficients equally. We also get a suggestion to increase the number of iterations. \n",
+ "\n",
+ "Let's fix this. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "02c2cdfa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import Lasso, LassoCV\n",
+ "\n",
+ "# Lasso with more iterations\n",
+ "lasso = Lasso(alpha=0.01, max_iter=100000, tol=1e-4)\n",
+ "lasso.fit(X_train_poly_scaled, y_train)\n",
+ "\n",
+ "# Count non-zero coefficients\n",
+ "n_nonzero = np.sum(np.abs(lasso.coef_) > 1e-10)\n",
+ "print(f\"Number of non-zero coefficients: {n_nonzero} out of {len(lasso.coef_)}\")\n",
+ "\n",
+ "y_test_pred_lasso = lasso.predict(X_test_poly_scaled)\n",
+ "test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)\n",
+ "print(f\"Lasso Test MSE: {test_mse_lasso:.4f}\")\n",
+ "\n",
+ "# Cross-validation for Lasso alpha\n",
+ "lasso_cv = LassoCV(\n",
+ " alphas=np.logspace(-3, 1, 30), \n",
+ " cv=5, \n",
+ " max_iter=100000, \n",
+ " tol=1e-4,\n",
+ " random_state=RANDOM_STATE\n",
+ ")\n",
+ "lasso_cv.fit(X_train_poly_scaled, y_train)\n",
+ "print(f\"Best alpha from LassoCV: {lasso_cv.alpha_:.4f}\")\n",
+ "print(f\"Number of non-zero coefficients (CV best): {np.sum(np.abs(lasso_cv.coef_) > 1e-10)}\")\n",
+ "\n",
+ "y_test_pred_lasso_cv = lasso_cv.predict(X_test_poly_scaled)\n",
+ "test_mse_lasso_cv = mean_squared_error(y_test, y_test_pred_lasso_cv)\n",
+ "print(f\"LassoCV Test MSE: {test_mse_lasso_cv:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6fcd6ba",
+ "metadata": {},
+ "source": [
+ "### Why Lasso Produces Sparse Solutions: The L¹ Geometry\n",
+ "\n",
+ "Recall from Notebook 3 that the $L^1$ unit ball is a diamond (a rotated square in $\\mathbb{R}^1$). This geometric fact is precisely why Lasso tends to produce coefficients that are **exactly zero**.\n",
+ "\n",
+ "Consider the constrained form of the problem:\n",
+ "\n",
+ "$$\n",
+ "\\min_{\\beta} \\|y - X\\beta\\|_2^2 \\quad \\text{subject to} \\quad \\|\\beta\\|_1 \\leq t.\n",
+ "$$\n",
+ "\n",
+ "The constraint region is the $L^1$ ball ā a diamond with corners on the axes. The contours of the loss function $\\|y - X\\beta\\|_2^2$ are ellipses centered at the OLS solution.\n",
+ "\n",
+ "**Key insight**: When an elliptical contour expands and first touches the diamond, it often hits a **corner**. Corners lie on the axes, meaning some coefficients are exactly zero.\n",
+ "\n",
+ "This is in contrast to Ridge, where the constraint region is a ball (circle in $\\mathbb{R}^1$), and the first contact is typically at a smooth point ā coefficients are shrunk but rarely zero."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca02a5f4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Visualize L1 vs L2 constraint regions and why Lasso gives sparsity\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
+ "\n",
+ "# L1 ball (diamond)\n",
+ "theta = np.linspace(0, 2*np.pi, 100)\n",
+ "r = 1\n",
+ "\n",
+ "# L1 ball vertices\n",
+ "l1_x = [r, 0, -r, 0, r]\n",
+ "l1_y = [0, r, 0, -r, 0]\n",
+ "\n",
+ "# L2 ball (circle)\n",
+ "l2_x = r * np.cos(theta)\n",
+ "l2_y = r * np.sin(theta)\n",
+ "\n",
+ "# Simulated loss contours (ellipses centered away from origin)\n",
+ "# The OLS solution is at some point (beta1_ols, beta2_ols)\n",
+ "beta_ols = np.array([0.7, 0.3])\n",
+ "\n",
+ "for idx, (ax, ball_type) in enumerate(zip(axes, ['Lasso (L¹)', 'Ridge (L²)'])):\n",
+ " # Draw constraint region\n",
+ " if idx == 0: # Lasso - L1 ball\n",
+ " ax.fill(l1_x, l1_y, alpha=0.3, color='blue', label='L¹ constraint region')\n",
+ " ax.plot(l1_x, l1_y, 'b-', linewidth=2)\n",
+ " else: # Ridge - L2 ball\n",
+ " ax.fill(l2_x, l2_y, alpha=0.3, color='green', label='L² constraint region')\n",
+ " ax.plot(l2_x, l2_y, 'g-', linewidth=2)\n",
+ " \n",
+ " # Draw loss contours (ellipses)\n",
+ " # Simplified: concentric ellipses around OLS solution\n",
+ " for scale in [0.3, 0.5, 0.7, 1.0]:\n",
+ " ellipse_x = beta_ols[0] + scale * 0.4 * np.cos(theta)\n",
+ " ellipse_y = beta_ols[1] + scale * 0.2 * np.sin(theta)\n",
+ " ax.plot(ellipse_x, ellipse_y, 'r--', alpha=0.5, linewidth=1)\n",
+ " \n",
+ " # Mark OLS solution\n",
+ " ax.scatter(*beta_ols, color='red', s=100, zorder=5, label='OLS solution')\n",
+ " \n",
+ " # Mark the \"first contact\" point (approximate)\n",
+ " if idx == 0: # Lasso hits corner\n",
+ " contact = np.array([1.0, 0.0]) # on the axis!\n",
+ " ax.scatter(*contact, color='purple', s=150, marker='*', zorder=6, label='Lasso solution (sparse!)')\n",
+ " else: # Ridge hits smooth part\n",
+ " contact = np.array([0.85, 0.35]) # not on axis\n",
+ " ax.scatter(*contact, color='purple', s=150, marker='*', zorder=6, label='Ridge solution')\n",
+ " \n",
+ " ax.set_xlim(-1.5, 1.5)\n",
+ " ax.set_ylim(-1.5, 1.5)\n",
+ " ax.set_xlabel(r'$\\beta_1$')\n",
+ " ax.set_ylabel(r'$\\beta_2$')\n",
+ " ax.set_title(f'{ball_type} Constraint')\n",
+ " ax.legend(loc='upper right', fontsize=9)\n",
+ " ax.set_aspect('equal')\n",
+ " ax.grid(True, alpha=0.3)\n",
+ " ax.axhline(0, color='k', linewidth=0.5)\n",
+ " ax.axvline(0, color='k', linewidth=0.5)\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/lasso_vs_ridge_geometry.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "77898731",
+ "metadata": {},
+ "source": [
+ "Key insight: Lasso's $L^1$ constraint has corners on the axes.\n",
+ "When the loss contour touches a corner, that coefficient becomes exactly zero."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6ce05c85",
+ "metadata": {},
+ "source": [
+ "## Principal Component Regression (PCR)\n",
+ "\n",
+ "Principal Component Regression combines the dimensionality reduction from Notebook 4 with linear regression. The idea is simple:\n",
+ "\n",
+ "1. Compute the principal components of $X$ (via SVD on centered data).\n",
+ "2. Keep only the top $k$ components (those with largest singular values).\n",
+ "3. Regress $y$ on these $k$ components.\n",
+ "\n",
+ "**Linear algebra perspective**: We project $X$ onto its best rank-$k$ approximation (in Frobenius norm) and then solve a least-squares problem in the reduced space. This is different from Ridge:\n",
+ "- Ridge **shrinks** all directions but keeps them.\n",
+ "- PCR **discards** the smallest singular directions entirely.\n",
+ "\n",
+ "PCR is particularly useful when:\n",
+ "- Features are highly correlated (multicollinearity).\n",
+ "- You want interpretable, low-dimensional representations.\n",
+ "- The signal lives in the top principal components while noise dominates the rest.\n",
+ "\n",
+ "The tradeoff: if the target $y$ is correlated with a small singular direction, PCR will discard useful information."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "46adb22c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.decomposition import PCA\n",
+ "from sklearn.linear_model import LinearRegression\n",
+ "from sklearn.pipeline import make_pipeline\n",
+ "from sklearn.model_selection import cross_val_score\n",
+ "\n",
+ "# Compare PCR with varying number of components\n",
+ "n_components_range = range(1, X_train_scaled.shape[1] + 1)\n",
+ "pcr_scores = []\n",
+ "\n",
+ "for n_comp in n_components_range:\n",
+ " pcr = make_pipeline(\n",
+ " PCA(n_components=n_comp),\n",
+ " LinearRegression()\n",
+ " )\n",
+ " # Negative MSE (sklearn convention: higher is better)\n",
+ " scores = cross_val_score(pcr, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')\n",
+ " pcr_scores.append(-scores.mean())\n",
+ "\n",
+ "# Also compute variance explained\n",
+ "pca_full = PCA()\n",
+ "pca_full.fit(X_train_scaled)\n",
+ "var_explained = np.cumsum(pca_full.explained_variance_ratio_)\n",
+ "\n",
+ "# Plot\n",
+ "fig, ax1 = plt.subplots(figsize=(10, 5))\n",
+ "\n",
+ "ax1.plot(n_components_range, pcr_scores, 'b-o', label='CV MSE')\n",
+ "ax1.set_xlabel('Number of Principal Components')\n",
+ "ax1.set_ylabel('Cross-Validated MSE', color='b')\n",
+ "ax1.tick_params(axis='y', labelcolor='b')\n",
+ "\n",
+ "ax2 = ax1.twinx()\n",
+ "ax2.plot(n_components_range, var_explained, 'r--s', label='Variance Explained')\n",
+ "ax2.set_ylabel('Cumulative Variance Explained', color='r')\n",
+ "ax2.tick_params(axis='y', labelcolor='r')\n",
+ "ax2.set_ylim(0, 1.05)\n",
+ "\n",
+ "plt.title('Principal Component Regression: Choosing k')\n",
+ "fig.legend(loc='center right', bbox_to_anchor=(0.85, 0.5))\n",
+ "plt.grid(True, alpha=0.3)\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/pcr_components_selection.png')\n",
+ "plt.show()\n",
+ "\n",
+ "# Best number of components\n",
+ "best_n_comp = n_components_range[np.argmin(pcr_scores)]\n",
+ "print(f\"Best number of components: {best_n_comp}\")\n",
+ "print(f\"Variance explained: {var_explained[best_n_comp-1]:.2%}\")\n",
+ "\n",
+ "# Compare with OLS and Ridge\n",
+ "print(f\"\\nModel Comparison (Test MSE):\")\n",
+ "print(f\" OLS (all features): {test_mse:.4f}\")\n",
+ "print(f\" PCR (k={best_n_comp}): {pcr_scores[best_n_comp-1]:.4f}\")\n",
+ "print(f\" Ridge (Ī»={best_alpha:.2f}): {test_mse_ridge:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f67d60bc",
+ "metadata": {},
+ "source": [
+ "## Gradient Descent: When the Normal Equations Are Not Enough\n",
+ "\n",
+ "For very large datasets, computing $(X^T X)^{-1}$ or even forming $X^T X$ becomes prohibitive. **Gradient descent** is an iterative optimisation method that uses only firstāorder derivatives.\n",
+ "\n",
+ "### The Linear Algebra of Convergence\n",
+ "\n",
+ "The loss function and its gradient:\n",
+ "\n",
+ "$$\n",
+ "L(\\beta) = \\frac{1}{2n}\\|y - X\\beta\\|_2^2, \\qquad \\nabla L(\\beta) = -\\frac{1}{n} X^T (y - X\\beta).\n",
+ "$$\n",
+ "\n",
+ "Starting from $\\beta^{(0)}$, we update:\n",
+ "\n",
+ "$$\n",
+ "\\beta^{(t+1)} = \\beta^{(t)} - \\eta \\nabla L(\\beta^{(t)}).\n",
+ "$$\n",
+ "\n",
+ "**Convergence depends on the eigenvalues of $X^T X$.** Let $\\lambda_{\\max}$ and $\\lambda_{\\min}$ be the largest and smallest eigenvalues. Then:\n",
+ "- The learning rate must satisfy $\\eta < \\frac{2}{\\lambda_{\\max}}$ for convergence.\n",
+ "- The convergence rate is governed by the **condition number** $\\kappa = \\frac{\\lambda_{\\max}}{\\lambda_{\\min}}$.\n",
+ "- When $\\kappa$ is large, gradients point in \"wrong\" directions ā the loss surface is a narrow valley.\n",
+ "\n",
+ "This is why feature scaling matters: it reduces $\\kappa$, making the loss surface more spherical and convergence faster."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4135ebb7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Demonstrate how condition number affects gradient descent convergence\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):\n",
+ " \"\"\"Batch gradient descent for linear regression.\"\"\"\n",
+ " n, p = X.shape\n",
+ " beta = np.zeros(p)\n",
+ " losses = []\n",
+ " for i in range(n_iter):\n",
+ " residual = y - X @ beta\n",
+ " grad = - (1/n) * X.T @ residual\n",
+ " beta -= learning_rate * grad\n",
+ " loss = (1/(2*n)) * np.linalg.norm(residual)**2\n",
+ " losses.append(loss)\n",
+ " return beta, losses\n",
+ "\n",
+ "# Use a subset for illustration\n",
+ "X_subset = X_train[:1000]\n",
+ "y_subset = y_train[:1000]\n",
+ "\n",
+ "# Add intercept\n",
+ "X_subset_aug = np.hstack([np.ones((X_subset.shape[0], 1)), X_subset])\n",
+ "\n",
+ "# Compute eigenvalues of X^T X\n",
+ "eigenvalues = np.linalg.eigvalsh(X_subset_aug.T @ X_subset_aug)\n",
+ "lambda_max, lambda_min = eigenvalues.max(), eigenvalues[eigenvalues > 1e-10].min()\n",
+ "cond_num = lambda_max / lambda_min\n",
+ "\n",
+ "print(f\"Eigenvalue range: [{lambda_min:.2e}, {lambda_max:.2e}]\")\n",
+ "print(f\"Condition number: {cond_num:.2e}\")\n",
+ "print(f\"Max stable learning rate: {2/lambda_max:.2e}\")\n",
+ "\n",
+ "# Try gradient descent with different learning rates on UNSCALED data\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "# UNSCALED\n",
+ "learning_rates = [1e-10, 1e-9, 1e-8]\n",
+ "for lr in learning_rates:\n",
+ " _, losses = gradient_descent_linear(X_subset_aug, y_subset, learning_rate=lr, n_iter=200)\n",
+ " axes[0].plot(losses, label=f'Ī· = {lr:.0e}')\n",
+ "axes[0].set_xlabel('Iteration')\n",
+ "axes[0].set_ylabel('Loss (MSE)')\n",
+ "axes[0].set_title(f'Unscaled Data (Īŗ = {cond_num:.1e})')\n",
+ "axes[0].legend()\n",
+ "axes[0].grid(True, alpha=0.3)\n",
+ "axes[0].set_yscale('log')\n",
+ "\n",
+ "# SCALED\n",
+ "scaler = StandardScaler()\n",
+ "X_subset_scaled = scaler.fit_transform(X_subset)\n",
+ "X_subset_scaled_aug = np.hstack([np.ones((X_subset_scaled.shape[0], 1)), X_subset_scaled])\n",
+ "\n",
+ "eigenvalues_scaled = np.linalg.eigvalsh(X_subset_scaled_aug.T @ X_subset_scaled_aug)\n",
+ "lambda_max_s, lambda_min_s = eigenvalues_scaled.max(), eigenvalues_scaled[eigenvalues_scaled > 1e-10].min()\n",
+ "cond_num_scaled = lambda_max_s / lambda_min_s\n",
+ "\n",
+ "learning_rates_scaled = [0.001, 0.01, 0.1]\n",
+ "for lr in learning_rates_scaled:\n",
+ " _, losses = gradient_descent_linear(X_subset_scaled_aug, y_subset, learning_rate=lr, n_iter=200)\n",
+ " axes[1].plot(losses, label=f'Ī· = {lr}')\n",
+ "axes[1].set_xlabel('Iteration')\n",
+ "axes[1].set_ylabel('Loss (MSE)')\n",
+ "axes[1].set_title(f'Scaled Data (Īŗ = {cond_num_scaled:.1f})')\n",
+ "axes[1].legend()\n",
+ "axes[1].grid(True, alpha=0.3)\n",
+ "axes[1].set_yscale('log')\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/gd_condition_number_effect.png')\n",
+ "plt.show()\n",
+ "\n",
+ "print(f\"\\nScaling reduced condition number from {cond_num:.1e} to {cond_num_scaled:.1f}\")\n",
+ "print(\"This allows much larger learning rates and faster convergence.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e1197d1a",
+ "metadata": {},
+ "source": [
+ "Let's apply gradient descent to our housing data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cdc96859",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Implement batch gradient descent for linear regression on a small subset for illustration\n",
+ "def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):\n",
+ " n, p = X.shape\n",
+ " beta = np.zeros(p)\n",
+ " losses = []\n",
+ " for i in range(n_iter):\n",
+ " grad = - (1/n) * X.T @ (y - X @ beta)\n",
+ " beta -= learning_rate * grad\n",
+ " loss = (1/(2*n)) * np.linalg.norm(y - X @ beta)**2\n",
+ " losses.append(loss)\n",
+ " if verbose and i % 200 == 0:\n",
+ " print(f\"Iter {i}: loss = {loss:.6f}\")\n",
+ " return beta, losses\n",
+ "\n",
+ "# Use a small subset for speed\n",
+ "X_small = X_train[:1000]\n",
+ "y_small = y_train[:1000]\n",
+ "\n",
+ "# Add intercept column\n",
+ "X_small_aug = np.hstack([np.ones((X_small.shape[0], 1)), X_small])\n",
+ "\n",
+ "beta_gd, losses = gradient_descent_linear(X_small_aug, y_small, learning_rate=0.01, n_iter=500)\n",
+ "\n",
+ "plt.figure(figsize=(8,4))\n",
+ "plt.plot(losses)\n",
+ "plt.xlabel('Iteration')\n",
+ "plt.ylabel('Loss')\n",
+ "plt.title('Gradient Descent Convergence')\n",
+ "plt.grid(True)\n",
+ "plt.savefig('../images/gradient_descent_convergence_unscaled')\n",
+ "plt.show()\n",
+ "\n",
+ "# Compare with closed-form solution on the same subset\n",
+ "beta_closed = np.linalg.lstsq(X_small_aug, y_small, rcond=None)[0]\n",
+ "print(f\"Difference between GD and closed-form: {np.linalg.norm(beta_gd - beta_closed):.2e}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b3a02ebc",
+ "metadata": {},
+ "source": [
+ "Again, we have scaling issues causing some errors. Large values will dominate gradients giving rise to instability."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b39817b2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "# 1. Prepare Data\n",
+ "# Use a small subset for speed\n",
+ "X_small = X_train[:1000].copy() # Use .copy() to avoid SettingWithCopyWarning\n",
+ "y_small = y_train[:1000].copy()\n",
+ "\n",
+ "# 2. SCALE THE FEATURES (Critical for Gradient Descent!)\n",
+ "scaler = StandardScaler()\n",
+ "X_small_scaled = scaler.fit_transform(X_small)\n",
+ "\n",
+ "# Add intercept column AFTER scaling\n",
+ "# (We don't scale the intercept column, it stays as 1s)\n",
+ "X_small_aug = np.hstack([np.ones((X_small_scaled.shape[0], 1)), X_small_scaled])\n",
+ "\n",
+ "# 3. Run Gradient Descent\n",
+ "def gradient_descent_linear(X, y, learning_rate=0.01, n_iter=1000, verbose=False):\n",
+ " n, p = X.shape\n",
+ " beta = np.zeros(p)\n",
+ " losses = []\n",
+ " for i in range(n_iter):\n",
+ " # Predict\n",
+ " prediction = X @ beta\n",
+ " # Residual\n",
+ " residual = y - prediction\n",
+ " # Gradient\n",
+ " grad = - (1/n) * X.T @ residual\n",
+ " # Update\n",
+ " beta -= learning_rate * grad\n",
+ " \n",
+ " # Calculate Loss (MSE)\n",
+ " loss = (1/(2*n)) * np.linalg.norm(residual)**2\n",
+ " losses.append(loss)\n",
+ " \n",
+ " if verbose and i % 200 == 0:\n",
+ " print(f\"Iter {i}: loss = {loss:.6f}\")\n",
+ " return beta, losses\n",
+ "\n",
+ "# With scaled data, learning_rate=0.01 or even 0.1 is usually safe\n",
+ "beta_gd, losses = gradient_descent_linear(X_small_aug, y_small, learning_rate=0.1, n_iter=500, verbose=True)\n",
+ "\n",
+ "# Plot convergence\n",
+ "plt.figure(figsize=(8,4))\n",
+ "plt.plot(losses)\n",
+ "plt.xlabel('Iteration')\n",
+ "plt.ylabel('Loss (MSE)')\n",
+ "plt.title('Gradient Descent Convergence (Scaled Data)')\n",
+ "plt.grid(True)\n",
+ "plt.savefig('../images/gradient_descent_convergence_scaled')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f74f1fe8",
+ "metadata": {},
+ "source": [
+ "In practice, we use **stochastic** or **miniābatch** gradient descent for large data. `sklearn`'s `SGDRegressor` implements these with various loss functions and penalties.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ffb519e1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import SGDRegressor\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "from sklearn.pipeline import make_pipeline\n",
+ "\n",
+ "# SGDRegressor is sensitive to feature scaling, so we use a pipeline\n",
+ "# penalty=None means no regularization (standard Linear Regression)\n",
+ "sgd_reg = make_pipeline(\n",
+ " StandardScaler(),\n",
+ " SGDRegressor(penalty=None, learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42)\n",
+ ")\n",
+ "\n",
+ "sgd_reg.fit(X_train, y_train)\n",
+ "\n",
+ "# Note: SGDRegressor optimizes a different loss function formulation by default,\n",
+ "# so coefficients might differ slightly from closed-form, but the prediction quality is similar.\n",
+ "print(f\"Coefficients: {sgd_reg.named_steps['sgdregressor'].coef_}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b204fd46",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Decision Trees and Random Forests\n",
+ "\n",
+ "Linear models assume a linear relationship. Decision trees are **nonāparametric** models that partition the feature space into rectangular regions and assign a constant prediction (or a simple model) in each region. The prediction function is piecewise constant. The basis functions are indicator functions of the leaves. While not linear in the original features, the model is linear in the (highādimensional) leafāindicator basis.\n",
+ "\n",
+ "**Random forests** combine many decision trees, each trained on a bootstrapped sample and a random subset of features. They reduce variance (overfitting) and often outperform single trees.\n",
+ "\n",
+ "When to use trees / forests:\n",
+ "- Nonlinear relationships with interactions.\n",
+ "- When interpretability is desired (a single tree can be visualised).\n",
+ "- When you have mixed categorical and continuous features.\n",
+ "- As a strong baseline before trying deep learning."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7ac05f78",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.tree import DecisionTreeRegressor\n",
+ "from sklearn.ensemble import RandomForestRegressor\n",
+ "\n",
+ "# Single decision tree (max depth 10)\n",
+ "tree = DecisionTreeRegressor(max_depth=10, random_state=RANDOM_STATE)\n",
+ "tree.fit(X_train, y_train)\n",
+ "y_test_pred_tree = tree.predict(X_test)\n",
+ "test_mse_tree = mean_squared_error(y_test, y_test_pred_tree)\n",
+ "\n",
+ "# Random forest (100 trees)\n",
+ "rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1)\n",
+ "rf.fit(X_train, y_train)\n",
+ "y_test_pred_rf = rf.predict(X_test)\n",
+ "test_mse_rf = mean_squared_error(y_test, y_test_pred_rf)\n",
+ "\n",
+ "print(f\"Decision Tree Test MSE: {test_mse_tree:.4f}\")\n",
+ "print(f\"Random Forest Test MSE: {test_mse_rf:.4f}\")\n",
+ "\n",
+ "# Compare with best linear model\n",
+ "print(f\"Ridge (poly) Test MSE: {test_mse_ridge:.4f}\")\n",
+ "print(f\"LassoCV Test MSE: {test_mse_lasso_cv:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4923e4e3",
+ "metadata": {},
+ "source": [
+ "Random forests often outperform linear models on complex realāworld data without requiring feature engineering or scaling.\n",
+ "\n",
+ "## Logistic Regression for Classification\n",
+ "\n",
+ "So far we have focused on regression (continuous targets). For binary classification (e.g., spam vs. not spam), **logistic regression** is a natural extension. It models the probability that an observation belongs to a class using the logistic (sigmoid) function:\n",
+ "\n",
+ "$$\n",
+ "P(y=1 \\mid x) = \\frac{1}{1 + e^{-x^T\\beta}}.\n",
+ "$$\n",
+ "\n",
+ "The decision boundary is linear in the features: $x^T\\beta = 0$. The model is fitted by **maximum likelihood estimation**, which is equivalent to minimising the **logāloss** (crossāentropy). There is no closedāform solution; we typically use gradient descent or Newton's method.\n",
+ "\n",
+ "We will illustrate logistic regression on a subset of the California housing data by creating a binary target (e.g., whether the median house value is above the median)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6c7802e5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
+ "\n",
+ "# Create binary target: 1 if house value > median, else 0\n",
+ "# Use the ORIGINAL dataframe to avoid confusion with scaled/transformed versions\n",
+ "y_binary = (df['MedHouseVal'] > df['MedHouseVal'].median()).astype(int).values\n",
+ "X_original = df[feature_names].values # original features, not overwritten\n",
+ "\n",
+ "# Split\n",
+ "X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(\n",
+ " X_original, y_binary, test_size=0.2, random_state=RANDOM_STATE\n",
+ ")\n",
+ "\n",
+ "# Scale features (important for logistic regression with regularization)\n",
+ "scaler_bin = StandardScaler()\n",
+ "X_train_bin_scaled = scaler_bin.fit_transform(X_train_bin)\n",
+ "X_test_bin_scaled = scaler_bin.transform(X_test_bin)\n",
+ "\n",
+ "# Train logistic regression\n",
+ "log_reg = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)\n",
+ "log_reg.fit(X_train_bin_scaled, y_train_bin)\n",
+ "\n",
+ "# Predict\n",
+ "y_pred_bin = log_reg.predict(X_test_bin_scaled)\n",
+ "accuracy = accuracy_score(y_test_bin, y_pred_bin)\n",
+ "\n",
+ "print(f\"Logistic Regression Accuracy: {accuracy:.4f}\")\n",
+ "print(\"\\nClassification Report:\")\n",
+ "print(classification_report(y_test_bin, y_pred_bin))\n",
+ "\n",
+ "# Coefficients (on scaled features)\n",
+ "coef_df = pd.DataFrame({\n",
+ " 'Feature': feature_names,\n",
+ " 'Coefficient': log_reg.coef_[0]\n",
+ "}).sort_values('Coefficient', key=abs, ascending=False)\n",
+ "print(\"\\nLogistic Regression Coefficients (scaled features):\")\n",
+ "print(coef_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cb23e281",
+ "metadata": {},
+ "source": [
+ "## CrossāValidation: A Deeper Look\n",
+ "\n",
+ "Crossāvalidation (CV) is a technique for assessing how well a model generalises to unseen data. Instead of a single train/validation split, we partition the training data into $k$ folds (typically 5 or 10). For each fold $i$, we train on the other $k-1$ folds and validate on fold $i$. The performance is averaged over the $k$ folds."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c445e5c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Illustrate 5-fold cross-validation\n",
+ "from sklearn.model_selection import KFold\n",
+ "\n",
+ "n_points = 20\n",
+ "X_cv = np.arange(n_points).reshape(-1, 1)\n",
+ "colors = plt.cm.tab10(np.linspace(0, 1, 5))\n",
+ "\n",
+ "kf = KFold(n_splits=5, shuffle=True, random_state=3)\n",
+ "\n",
+ "fig, axes = plt.subplots(5, 1, figsize=(12, 8))\n",
+ "\n",
+ "for i, (train_idx, test_idx) in enumerate(kf.split(X_cv)):\n",
+ " ax = axes[i]\n",
+ " \n",
+ " # Plot all points\n",
+ " for j in range(n_points):\n",
+ " if j in test_idx:\n",
+ " ax.scatter(j, 0, s=200, c='red', marker='s', label='Test' if j == test_idx[0] else '')\n",
+ " else:\n",
+ " ax.scatter(j, 0, s=200, c='blue', marker='o', label='Train' if j == train_idx[0] else '')\n",
+ " \n",
+ " ax.set_xlim(-1, n_points)\n",
+ " ax.set_ylim(-0.5, 0.5)\n",
+ " ax.set_yticks([])\n",
+ " ax.set_ylabel(f'Fold {i+1}', rotation=0, labelpad=30)\n",
+ " \n",
+ " if i == 0:\n",
+ " ax.legend(loc='upper right', ncol=2)\n",
+ " if i < 4:\n",
+ " ax.set_xticks([])\n",
+ "\n",
+ "axes[-1].set_xlabel('Sample Index')\n",
+ "axes[2].set_title('5-Fold Cross-Validation', pad=20)\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.savefig('../images/cross_validation_illustration.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "17211749",
+ "metadata": {},
+ "source": [
+ "\n",
+ "> **Why crossāvalidate?** It reduces the variance of the performance estimate and makes better use of limited data. It is also essential for hyperparameter tuning (as we did with Ridge and Lasso).\n",
+ "\n",
+ "We already used `cross_val_score` above. Here's an explicit example with a linear model on the housing data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "99778878",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import cross_val_score, KFold\n",
+ "\n",
+ "# 5-fold CV on linear regression\n",
+ "lin_reg_cv = LinearRegression()\n",
+ "scores = cross_val_score(lin_reg_cv, X_train, y_train, cv=5, scoring='r2')\n",
+ "print(f\"5-fold CV R² scores: {scores}\")\n",
+ "print(f\"Mean R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})\")\n",
+ "\n",
+ "# We can also use a custom cross-validator\n",
+ "kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)\n",
+ "scores_shuffled = cross_val_score(lin_reg_cv, X_train, y_train, cv=kf, scoring='r2')\n",
+ "print(f\"Shuffled CV R² scores: {scores_shuffled}\")\n",
+ "print(f\"Mean R² (shuffled): {scores_shuffled.mean():.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0f5b2dc",
+ "metadata": {},
+ "source": [
+ "## Feature Scaling\n",
+ "\n",
+ "Many machine learning algorithms are sensitive to the scale of features. For example:\n",
+ "- Gradient descent converges faster when features are on similar scales.\n",
+ "- Regularisation (Ridge, Lasso) penalises coefficients equally; if features have different scales, the penalty is not meaningful.\n",
+ "- Distanceābased methods (kānearest neighbours, SVM with RBF kernel) assume all features are comparable.\n",
+ "\n",
+ "> **Linear algebra view**: Scaling corresponds to multiplying each column of $X$ by a positive scalar. This changes the condition number and the geometry of the optimisation landscape.\n",
+ "\n",
+ "Common scaling techniques:\n",
+ "- **Standardisation** (Zāscore): $x' = \\frac{x - \\mu}{\\sigma}$ (mean 0, variance 1).\n",
+ "- **Mināmax scaling**: $x' = \\frac{x - \\min}{\\max - \\min}$ (range [0,1]).\n",
+ "\n",
+ "We should always fit the scaler on the training set and then transform both train and test sets to avoid data leakage."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0b89e4b1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "# Create scaler\n",
+ "scaler = StandardScaler()\n",
+ "\n",
+ "# Fit on training data only\n",
+ "X_train_scaled = scaler.fit_transform(X_train)\n",
+ "X_test_scaled = scaler.transform(X_test)\n",
+ "\n",
+ "# Compare condition number before and after scaling\n",
+ "X_train_aug = np.hstack([np.ones((X_train.shape[0], 1)), X_train])\n",
+ "X_train_scaled_aug = np.hstack([np.ones((X_train_scaled.shape[0], 1)), X_train_scaled])\n",
+ "\n",
+ "print(f\"Condition number (original): {cond(X_train_aug):.2e}\")\n",
+ "print(f\"Condition number (scaled): {cond(X_train_scaled_aug):.2e}\")\n",
+ "\n",
+ "# Fit linear regression on scaled data\n",
+ "lin_reg_scaled = LinearRegression()\n",
+ "lin_reg_scaled.fit(X_train_scaled, y_train)\n",
+ "y_test_pred_scaled = lin_reg_scaled.predict(X_test_scaled)\n",
+ "test_mse_scaled = mean_squared_error(y_test, y_test_pred_scaled)\n",
+ "print(f\"Linear regression (scaled) Test MSE: {test_mse_scaled:.4f}\")\n",
+ "print(f\"Linear regression (original) Test MSE: {test_mse:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "65b450d1",
+ "metadata": {},
+ "source": [
+ "Scaling did not change the linear regression performance because OLS is scaleāinvariant (the coefficients adjust accordingly). However, it improves numerical stability and is crucial for regularised models and gradient descent.\n",
+ "\n",
+ "## Model Interpretation\n",
+ "\n",
+ "Interpretability is important in many applications. Different models offer different levels of insight.\n",
+ "\n",
+ "### Linear Models (Ridge, Lasso)\n",
+ "- Coefficients directly indicate the effect of each feature (assuming features are scaled).\n",
+ "- Sign and magnitude tell us direction and importance.\n",
+ "\n",
+ "### Decision Trees\n",
+ "- We can visualise the tree structure.\n",
+ "- Feature importance based on how much each feature reduces impurity (e.g., variance for regression, Gini for classification).\n",
+ "\n",
+ "### Random Forests\n",
+ "- Aggregate feature importance across all trees.\n",
+ "- Can also use SHAP or LIME for local explanations.\n",
+ "\n",
+ "Let's examine coefficients from a scaled linear model and feature importance from a random forest."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9a7c2009",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Train Ridge on scaled data (with default alpha)\n",
+ "ridge_scaled = Ridge(alpha=1.0)\n",
+ "ridge_scaled.fit(X_train_scaled, y_train)\n",
+ "\n",
+ "# Display coefficients\n",
+ "coef_df = pd.DataFrame({\n",
+ " 'Feature': feature_names,\n",
+ " 'Coefficient': ridge_scaled.coef_\n",
+ "})\n",
+ "print(\"Ridge coefficients (scaled features):\")\n",
+ "print(coef_df.sort_values('Coefficient', key=abs, ascending=False))\n",
+ "\n",
+ "# Random forest feature importance\n",
+ "rf.fit(X_train, y_train) # already fitted earlier, but ensure\n",
+ "importances = rf.feature_importances_\n",
+ "importance_df = pd.DataFrame({\n",
+ " 'Feature': feature_names,\n",
+ " 'Importance': importances\n",
+ "}).sort_values('Importance', ascending=False)\n",
+ "\n",
+ "print(\"\\nRandom Forest Feature Importances:\")\n",
+ "print(importance_df)\n",
+ "\n",
+ "# Plot\n",
+ "plt.figure(figsize=(8,4))\n",
+ "plt.barh(importance_df['Feature'], importance_df['Importance'])\n",
+ "plt.xlabel('Importance')\n",
+ "plt.title('Random Forest Feature Importance')\n",
+ "plt.gca().invert_yaxis()\n",
+ "plt.savefig('../images/RF_feature_importance.png')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a470114c",
+ "metadata": {},
+ "source": [
+ "## Hyperparameter Tuning with Grid Search\n",
+ "\n",
+ "Most models have hyperparameters that are not learned from data (e.g., `alpha` in Ridge, `max_depth` in trees, `n_estimators` in random forests). Tuning them properly is essential for good performance. Choosing hyperparameters is like selecting the optimal basis or regularisation parameter ā it changes the solution space.\n",
+ "\n",
+ "**Grid search** exhaustively tries a predefined set of hyperparameter combinations using crossāvalidation. `sklearn.model_selection.GridSearchCV` does this efficiently.\n",
+ "\n",
+ "Let's tune a random forest regressor on the housing data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b8a04531",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import GridSearchCV\n",
+ "\n",
+ "# Define parameter grid\n",
+ "param_grid = {\n",
+ " 'n_estimators': [50, 100],\n",
+ " 'max_depth': [5, 10, None],\n",
+ " 'min_samples_split': [2, 5]\n",
+ "}\n",
+ "\n",
+ "# Create random forest\n",
+ "rf_tune = RandomForestRegressor(random_state=42, n_jobs=-1)\n",
+ "\n",
+ "# Grid search with 3-fold CV (use a subset of training data for speed)\n",
+ "X_train_subset = X_train[:5000]\n",
+ "y_train_subset = y_train[:5000]\n",
+ "\n",
+ "grid_search = GridSearchCV(rf_tune, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)\n",
+ "grid_search.fit(X_train_subset, y_train_subset)\n",
+ "\n",
+ "print(\"Best parameters:\", grid_search.best_params_)\n",
+ "print(\"Best CV MSE:\", -grid_search.best_score_)\n",
+ "\n",
+ "# Evaluate on test set\n",
+ "best_rf = grid_search.best_estimator_\n",
+ "y_test_pred_best_rf = best_rf.predict(X_test)\n",
+ "test_mse_best_rf = mean_squared_error(y_test, y_test_pred_best_rf)\n",
+ "print(f\"Tuned Random Forest Test MSE: {test_mse_best_rf:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f8528fe3",
+ "metadata": {},
+ "source": [
+ "## Summary and Additional Considerations\n",
+ "\n",
+ "We have covered a progression of modelling techniques and essential practices:\n",
+ "\n",
+ "| Method | Linearity | Regularisation | Feature Selection | Scalability |\n",
+ "|--------|-----------|----------------|-------------------|-------------|\n",
+ "| Linear regression | Yes | No | No | Good (closedāform) |\n",
+ "| Polynomial regression | In features | No | No | Poor (exploding dimension) |\n",
+ "| Ridge | Yes | $L^2$ | No (shrinks only) | Good |\n",
+ "| Lasso | Yes | $L^1$ | Yes | Good (via coordinate descent) |\n",
+ "| Logistic regression | Decision boundary linear | Optional | With L1/L2 | Good |\n",
+ "| Gradient descent | Yes (or any differentiable) | Optional | Optional | Excellent (very large data) |\n",
+ "| Decision trees | No | No (but depth limits) | Implicitly | Moderate |\n",
+ "| Random forests | No | No (ensemble reduces variance) | Implicitly | Moderate (parallelisable) |\n",
+ "\n",
+ "> **Biasāvariance tradeoff**: Simple models (linear) have high bias but low variance. Complex models (deep trees) have low bias but high variance. Regularisation and ensembles (random forests) try to balance this.\n",
+ "\n",
+ "**What else could be added?**\n",
+ "- **Support vector machines** (SVM) ā geometric margin classifiers.\n",
+ "- **Neural networks** ā highly flexible nonlinear models.\n",
+ "- **Time series models** (ARIMA, etc.).\n",
+ "- **Model selection criteria** (AIC, BIC)."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.14.3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/html/01_least_squares.html b/notebooks/html/01_least_squares.html
new file mode 100644
index 0000000..cebce70
--- /dev/null
+++ b/notebooks/html/01_least_squares.html
@@ -0,0 +1,9027 @@
+
+
+
+
+
+01_least_squares
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[4]:
+
+
array([[-1.0658141e-15],
+ [ 9.0000000e-01]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[7]:
+
+
array([[6.16291085e-16],
+ [9.00000000e-01]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[9]:
+
+
+
+
+
+
+ |
+Square ft |
+Bedrooms |
+Price |
+
+
+
+
+| 0 |
+1600 |
+3 |
+500 |
+
+
+| 1 |
+2100 |
+4 |
+650 |
+
+
+| 2 |
+1550 |
+2 |
+475 |
+
+
+| 3 |
+1600 |
+3 |
+490 |
+
+
+| 4 |
+2000 |
+4 |
+620 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[10]:
+
+
+
+
+
+
+ |
+Square ft |
+Bedrooms |
+Price |
+
+
+
+
+| count |
+5.000000 |
+5.00000 |
+5.000000 |
+
+
+| mean |
+1770.000000 |
+3.20000 |
+547.000000 |
+
+
+| std |
+258.843582 |
+0.83666 |
+81.516869 |
+
+
+| min |
+1550.000000 |
+2.00000 |
+475.000000 |
+
+
+| 25% |
+1600.000000 |
+3.00000 |
+490.000000 |
+
+
+| 50% |
+1600.000000 |
+3.00000 |
+500.000000 |
+
+
+| 75% |
+2000.000000 |
+4.00000 |
+620.000000 |
+
+
+| max |
+2100.000000 |
+4.00000 |
+650.000000 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[11]:
+
+
+
+
+
+
+ |
+Square ft |
+Bedrooms |
+Price |
+
+
+
+
+| Square ft |
+1.000000 |
+0.900426 |
+0.998810 |
+
+
+| Bedrooms |
+0.900426 |
+1.000000 |
+0.909066 |
+
+
+| Price |
+0.998810 |
+0.909066 |
+1.000000 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[16]:
+
+
array([[4.0098513e-13],
+ [3.0000000e-01],
+ [5.0000000e+00]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
Out[21]:
+
+
poly1d([ 3.08080808e-07, -1.78106061e-03, 3.71744949e+00, -2.15530303e+03])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
diff --git a/notebooks/html/02_qr_svd.html b/notebooks/html/02_qr_svd.html
new file mode 100644
index 0000000..a7bf886
--- /dev/null
+++ b/notebooks/html/02_qr_svd.html
@@ -0,0 +1,8586 @@
+
+
+
+
+
+02_qr_svd
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
QA = [[ 1. 0. 0.]
+ [-0. 1. 0.]
+ [-0. -0. 1.]
+ [-0. -0. -0.]]
+
+RA = [[1. 1. 1.]
+ [0. 1. 1.]
+ [0. 0. 1.]]
+
+QB = [[-0.5 0.8660254 0. ]
+ [-0.5 -0.28867513 0.81649658]
+ [-0.5 -0.28867513 -0.40824829]
+ [-0.5 -0.28867513 -0.40824829]]
+
+RB = [[-2. -1.5 -1. ]
+ [ 0. -0.8660254 -0.57735027]
+ [ 0. 0. -0.81649658]]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[13]:
+
+
array([[ 1.00000000e+00],
+ [ 6.40987562e-17],
+ [-5.00000000e-01]])
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[14]:
+
+
array([[ 1.00000000e+00],
+ [ 2.22044605e-16],
+ [-5.00000000e-01]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Q = [[-0.4472136 0.32838365 0.40496317]
+ [-0.4472136 -0.63745061 -0.22042299]
+ [-0.4472136 0.42496708 -0.7689174 ]
+ [-0.4472136 0.32838365 0.40496317]
+ [-0.4472136 -0.44428376 0.17941406]]
+
+R = [[-2.23606798e+00 -3.95784032e+03 -7.15541753e+00]
+ [ 0.00000000e+00 -5.17687164e+02 -1.50670145e+00]
+ [ 0.00000000e+00 0.00000000e+00 7.27908474e-01]]
+
+beta = [[-3.05053797e-13]
+ [ 3.00000000e-01]
+ [ 5.00000000e+00]]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[20]:
+
+
array([[1.00000000e+00, 2.89687929e-17, 2.89687929e-17, 2.89687929e-17],
+ [2.89687929e-17, 1.00000000e+00, 7.07349921e-17, 7.07349921e-17],
+ [2.89687929e-17, 7.07349921e-17, 5.00000000e-01, 5.00000000e-01],
+ [2.89687929e-17, 7.07349921e-17, 5.00000000e-01, 5.00000000e-01]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
U = [[-0.70710678 -0.70710678]
+ [-0.70710678 0.70710678]]
+
+S = [5. 3.]
+
+Vh.T = [[-7.07106781e-01 -2.35702260e-01 -6.66666667e-01]
+ [-7.07106781e-01 2.35702260e-01 6.66666667e-01]
+ [-6.47932334e-17 -9.42809042e-01 3.33333333e-01]]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
U_A = [[ 0.73697623 0.59100905 0.32798528 0. ]
+ [ 0.59100905 -0.32798528 -0.73697623 0. ]
+ [ 0.32798528 -0.73697623 0.59100905 0. ]
+ [ 0. 0. 0. 1. ]]
+
+S_A = [2.2469796 0.80193774 0.55495813]
+
+Vh_A.T = [[ 0.32798528 0.73697623 0.59100905]
+ [ 0.59100905 0.32798528 -0.73697623]
+ [ 0.73697623 -0.59100905 0.32798528]]
+
+U_B = [[-2.41816250e-01 7.12015746e-01 -6.59210496e-01 0.00000000e+00]
+ [-4.52990541e-01 5.17957311e-01 7.25616837e-01 6.71536163e-17]
+ [-6.06763739e-01 -3.35226641e-01 -1.39502200e-01 -7.07106781e-01]
+ [-6.06763739e-01 -3.35226641e-01 -1.39502200e-01 7.07106781e-01]]
+
+S_B = [2.8092118 0.88646771 0.56789441]
+
+Vh_B.T = [[-0.67931306 0.63117897 -0.37436195]
+ [-0.59323331 -0.17202654 0.7864357 ]
+ [-0.43198148 -0.75632002 -0.49129626]]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[27]:
+
+
array([[ 2.74080345e+15],
+ [ 2.74080345e+15],
+ [-2.74080345e+15]])
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[28]:
+
+
array([3.00000000e+00, 1.00000000e+00, 1.21618839e-16])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
A_pinv=[[ 0.11111111 -0.44444444 0.55555556 0. ]
+ [ 0.11111111 0.55555556 -0.44444444 0. ]
+ [ 0.22222222 0.11111111 0.11111111 0. ]]
+
+beta=[[0.22222222]
+ [0.22222222]
+ [0.44444444]]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[33]:
+
+
np.float64(4329.082589067693)
+
+
+
+
+
+
+
+
+
diff --git a/notebooks/html/03_some_notes_and_what_goes_wrong.html b/notebooks/html/03_some_notes_and_what_goes_wrong.html
new file mode 100644
index 0000000..43603dc
--- /dev/null
+++ b/notebooks/html/03_some_notes_and_what_goes_wrong.html
@@ -0,0 +1,8000 @@
+
+
+
+
+
+03_some_notes_and_what_goes_wrong
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notebooks/html/04_pca.html b/notebooks/html/04_pca.html
new file mode 100644
index 0000000..db36e7c
--- /dev/null
+++ b/notebooks/html/04_pca.html
@@ -0,0 +1,8332 @@
+
+
+
+
+
+04_pca
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[3]:
+
+
+
+
+
+
+ |
+Square ft |
+Square m |
+Bedrooms |
+Price |
+
+
+
+
+| count |
+5.000000 |
+5.000000 |
+5.00000 |
+5.000000 |
+
+
+| mean |
+1770.000000 |
+164.000000 |
+3.20000 |
+547.000000 |
+
+
+| std |
+258.843582 |
+24.052027 |
+0.83666 |
+81.516869 |
+
+
+| min |
+1550.000000 |
+144.000000 |
+2.00000 |
+475.000000 |
+
+
+| 25% |
+1600.000000 |
+148.000000 |
+3.00000 |
+490.000000 |
+
+
+| 50% |
+1600.000000 |
+148.000000 |
+3.00000 |
+500.000000 |
+
+
+| 75% |
+2000.000000 |
+185.000000 |
+4.00000 |
+620.000000 |
+
+
+| max |
+2100.000000 |
+195.000000 |
+4.00000 |
+650.000000 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[4]:
+
+
+
+
+
+
+ |
+Square ft |
+Square m |
+Bedrooms |
+Price |
+
+
+
+
+| Square ft |
+1.000000 |
+0.999886 |
+0.900426 |
+0.998810 |
+
+
+| Square m |
+0.999886 |
+1.000000 |
+0.894482 |
+0.998395 |
+
+
+| Bedrooms |
+0.900426 |
+0.894482 |
+1.000000 |
+0.909066 |
+
+
+| Price |
+0.998810 |
+0.998395 |
+0.909066 |
+1.000000 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[5]:
+
+
np.float64(8222.19067218415)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[7]:
+
+
np.float64(4.999999999999999)
+
+
+
+
+
+
+
+
+
+
+
+
Out[8]:
+
+
array([[-0.70710678],
+ [-0.70710678]])
+
+
+
+
+
+
+
+
+
+
+
+
Out[9]:
+
+
array([[-7.07106781e-01, -7.07106781e-01, -6.47932334e-17]])
+
+
+
+
+
+
+
+
+
+
+
+
Out[10]:
+
+
array([[2.50000000e+00, 2.50000000e+00, 2.29078674e-16],
+ [2.50000000e+00, 2.50000000e+00, 2.29078674e-16]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[14]:
+
+
array([1770. , 164. , 3.2, 547. ])
+
+
+
+
+
+
+
+
+
+
+
+
Out[15]:
+
+
array([[-1.70e+02, -1.60e+01, -2.00e-01, -4.70e+01],
+ [ 3.30e+02, 3.10e+01, 8.00e-01, 1.03e+02],
+ [-2.20e+02, -2.00e+01, -1.20e+00, -7.20e+01],
+ [-1.70e+02, -1.60e+01, -2.00e-01, -5.70e+01],
+ [ 2.30e+02, 2.10e+01, 8.00e-01, 7.30e+01]])
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
U = [[-0.32486018 -0.81524197 -0.01735449 -0.17188722 0.4472136 ]
+ [ 0.63705869 0.10707263 -0.3450375 -0.51345964 0.4472136 ]
+ [-0.42643013 0.35553416 -0.61058318 0.34487822 0.4472136 ]
+ [-0.33034709 0.436448 0.61781883 -0.3445052 0.4472136 ]
+ [ 0.44457871 -0.08381281 0.35515633 0.68497384 0.4472136 ]]
+
+S = [5.44828440e+02 7.61035608e+00 8.91429037e-01 2.41987799e-01]
+
+Vh.T = [[ 0.95017495 0.29361033 0.08182661 0.06530651]
+ [ 0.08827897 0.06690917 -0.71081981 -0.69459714]
+ [ 0.00276797 -0.04366082 0.69629997 -0.71641638]
+ [ 0.29894268 -0.95258064 -0.05662119 0.00417714]]
+
+Condition number of X_centered = 2251.4707027583063
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
[[-168.1743765 -15.62476472 -0.48991109 -52.91078079]
+ [ 329.79403078 30.64054254 0.96072753 103.7593243 ]
+ [-220.7553464 -20.50996365 -0.64308544 -69.45373002]
+ [-171.01485494 -15.88866823 -0.49818573 -53.80444804]
+ [ 230.15054706 21.38285405 0.67045472 72.40963456]]
+ k=1: relative Frobenius reconstruction error on centered data = 0.0141
+
+[[-1.69996018e+02 -1.60398881e+01 -2.19027093e-01 -4.70007022e+01]
+ [ 3.30033282e+02 3.06950642e+01 9.25150039e-01 1.02983104e+02]
+ [-2.19960913e+02 -2.03289247e+01 -7.61220318e-01 -7.20311670e+01]
+ [-1.70039621e+02 -1.56664278e+01 -6.43206200e-01 -5.69684681e+01]
+ [ 2.29963269e+02 2.13401763e+01 6.98303572e-01 7.30172337e+01]]
+ k=2: relative Frobenius reconstruction error on centered data = 0.0017
+
+[[-1.69997284e+02 -1.60288915e+01 -2.29799059e-01 -4.69998263e+01]
+ [ 3.30008114e+02 3.09136956e+01 7.10984571e-01 1.03000519e+02]
+ [-2.20005450e+02 -1.99420315e+01 -1.14021052e+00 -7.20003486e+01]
+ [-1.69994556e+02 -1.60579058e+01 -2.59724807e-01 -5.69996518e+01]
+ [ 2.29989175e+02 2.11151332e+01 9.18749820e-01 7.29993076e+01]]
+ k=3: relative Frobenius reconstruction error on centered data = 0.0004
+
+
+
+
+
+
+
+
+
+
+
diff --git a/notebooks/html/05_svd_image_denoising.html b/notebooks/html/05_svd_image_denoising.html
new file mode 100644
index 0000000..64e329c
--- /dev/null
+++ b/notebooks/html/05_svd_image_denoising.html
@@ -0,0 +1,8530 @@
+
+
+
+
+
+05_svd_image_denoising
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
scikit-image available: True
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Using image: ../images/bella.jpg
+Image shape: (3456, 5184, 3)
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Saved noisy image to: ../images/bella_noisy.png
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Saved comparison figure to: ../images/bella_truncated_svd_multiple_ks.png
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Noisy image baseline -> MSE: 1426.31, PSNR: 16.59 dB
+
+Rank-k reconstructions:
+k = 5 | MSE = 314.20 | PSNR = 23.16 dB
+k = 20 | MSE = 120.90 | PSNR = 27.31 dB
+k = 50 | MSE = 104.98 | PSNR = 27.92 dB
+k = 100 | MSE = 155.79 | PSNR = 26.21 dB
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Baseline noisy vs clean:
+ MSE : 1426.31
+ PSNR: 16.59
+ SSIM: 0.0674
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Plain SVD:
+ Best by MSE : (41, np.float64(100.86643831213188), np.float64(28.093336751115086), 0.6093116647651357)
+ Best by PSNR: (41, np.float64(100.86643831213188), np.float64(28.093336751115086), 0.6093116647651357)
+Centered SVD:
+ Best by MSE : (36, np.float64(100.77445519633338), np.float64(28.097299019055427), 0.6300189753895039)
+ Best by PSNR: (36, np.float64(100.77445519633338), np.float64(28.097299019055427), 0.6300189753895039)
+Plain SVD:
+ Best by SSIM: (1, np.float64(1048.5582022585174), np.float64(17.924878190392008), 0.7729763808220854)
+Centered SVD:
+ Best by SSIM: (1, np.float64(891.0525380576963), np.float64(18.631770492935846), 0.7731825777973053)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+

+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Plain SVD ranks to inspect : [31, 36, 41, 46, 51]
+Centered SVD ranks to inspect: [26, 31, 36, 41, 46]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
diff --git a/notebooks/html/06_modelling_101.html b/notebooks/html/06_modelling_101.html
new file mode 100644
index 0000000..5502c59
--- /dev/null
+++ b/notebooks/html/06_modelling_101.html
@@ -0,0 +1,10539 @@
+
+
+
+
+
+06_modelling_101
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Data shape: (20640, 9)
+
+
+
+
+
Out[1]:
+
+
+
+
+
+
+ |
+MedInc |
+HouseAge |
+AveRooms |
+AveBedrms |
+Population |
+AveOccup |
+Latitude |
+Longitude |
+MedHouseVal |
+
+
+
+
+| 0 |
+8.3252 |
+41.0 |
+6.984127 |
+1.023810 |
+322.0 |
+2.555556 |
+37.88 |
+-122.23 |
+4.526 |
+
+
+| 1 |
+8.3014 |
+21.0 |
+6.238137 |
+0.971880 |
+2401.0 |
+2.109842 |
+37.86 |
+-122.22 |
+3.585 |
+
+
+| 2 |
+7.2574 |
+52.0 |
+8.288136 |
+1.073446 |
+496.0 |
+2.802260 |
+37.85 |
+-122.24 |
+3.521 |
+
+
+| 3 |
+5.6431 |
+52.0 |
+5.817352 |
+1.073059 |
+558.0 |
+2.547945 |
+37.85 |
+-122.25 |
+3.413 |
+
+
+| 4 |
+3.8462 |
+52.0 |
+6.281853 |
+1.081081 |
+565.0 |
+2.181467 |
+37.85 |
+-122.25 |
+3.422 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Out[2]:
+
+
+
+
+
+
+ |
+MedInc |
+HouseAge |
+AveRooms |
+AveBedrms |
+Population |
+AveOccup |
+Latitude |
+Longitude |
+MedHouseVal |
+
+
+
+
+| count |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+20640.000000 |
+
+
+| mean |
+3.870671 |
+28.639486 |
+5.429000 |
+1.096675 |
+1425.476744 |
+3.070655 |
+35.631861 |
+-119.569704 |
+2.068558 |
+
+
+| std |
+1.899822 |
+12.585558 |
+2.474173 |
+0.473911 |
+1132.462122 |
+10.386050 |
+2.135952 |
+2.003532 |
+1.153956 |
+
+
+| min |
+0.499900 |
+1.000000 |
+0.846154 |
+0.333333 |
+3.000000 |
+0.692308 |
+32.540000 |
+-124.350000 |
+0.149990 |
+
+
+| 25% |
+2.563400 |
+18.000000 |
+4.440716 |
+1.006079 |
+787.000000 |
+2.429741 |
+33.930000 |
+-121.800000 |
+1.196000 |
+
+
+| 50% |
+3.534800 |
+29.000000 |
+5.229129 |
+1.048780 |
+1166.000000 |
+2.818116 |
+34.260000 |
+-118.490000 |
+1.797000 |
+
+
+| 75% |
+4.743250 |
+37.000000 |
+6.052381 |
+1.099526 |
+1725.000000 |
+3.282261 |
+37.710000 |
+-118.010000 |
+2.647250 |
+
+
+| max |
+15.000100 |
+52.000000 |
+141.909091 |
+34.066667 |
+35682.000000 |
+1243.333333 |
+41.950000 |
+-114.310000 |
+5.000010 |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
Total samples: 50
+Training samples: 35 (70%)
+Test samples: 15 (30%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Training set size: 16512
+Test set size: 4128
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Train MSE: 0.5229, Train R²: 0.6079
+Test MSE: 0.5381, Test R²: 0.5931
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Original training features: 8
+Polynomial training features: 44
+Condition number of polynomial design matrix: 1.55e+11
+Condition number of polynomial design matrix: 1.55e+11
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Polynomial (deg=2) Train MSE: 0.4217
+Polynomial (deg=2) Test MSE: 0.4669
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.61091e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.8355e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.12863e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.2106e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.54461e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.8756e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.92268e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.38803e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.47667e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.81721e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.42347e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.1031e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.92486e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.02729e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.38133e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.55789e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.47656e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.03616e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.16704e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.54886e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=9.90852e-21): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.24995e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.03379e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.05273e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.09661e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.47609e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.8532e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.51111e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.54201e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.59756e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.48243e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.18458e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.5036e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.55854e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.63843e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.58073e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.5141e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.57846e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.68043e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.81413e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.99842e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.97942e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.95471e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=9.14672e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=9.41068e-20): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.84197e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=6.10261e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.82904e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.8654e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.92511e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.2525e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.40749e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.23705e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.31565e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=4.36802e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.62957e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.85107e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.62987e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.72841e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.85718e-19): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.79172e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.9137e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.79522e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.81032e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.83658e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.74634e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.23675e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.75365e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.78706e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.837e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.84044e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=2.6169e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.83226e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.97521e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=8.10899e-18): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.62385e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=5.40767e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.62848e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.64981e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=1.67422e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.37154e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.37402e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.42091e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=3.46597e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.00456e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.00113e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.10119e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+/usr/lib64/python3.14/site-packages/scipy/_lib/_util.py:1233: LinAlgWarning: Ill-conditioned matrix (rcond=7.18479e-17): result may not be accurate.
+ return f(*arrays, *other_args, **kwargs)
+
+
+
+
+
+
+
Best alpha from CV: 1000.0000
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Best alpha from CV: 233.5721
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
Ridge (poly deg=2) Test MSE: 0.4791
+Ridge improved over plain polynomial (MSE 0.4669 ā 0.4791)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
Singular values: [182.41 176.4 144.85 130.91 128.68 104.03 37.56 28.16]
+
+Condition number (OLS): 6.48
+Effective condition number (Ī»=0.1): 41.96
+Effective condition number (Ī»=1): 41.91
+Effective condition number (Ī»=10): 41.45
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:716: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.725e+03, tolerance: 2.202e+00
+ model = cd_fast.enet_coordinate_descent(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.501e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.286e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+
+
+
+
+
+
+
Number of non-zero coefficients: 33 out of 44
+Lasso (poly deg=2) Test MSE: 0.4538
+
+
+
+
+
+
+
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.247e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.192e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.136e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.106e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.047e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.229e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.927e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.995e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.035e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.017e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.002e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.989e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.977e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.966e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.957e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.949e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.941e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.933e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.926e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.920e+03, tolerance: 1.774e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.902e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.045e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.033e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.987e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.939e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.892e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.049e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.055e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.031e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.009e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.990e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.975e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.964e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.953e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.944e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.936e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.929e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.921e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.913e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.906e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.900e+03, tolerance: 1.759e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.612e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.935e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.917e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.900e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.885e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.871e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.857e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.845e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.834e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.826e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.816e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.806e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.798e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.791e+03, tolerance: 1.753e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.694e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.248e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.219e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.166e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.117e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.075e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.039e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.133e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.112e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.090e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.071e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.056e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.044e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.033e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.022e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.012e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.002e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.991e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.981e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.972e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.965e+03, tolerance: 1.771e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.153e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.201e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.168e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.116e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.067e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.026e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.987e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.057e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.040e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.016e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.995e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.979e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.965e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.952e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.940e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.930e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.922e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.913e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.904e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.896e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:701: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.889e+03, tolerance: 1.752e+00
+ model = cd_fast.enet_coordinate_descent_gram(
+
+
+
+
+
+
+
Best alpha from LassoCV: 0.0067
+Number of non-zero coefficients (CV best): 34
+LassoCV Test MSE: 0.4587
+
+
+
+
+
+
+
/home/pks/.local/lib/python3.14/site-packages/sklearn/linear_model/_coordinate_descent.py:716: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.709e+03, tolerance: 2.202e+00
+ model = cd_fast.enet_coordinate_descent(
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Number of non-zero coefficients: 15 out of 44
+Lasso Test MSE: 0.5347
+Best alpha from LassoCV: 0.0067
+Number of non-zero coefficients (CV best): 16
+LassoCV Test MSE: 0.5305
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
Best number of components: 8
+Variance explained: 100.00%
+
+Model Comparison (Test MSE):
+ OLS (all features): 0.5381
+ PCR (k=8): 0.5272
+ Ridge (Ī»=233.57): 0.4791
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Eigenvalue range: [5.59e-02, 3.07e+09]
+Condition number: 5.49e+10
+Max stable learning rate: 6.52e-10
+
+
+
+
+
+
+

+
+
+
+
+
+
+Scaling reduced condition number from 5.5e+10 to 42.2
+This allows much larger learning rates and faster convergence.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
/usr/lib64/python3.14/site-packages/numpy/linalg/_linalg.py:2792: RuntimeWarning: overflow encountered in dot
+ sqnorm = x.dot(x)
+/tmp/ipykernel_66479/3132444904.py:7: RuntimeWarning: overflow encountered in matmul
+ grad = - (1/n) * X.T @ (y - X @ beta)
+/tmp/ipykernel_66479/3132444904.py:8: RuntimeWarning: invalid value encountered in subtract
+ beta -= learning_rate * grad
+
+
+
+
+
+
+

+
+
+
+
+
+
Difference between GD and closed-form: nan
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Iter 0: loss = 2.758919
+Iter 200: loss = 0.234124
+Iter 400: loss = 0.230774
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coefficients: [ 3.97676073e+09 -1.14418633e+10 -1.78357850e+10 1.01065426e+11
+ -1.80378121e+10 -3.02815983e+09 -5.43520408e+10 -4.51215845e+10]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Decision Tree Test MSE: 0.3961
+Random Forest Test MSE: 0.2752
+Ridge (poly) Test MSE: 0.4791
+LassoCV Test MSE: 0.5305
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Logistic Regression Accuracy: 0.8324
+
+Classification Report:
+ precision recall f1-score support
+
+ 0 0.83 0.84 0.83 2083
+ 1 0.83 0.83 0.83 2045
+
+ accuracy 0.83 4128
+ macro avg 0.83 0.83 0.83 4128
+weighted avg 0.83 0.83 0.83 4128
+
+
+Logistic Regression Coefficients (scaled features):
+ Feature Coefficient
+6 Latitude -3.532385
+7 Longitude -3.328767
+5 AveOccup -3.094635
+0 MedInc 2.512414
+3 AveBedrms 0.899888
+2 AveRooms -0.786384
+1 HouseAge 0.276838
+4 Population 0.062450
+ precision recall f1-score support
+
+ 0 0.83 0.84 0.83 2083
+ 1 0.83 0.83 0.83 2045
+
+ accuracy 0.83 4128
+ macro avg 0.83 0.83 0.83 4128
+weighted avg 0.83 0.83 0.83 4128
+
+
+Logistic Regression Coefficients (scaled features):
+ Feature Coefficient
+6 Latitude -3.532385
+7 Longitude -3.328767
+5 AveOccup -3.094635
+0 MedInc 2.512414
+3 AveBedrms 0.899888
+2 AveRooms -0.786384
+1 HouseAge 0.276838
+4 Population 0.062450
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
5-fold CV R² scores: [0.60709214 0.59544452 0.58112984 0.63060861 0.61005689]
+Mean R²: 0.6049 (+/- 0.0328)
+Shuffled CV R² scores: [0.60563739 0.59602593 0.5917264 0.61941109 0.62268184]
+Mean R² (shuffled): 0.6071
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Condition number (original): 2.40e+05
+Condition number (scaled): 6.48e+00
+Linear regression (scaled) Test MSE: 0.5381
+Linear regression (original) Test MSE: 0.5381
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Ridge coefficients (scaled features):
+ Feature Coefficient
+6 Latitude -0.896656
+7 Longitude -0.870257
+0 MedInc 0.848402
+3 AveBedrms 0.332536
+2 AveRooms -0.287161
+1 HouseAge 0.125807
+5 AveOccup -0.040522
+4 Population -0.002190
+
+Random Forest Feature Importances:
+ Feature Importance
+0 MedInc 0.589486
+5 AveOccup 0.137379
+6 Latitude 0.078123
+7 Longitude 0.077486
+1 HouseAge 0.047525
+2 AveRooms 0.034377
+4 Population 0.018634
+3 AveBedrms 0.016990
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Fitting 3 folds for each of 12 candidates, totalling 36 fits
+Best parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
+Best CV MSE: 0.3210370034255883
+Tuned Random Forest Test MSE: 0.2928
+
+
+
+
+
+
+
+
+
+
diff --git a/pictures/bella_grayscale.jpg b/pictures/bella_grayscale.jpg
deleted file mode 100644
index 492c965..0000000
Binary files a/pictures/bella_grayscale.jpg and /dev/null differ
diff --git a/pictures/bella_grayscale_noisy.jpg b/pictures/bella_grayscale_noisy.jpg
deleted file mode 100644
index 1afc096..0000000
Binary files a/pictures/bella_grayscale_noisy.jpg and /dev/null differ
diff --git a/pictures/bella_grayscale_svd.jpg b/pictures/bella_grayscale_svd.jpg
deleted file mode 100644
index bfff369..0000000
Binary files a/pictures/bella_grayscale_svd.jpg and /dev/null differ
diff --git a/pictures/bella_grayscale_svd_200_500_1000_3000.jpg b/pictures/bella_grayscale_svd_200_500_1000_3000.jpg
deleted file mode 100644
index 1ffc616..0000000
Binary files a/pictures/bella_grayscale_svd_200_500_1000_3000.jpg and /dev/null differ
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..0106600
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,5 @@
+pandas
+numpy
+matplotlib
+pillow
+scikit-image
\ No newline at end of file