144 lines
3.4 KiB
Markdown
144 lines
3.4 KiB
Markdown
# Data Science for the Linear Algebraist
|
||
|
||
A practical, linear-algebra-first introduction to data science.
|
||
|
||
This repository demonstrates how core linear algebra concepts -- least squares, matrix decompositions, and spectral methods -- directly power modern data science and machine learning workflows. We finish off with a mini-project involving image denoising using the truncated SVD.
|
||
|
||
Rather than treating data science as a collection of tools, this project builds everything from first principles and connects theory to implementation through jupyter notebooks.
|
||
|
||
|
||
## Structure
|
||
|
||
This project is organized as a collection of focused notebooks:
|
||
|
||
```text
|
||
images/ # saved images/visualizations
|
||
notebooks/ # jupyter notebooks containing theory, code, visuals
|
||
bibliography.md # references for essentially everything
|
||
requirements.txt # python requirements
|
||
LICENSE # project license
|
||
```
|
||
|
||
Each notebook is self-contained and moves from theory to implementation to visualization.
|
||
|
||
## Dependencies
|
||
|
||
* **Python 3**
|
||
* **NumPy** -- linear algebra
|
||
* **Pandas** -- data handling
|
||
* **Matplotlib** -- visualization
|
||
* **Pillow** -- imaging library
|
||
|
||
## How to Run
|
||
|
||
```bash
|
||
git clone https://gitlab.com/psark/ds-for-la.git
|
||
cd ds-for-la
|
||
|
||
pip install requirements.txt
|
||
|
||
jupyter notebook
|
||
```
|
||
|
||
Open any notebook inside the `notebooks/` folder.
|
||
|
||
---
|
||
|
||
## Topics
|
||
|
||
### 1. Least Squares Regression
|
||
|
||
* Overdetermined systems
|
||
* Normal equations
|
||
* Geometric interpretation (projection onto column space)
|
||
* Implementation using NumPy
|
||
|
||
### 2. QR Decomposition & SVD
|
||
|
||
* Numerical stability vs. normal equations
|
||
* Orthogonal bases and conditioning
|
||
* Solving linear systems without forming $X^T X$
|
||
|
||
### 3. Some Notes & What Can Go Wrong
|
||
|
||
* Other vector norms ($L^1, L^\infty$), as well as matrix norms (Frobenius, Operator)
|
||
* What can go wrong?
|
||
|
||
### 4. Principal Component Analysis (PCA)
|
||
|
||
* Dimensionality reduction via spectral methods
|
||
* Relationship between covariance matrices and eigenvectors
|
||
* Handling correlated features
|
||
|
||
### 5. Project: Spectral Image Denoising via Truncated SVD
|
||
|
||
* Low-rank approximation of images
|
||
* Noise removal using singular value truncation
|
||
* RGB images (channel-wise SVD)
|
||
* Quantitative evaluation (MSE, PSNR)
|
||
|
||
---
|
||
|
||
## Example: Image Denoising via SVD
|
||
|
||
Given an image matrix $A$ (for simplicity, let's go with greyscale), we compute its singular value decomposition:
|
||
|
||
$$
|
||
A = U \Sigma V^T
|
||
$$
|
||
|
||
We approximate the image using only the top $k$ singular values:
|
||
|
||
$$
|
||
A_k = U_k \Sigma_k V_k^T
|
||
$$
|
||
|
||
This produces:
|
||
|
||
* **Noise reduction**
|
||
* **Compression**
|
||
* A direct application of the **Eckart–Young–Mirsky theorem**
|
||
|
||
For color images, this is applied independently to each channel (R, G, B).
|
||
|
||
---
|
||
|
||
## Key Takeaways
|
||
|
||
* Data science problems can be framed as:
|
||
|
||
> *approximate solutions to linear systems*
|
||
|
||
* Numerical linear algebra is necessary; it determines:
|
||
|
||
* stability
|
||
* performance
|
||
* model reliability
|
||
|
||
* Spectral methods (SVD, PCA) provide:
|
||
|
||
* structure
|
||
* compression
|
||
* interpretability
|
||
|
||
|
||
---
|
||
|
||
## Purpose
|
||
|
||
This project is part of a broader effort to translate a background in pure mathematics into practical data science and machine learning skills.
|
||
|
||
|
||
## Future Work
|
||
|
||
* [ ] Add regularization (Ridge, Lasso)
|
||
* [ ] Extend PCA to real datasets
|
||
* [ ] Compare SVD vs. autoencoders for compression
|
||
* [ ] Add performance benchmarks (QR vs SVD vs normal equations)
|
||
|
||
---
|
||
|
||
# License
|
||
|
||
This project is licensed under the MIT License.
|
||
See the [`LICENSE`](./LICENSE) file for details. |