Files
DS-for-LA/README.md
2026-03-31 13:53:56 -04:00

144 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data Science for the Linear Algebraist
A practical, linear-algebra-first introduction to data science.
This repository demonstrates how core linear algebra concepts -- least squares, matrix decompositions, and spectral methods -- directly power modern data science and machine learning workflows. We finish off with a mini-project involving image denoising using the truncated SVD.
Rather than treating data science as a collection of tools, this project builds everything from first principles and connects theory to implementation through jupyter notebooks.
## Structure
This project is organized as a collection of focused notebooks:
```text
images/ # saved images/visualizations
notebooks/ # jupyter notebooks containing theory, code, visuals
bibliography.md # references for essentially everything
requirements.txt # python requirements
LICENSE # project license
```
Each notebook is self-contained and moves from theory to implementation to visualization.
## Dependencies
* **Python 3**
* **NumPy** -- linear algebra
* **Pandas** -- data handling
* **Matplotlib** -- visualization
* **Pillow** -- imaging library
## How to Run
```bash
git clone https://gitlab.com/psark/ds-for-la.git
cd ds-for-la
pip install requirements.txt
jupyter notebook
```
Open any notebook inside the `notebooks/` folder.
---
## Topics
### 1. Least Squares Regression
* Overdetermined systems
* Normal equations
* Geometric interpretation (projection onto column space)
* Implementation using NumPy
### 2. QR Decomposition & SVD
* Numerical stability vs. normal equations
* Orthogonal bases and conditioning
* Solving linear systems without forming $X^T X$
### 3. Some Notes & What Can Go Wrong
* Other vector norms ($L^1, L^\infty$), as well as matrix norms (Frobenius, Operator)
* What can go wrong?
### 4. Principal Component Analysis (PCA)
* Dimensionality reduction via spectral methods
* Relationship between covariance matrices and eigenvectors
* Handling correlated features
### 5. Project: Spectral Image Denoising via Truncated SVD
* Low-rank approximation of images
* Noise removal using singular value truncation
* RGB images (channel-wise SVD)
* Quantitative evaluation (MSE, PSNR)
---
## Example: Image Denoising via SVD
Given an image matrix $A$ (for simplicity, let's go with greyscale), we compute its singular value decomposition:
$$
A = U \Sigma V^T
$$
We approximate the image using only the top $k$ singular values:
$$
A_k = U_k \Sigma_k V_k^T
$$
This produces:
* **Noise reduction**
* **Compression**
* A direct application of the **EckartYoungMirsky theorem**
For color images, this is applied independently to each channel (R, G, B).
---
## Key Takeaways
* Data science problems can be framed as:
> *approximate solutions to linear systems*
* Numerical linear algebra is necessary; it determines:
* stability
* performance
* model reliability
* Spectral methods (SVD, PCA) provide:
* structure
* compression
* interpretability
---
## Purpose
This project is part of a broader effort to translate a background in pure mathematics into practical data science and machine learning skills.
## Future Work
* [ ] Add regularization (Ridge, Lasso)
* [ ] Extend PCA to real datasets
* [ ] Compare SVD vs. autoencoders for compression
* [ ] Add performance benchmarks (QR vs SVD vs normal equations)
---
# License
This project is licensed under the MIT License.
See the [`LICENSE`](./LICENSE) file for details.