Pawel Sarkowicz 290e8bc3e6 link to website
2026-03-31 16:05:58 -04:00
2026-03-31 13:53:56 -04:00
2026-03-31 16:05:58 -04:00
2026-03-30 18:45:32 -04:00
2026-01-29 21:27:58 -05:00
2026-03-31 16:05:58 -04:00
2026-03-31 13:53:56 -04:00

Data Science for the Linear Algebraist

A practical, linear-algebra-first introduction to data science.

This repository demonstrates how core linear algebra concepts -- least squares, matrix decompositions, and spectral methods -- directly power modern data science and machine learning workflows. We finish off with a mini-project involving image denoising using the truncated SVD.

Rather than treating data science as a collection of tools, this project builds everything from first principles and connects theory to implementation through jupyter notebooks.

The compiled notebooks in this project can be viewed as a single webpage on my website.

Structure

This project is organized as a collection of focused notebooks:

images/           # saved images/visualizations
notebooks/        # jupyter notebooks containing theory, code, visuals
bibliography.md   # references for essentially everything
requirements.txt  # python requirements
LICENSE           # project license

Each notebook is self-contained and moves from theory to implementation to visualization.

Dependencies

  • Python 3
  • NumPy -- linear algebra
  • Pandas -- data handling
  • Matplotlib -- visualization
  • Pillow -- imaging library

How to Run

git clone https://gitlab.com/psark/ds-for-la.git
cd ds-for-la

pip install requirements.txt

jupyter notebook

Open any notebook inside the notebooks/ folder.


Topics

1. Least Squares Regression

  • Overdetermined systems
  • Normal equations
  • Geometric interpretation (projection onto column space)
  • Implementation using NumPy

2. QR Decomposition & SVD

  • Numerical stability vs. normal equations
  • Orthogonal bases and conditioning
  • Solving linear systems without forming X^T X

3. Some Notes & What Can Go Wrong

  • Other vector norms (L^1, L^\infty), as well as matrix norms (Frobenius, Operator)
  • What can go wrong?

4. Principal Component Analysis (PCA)

  • Dimensionality reduction via spectral methods
  • Relationship between covariance matrices and eigenvectors
  • Handling correlated features

5. Project: Spectral Image Denoising via Truncated SVD

  • Low-rank approximation of images
  • Noise removal using singular value truncation
  • RGB images (channel-wise SVD)
  • Quantitative evaluation (MSE, PSNR)

Example: Image Denoising via SVD

Given an image matrix A (for simplicity, let's go with greyscale), we compute its singular value decomposition:


A = U \Sigma V^T

We approximate the image using only the top k singular values:


A_k = U_k \Sigma_k V_k^T

This produces:

  • Noise reduction
  • Compression
  • A direct application of the EckartYoungMirsky theorem

For color images, this is applied independently to each channel (R, G, B).


Key Takeaways

  • Data science problems can be framed as:

    approximate solutions to linear systems

  • Numerical linear algebra is necessary; it determines:

    • stability
    • performance
    • model reliability
  • Spectral methods (SVD, PCA) provide:

    • structure
    • compression
    • interpretability

Purpose

This project is part of a broader effort to translate a background in pure mathematics into practical data science and machine learning skills.

Future Work

  • Add regularization (Ridge, Lasso)
  • Extend PCA to real datasets
  • Compare SVD vs. autoencoders for compression
  • Add performance benchmarks (QR vs SVD vs normal equations)

License

This project is licensed under the MIT License. See the LICENSE file for details.

Description
A not-so comprehensive guide bridging linear algebra theory with practical data science implementation. Meant to be used as a learning resource for someone with a strong linear algebra background. This project demonstrates how fundamental linear algebra concepts power modern machine learning algorithms, with hands-on Python implementations.
Readme MIT 72 MiB
Languages
HTML 93.6%
Jupyter Notebook 6.4%