{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Principal Component Analysis\n", "\n", "Principal Component Analysis (PCA) addresses the issues of multicollinearity and dimensionality mentioned at the end of the previous section by transforming the data into a new coordinate system. The new axes -- called principal components -- are chosen to capture the maximum variance in the data. In linear algebra terms, we are finding a subspace of potentially smaller dimension that best approximates our data.\n", "\n", "> **Example**: Let us return to our house example. Suppose we decide to list the square footage in both square feet and square meters. Let's add this feature to our dataset.\n", "> |House | Square ft | Square m | Bedrooms | Price (in $1000s) |\n", "> | --- | --- | --- | --- | --- |\n", "> | 0 | 1600 | 148 | 3 | 500 |\n", "> | 1 | 2100 | 195 | 4 | 650 |\n", "> | 2 | 1550 | 144 | 2 | 475 |\n", "> | 3 | 1600 | 148 | 3 | 490 |\n", "> | 4 | 2000 | 185 | 4 | 620 |\n", "> \n", "> In this case, our associated matrix is:\n", "> $$ X = \\begin{bmatrix} 1600 & 148 & 3 & 500 \\\\ 2100 & 195 & 4 & 650 \\\\ 1550 & 144 & 2 & 475 \\\\ 1600 & 148 & 3 & 490 \\\\ 2000 & 185 & 4 & 620 \\end{bmatrix} $$\n", "\n", "There are a few problems with the above data and the associated matrix $X$ (this time, we're not looking to make predictions, so we don't omit the last column).\n", "- **Redundancy**: Square feet and square meters give the same information. It's just a matter of if you're from a civilized country or from an uncivilized country.\n", "- **Numerical instability**: The columns of $X$ are nearly linearly dependent. Indeed, the second column is almost a multiple of the first. Moreover, one can make a safe bet that the number of bedrooms increases as the square footage does, so that the first and third columns are correlated.\n", "- **Interpretation difficulty**: We used the square footage and bedrooms *together* in the previous section to predict the price of a house. However, because of their correlation, this obfuscates the true relationship, say, between the square footage and the price of a house, or the number of bedrooms and the price of a house. \n", "\n", "So the question becomes: what do we do about this? We will try to get a smaller matrix (less columns) that contains the same, or a close enough, amount of information. The point is that the data is *effectively* lower-dimensional. \n", "\n", "Let's do a little analysis on our dataset before progressing. Let's use `pandas.DataFrame.describe`, `pandas.DataFrame.corr` and `numpy.linalg.cond`. First, let's set up our data.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# First let us make a dictionary incorporating our data.\n", "# Each entry corresponds to a column (feature of our data)\n", "data = {\n", " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n", "\t'Square m': [148, 195, 144, 148, 185],\n", " 'Bedrooms': [3, 4, 2, 3, 4],\n", " 'Price': [500, 650, 475, 490, 620]\n", "}\n", "\n", "# Create a pandas DataFrame\n", "df = pd.DataFrame(data)\n", "\n", "# Create out matrix X\n", "X = df.to_numpy()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Now let's see what it has to offer. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8514ed8b", "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "0eb032aa", "metadata": {}, "outputs": [], "source": [ "df.corr()" ] }, { "cell_type": "code", "execution_count": null, "id": "6a166792", "metadata": {}, "outputs": [], "source": [ "np.linalg.cond(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "As we can see, everything is basically correlated, and we clearly have some redundancies. \n", "\n", "This section is structured as follows. \n", "- [Low-rank approximation via SVD](#low-rank-approximation-via-svd)\n", "- [Centering data](#centering-data)\n", "\n", "\n", "## Low-rank approximation via SVD\n", "\n", "Let $A$ be an $n \\times p$ matrix and let $A = U\\Sigma V^T$ be a SVD. Let $u_1,\\dots,u_n$ be the columns of $U$, $v_1,\\dots,v_p$ be the column of $V$, and $\\sigma_1 \\geq \\cdots \\sigma_r > 0$ be the singular values, where $r \\leq \\min\\{n,p\\}$ is the rank of $A$. Then we have the **reduced singular value decomposition** (see [Pseudoinverses and using the svd](#pseudoinverses-and-using-the-svd))\n", "$$ A = \\sum_{i=1}^r \\sigma_i u_iv_i^T $$\n", "(note that $u_i$ is a $n \\times 1$ matrix and $v_i$ is a $p \\times 1$ matrix, so $u_iv_i^T$ is some $n \\times p$ matrix).\n", "The key idea is that if the rank of $A$ is higher, say $s$, but the latter singular values are small, then we should still have an approximation like this. Say $\\sigma_{r+1},\\dots,\\sigma_{s}$ are tiny. Then\n", "$$ \\begin{split} A &= \\sum_{i=1}^s \\sigma_i u_i v_i^T \\\\ &= \\sum_{i=1}^r \\sigma_i u_iv_i^T + \\sum_{i=r+1}^{s} \\sigma_i u_iv_i^T \\\\ &\\approx \\sum_{i=1}^r \\sigma_iu_i v_i^T \\end{split}. $$\n", "So defining $A_r := \\sum_{i=1}^r \\sigma_i u_iv_i^T$, we are approximating $A$ by $A_r$.\n", "\n", "In what sense is this a good approximation though? Recall that the Frobenius norm of a matrix $A$ is defined as the sqrt root of the sum of the squares of all the entries:\n", "$$ \\|A\\|_F = \\sqrt{\\sum_{i,j} a_{ij}^2}. $$\n", "The Frobenius norm acts as a very nice generalization of the $L^2$ norm for vectors, and is an indispensable tool in both linear algebra and data science. The point is that this \"approximation\" above actually works in the Frobenius norm, and this reduced singular value decomposition in fact minimizes the error.\n", "\n", "> **Theorem** (Eckart–Young–Mirsky). Let $A$ be an $n \\times p$ matrix of rank $r$. For $k \\leq r$,\n", "> $$ \\min_{B \\text{ such that rank}(B) \\leq k} \\|A - B\\|_F = \\|A - A_k\\|_F. $$\n", "> The (at most) rank $k$ matrix $A_k$ also realizes the minimum when optimizing for the operator norm.\n", "\n", "> **Example**. Recall that we have the following SVD:\n", "> $$ \\begin{bmatrix} 3 & 2 & 2 \\\\ 2 & 3 & -2 \\end{bmatrix} = \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{\\sqrt{2}} \\end{bmatrix} \\begin{bmatrix} 5 & 0 & 0 \\\\ 0 & 3 & 0 \\end{bmatrix} \\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{3\\sqrt{2}} & -\\frac{2}{3} \\\\ \\frac{1}{\\sqrt{2}} & -\\frac{1}{3\\sqrt{2}} & \\frac{2}{3} \\\\ 0 & \\frac{4}{3\\sqrt{2}} & \\frac{1}{3} \\end{bmatrix}^T. $$\n", "> So if we want a rank-one approximation for the matrix, we'll do the reduced SVD. We have\n", "> $$ \\begin{split} A_1 &= \\sigma_1u_1v_1^T \\\\ &= 5\\begin{bmatrix} \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} \\end{bmatrix}\\begin{bmatrix} \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} & 0 \\end{bmatrix} \\\\ &= \\begin{bmatrix} \\frac{5}{2} & \\frac{5}{2} & 0 \\\\ \\frac{5}{2} & \\frac{5}{2} & 0 \\end{bmatrix} \\end{split}$$\n", "> Now let's compute the (square of the) Frobenius norm of the difference $A - A_1$. We have\n", "> $$ \\begin{split} \\|A - A_1\\|_F^2 &= \\left\\| \\begin{bmatrix} \\frac{1}{2} & -\\frac{1}{2} & 2 \\\\ -\\frac{1}{2} & \\frac{1}{2} & -2 \\end{bmatrix}\\right\\|_F^2 \\\\ &= 4(\\frac{1}{2})^2 + 2(2^2) = 9. \\end{split} $$\n", "> So the Frobenius distance between $A$ and $A_1$ is 3, and we know by Eckart-Young-Mirsky that this is the smallest we can get when looking at the difference between $A$ and a (at most) rank one $2 \\times 3$ matrix. As mentioned, the operator norm $\\|A - A_1\\|$ also minimizes the distance (in operator norm). We know this to be the largest singular value. As $A - A_1$ has SVD\n", "> $$ \\begin{bmatrix} \\frac{1}{2} & -\\frac{1}{2} & 2 \\\\ -\\frac{1}{2} & \\frac{1}{2} & -2 \\end{bmatrix} = \\begin{bmatrix} -\\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\\\ \\frac{1}{\\sqrt{2}} & \\frac{1}{\\sqrt{2}} \\end{bmatrix}\\begin{bmatrix} 3 & 0 & 0 \\\\ 0 & 0 & 0 \\end{bmatrix} \\begin{bmatrix} -\\frac{1}{3\\sqrt{2}} & -\\frac{4}{\\sqrt{17}} & \\frac{1}{3\\sqrt{34}} \\\\ \\frac{1}{3\\sqrt{2}} & 0 & \\frac{1}{3}\\sqrt{\\frac{17}{2}} \\\\ -\\frac{2\\sqrt{2}}{3} & \\frac{1}{\\sqrt{17}} & \\frac{2}{3}\\sqrt{\\frac{2}{17}} \\end{bmatrix}, $$\n", "> the operator norm is also 3. \n", "\n", "Now let's do this in python. We'll set up our matrix as usual, take the SVD, do the truncated construction of $A_1$, and use `numpy.linalg.norm` to look at the norms. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Create our matrix A\n", "A = np.array([[3,2,2],[2,3,-2]])\n", "\n", "# Take the SVD\n", "U, S, Vh = np.linalg.svd(A)\n", "\n", "# Create our rank-1 approximation\n", "sigma1 = S[0]\n", "u1 = U[:, [0]]\t\t#shape (2,2)\n", "v1T = Vh[[0], :]\t\t#shape (3,3)\n", "A1 = sigma1 * (u1 @ v1T)\n", "\n", "# Take norms and view errors\n", "frobenius_error = np.linalg.norm(A - A1, ord=\"fro\")\t#Frobenius norm\n", "operator_error = np.linalg.norm(A - A1, ord=2)\t\t#operator norm\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's see if we get what we expect.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "799ea5da", "metadata": {}, "outputs": [], "source": [ "sigma1" ] }, { "cell_type": "code", "execution_count": null, "id": "e17ad031", "metadata": {}, "outputs": [], "source": [ "u1" ] }, { "cell_type": "code", "execution_count": null, "id": "b75d1b41", "metadata": {}, "outputs": [], "source": [ "v1T" ] }, { "cell_type": "code", "execution_count": null, "id": "cda3bc1a", "metadata": {}, "outputs": [], "source": [ "A1" ] }, { "cell_type": "code", "execution_count": null, "id": "5741dc92", "metadata": {}, "outputs": [], "source": [ "frobenius_error" ] }, { "cell_type": "code", "execution_count": null, "id": "b1171244", "metadata": {}, "outputs": [], "source": [ "operator_error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "So this numerically confirms the EYM theorem. \n", "\n", "## Centering data \n", "In data science, we rarely apply low-rank approximation to raw values directly, because translation and units can dominate the geometry. Instead, we apply these methods to centered (and often standardized) data so that low-rank structure reflects relationships among features rather than the absolute location or measurement scale. Centering converts the problem from approximating an affine cloud to approximating a linear one, in direct analogy with including an intercept term in linear regression. Therefore, before we can analyze the variance structure, we must ensure our data is centered, i.e., that each feature has a mean of 0. We achieve this by subtracting the mean of each column from every entry in that column.\n", "Suppose $X$ is our $n \\times p$ data matrix, and let\n", "$$ \\mu = \\frac{1}{n}\\mathbb{1}^T X. $$\n", "Then\n", "$$ \\hat{X} = X - \\mu \\mathbb{1} $$\n", "will be centered data matrix.\n", "\n", "> **Example**. Going back to our housing example, the means of the columns are 1770, 164, 3.2, and 547, respectively. So our centered matrix is\n", "> $$ \\hat{X} = \\begin{bmatrix} -170 & -16 & -0.2 & -47 \\\\ 330 & 31 & 0.8 & 103 \\\\ -220 & -20 & -1.2 & -72 \\\\ -170 & -16 & -0.2 & -57 \\\\ 230 & 21 & 0.8 & 73 \\end{bmatrix}. $$\n", "\n", "Let's do this in python.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# First let us make a dictionary incorporating our data.\n", "# Each entry corresponds to a column (feature of our data)\n", "data = {\n", " 'Square ft': [1600, 2100, 1550, 1600, 2000],\n", "\t'Square m': [148, 195, 144, 148, 185],\n", " 'Bedrooms': [3, 4, 2, 3, 4],\n", " 'Price': [500, 650, 475, 490, 620]\n", "}\n", "\n", "# Create a pandas DataFrame\n", "df = pd.DataFrame(data)\n", "\n", "# Create out matrix X\n", "X = df.to_numpy()\n", "\n", "# Get our vector of means\n", "X_means = np.mean(X, axis=0)\n", "\n", "# Create our centered matrix\n", "X_centered = X - X_means\n", "\n", "# Get the SVD for X_centered\n", "U, S, Vh = np.linalg.svd(X_centered)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This returns the following.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4288abb2", "metadata": {}, "outputs": [], "source": [ "X_means" ] }, { "cell_type": "code", "execution_count": null, "id": "31c2ebf2", "metadata": {}, "outputs": [], "source": [ "X_centered" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "We will apply the low-rank approximations from the previous sections. First let's see what our SVD looks like, and what the condition number is.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d944d257", "metadata": {}, "outputs": [], "source": [ "print(f\"U = {U}\\n\\nS = {S}\\n\\nVh.T = {Vh.T}\\n\")\n", "print(\"Condition number of X_centered = \", np.linalg.cond(X_centered))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now let's approximate our centered matrix $\\hat{X}$ by some lower-rank matrices. First, we'll define a function which will give us a rank $k$ truncated SVD. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Defining the truncated svd\n", "def reduced_svd_matrix_k(U, S, Vh, k):\n", "\tUk = U[:, :k]\n", "\tSk = np.diag(S[:k])\n", "\tVhk = Vh[:k, :]\n", "\treturn Uk @ Sk @ Vhk\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now, as $\\hat{X}$ has rank 4, we can do a reduced matrix of rank 1,2,3. We will do this in a loop.\n", "\n", "> **Remark**. We'll divide the error by the (Frobenius) norm so that we have a relative error. E.g., if two houses are within 10k of each other, they are similarly priced. The magnitude of error being large doesn't say much if our quantities are large.\n", "> \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for k in [1, 2, 3]:\n", "\t# Define our reduced matrix\n", " Xck = reduced_svd_matrix_k(U, S, Vh, k)\n", "\t# Compute the relative error\n", " rel_err = np.linalg.norm(X_centered - Xck, ord=\"fro\") / np.linalg.norm(X_centered, ord=\"fro\")\n", "\t# Print the information\n", " print(Xck, \"\\n\", f\"k={k}: relative Frobenius reconstruction error on centered data = {rel_err:.4f}\", \"\\n\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "This seems to check out -- it says that one rank (or one feature) should be roughly enough to describe this data. This should make sense because clearly the square meterage, # of bedrooms, and price depend on the square footage. \n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 5 }