Principal Component Analysis

Before we apply PCA, we often normalize our data so that each feature has mean 0 and variance 1. We do this by computing the mean and standard deviation of each feature and then for each $x$ in our data, we subtract the mean and divide by the standard deviation.

\mu_j = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}_j

\sigma_j^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}_j - \mu_j)^2

x^{(i)}_j \leftarrow \frac{x^{(i)}_j - \mu_j}{\sigma_j}

To select the principal component of $x$ , we need to find the unit vector $u$ that maximizes the variance of the data when projected onto $u$ . The greater the projection of $x$ onto $u$ , the higher the variance, meaning more information is captured in that direction.

u = \arg \max_u \frac{1}{n} \sum_{i=1}^{n} \left\| proj_u ( x^{(i)}) \right\|_2^2

= \arg \max_u \left[ u^T \, \Sigma \, u \right] \text{ where } \Sigma = \frac{1}{n} \sum_{i=1}^{n} x^{(i)}x^{(i) T}

See derivation

We can now use Lagrange optimization to find the unit vector $u$ that maximizes the variance. And it turns out the variance is maximized when $u$ is the eigenvector of our symmetric matrix $\Sigma$ .

\Sigma u = \lambda u

See derivation

However, we don't know which $\lambda$ to choose if there are multiple that satisfy this equation.

But we can show we get the maximum variance when we choose the largest eigenvalue $\lambda$ .

\max \sigma^2 \leftrightarrow \max \lambda

See derivation

Therefore, to maximize the variance, we need to choose the eigenvector with the largest eigenvalue.

In practice, we decompose $\Sigma$ into its eigenvalues and eigenvectors using singular value decomposition and then choose the top $k$ eigenvectors with the largest eigenvalues.

Please use a larger screen

This content is best viewed on a laptop or desktop device.