Before we apply PCA, we often normalize our data so that each feature has mean 0 and variance 1. We do this by computing the mean and standard deviation of each feature and then for each x in our data, we subtract the mean and divide by the standard deviation.
μj=m1i=1∑mxj(i)
σj2=m1i=1∑m(xj(i)−μj)2
xj(i)←σjxj(i)−μj
To select the principal component of x, we need to find the unit vector u that maximizes the variance of the data when projected onto u. The greater the projection of x onto u, the higher the variance, meaning more information is captured in that direction.
u=argumaxn1i=1∑nproju(x(i))22
=argumax[uTΣu] where Σ=n1i=1∑nx(i)x(i)T
See derivation
We can now use Lagrange optimization to find the unit vector u that maximizes the variance. And it turns out the variance is maximized when u is the eigenvector of our symmetric matrix Σ.
Σu=λu
See derivation
However, we don't know which λ to choose if there are multiple that satisfy this equation.
But we can show we get the maximum variance when we choose the largest eigenvalue λ.
maxσ2↔maxλ
See derivation
Therefore, to maximize the variance, we need to choose the eigenvector with the largest eigenvalue.
In practice, we decompose Σ into its eigenvalues and eigenvectors using singular value decomposition and then choose the top k eigenvectors with the largest eigenvalues.