Factor Analysis
If we have a dataset with
n data points and
d features and we want to model the data using a
mixture of gaussians, it would be very hard to do so if
d≫n.
One way around this is to put restrictions on the covariance matrix of the data and force it to be diagonal where each entry is just an empirical estimate of the variance in the
j-th coordinate.
Σjj=n1i=1∑n(xj(i)−μj)2 However, this fails to capture any correlations between the different features of our data. Therefore, a better solution is to learn a lower dimensional representation of our data.
In the factor analysis model, we assume that each feature
xj is generated by a linear combination of the entries of a lower-dimensional latent variable
z. We also assume that
z is normally distributed.
z∼N(0,I) x∣z∼N(μ+Wz,Ψ) We can rewrite this as:
z∼N(0,I) ϵ∼N(0,Ψ) x=μ+Wz+ϵ For any random variable that follows a normal distribution
x∼N(μ,σ2), it can be written in terms of
ϵ where
ϵ∼N(0,1).
x=μ+σ⋅ϵ We can derive the mean and covariance of
x and see that
x follows the following distribution:
x∼N(μ,WWT+Ψ) Thus, we can write the log likelihood of our data under this model as:
ℓ(μ,W,Ψ)=logi=1∏n(2π)d/2∣WWT+Ψ∣1/21exp(−21(x(i)−μ)T(WWT+Ψ)−1(x(i)−μ)) However, there does not exist any closed form solution for the maximum likelihood estimate of
μ,W,Ψ. This is because the matrices
W and
Ψ are coupled in the likelihood function and therefore, we cannot optimize them separately.
For the
E-step, we need to find the estimate for our posterior distribution of
z given
x. This distribution can be written as:
z∣x∼N(μz∣x,Σz∣x) μz∣x=WT(WWT+Ψ)−1(x−μ) Σz∣x=I−WT(WWT+Ψ)−1W Now, the
E-step for our EM algorithm is given by:
Qi(z(i))=p(z(i)∣x(i);μ,W,Ψ) =(2π)k/2∣Σz(i)∣x(i)∣1/21exp(−21(z(i)−μz(i)∣x(i))TΣz(i)∣x(i)−1(z(i)−μz(i)∣x(i))) In the
M-step, we need to maximize our evidence lower bound:
ELBO(x;Q,μ,W,Ψ)=i=1∑nEz(i)∼Qi[logQi(z(i))p(x(i),z(i);μ,W,Ψ)] Maximizing the
ELBO with respect to each of the parameters
μ,W,Ψ, we get the following update equations:
μ=n1i=1∑nx(i) W=(i=1∑n(x(i)−μ)μz(i)∣x(i)T)(i=1∑nμz(i)∣x(i)μz(i)∣x(i)T+Σz(i)∣x(i))−1 Ψ=n1i=1∑n((x(i)−μ)(x(i)−μ)T−Wμz(i)∣x(i)(x(i)−μ)T−(x(i)−μ)μz(i)∣x(i)TWT +W(μz(i)∣x(i)μz(i)∣x(i)T+Σz(i)∣x(i))WT)