Gaussian Mixture Models
Suppose that we are given a training set x(1),…,x(n) and we wish to model the data by specifying a joint distribution p(x(i),z(i)) where z(i)∼Multinomial(ϕ) is a latent variable where p(z(i)=j)=ϕj and ∑j=1kϕj=1 where k is the number of values that z(i) can take.
Moreover, we assume that x(i)∣z(i)=j∼N(μj,Σj). Gaussian mixture models are similar to the K-means Algorithm except that we allow for overlapping clusters and each each cluster follows a Gaussian distribution.
To maximize the log likelihood, we need to maximize:
ℓ(ϕ,μ,Σ)=i=1∑nlogp(x(i);ϕ,μ,Σ)
=i=1∑nlogz(i)=1∑kp(x(i)∣z(i);μ,Σ)p(z(i);ϕ)
The random variables z(i)'s indicate which of the k Gaussians each x(i) had come from. Note that if we knew what the z(i)'s were, the maximum likelihood problem would have been easy and almost similar to that of Gaussian Discriminant Analysis.
Since there is no way for us to take the derivative of the above equation and find a closed form solution, we need to use the Expectancy Maximization Algorithm instead.
In the E step, for all values of i and j, we find p(z(i)=j∣x(i);ϕ,μ,Σ) using the current parameters and the bayes rule:
p(z(i)=j∣x(i);ϕ,μ,Σ)=∑l=1kp(x(i)∣z(i)=l;μ,Σ)p(z(i)=l;ϕ)p(x(i)∣z(i)=j;μ,Σ)p(z(i)=j;ϕ)
When we plug in the equations for the gaussian distribution, this takes a form very similar to the softmax function. So in the E step, we find our estimated probability distribution for what value z(i) should take for each x(i).
Now for the M step, for each j, we find the values of our parameters that maximize our new ELBO with:
wj(i)=Qi(z(i)=j)=p(z(i)=j∣x(i);ϕ,μ,Σ)
Therefore we need to maximize the following:
i=1∑nj=1∑k[Qi(z(i)=j)logQi(z(i)=j)p(x(i)∣z(i)=j;μ,Σ)p(z(i)=j;ϕ)]
=i=1∑nj=1∑k[wj(i)logwj(i)p(x(i)∣z(i)=j;μ,Σ)p(z(i)=j;ϕ)]
When we take the derivative of the above equation with respect to each of our parameters ϕ,μ,Σ, and set it equal to 0, we get our update equations for each of the parameters:
ϕl=n1i=1∑nwl(i)
μl=∑i=1nwl(i)∑i=1nwl(i)x(i)
Σl=∑i=1nwl(i)∑i=1nwl(i)(x(i)−μl)(x(i)−μl)T
Repeat until convergence {(E-step) For each i and j, set {wj(i)←p(z(i)=j∣x(i);ϕ,μ,Σ).}(M-step) For each j, set {ϕj←n1i=1∑nwj(i)μj←∑i=1nwj(i)∑i=1nwj(i)x(i)Σj←∑i=1nwj(i)∑i=1nwj(i)(x(i)−μj)(x(i)−μj)T}}
In the E step, we estimate the probability of each z(i) given x(i) using the current parameters ϕ,μ,Σ.
In the M step, we update the value of our parameters to maximize the ELBO for which it is equal to p(x;μ,Σ) for our current values of ϕ,μ,Σ.