Gaussian Mixture Models
Suppose that we are given a training set
x(1),…,x(n) and we wish to model the data by specifying a joint distribution
p(x(i),z(i)) where
z(i)∼Multinomial(ϕ) is a latent variable where
p(z(i)=j)=ϕj and
∑j=1kϕj=1 where
k is the number of values that
z(i) can take.
Moreover, we assume that
x(i)∣z(i)=j∼N(μj,Σj). Gaussian mixture models are similar to the
K-means Algorithm except that we allow for overlapping clusters and each each cluster follows a Gaussian distribution.
To maximize the log likelihood, we need to maximize:
ℓ(ϕ,μ,Σ)=i=1∑nlogp(x(i);ϕ,μ,Σ) =i=1∑nlogz(i)=1∑kp(x(i)∣z(i);μ,Σ)p(z(i);ϕ) The random variables
z(i)'s indicate which of the
k Gaussians each
x(i) had come from. Note that if we knew what the
z(i)’s were, the maximum likelihood problem would have been easy and almost similar to that of
Gaussian Discriminant Analysis.
In the
E step, for all values of
i and
j, we find
p(z(i)=j∣x(i);ϕ,μ,Σ) using the current parameters and the bayes rule:
p(z(i)=j∣x(i);ϕ,μ,Σ)=∑l=1kp(x(i)∣z(i)=l;μ,Σ)p(z(i)=l;ϕ)p(x(i)∣z(i)=j;μ,Σ)p(z(i)=j;ϕ) When we plug in the equations for the gaussian distribution, this takes a form very similar to the softmax function. So in the
E step, we find our estimated probability distribution for what value
z(i) should take for each
x(i).
Now for the
M step, for each
j, we find the values of our parameters that maximize our new
ELBO with:
wj(i)=Qi(z(i)=j)=p(z(i)=j∣x(i);ϕ,μ,Σ) Therefore we need to maximize the following:
i=1∑nj=1∑k[Qi(z(i)=j)logQi(z(i)=j)p(x(i)∣z(i)=j;μ,Σ)p(z(i)=j;ϕ)] =i=1∑nj=1∑k[wj(i)logwj(i)p(x(i)∣z(i)=j;μ,Σ)p(z(i)=j;ϕ)] When we take the derivative of the above equation with respect to each of our parameters
ϕ,μ,Σ, and set it equal to
0, we get our update equations for each of the parameters:
ϕl=n1i=1∑nwl(i) μl=∑i=1nwl(i)∑i=1nwl(i)x(i) Σl=∑i=1nwl(i)∑i=1nwl(i)(x(i)−μl)(x(i)−μl)T Repeat until convergence { (E-step) For each i and j, set { wj(i)←p(z(i)=j∣x(i);ϕ,μ,Σ). (M-step) For each j, set { ϕj←n1i=1∑nwj(i) μj←∑i=1nwj(i)∑i=1nwj(i)x(i) Σj←∑i=1nwj(i)∑i=1nwj(i)(x(i)−μj)(x(i)−μj)T In the
E step, we estimate the probability of each
z(i) given
x(i) using the current parameters
ϕ,μ,Σ.
In the
M step, we update the value of our parameters to maximize the
ELBO for which it is equal to
p(x;μ,Σ) for our current values of
ϕ,μ,Σ.