Factor Analysis

If we have a dataset with

n

data points and

d

features and we want to model the data using a mixture of gaussians, it would be very hard to do so if

d \gg n

One way around this is to put restrictions on the covariance matrix of the data and force it to be diagonal where each entry is just an empirical estimate of the variance in the

j

-th coordinate.

\Sigma_{jj} = \frac{1}{n} \sum_{i=1}^n \left(x_j^{(i)} - \mu_j\right)^2

However, this fails to capture any correlations between the different features of our data. Therefore, a better solution is to learn a lower dimensional representation of our data.

In the factor analysis model, we assume that each feature

x_j

is generated by a linear combination of the entries of a lower-dimensional latent variable

z

. We also assume that

z

is normally distributed.

z \sim \mathcal{N}(0, I)

x | z \sim \mathcal{N}(\mu + Wz, \Psi)

We can rewrite this as:

z \sim \mathcal{N}(0, I)

\epsilon \sim \mathcal{N}(0, \Psi)

x = \mu + Wz + \epsilon

For any random variable that follows a normal distribution

x \sim \mathcal{N}(\mu, \sigma^2)

, it can be written in terms of

\epsilon

where

\epsilon \sim \mathcal{N}(0, 1)

x = \mu + \sigma \cdot \epsilon

We can derive the mean and covariance of

x

and see that

x

follows the following distribution:

x \sim \mathcal{N}(\mu, WW^T + \Psi)

See derivation

Thus, we can write the log likelihood of our data under this model as:

\ell(\mu, W, \Psi) = \log \prod_{i=1}^{n} \frac{1}{(2\pi)^{d/2} |W W^T + \Psi|^{1/2}} \exp \left( -\frac{1}{2} (x^{(i)} - \mu)^T (W W^T + \Psi)^{-1} (x^{(i)} - \mu) \right)

However, there does not exist any closed form solution for the maximum likelihood estimate of

\mu, W, \Psi

. This is because the matrices

W

and

\Psi

are coupled in the likelihood function and therefore, we cannot optimize them separately.

Therefore, we need to use the Expectation Maximization algorithm instead.

For the

E

-step, we need to find the estimate for our posterior distribution of

z

given

x

. This distribution can be written as:

z|x \sim \mathcal{N}\left(\mu_{z|x}, \, \Sigma_{z|x}\right)

\mu_{z|x} = W^T \left( WW^T + \Psi \right)^{-1} \left( x - \mu \right)

\Sigma_{z|x} = I - W^T \left( WW^T + \Psi \right)^{-1} W

See derivation

Now, the

E

-step for our EM algorithm is given by:

Q_i(z^{(i)}) = p\left(z^{(i)} | x^{(i)}; \mu, W, \Psi\right)

= \frac{1}{(2\pi)^{k/2} |\Sigma_{z^{(i)}|x^{(i)}}|^{1/2}} \exp\left( -\frac{1}{2} \left( z^{(i)} - \mu_{z^{(i)}|x^{(i)}} \right)^T \Sigma_{z^{(i)}|x^{(i)}}^{-1} \left( z^{(i)} - \mu_{z^{(i)}|x^{(i)}} \right) \right)

In the

M

-step, we need to maximize our evidence lower bound:

\text{ELBO}\left(x; Q, \mu, W, \Psi \right) = \sum_{i=1}^n \mathbb{E}_{z^{(i)} \sim Q_i} \left[ \log \frac{p\left(x^{(i)}, z^{(i)}; \mu, W, \Psi\right)}{Q_i(z^{(i)})} \right]

Maximizing the

\text{ELBO}

with respect to each of the parameters

\mu, W, \Psi

, we get the following update equations:

\mu = \frac{1}{n} \sum_{i=1}^n x^{(i)}

W = \left( \sum_{i=1}^{n} (x^{(i)} - \mu) \mu_{z^{(i)}|x^{(i)}}^T \right) \left( \sum_{i=1}^{n} \mu_{z^{(i)}|x^{(i)}} \mu_{z^{(i)}|x^{(i)}}^T + \Sigma_{z^{(i)}|x^{(i)}} \right)^{-1}

\Psi = \frac{1}{n} \, \sum_{i=1}^{n} \left(\left(x^{(i)} - \mu\right) \left(x^{(i)} - \mu\right)^T - W \mu_{z^{(i)} \mid x^{(i)}} \left(x^{(i)} - \mu\right)^T - \left(x^{(i)} - \mu\right) \mu_{z^{(i)} \mid x^{(i)}}^T W^T \right.

\left. + \, W \left( \mu_{z^{(i)}|x^{(i)}} \, \mu_{z^{(i)}|x^{(i)}}^T + \Sigma_{z^{(i)}|x^{(i)}} \right) W^T \right)

See derivation

Please use a larger screen

This content is best viewed on a laptop or desktop device.