We imagine there is some data s∈Rd that is generated via d independent sources.
We also have a mixing matrix A∈Rd×d that mixes the sources to produce the observed data x∈Rd.
x=As
Repeated observations of the data give us a set of training samples {x(1),…,x(n)} and we want to find the unmixing matrix W=A−1 that allows us to reconstruct the sources.
s=Wx
We suppose that the distribution of each source sj is given by ps(sj) and is independent of the other sources. Therefore the joint distribution of the sources is:
p(s)=j=1∏dps(sj)
Change of Variables in Probability
For any two vectors that are linearly related by s=Wx, the absolute value of the determinant of the transformation matrix W gives the factor by which the volume of any region in the space changes.
Therefore, if we have a probability distribution p(s), the probability distribution of the transformed variables p(x) is given by:
p(s)=∣W∣1⋅p(x)
Since s=Wx, the distribution of the sources in terms of the data is:
p(s)⋅∣W∣=p(x)
Therefore the distribution of the data in terms of the sources is:
p(x)=(j=1∏dps(wjTx))⋅∣W∣
We want to choose a monotonically increasing function that increases from 0 to 1 to be the CDF of our probability distribution. The derivative of the CDF will give us our probability density function. Choosing Sigmoid to be our CDF, we get:
p(s)=g′(s)
g(s)=1+exp(−s)1
The log-likelihood of the data is then given by:
ℓ(W)=i=1∑n(j=1∑dlogg′(wjTx(i)))+log∣W∣
Taking the derivative of the log-likelihood with respect to W and setting it to 0, we get the following gradient descent update rule:
There are certain scenarios where Independent Component Analysis (ICA) might not work well:
If our unmixing matrix W is multiplied by a permutation matrix P, there is no way for us to know about it. In this case we won't be able to know which signal was from which source.
W→PW
If a row in the unmixing matrix W is scaled by a constant α, this will just result in the corresponding source being scaled by 1/α. There is no way for us to know if scaling has occurred. Therefore, we won't be able to retrieve the true amplitude of our signal.
If the data x follows a gaussian distribution, then our sources s will also follow a gaussian distribution. And gaussian distributions are symmetric in nature. Therefore, if our unmixing matrix W is multiplied by a rotation or reflection matrix R, there is no way for us to know about it.
W→RW
Moreover, we have assumed that our data points are independent and identically distributed. This is however, not true for time-series data.
Despite all these limitations, ICA still works very well given enough data.