The gradient descent update for the least mean squares looks like this:
θ←θ+αi=1∑n(y(i)−θTx(i))x(i)
However, sometimes our data is not linearly separable. For this, we can use a feature mapping to transform the data into a higher-dimensional space, and then apply the gradient descent as follows:
θ←θ+αi=1∑n(y(i)−θTϕ(x(i)))ϕ(x(i))
However, this can become computationally expensive if the feature mapping is too complex. For example, if ϕ(x) is a vector that contains all the monomials of x with degree ≤3, then the dimension of our features are d3 where d is the dimension of x.
This would mean that θ would also have d3 parameters and we will need to compute d3 gradient updates.
To fix this, we first assume that θ can be represented as a linear combination of our training examples (or their feature mappings):
θ=i=1∑nβiϕ(x(i))
We can then rewrite the gradient descent update as follows:
θ←θ+αi=1∑n(y(i)−θTϕ(x(i)))ϕ(x(i))
θ←i=1∑nβiϕ(x(i))+αi=1∑n(y(i)−θTϕ(x(i)))ϕ(x(i))
θ←i=1∑nnew βiβi+α(y(i)−θTϕ(x(i)))ϕ(x(i))
With this, to fine the new θ, we only need to compute the new βi's for all of our training examples. And the new βi's can be found using the following update:
βi←βi+α(y(i)−θTϕ(x(i)))
βi←βi+αy(i)−(j=1∑nβjϕ(x(j)))Tϕ(x(i))
βi←βi+α(y(i)−j=1∑nβjϕ(x(j))Tϕ(x(i)))
To avoid having to compute the dot products of the feature mappings for all i,j, on each iteration, we can precompute a kernel function that contains all the dot products before the training starts.
K(x(i),x(j))=ϕ(x(i))Tϕ(x(j))=⟨ϕ(x(i)),ϕ(x(j))⟩
It would seem that this is still computationally expensive if we have a lot of training examples since each dot product takes only O(d3) operations. But in reality, a dot product can be broken down so that it takes O(d) operations.
See derivation
The update rule for βi is now:
βi←βi+α(y(i)−j=1∑nβjK(x(i),x(j)))
Similarly, for inference, we can predict the value of a new example x as follows:
θTϕ(x)=i=1∑nβiK(x(i),x)
Validity of Kernels
A kernel K(x,z) is valid if there exists a feature mapping ϕ such that K(x,z)=⟨ϕ(x),ϕ(z)⟩.
One example of a valid kernel is the quadratic kernel. K(x,z)=(xTz+c)2. The feature mapping of this kernel is: