The classification problem is different from the regression problem in that y takes a discrete value (a category label) rather than a continuous value.
Therefore, for logistic regression, we will choose our hθ(x) to be a sigmoid function that squishes any real number to a value between 0 and 1.
hθ(x)=g(θTx)=1+e−θTx1
Let's us assume that:
p(y=1;x,θ)=hθ(x)
p(y=0;x,θ)=1−hθ(x)
Now this can be rewritten as:
p(y∣x,θ)=(hθ(x))y(1−hθ(x))1−y
Now, the log-likelihood can be written as:
ℓ(θ)=logi=1∏np(y(i)∣x(i),θ)
=i=1∑nlogp(y(i)∣x(i),θ)
=i=1∑nlog[(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)]
Taking it's derivative with respect to θ, we get:
∂θj∂ℓ(θ)=i=1∑n(y(i)−hθ(x(i)))xj(i)
And so our gradient descent update rule to maximize the log-likelihood becomes:
θj←θj+α(y(i)−hθ(x(i)))xj(i)
See derivation
See derivation
Also, note that maximizing the log-likelihood is equivalent to minimizing the logistic loss where t=θTx
argθminℓlogistic(t,y)=argθmaxℓ(θ)
See derivation
Multiclass Classification
For multi-class classification, if we have k classes, we will have k∗θ parameters and will use a one-vs-all approach.
p(y=i∣x;θ)=ϕi=∑j=1kexp(θjTx)exp(θiTx)
Our cross-entropy loss (which is the negative log-likelihood) can then be written as:
Taking the derivative of the cross-entropy loss with respect to θj, we get:
∂θj∂ℓce(θ)=i=1∑m(ϕj(i)−1{y(i)=j})x(i)
Note that ϕj(i)=p(y(i)=j∣x(i);θ).
Therefore, since ϕj(i) is a probability between 0 and 1, if y(i)=j, we add a negative value of x(i) to our gradient. And if y(i)=j, we add a positive value of x(i) to our gradient.