The classification problem is different from the regression problem in that y takes a discrete value (a category label) rather than a continuous value.
Therefore, for logistic regression, we will choose our hθ(x) to be a sigmoid function that squishes any real number to a value between 0 and 1.
Let's us assume that:
Now this can be rewritten as:
Now, the log-likelihood can be written as:
Taking it's derivative with respect to θ, we get:
And so our gradient descent update rule to maximize the log-likelihood becomes:
See derivation
See derivation
Also, note that maximizing the log-likelihood is equivalent to minimizing the logistic loss where t=θTx
See derivation
Multiclass Classification
For multi-class classification, if we have k classes, we will have k∗θ parameters and will use a one-vs-all approach.
Our cross-entropy loss (which is the negative log-likelihood) can then be written as:
Taking the derivative of the cross-entropy loss with respect to θj, we get:
Note that ϕj(i)=p(y(i)=j∣x(i);θ).
Therefore, since ϕj(i) is a probability between 0 and 1, if y(i)=j, we add a negative value of x(i) to our gradient. And if y(i)=j, we add a positive value of x(i) to our gradient.