Probabilistic Interpretation
We assume that
y(i) with
ϵ(i) as the noise in the
ith example such that
ϵ(i)∼N(0,σ2)y(i)=θTx(i)+ϵ(i) This can be rewritten as
ϵ(i)=θTx(i)−y(i). Now, since
ϵ(i)∼N(0,σ2), hence
(y(i)−θTx(i))∼N(0,σ2).
Finally, if we are given
x(i), then
y(i)∣x(i) will also follow
N(0,σ2) shifted by
θTx(i).
P(y(i)∣x(i);θ)∼N(θTx(i),σ2) To find the maximum likelihood estimate of
θ, we need to maximize:
L(θ)=L(θ;X,Y)=i=1∏np(y(i)∣x(i);θ) Since
log is a monotonically increasing function, we can maximize the following instead:
ℓ(θ)=logi=1∏np(y(i)∣x(i);θ) Solving this we see that maximizing
ℓ(θ) is equivalent to minimizing:
−21i=1∑n(y(i)−θTx(i))2 Notice, that the function that we need to minimized does not depend on
σ which means that we don't need to know that variance of our noise to be able to maximize our likelihood.