The training set’s input can be written as the following matrix:
Also, let y be
Now, since hθ(x(i))=(x(i))Tθ, we can verify that:
To minimize J, we can derive that:
Then we obtain the normal equations:
Thus, the value of θ that minimizes J is:
Probabilistic interpretation
To discover whether J is a reasonable choice, let us assume that the target variables and the inputs are related via the equation
Let us further assume that the ϵ(i) are distributed IID (independently and identically distributed) according to a Gaussian distribution. We can write this as following:
To view this as a function of θ, we instead call it the likelihood function:
The principal of maximum likelihood says we should choose θ to maximize L(θ).
To make it simpler, we instead maximize the log likelihoodℓ(θ):
Hence, maximizing ℓ is the same as minimizing 21∑i=1m(y(i)−θTx(i))2=J(θ).
Note that our final choice of θ did not depend on σ2.
Locally weighted linear regression
In contrast to the original linear regression algorithm, the locally weighted regression gives every ϵ(i) a weight w(i), thus we can “ignore” some bad training examples.
A fair standard choice for the weights is:
As a non-parametric algorithm, to make predictions using locally weighted linear regressions, we need to keep the entire training set around for h to grow linearly with the size of the training set.