The training set’s input can be written as the following matrix:
X=⎣⎢⎢⎢⎢⎡−(x(1))T−−(x(2))T−⋮−(x(m))T−⎦⎥⎥⎥⎥⎤
Also, let y be
y=⎣⎢⎢⎢⎢⎡y(1)y(2)⋮y(m)⎦⎥⎥⎥⎥⎤
Now, since hθ(x(i))=(x(i))Tθ, we can verify that:
J(θ)=21(Xθ−y)T(Xθ−y)
To minimize J, we can derive that:
∂θJ(θ)=XTXθ−XTy
Then we obtain the normal equations:
XTXθ=XTy
Thus, the value of θ that minimizes J is:
θ=(XTX)−1XTy
Probabilistic interpretation
To discover whether J is a reasonable choice, let us assume that the target variables and the inputs are related via the equation
y(i)=θTx(i)+ϵ(i)
Let us further assume that the ϵ(i) are distributed IID (independently and identically distributed) according to a Gaussian distribution. We can write this as following:
p(y(i)∣x(i);θ)=2πσ1exp(−2σ2(y(i)−θTx(i))2)
To view this as a function of θ, we instead call it the likelihood function:
L(θ)=L(θ;X,y)=p(y∣X;θ)=i=1∏mp(y(i)∣x(i);θ)
The principal of maximum likelihood says we should choose θ to maximize L(θ).
To make it simpler, we instead maximize the log likelihoodℓ(θ):
Hence, maximizing ℓ is the same as minimizing 21∑i=1m(y(i)−θTx(i))2=J(θ).
Note that our final choice of θ did not depend on σ2.
Locally weighted linear regression
In contrast to the original linear regression algorithm, the locally weighted regression gives every ϵ(i) a weight w(i), thus we can “ignore” some bad training examples.
A fair standard choice for the weights is:
w(i)=exp(−2τ2(x(i)−x)2)
As a non-parametric algorithm, to make predictions using locally weighted linear regressions, we need to keep the entire training set around for h to grow linearly with the size of the training set.
Comments