And so that's how you compute this thing.
You compute X'X and then you compute its inverse.
We haven't yet talked about Octave.
We'll do so in the later
set of videos, but in the
Octave programming language or a
similar view, and also the
matlab programming language is very similar.
The command to compute this quantity,
X transpose X inverse times
X transpose Y, is as follows.
In Octave X prime is
the notation that you use to denote X transpose.
And so, this expression that's
boxed in red, that's computing
X transpose times X.
pinv is a function for
computing the inverse of
a matrix, so this computes
X transpose X inverse,
and then you multiply that by
X transpose, and you multiply
that by Y. So you
end computing that formula
which I didn't prove,
but it is possible to
show mathematically even though I'm
not going to do so
here, that this formula gives you
the optimal value of theta
in the sense that if you set theta equal
to this, that's the value
of theta that minimizes the
cost function J of theta
for the new regression.
One last detail in the earlier video.
I talked about the feature
skill and the idea of
getting features to be
on similar ranges of
Scales of similar ranges of values of each other.
If you are using this normal
equation method then feature
scaling isn't actually necessary
and is actually okay if,
say, some feature X one
is between zero and one,
and some feature X two is
between ranges from zero to
one thousand and some feature
x three ranges from zero
to ten to the
minus five and if
you are using the normal equation method
this is okay and there is
no need to do features
scaling, although of course
if you are using gradient descent,
then, features scaling is still important.
Finally, where should you use the gradient descent
and when should you use the normal equation method.
Here are some of the their advantages and disadvantages.
Let's say you have m training
examples and n features.
One disadvantage of gradient descent
is that, you need to choose the learning rate Alpha.
And, often, this means running
it few times with different learning
rate alphas and then seeing what works best.
And so that is sort of extra work and extra hassle.
Another disadvantage with gradient descent
is it needs many more iterations.
So, depending on the details,
that could make it slower, although
there's more to the story as we'll see in a second.
As for the normal equation, you don't need to choose any learning rate alpha.
So that, you know, makes it really convenient, makes it simple to implement.
You just run it and it usually just works.
And you don't need to
iterate, so, you don't need
to plot J of Theta or
check the convergence or take all those extra steps.
So far, the balance seems to
favor normal the normal equation.