And so that's how you compute this thing.

You compute X'X and then you compute its inverse.

We haven't yet talked about Octave.

We'll do so in the later

set of videos, but in the

Octave programming language or a

similar view, and also the

matlab programming language is very similar.

The command to compute this quantity,

X transpose X inverse times

X transpose Y, is as follows.

In Octave X prime is

the notation that you use to denote X transpose.

And so, this expression that's

boxed in red, that's computing

X transpose times X.

pinv is a function for

computing the inverse of

a matrix, so this computes

X transpose X inverse,

and then you multiply that by

X transpose, and you multiply

that by Y. So you

end computing that formula

which I didn't prove,

but it is possible to

show mathematically even though I'm

not going to do so

here, that this formula gives you

the optimal value of theta

in the sense that if you set theta equal

to this, that's the value

of theta that minimizes the

cost function J of theta

for the new regression.

One last detail in the earlier video.

I talked about the feature

skill and the idea of

getting features to be

on similar ranges of

Scales of similar ranges of values of each other.

If you are using this normal

equation method then feature

scaling isn't actually necessary

and is actually okay if,

say, some feature X one

is between zero and one,

and some feature X two is

between ranges from zero to

one thousand and some feature

x three ranges from zero

to ten to the

minus five and if

you are using the normal equation method

this is okay and there is

no need to do features

scaling, although of course

if you are using gradient descent,

then, features scaling is still important.

Finally, where should you use the gradient descent

and when should you use the normal equation method.

Here are some of the their advantages and disadvantages.

Let's say you have m training

examples and n features.

One disadvantage of gradient descent

is that, you need to choose the learning rate Alpha.

And, often, this means running

it few times with different learning

rate alphas and then seeing what works best.

And so that is sort of extra work and extra hassle.

Another disadvantage with gradient descent

is it needs many more iterations.

So, depending on the details,

that could make it slower, although

there's more to the story as we'll see in a second.

As for the normal equation, you don't need to choose any learning rate alpha.

So that, you know, makes it really convenient, makes it simple to implement.

You just run it and it usually just works.

And you don't need to

iterate, so, you don't need

to plot J of Theta or

check the convergence or take all those extra steps.

So far, the balance seems to

favor normal the normal equation.