0:00

In this video, we'll talk about

the normal equation, which for

some linear regression problems, will

give us a much better way

to solve for the optimal value

of the parameters theta.

Concretely, so far the

algorithm that we've been using

for linear regression is gradient

descent where in order

to minimize the cost function

J of Theta, we would take

this iterative algorithm that takes

many steps, multiple iterations of

gradient descent to converge

to the global minimum.

In contrast, the normal equation

would give us a method to

solve for theta analytically, so

that rather than needing to run

this iterative algorithm, we can

instead just solve for the

optimal value for theta

all at one go, so that in

basically one step you get

to the optimal value right there.

0:49

It turns out the normal equation

that has some advantages and

some disadvantages, but before

we get to that and talk about

when you should use it, let's

get some intuition about what this method does.

For this week's planetary example, let's

imagine, let's take a

very simplified cost function

J of Theta, that's just the

function of a real number Theta.

So, for now, imagine that Theta

is just a scalar value or that Theta is just a row value.

It's just a number, rather than a vector.

Imagine that we have a cost function J that's a quadratic function of this real value

parameter Theta, so J of Theta looks like that.

Well, how do you minimize a quadratic function?

For those of you that know a little bit of calculus,

you may know that the way to

minimize a function is to

take derivatives and to

set derivatives equal to zero.

So, you take the derivative of J with respect to the parameter of Theta.

You get some formula which I am not going to derive,

you set that derivative

equal to zero, and this

allows you to solve for

the value of Theda that minimizes J of Theta.

That was a simpler case

of when data was just real number.

In the problem that we are interested in, Theta is

2:04

no longer just a real number,

but, instead, is this

n+1-dimensional parameter vector, and,

a cost function J is

a function of this vector

value or Theta 0 through

Theta m. And, a cost

function looks like this, some square cost function on the right.

How do we minimize this cost function J?

Calculus actually tells us

that, if you, that

one way to do so, is

to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set

all of these to 0.

If you do that, and you

solve for the values of

Theta 0, Theta 1,

up to Theta N, then,

this would give you that values

of Theta to minimize the cost

function J. Where, if

you actually work through the

calculus and work through

the solution to the parameters

Theta 0 through Theta N, the

derivation ends up being somewhat involved.

And, what I am going

to do in this video,

is actually to not go

through the derivation, which is kind

of long and kind of involved, but

what I want to do is just

tell you what you need to know

in order to implement this process

so you can solve for the

values of the thetas that

corresponds to where the

partial derivatives is equal to zero.

Or alternatively, or equivalently,

the values of Theta is that

minimize the cost function J of Theta.

I realize that some of

the comments I made that made

more sense only to those

of you that are normally familiar with calculus.

So, but if you don't

know, if you're less familiar

with calculus, don't worry about it.

I'm just going to tell you what

you need to know in order to

implement this algorithm and get it to work.

For the example that I

want to use as a running

example let's say that

I have m = 4 training examples.

3:50

In order to implement this normal

equation at big, what I'm going to do is the following.

I'm going to take my

data set, so here are my four training examples.

In this case let's assume that,

you know, these four examples is all the data I have.

What I am going to do is take

my data set and add

an extra column that corresponds

to my extra feature, x0,

that is always takes

on this value of 1.

What I'm going to do is

I'm then going to construct

a matrix called X that's

a matrix are basically contains all

of the features from my

training data, so completely

here is my here are

all my features and we're

going to take all those numbers and

put them into this matrix "X", okay?

So just, you know, copy

the data over one column

at a time and then I am going to do something similar for y's.

I am going to take the

values that I'm trying to

predict and construct now

a vector, like so

and call that a vector y.

So X is going to be a

4:59

m by (n+1) - dimensional matrix, and

Y is going to be

a m-dimensional vector

where m is the number of training examples

and n is, n is

a number of features, n+1, because of

this extra feature X0 that I had.

Finally if you take

your matrix X and you take

your vector Y, and if you

just compute this, and set

theta to be equal to

X transpose X inverse times

X transpose Y, this would

give you the value of theta

that minimizes your cost function.

There was a lot

that happened on the slides and

I work through it using one specific example of one dataset.

Let me just write this

out in a slightly more general form

and then let me just, and later on in

this video let me explain this equation a little bit more.

6:16

The way I'm going to construct the

matrix "X", this is

also called the design matrix

is as follows.

Each training example gives

me a feature vector like this.

say, sort of n+1 dimensional vector.

The way I am going to construct my

design matrix x is only construct the matrix like this.

and what I'm going to

do is take the first

training example, so that's

a vector, take its transpose

so it ends up being this,

you know, long flat thing and

make x1 transpose the first row of my design matrix.

Then I am going to take my

second training example, x2, take

the transpose of that and

put that as the second row

of x and so on,

down until my last training example.

Take the transpose of that,

and that's my last row of

my matrix X. And, so,

that makes my matrix X, an

M by N +1

dimensional matrix.

As a concrete example, let's

say I have only one

feature, really, only one

feature other than X zero,

which is always equal to 1.

So if my feature vectors

X-i are equal to this

1, which is X-0, then

some real feature, like maybe the

size of the house, then my

design matrix, X, would be equal to this.

For the first row, I'm going

to basically take this and take its transpose.

So, I'm going to end up with 1, and then X-1-1.

For the second row, we're going to end

up with 1 and then

X-1-2 and so

on down to 1, and

then X-1-M.

And thus, this will be

a m by 2-dimensional matrix.

So, that's how to construct

the matrix X. And, the

vector Y--sometimes I might

write an arrow on top to

denote that it is a vector,

but very often I'll just write this as Y, either way.

The vector Y is obtained by

taking all all the labels,

all the correct prices of

houses in my training set, and

just stacking them up into

an M-dimensional vector, and

that's Y. Finally, having

constructed the matrix X

and the vector Y, we then

just compute theta as X'(1/X)

x X'Y. I just

want to make

I just want to make sure that this equation makes sense to you

and that you know how to implement it.

So, you know, concretely, what is this X'(1/X)?

Well, X'(1/X) is the

inverse of the matrix X'X.

Concretely, if you were

to say set A to

be equal to X' x

X, so X' is a

matrix, X' x X

gives you another matrix, and we

call that matrix A. Then, you

know, X'(1/X) is just

you take this matrix A and you invert it, right!

9:26

And so that's how you compute this thing.

You compute X'X and then you compute its inverse.

We haven't yet talked about Octave.

We'll do so in the later

set of videos, but in the

Octave programming language or a

similar view, and also the

matlab programming language is very similar.

The command to compute this quantity,

X transpose X inverse times

X transpose Y, is as follows.

In Octave X prime is

the notation that you use to denote X transpose.

And so, this expression that's

boxed in red, that's computing

X transpose times X.

pinv is a function for

computing the inverse of

a matrix, so this computes

X transpose X inverse,

and then you multiply that by

X transpose, and you multiply

that by Y. So you

end computing that formula

which I didn't prove,

but it is possible to

show mathematically even though I'm

not going to do so

here, that this formula gives you

the optimal value of theta

in the sense that if you set theta equal

to this, that's the value

of theta that minimizes the

cost function J of theta

for the new regression.

One last detail in the earlier video.

I talked about the feature

skill and the idea of

getting features to be

on similar ranges of

Scales of similar ranges of values of each other.

If you are using this normal

equation method then feature

scaling isn't actually necessary

and is actually okay if,

say, some feature X one

is between zero and one,

and some feature X two is

between ranges from zero to

one thousand and some feature

x three ranges from zero

to ten to the

minus five and if

you are using the normal equation method

this is okay and there is

no need to do features

scaling, although of course

if you are using gradient descent,

then, features scaling is still important.

Finally, where should you use the gradient descent

and when should you use the normal equation method.

Here are some of the their advantages and disadvantages.

Let's say you have m training

examples and n features.

One disadvantage of gradient descent

is that, you need to choose the learning rate Alpha.

And, often, this means running

it few times with different learning

rate alphas and then seeing what works best.

And so that is sort of extra work and extra hassle.

Another disadvantage with gradient descent

is it needs many more iterations.

So, depending on the details,

that could make it slower, although

there's more to the story as we'll see in a second.

As for the normal equation, you don't need to choose any learning rate alpha.

So that, you know, makes it really convenient, makes it simple to implement.

You just run it and it usually just works.

And you don't need to

iterate, so, you don't need

to plot J of Theta or

check the convergence or take all those extra steps.

So far, the balance seems to

favor normal the normal equation.

12:27

the normal equation, and some advantages of gradient descent.

Gradient descent works pretty well,

even when you have a very large number of features.

So, even if you

have millions of features you

can run gradient descent and it will be reasonably efficient.

It will do something reasonable.

In contrast to normal equation, In, in

order to solve for the parameters

data, we need to solve for this term.

We need to compute this term, X transpose, X inverse.

This matrix X transpose X.

That's an n by n matrix, if you have n features.

13:00

Because, if you look

at the dimensions of

X transpose the dimension of

X, you multiply, figure out what

the dimension of the product

is, the matrix X transpose

X is an n by n matrix where

n is the number of features, and

for almost computed implementations

the cost of inverting

the matrix, rose roughly as

the cube of the dimension of the matrix.

So, computing this inverse costs,

roughly order, and cube time.

Sometimes, it's slightly faster than

N cube but, it's, you know, close enough for our purposes.

So if n the number of features is very large,

13:37

then computing this

quantity can be slow and

the normal equation method can actually be much slower.

So if n is

large then I might

usually use gradient descent because

we don't want to pay this all in q time.

But, if n is relatively small,

then the normal equation might give you a better way to solve the parameters.

What does small and large mean?

Well, if n is on

the order of a hundred, then

inverting a hundred-by-hundred matrix is

no problem by modern computing standards.

If n is a thousand, I would still use the normal equation method.

Inverting a thousand-by-thousand matrix is

actually really fast on a modern computer.

If n is ten thousand, then I might start to wonder.

Inverting a ten-thousand- by-ten-thousand matrix

starts to get kind of slow,

and I might then start to

maybe lean in the

direction of gradient descent, but maybe not quite.

n equals ten thousand, you can

sort of convert a ten-thousand-by-ten-thousand matrix.

But if it gets much bigger than that, then, I would probably use gradient descent.

So, if n equals ten

to the sixth with a million

features, then inverting a

million-by-million matrix is going

to be very expensive, and

I would definitely favor gradient descent if you have that many features.

So exactly how large

set of features has to be

before you convert a gradient descent, it's hard to give a strict number.

But, for me, it is usually

around ten thousand that I might

start to consider switching over

to gradient descents or maybe,

some other algorithms that we'll talk about later in this class.

To summarize, so long

as the number of features is

not too large, the normal equation

gives us a great alternative method to solve for the parameter theta.

Concretely, so long as

the number of features is less

than 1000, you know, I would

use, I would usually is used

in normal equation method rather than, gradient descent.

To preview some ideas that

we'll talk about later in this

course, as we get

to the more complex learning algorithm, for

example, when we talk about

classification algorithm, like a logistic regression algorithm,

15:32

We'll see that those algorithm

actually...

The normal equation method actually do not work

for those more sophisticated

learning algorithms, and, we

will have to resort to gradient descent for those algorithms.

So, gradient descent is a very useful algorithm to know.

The linear regression will have

a large number of features and

for some of the other algorithms

that we'll see in

this course, because, for them, the normal

equation method just doesn't apply and doesn't work.

But for this specific model of

linear regression, the normal equation

can give you a alternative