In the previous video, we talked about the form of the hypothesis for linear

regression with multiple features or with multiple variables.

In this video, let's talk about how to fit the parameters of that hypothesis.

In particular let's talk about how to use gradient descent for linear

regression with multiple features.

To quickly summarize our notation, this is our formal hypothesis in

multivariable linear regression where we've adopted the convention that x0=1.

The parameters of this model are theta0 through theta n, but instead of thinking

of this as n separate parameters, which is valid, I'm instead going to think of

the parameters as theta where theta here is a n+1-dimensional vector.

So I'm just going to think of the parameters of this model

as itself being a vector.

Our cost function is J of theta0 through theta n which is given by this usual

sum of square of error term. But again instead of thinking of J as a function

of these n+1 numbers, I'm going to more commonly write J as just a

function of the parameter vector theta so that theta here is a vector.

Here's what gradient descent looks like. We're going to repeatedly update each

parameter theta j according to theta j minus alpha times this derivative term.

And once again we just write this as J of theta, so theta j is updated as

theta j minus the learning rate alpha times the derivative, a partial

derivative of the cost function with respect to the parameter theta j.

Let's see what this looks like when we implement gradient descent and,

in particular, let's go see what that partial derivative term looks like.

Here's what we have for gradient descent for the case of when we had N=1 feature.

We had two separate update rules for the parameters theta0 and theta1, and

hopefully these look familiar to you. And this term here was of course the

partial derivative of the cost function with respect to the parameter of theta0,

and similarly we had a different update rule for the parameter theta1.

There's one little difference which is that when we previously had only one

feature, we would call that feature x(i) but now in our new notation

we would of course call this x(i)<u>1 to denote our one feature.</u>

So that was for when we had only one feature.

Let's look at the new algorithm for we have more than one feature,

where the number of features n may be much larger than one.

We get this update rule for gradient descent and, maybe for those of you that

know calculus, if you take the definition of the cost function and take

the partial derivative of the cost function J with respect to the parameter

theta j, you'll find that that partial derivative is exactly that term that

I've drawn the blue box around.

And if you implement this you will get a working implementation of

gradient descent for multivariate linear regression.

The last thing I want to do on this slide is give you a sense of

why these new and old algorithms are sort of the same thing or why they're

both similar algorithms or why they're both gradient descent algorithms.

Let's consider a case where we have two features

or maybe more than two features, so we have three update rules for

the parameters theta0, theta1, theta2 and maybe other values of theta as well.

If you look at the update rule for theta0, what you find is that this

update rule here is the same as the update rule that we had previously

for the case of n = 1.

And the reason that they are equivalent is, of course,

because in our notational convention we had this x(i)<u>0 = 1 convention, which is</u>

why these two term that I've drawn the magenta boxes around are equivalent.

Similarly, if you look the update rule for theta1, you find that

this term here is equivalent to the term we previously had,

or the equation or the update rule we previously had for theta1,

where of course we're just using this new notation x(i)<u>1 to denote</u>

our first feature, and now that we have more than one feature we can have

similar update rules for the other parameters like theta2 and so on.

There's a lot going on on this slide so I definitely encourage you

if you need to to pause the video and look at all the math on this slide

slowly to make sure you understand everything that's going on here.

But if you implement the algorithm written up here then you have

a working implementation of linear regression with multiple features.