0:00

So, in the last video,

we learned two very important things.

First, we learned that linear regression is nothing but a very special neural network.

Namely, it is a network made of one linear neuron.

There are more advanced neural networks called feed-forward networks

that organize cascades of information processing between layers of the network,

all the way from the inputs to the outputs.

We also said that weights W for all neurons in

such a network can be fine-tuned to

some training data using a method called Gradient Descent.

Now, because linear regression is a very special neural network,

let's see first how it works for a single neuron that implements linear regression.

To see this, let's go back again to our example from the last video.

To recap, we have a set of inputs x naught to xN,

which are transformed by a linear neuron, that is,

we use a linear activation function into their output h

sub w of x depends on both inputs and weights W. Now,

how do we fine tune parameters W to feed some data here?

Well, for linear regression we already know the answer.

It is given by the normal equation that we talked about the last week.

Let's briefly recap the math that we covered in the last week.

We said that the optimal weights should be such that the gradient of

the mean square error MSE sub train should vanish for these optimal weights.

And after some algebra,

this produced an analytical expression for the vector of

optimal weights known as the normal equation.

This is a great solution for many cases.

But in many other cases,

it's not so simple.

First, this exact solution is only available

for linear regression that has a quadratic loss function.

Change the loss function and then exact solution is not available anymore.

Second, normal equation involves semantic inversion here,

which has a high complexity cost of O of N to the third power,

which can be reduced to something like O of N to power 2.4,

by using dedicated algorithms for matrix inversion.

But still this is a high computation of

complexity that can lead to a very slow calculation when

your data matrix X has the number of columns in tens or hundreds of thousands.

It turns out that the method of gradient descent is a method of choice,

both for this case,

as well as for other loss functions that might be needed if we

use non-linear regression instead of linear regression.

Gradient descent simply discards

the normal equation and goes one step back.

As said, the optimal point the gradient of the loss function should be zero,

the gradient descent finds this optimal point agilely in steps.

Let us see how it works on the simple example of one variable.

Let's call the argument of this function,

theta, and the function that we want to minimize, g of theta.

If we deal with a linear regression,

g of theta would be the mean squared of loss,

and the theta would be a regression parameter.

But I want to use a bit more general notation here.

So, how does it work?

We start with some initial guess theta naught and

compute the derivative of the function g prime at this point.

The derivative is given by this red line and it's negative in this example.

Then, we update the value of theta by

shifting our initial value by the amount equal to the

negative of parameter eta times the gradient of the function at point theta naught.

The parameter eta is called the Learning rate,

and it plays a very important role in the gradient descent method.

Let us see what this equation means.

Because eta is positive,

while the gradient at theta naught is negative,

and because of this negative side here,

the net effect of this whole term,

including the minus sign, would be positive.

This is exactly what we want because the minimum

is to the right of our initial guess theta naught.

But if it was on the right of it, like here,

then this term would be negative,

and the initial guess would move to the left at the first iteration.

So, the negative sign here is correct because it means that to minimize the function,

we have to go in an opposite direction to its gradient,

which points to a direction of maximum of this function.

The learning rate eta is just the amount by which we move in this direction.

If we take eta larger,

you move more along this line

and it gives you the next iteration of the gradient descent method.

Now, all it does afterwards is a repetition of this very first step.

Again, a new initial guess theta 1 is used and the update of

the current optimal value of theta is done using

the same equation and the same learning rate eta.

The new candidate solution theta 2 is now closer to the true minimum theta start.

The process then continues until the function

stops to change at the given tolerance level.

This means the algorithm has converged.

Okay, this was an explanation of how the algorithm works in one dimension.

When there are many dimensions,

the algorithm makes a gradient descent of

data along each dimension following the same rule.

Let me show you a simple animation of how gradient descent works

for a two-dimensional loss function.

As you can see,

the initial point shown here,

as there red dot gradually descends to the true minimum

of the function in a few gracious steps.

How well all this works in practice depends on both the function

we want to minimize and the learning rate eta.

For example, our learning rate might be too low,

so, it will take way too many steps to find the minimum.

As another example, the learning rate might be too high.

In this scenario, the solution may even diverge as illustrated here.

For a large learning rate,

your iteration would overshoot each time,

and each time the side overshoot would in fact grow larger.

This will cause the solution to numerically diverge.

Other subtle points with using gradient descent

arise when the function we want to minimize is non-convex.

That is, it has more than one minimum.

In a scenario shown here,

it might be difficult for gradient descent to find the global minimum of theta star

using the initial guess theta naught because most likely,

it will land in the local minimum shown here.

Later, in this visualization,

we will talk about how making the learning rate eta,

a function of the iteration number.

The gradient descent method can be used for such non-convex optimization problems.

We will also see how the same technique helps to optimize the optimization, in the sense,

where the last function has nearly flat plateaus like this one here,

where the function value is nearly constant.

But for now, we will consider a basic version

of the gradient descent method with the constant learning rate eta.

So, we understood how gradient descent works

for a simple case of one linear neuron that implements linear regression.

Now, how about more general neural networks with more nodes?

A general neural network has many free parameters,

that is one set of parameters for each node.

Can gradient descent work for this case as well?

As we will see, it can.

But to make it efficient,

it needs to be used in a combination with a method called Backpropagation.

Let us see how it works in the next video.