Real world datasets often have hundreds of thousands or millions of records.

The main problem is gradient descent with backpropagation for such datasets,

is that it might become very slow.

This is because each iteration of gradient descent needs to scan the whole dataset.

Not only this is computationally costly,

it might not be the end of the story.

Remember that while we minimize the training error,

we care in fact about a generalization error.

It might will happen that after training with a given train dataset,

our test error would not be good.

And then, we would need to retrain the model after some modifications.

Finally, gradient descent would only find

a local minimum of a loss function rather than the global minimum.

It turns out that there is a better way to use

modification of gradient descent that simultaneously tackles all these issues.

It called the stochastic gradient descent.

And to this date,

it's the most popular algorithm used by

researchers and practitioners that work with neural nets.

Let's now see how it works.

Let's start with recalling what gradient descent actually does.

We are interested in minimization of expected loss,

given by an integral of some loss functional L of inputs x and labels y,

were some data generating distribution P of x and y.

We approximate this expectation by train error, En.

It gives us some over all examples in a train set.

Now, a step of gradient descent uses this error function to make its update.

It goes like this.

To recap the parameter eta here is [inaudible].

As you see, the sum here runs over all m examples in the training set.

Now, the idea of

stochastic gradient descent method is exactly not to take this whole sum.

Instead, at each iteration,

k of gradient descent,

we randomly select some mini-batches of

size N sub MB of samples from our dataset.

If N sub MB is much smaller than N,

then there are much less terms to evaluate in the sum.

So the first thing,

stochastic gradient descent buys is

a faster evaluation over a single step of gradient descent.

But actually, buys us much more than this.

Unlike gradient descent that operates only with the fixed train dataset,

and minimizes their training error,

stochastic gradient descent attempts to directly minimize the generalization loss.

And why is it so?

This is because each time we randomly sample a next mini-batch,

we actually sample from a two data generating distribution.

Therefore, when we minimize the training error on these mini-batch,

we actually estimate the generalization error.

In other words, we can say that stochastic gradient descent

minimizes the generalization error plus random noise.

This is represented in this equation.

The last term epsilon k,

stands for numerical noise due to

the choice of a particular mini-batch on the k iteration.

This equation is known in physics as

the Langevin equation and it typically describes what is called the stochastic

relaxation of the physical quantities

subject to an external force plus random noise.

Now, the presence of random noise in

this problem has some profound consequences for learning.

First, it often helps to escape local minima in training.

Because of the random term,

gradient descent may not get trapped in the local minimum but has a chance to escape it,

due to a random fluctuation in the mini-batch.

Another interesting observation concerns the relative strengths of

the deterministic and random terms in this equation.

If your mini-batches are not too small,

then at least initially,

the first deterministic term in the Langevin equation dominates.

This is because the initial guess usually has very large gradients of the loss function.

However, as the training progresses,

gradients become smaller and smaller.

And eventually, we get a regime that is dominated by the second random term,

while the deterministic force proportional to the gradient

of the generalization error, becomes very small.

As was recently discovered by Naftali Tishby of Jerusalem University,

this is a very special regime where the neural network

compresses its data representation for a better generalization.

We will talk more about this in our next course.

But here, I want to continue with our discussion of stochastic gradient descent.

What happens if the size of a mini-batch is taken to be one?

That is on the one randomly chosen point from a data set,

is used at each iteration of gradient descent.

In this case, we have

an online stochastic gradient descent as opposed to the previous case,

which assumes that N sub MB is larger than one,

which is sometimes referred to as a mini-batch stochastic gradient descent.

For online stochastic gradient descent,

the random noise in the Langevin equation

becomes much stronger than for a mini-batch gradient descent.

Therefore, conversions with this method is

typically much slower than for a mini-batch method.

And for this reason,

it's a latter version that is usually used in practice.

Let's summarize the feature as follows.

Stochastic gradient descent is arguably

the most popular method for training neural networks,

as well as other Machine Learning algorithms that

we will discuss later in this specialization.

Simply by using mini-batches for training instead of the whole training dataset,

stochastic gradient descent achieve several objectives simultaneously.

First, it makes gradient descent work faster,

especially for large datasets.

Second, it often provides a smaller generalization error than gradient descent,

because it's specifically minimize the generalization error.

Third, due stochasticity of the mini-batch selection,

random noise and stochastic gradient descent sometimes help

escape in local minima of the training loss function.

Now, after this theoretical introduction to stochastic gradient descent,

we are fully ready to try it in TensorFlow.

Let's do it in the next video.