0:00

[MUSIC]

Â In this video, we'll discuss advanced optimization techniques that allow to

Â improve gradient descent methods.

Â As we remember from the previous video, if our function is difficult,

Â for example, it has elliptic level lines like on this graph?

Â Then gradient descent will oscillate, and it will take many iterations for

Â it to converge to the minimum.

Â To improve this, we can somehow change our gradient descent methods.

Â For example, take mini-batch gradient descent.

Â In this method, we take some random m examples from our training set,

Â we approximate the gradient based on these m examples, and

Â then we'll make a gradient step.

Â So in this video, we'll suppose that we have some way to approximate gradient.

Â Maybe by mini-batch, maybe by stochastic gradient descent with just one point,

Â or maybe we calculate full gradient.

Â That doesn't matter, we just have some approximation of gradient, and

Â we allot it by gt.

Â And in this video, we'll discuss how can we modify the way we make a gradient step,

Â how we add a gradient or antigradient to our previous approximation, w t-1.

Â So let's start with momentum method.

Â In this method, we maintain additional vector h at every iteration.

Â To calculate h in the step t, we take this vector h at the step t-1,

Â multiply by some coefficient alpha.

Â And add the gradient at the current iteration, gt,

Â multiplied by the learning rate, eta t.

Â So actually, ht is just a weighted sum of gradients from all previous iteration, and

Â from this iteration, also.

Â And then we just make a gradient step in the direction of -ht.

Â So what is the point of this method?

Â Suppose that we have some function, it'll make gradient descent, and maybe for

Â some coordinates of our parameter vector.

Â Gradients always have the same sign, so they lead us to the minimum.

Â And for some coordinates,

Â the sign of the gradient changes from iteration to iteration.

Â So, vector ht would be large for component where gradients have the same sign on

Â every iteration, and will make large steps by this coordinates.

Â And for coordinates that change sign,

Â they will just cancel each other and ht will be close to zero.

Â So, ht cancels some coordinates that lead to oscillation of gradients,

Â and help us to achieve better convergence.

Â 2:35

In practice, this momentum method indeed leads to faster convergence.

Â So for our last function, with elliptic level lines,

Â we'll see picture like this one.

Â So here, we don't oscillate that much as on the previous picture, and

Â we converge better.

Â There is also an extension of momentum method called Nesterov momentum.

Â So in simple momentum method, on every iteration,

Â we calculate the gradient at current point w t-1.

Â We take a gradient step from it by gt, and we then get our momentum, ht.

Â But since it's certain that we'll move in the direction of the momentum,

Â it will be more clever to, first, step in the direction of ht

Â to get some new approximation of parameter vector.

Â And then to calculate gradient at the new point, w t-1 plus ht.

Â So in this case, we'll get a better approximation on the next step.

Â 3:37

Mathematically, we take the current point, w t-1,

Â we subtract alpha multiplied by h t-1, and calculate the gradient at this new point.

Â And then we add the gradient at this new point, multiplied by the learning rate,

Â to the h from the previous iteration, alpha multiplied by h t-1.

Â And in practice,

Â this method indeed leads to better convergence than momentum method.

Â So momentum method and Nesterov momentum method work better with difficult

Â functions with complex level sets.

Â But they still require to choose learning rate, and they're very sensitive to it.

Â So now we'll discuss some other optimization methods that try to choose

Â learning rate adaptively, so we don't have to choose it ourselves.

Â One of such methods is AdaGrad.

Â So let's remember how we make a gradient step, just for

Â one coordinate g of parameter vector w.

Â To make a gradient step, we take j-th component of parameter

Â vector from the previous iteration, is w t-1 j.

Â And we subtract j-th component of the gradient at the current point,

Â to find out the next parameter approximation, w t j.

Â 4:48

In AdaGrad, we've obtained additional value for

Â each parameter from all our parameter vector, G.

Â So to calculate it, we take G t-1, G from the previous iteration.

Â We take j-th component of it, and

Â we just add a square of gradient at the current iteration.

Â So essentially, G is a sum of squares of gradients from all previous iterations,

Â and then we modify our gradient step.

Â We divide our learning rate, eta t,

Â by the square root of G t j, plus some small number, epsilon.

Â We add epsilon, just to make sure that we don't divide learning rate by zero.

Â So this AdaGrad method, it chooses learning rate adaptively.

Â Here, we can just set eta t, our learning rate, to some constant, for

Â example, 0.01, and don't remember about it at all.

Â So this doesn't need to somehow

Â wisely choose learning rate, but it also has some disadvantages.

Â Our auxiliary parameter, G, accumulates squares of gradient, and

Â at some step it can become too large.

Â We'll divide our learning rate, eta t, by a large number, and

Â gradient descent will stop building.

Â So to somehow overcome this, we need some other methods like AdaGrad.

Â By the way, AdaGrad has another advantage,

Â it chooses its own learning rate for each example.

Â So suppose that we are analyzing texts, and

Â each feature from our sample corresponds to one word.

Â So for some frequent word that we see in every document,

Â we have gradient updates on every step, and we'll make smaller steps.

Â And for some rare words that we can met only in one of thousand or

Â ten thousand documents, we'll make large updates,

Â because they are rare, we don't need them very often.

Â And we need to move faster in the direction of these words.

Â Another matter that can improve AdaGrad is RMSprop.

Â This method is very similar to AdaGrad, but here we take an exponentially weighted

Â average of squares of gradients on every step.

Â So here, to calculate G j at the step t, we take G j from the previous step,

Â t-1, we multiply it by some coefficient alpha.

Â And then we add the square of the gradient, G t j, at this iteration,

Â multiplied by 1- alpha.

Â 7:30

Let's start with RMSprop method and slightly modify it.

Â So in RMSprop, we maintained an additional variable,

Â and it will be augmented by v t j.

Â And is just an exponentially weighted sum of gradients from all iterations.

Â So here, to calculate v for j-th component at the step t,

Â we take v from the previous step, v j t-1, multiplied by some coefficient beta 2.

Â And we add square of gradient in the current iteration,

Â g t j squared, multiplied by 1- beta 2.

Â And we can notice, that this approximation, v has some bias towards

Â zero, especially at first steps, because we initialize it with zero.

Â So to overcome this, we just divide it by 1- beta 2, in the degree of t.

Â So this normalization allows us to get rid of this bias.

Â At first steps, this normalization is large, and for

Â large t's, this normalization almost equals to 1.

Â And then we use this v to divide our learning rate, eta t.

Â 8:36

But as we go from momentum method,

Â an approximation of gradient from one step can be noisy and lead to oscillations.

Â So let's smooth our gradients.

Â To do it, we maintain another auxiliary variable, m,

Â that is essentially a sum of gradients.

Â Not squares of gradients, like in v, but just gradients.

Â And it's calculated in the same way,

Â it's just an exponentially weighted average with coefficient beta 1.

Â And then we replace g, the gradient approximation in our gradient step,

Â by m, by weighted average of gradients from all previous iterations.

Â So this method, Adam, combines both momentum methods and

Â adaptive learning rate methods.

Â And in practice, it achieves better convergence and faster convergence.

Â In this video, we discussed some advanced optimization

Â techniques that can improve gradient decent.

Â For example, momentum method and Nesterov momentum method,

Â that allows us to work with some complex functions with complex level sets.

Â And AdaGrad, RMSprop method, that have adaptive learning rates, so

Â we don't have to choose eta t manually.

Â And there is also an Adam method, that combines both of this methods.

Â [MUSIC]

Â