0:00

In the early days of deep learning,

Â people used to worry a lot about the optimization algorithm

Â getting stuck in bad local optima.

Â But as this theory of deep learning has advanced,

Â our understanding of local optima is also changing.

Â Let me show you how we now think about local optima

Â and problems in the optimization problem in deep learning.

Â This was a picture people used to have in mind when they worried about local optima.

Â Maybe you are trying to optimize some set of parameters,

Â we call them W1 and W2,

Â and the height in the surface is the cost function.

Â In this picture, it looks like there are a lot of local optima in all those places.

Â And it'd be easy for grading the sense,

Â or one of the other algorithms to get stuck in a local

Â optimum rather than find its way to a global optimum.

Â It turns out that if you are plotting a figure like this in two dimensions,

Â then it's easy to create plots like this with a lot of different local optima.

Â And these very low dimensional plots used to guide their intuition.

Â But this intuition isn't actually correct.

Â It turns out if you create a neural network,

Â most points of zero gradients are not local optima like points like this.

Â Instead most points of zero gradient in a cost function are saddle points.

Â So, that's a point where the zero gradient,

Â again, just is maybe W1,

Â W2, and the height is the value of the cost function J.

Â But informally, a function of very high dimensional space,

Â if the gradient is zero,

Â then in each direction it can either be

Â a convex light function or a concave light function.

Â And if you are in, say,

Â a 20,000 dimensional space,

Â then for it to be a local optima,

Â all 20,000 directions need to look like this.

Â And so the chance of that happening is maybe very small,

Â maybe two to the minus 20,000.

Â Instead you're much more likely to get some directions where the curve bends up like so,

Â as well as some directions where the curve function is bending

Â down rather than have them all bend upwards.

Â So that's why in very high-dimensional spaces you're

Â actually much more likely to run into a saddle point like that shown on the right,

Â then the local optimum.

Â As for why the surface is called a saddle point,

Â if you can picture,

Â maybe this is a sort of saddle you put on a horse, right?

Â Maybe this is a horse.

Â This is a head of a horse,

Â this is the eye of a horse.

Â Well, not a good drawing of a horse but you get the idea.

Â Then you, the rider,

Â will sit here in the saddle.

Â That's why this point here,

Â where the derivative is zero,

Â that point is called a saddle point.

Â There's really the point on this saddle where you would sit, I guess,

Â and that happens to have derivative zero.

Â And so, one of the lessons we learned in history of

Â deep learning is that a lot of our intuitions about low-dimensional spaces,

Â like what you can plot on the left,

Â they really don't transfer to

Â the very high-dimensional spaces that any other algorithms are operating over.

Â Because if you have 20,000 parameters,

Â then J as your function over 20,000 dimensional vector,

Â then you're much more likely to see saddle points than local optimum.

Â If local optima aren't a problem,

Â then what is a problem?

Â It turns out that plateaus can really slow down learning and

Â a plateau is a region where the derivative is close to zero for a long time.

Â So if you're here,

Â then gradient descents will move down the surface,

Â and because the gradient is zero or near zero,

Â the surface is quite flat.

Â You can actually take a very long time, you know,

Â to slowly find your way to maybe this point on the plateau.

Â And then because of a random perturbation of left or right,

Â maybe then finally I'm going to search pen colors for clarity.

Â Your algorithm can then find its way off the plateau.

Â Let it take this very long slope off before it's found its way

Â here and they could get off this plateau.

Â So the takeaways from this video are, first,

Â you're actually pretty unlikely to get stuck in

Â bad local optima so long as you're training a reasonably large neural network,

Â save a lot of parameters,

Â and the cost function J is defined over a relatively high dimensional space.

Â But second, that plateaus are a problem and you can actually make learning pretty slow.

Â And this is where algorithms like momentum or RmsProp or

Â Adam can really help your learning algorithm as well.

Â And these are scenarios where more sophisticated observation algorithms, such as Adam,

Â can actually speed up the rate at which you

Â could move down the plateau and then get off the plateau.

Â So because your network is solving

Â optimizations problems over such high dimensional spaces, to be honest,

Â I don't think anyone has great intuitions about what these spaces really look like,

Â and our understanding of them is still evolving.

Â But I hope this gives you some better intuition about

Â the challenges that the optimization algorithms may face.

Â So that's congratulations on coming to the end of this week's content.

Â Please take a look at this week's quiz as well as the [inaudible] exercise.

Â I hope you enjoy practicing some of these ideas of this week [inaudible]

Â exercise and I look forward to seeing you at the start of next week's videos.

Â