In the previous video, we saw that different loss functions might present different challenges for learning algorithms that minimize the loss. We introduced the idea of iterative methods, instead they provide a mathematical tool for approximating solutions. In this video, we're going to take a closer look at Gradient Descent, an iterative method that is very commonly used in machine learning. We're going to assume that you're familiar with the topics discussed in the calculus section of the math review reading at the start of this course. If those are not fresh in your mind, we encourage you to go back and revisit that section of the reading. Earlier, we said iterative methods start with some initial point and then take small steps to bring us closer to an approximate solution, but we want to minimize loss we use an iterative method by taking small steps in whatever direction the function is decreasing. We also talked about two loss functions for the sine penalty is based on how far off our model is from the training data. We care about minimizing loss functions because finding the minimum of our loss-function means finding the model that has the smallest penalty. In other words, it means we'll find the optimal model according to our loss function if our loss function is convex. So let's take a look at how we can use an iterative method called gradient descent in order to find the minimum of loss functions. We said, you can think about gradient descent as a ball rolling down a hill, but of course, there's more to the story than that, so let's take a look. The hill in this story is the graph of the loss function. In gradient descent, we rely on the property that our graph is smooth and continuous so that we can take the derivative and convex. So there's no more than one minimum value. We'll use an example in two-dimensional space to make it easy to visualize, but everything we talk about can be extended to higher dimensions just like we described earlier. In this example, we'll use the L2 loss function for the linear regression model. Look closely at the vertical and horizontal axis as this is often a point of confusion. On the vertical axis, we have the value of our loss function, the value we're trying to minimize. Remember, that's the penalty for the mistakes our function is making. So we want to find where it's as small as possible. Now, what about the horizontal axis? Since we're using linear regression as the example, we're comparing hypothesis functions from the hypothesis space of lines. In the two-dimensional case, lines are of the form h of x equals wx plus b. Since we're interested in finding the line that minimizes our loss, losses are dependent variable for the vertical axis of our loss graph. The independent variable, the one we're allowed to change is the parameter w. So we plot w along the horizontal axis. We want to find the w that has the smallest possible loss, find the lowest point on this graph. Although the loss function depends on both perimeter w and perimeter b of the line h of x, we're going to look at only optimizing the slope parameter w. Optimizing b would follow the same procedure and we don't need to repeat ourselves anymore. The thing is, it's easy to assume that our x value should be measured on the horizontal axis. I mean, we even call it the x-axis in general. But remember, that in this case, x represents the values in our training data and that's not something our learning algorithm can change. It doesn't change the training data, it changes the hypothesis itself to find the best model for minimizing that loss. Okay. Now we've sorted out our picture. The hill we were talking about is the surface created by this loss function over all the allowable variations of our model. Now we can get into gradient descent and what we mean by rolling a ball down a hill. We randomly pick a starting position and imagine placing a ball there. For minimizing loss, this amounts to assigning random parameters to our model. This is simply an initial guess, we have to start somewhere before we can improve it. Say our random initialization is peer. We hinted that the ball will roll down the pill somehow, but we can exactly calculate in what direction the ball will roll using the derivative or the slope of the tangent to the graph at this point. We'll use this derivative to take small steps towards the minimum of our graph. We have this mathematical notation for how to take a step or in other words, how we can tweak that parameter w. We update it by taking the current estimate then subtracting alpha times the derivative with respect to w of the loss function of the current estimate. Let's go through that bit by bit to see what's going on. Of course, we start at time zero, so w0 is our first guess for our model parameter. Now we want the ball to roll down the hill, but we need to know which direction is down. Hurray for calculus, that's the gradient of the loss function, the derivative with respect to w. So that's what direction to step, but how far do we adjust w? We need to know how large a step to take. There's a parameter for this known as the learning rate or step size. Here in our equation, we've represented this with the Greek letter alpha. It's your choice what learning rate to use, but it's some number greater than zero or you wouldn't learn anything and smaller than one or you'll over correct. Multiplying the learning rate by the gradient determines how much we should change the current position on the horizontal axis. Remember, this is the value for our models parameter w. Notice that the steeper the gradient, the bigger the change and vice versa. When the gradient is less steep, the learning rate multiplied by the gradient will be smaller. So the amount we update our position by will be smaller. This is true even if we have a rather large learning rate. What is a good learning rate to use then? It turns out this is a hard question to answer. A good learning rate is one that's not too big because you'll end up stepping past local minima and you never settle on a good parameter value. But can't be too small or it will take forever to converge, to settle down near the best value. So we want to have alpha that's as big as possible so that it will learn quickly but not one that's too big and diverges. Some techniques have decreasing alphas so that you can start big but settled right down, and depending on the rate of decrease, you can actually guarantee convergence. But it's usually simplest to sacrifice optimal convergence for our convenience and use some small alpha. The best alpha depends entirely on your particular problem. When in doubt, try 0.1, if that seems unstable, try 0.01 or start with a default set by your particular library. Enough about alpha. Let's look at the key operation in this equation, subtraction. We're taking our current position and subtracting the learning rate multiplied by the gradient. This gives us our new value which we really want to be closer to the optimal value. So why subtraction? Notice that at time t, we're here on our graph. In our 2D example, we can see that in order to get closer to the minimum, we want to step to the right. Stepping to the right is equivalent to increasing w along the horizontal axis. Notice that at this position the gradient is negative. The learning rate is always a positive number so subtracting a negative number from our current position moves us to the right. Similarly, if we'd been located here at time t, we'd want to move left to get closer to the minimum. Our gradient at this point is positive and so subtracting it from our current position will make w smaller moving us to the left. To summarize, because we've made sure our loss graph is smooth and convex the derivative or gradient always exists and always points away from the minimum point. In order to get closer to the minimum, we move against the gradient. This is why we have a minus sign in our update equation and also why we call it gradient descent. The process is iterative, so after subtracting the gradient multiplied by our learning rate, we have a new position for our ball. We calculate the gradient at this new point, multiply it by our learning rate and subtract that from the current position. Repeat this process over and over until we're not changing our position very much because the gradient has gotten so small. This lets us converge on or close to the minima. What does convergence mean? Remember, that local minimums happen when the derivative is equal to zero. As we step closer and closer to the minimum, the derivative will get closer and closer to zero as long as we're not overstepping. When we arrive at the minimum, the change will be zero and so our parameter value won't change anymore. We converge to the position where the minimum occurs. We've walked through a two-dimensional example, but the good news is, the process generalizes to any number of dimensions. The difference is, that instead of using w as a single parameter on the horizontal axis, we have a whole bunch of parameters in the vector w, and that means we have a hyperplane in the dimensionality of w. We've mentioned that your loss function must be smooth in order to apply gradient descent. But there are many cases where you may have a convex function that you want to optimize but this function is not smooth at some point and so is the gradient does not exist. For more information on why the gradient does not exist at that point you can find the explanation in the calculus section of the math review reading. We saw this in the case of the L1 loss, right at that minimum point, it's not smooth. The good news is that there are other iterative methods you can use to get close to the minimum value of your loss-function. Examples include linear programs such as the simplex method and then there are quadratic programming solvers for problems that are of degree two. We're not going to go into the details of these techniques in this course, but you should know they're only a library away if you need them. In this video, you saw how gradient descent lets you minimize loss functions for a two-dimensional example. You learned that gradient descent generalizes to any number of dimensions but requires the loss function to be smooth and continuous. We also gave you some keywords for other iterative methods when you don't have a differentiable function. We hope you enjoyed learning about the math behind gradient descent and that next time you see a ball rolling on a hill, you'll think of Machine Learning.