In the previous video, we introduced linear regression through a two-dimensional illustrative example that used the L2 loss function. Recall that we used this loss function to calculate the penalty for each line, to measure how far off that line is from the training data. But why did we choose L2 loss? Is this always the penalty you want to use? In many cases, the answer to that question is no. But in this video, we're going to introduce you to an alternative loss function, formalize what we require in general from loss functions, and explain how these might impact finding a minimum penalty value. We'll also ground these ideas in mathematical notation. Recall that the L2 loss function is calculated by squaring the difference between the predicted value and the label for each example, then summing up all those squared differences. We said before that we squared each difference so that each value we're summing together will be positive. Fine, but we could just take the absolute value of the difference to get the actual magnitude between the predicted value and the label. Why not do that? This is a totally valid loss function known as the L1 loss or least absolute error. It looks just like our L2 loss function, except instead of squaring the difference, we take the absolute value. But remember, we plotted the shape of the L2 loss function on its own graph and got that nice bowl shape. The hypothesis parameters are the independent variables on the x-axis, and the loss for those parameters are the dependent variable. We can do that for the L1 loss too. Just like the L2 loss function, the shape of the L1 loss is convex. And we can see in the picture that it has an obvious minimum penalty value. However, notice there's a sharp corner at that minimum value. You might remember from calculus, this means the function's not smooth, which means it's not differentiable. So we can't use that tangent trick. We can't take the derivative of the L1 loss to find which parameters give us the minimum penalty, like we did with the L2 loss function. If we can't set the derivative to zero to find the minimum value for the L1 loss function, what can we do? Well, we'll have to use some kind of iterative process. We start at any old point, then make small adjustments in our parameters to bring us closer to an approximate solution. This kind of approach is known as an iterative method, and there are a whole lot of those. In this course, you'll see and use one particular iterative method that's really important to machine learning, gradient descent. To visualize how gradient descent works, you can think of it as a ball rolling down a hill. The ball is dropped onto some initial point and then begins rolling down the hill, or graph of a function in our case, until it eventually, hopefully, stops at some minimum value. We'll discuss gradient descent in much more detail in the next video. But for now, I just want to point out that although gradient descent is an iterative method, it still requires a smooth function. In other words, it works for L2 loss, but not L1. But back to loss functions in general, although we've only introduced two, many different loss functions exist. But across all loss functions, there's one crucial property that each must possess if you want to guarantee you're finding the global minimum. This property is convexity. Let's take a look at what convexity means. The function is called convex if, for any two points on the graph of the function, the line segment connecting those two points lies above or on the graph. If you think of functions as describing some surface, you have to be able to attach the ends of a string anywhere along the surface and pull that string flat. If there's some way to attach the string where you can't make it absolutely flat without intersecting the surface, then it's not a convex surface. Why is convexity important? Convexity guarantees that our function has no more than one local minimum value. Notice that the graph on the left has one local minimum, but the graph on the right has two. The global minimum is, not surprisingly, the smallest value in the entire world of the function. In the math review reading at the start of the course, we talked about this difference and it's an important distinction. Feel free to pause the video and take a look back at that part of the reading if you're not sure about it. So let's take our visualization for gradient descent and think of that ball rolling down a hill. In this particular graph, we can see when the ball starts from this point, it rolls into the nearest trough and says, I found a minimum. But if we start in some other location, it might end up in a different trough. Starting over here, it rolls into that trough and says, I found a minimum. So we might have found some local minimum value that has a small penalty. And so it does an okay job of fitting our data. But it might not be the global minimum value, which means it's not the optimal model that fits the data. When our graph is not convex, we don't know for sure we've found the absolute smallest minimum value unless we explore the entire space. Whereas, no matter where we drop that ball on a convex graph, we'll reach the same final point at the only local minimum. That's the key, convex functions have at most one minimum value. So regardless of whether or not you use derivatives, if your function is differentiable or some other iterative numerical solver, if your function is not differentiable, when your function is convex, you're guaranteed to find the optimal solution in that hypothesis space. So one local minimum means we're able to find the smallest penalty value for our loss function. And if our loss function is differentiable as well as convex, this means we may be able to find an optimal function by setting the derivative of our loss function to zero, as we did in the case of L2 loss. Let's now connect these ideas to mathematical notation, so we can more easily refer to them in the future. We start with some set of data we hope to fit with the model, and we call that data set X. We try to fit some model or hypothesis to this dataset. In general, we call that function little h. Of course, we're not concerned with just one hypothesis function. We want to find the optimal hypothesis function in our hypothesis space. So we have many different little h's belonging to our hypothesis space, big H. In this case, our hypothesis space is the set of lines in two-dimensional space, three of which are shown on the slide. We need some way to compare which of these hypotheses is better. So we introduce a loss function that measures how far off the hypotheses are from the training data. In mathematical notation, we write this as L(h, X), the loss of the hypothesis h over the training data X. On the slide, you can see the loss we've calculated for each of the three hypotheses shown. Of course, we want to minimize the value of our loss function. We write this as min of all possible little h and big H of the loss. This means that we're interested in the smallest value of the loss function across every single hypothesis function in our hypothesis space. But the main goal is not to find the smallest value itself, but rather the hypothesis function that has the smallest value loss. In mathematical notation, we write this as the arg min of the loss function over all hypotheses in H. Because ultimately, we want to find the argument, in this case the hypothesis function, that achieves that minimum value. In other words, we want to find the model that fits our data optimally. In this video, we learned there are different kinds of loss functions we can use to evaluate how well our model fits a set of data. The loss function acts as a penalty, it measures how far off our model is from the training data. We looked specifically at two different loss functions, L1 and L2. But it's important to know that these are not the only two you can use. Many others exist that you might find useful. We saw that depending on the properties of the loss function we choose, we may need different techniques to find out where it's smallest, and that finding this minimum value can be tricky. Luckily, we have techniques from calculus or iterative methods to help us out. More on this in the next video.