In the previous video, we discussed how to fit a line to a given set of data points and how to calculate the penalty or loss for that particular line. But how do we know this is actually the best fit line? In other words, how do we know we have the best line in the entire hypothesis space of lines and not just the best one out of the few we happen to calculate the loss for. In this video, we're going to describe how and when we can know for sure our hypothesis is the best candidate in the entire space of hypotheses. Remember that in linear regression, our hypothesis space is the space of all possible lines that return a number. The hypothesis space of lines is one of those infinite spaces picking and choosing lines we think look pretty good is not efficient, and there's no guarantee that we'll find the best one that way. We need a strategic way to find the line that minimizes our penalty. So remember that our hypothesis is called h_x and the L2 loss function uses the sum over all training examples of the square difference between predicted and actual labels. If aligned in our hypothesis space is described by the function h _x equals wx plus b, the L2 loss function is calculated by taking the average of the square of wx plus b minus y for each xy example. W and b are those things the learning algorithm can adjust in its quest to find the best line. While x and y are determined by the training set. So finding the line that minimizes our loss-function amounts to finding those values for w and b that give us the smallest possible loss. The Min over WB of the sum of the squared error, which is wx plus b minus y squared for every xy example. But wait, how does this help us? Calculus to the rescue as unwieldy as that loss-function sounded, it happens to be a perfectly lovely function for deriving gradients. If you don't remember calculus, don't worry about these details. You don't actually need to understand them fully to grasp the ideas in this course, but if you do, you might have noticed that we just so happened to define our loss function to be both smooth and convex, which means it's going to have a unique minimum point. We don't know what that minimum is, but using calculus we can find it. So let's go back to the four hypotheses that we already calculated loss for in the previous video. What we have is a particular total loss associated with each hypothesis. This is itself a function and we can graph the total loss for each hypothesis each particular setting for the slope w and the y-intercept b. We can plot those penalty values on a graph of their own with the choices of WB as the independent variables and the resulting loss as the dependent variable, that means we plot the loss along the vertical access, and for simplicity so we can use two dimensions, we'll ignore the choice of b and only worry about the slope w. For each of our hypotheses, we plot there w along the horizontal axis. If we were to do this for more than just the four lines on the left, we would get a lovely bowl shape, and on our example graph, we can see visually that the minimum loss the lowest point of that bowl is at 10. In other words, the hypothesis that uses 10 for w, results in the smallest possible loss. Now, if only there was some way to calculate that point of minimum loss and solve for the best possible model. We don't know what the point of minimum loss is exactly, but we do know something about that point. Remember from the math review that we can find the slope of a tangent to a function at a particular point by calculating the derivative of the function. Take a moment to guess what's special about the slope of the tangent at this point on the loss graph where the loss is as small as possible. This is why the derivative is so important to us. At that all important lowest point of our loss function, the tangent has a special characteristic, a slope of zero. So the point we care about most on this graph of our loss is conveniently the point where the derivative of the loss function is equal to zero, and our L2 loss function just so happens to have a very easy to calculate derivative. For calculus fans, noticed the equation on the slide. All we have to do now is to calculate the derivative set it equal to zero and solve for the parameters of the line w and b. Voila, the solution that absolutely, definitely, gives us the smallest possible loss. We can find exactly the equation for the line where the least penalty happens. This gives us the truly best-fit line or in other words the optimal hypothesis. This is what a linear regression algorithm does. We now have a mathematically precise way for our computer to choose the best line within the entire hypothesis space of lines. That doesn't mean it's the best possible hypothesis for all the examples we might possibly care about because we already narrowed down the hypothesis space to only consider linear functions and we can only use the examples we have labels for to find the best fit. But, thanks to calculus we do know that we found the candidate within the space of linear functions that gives us the lowest possible L2 loss on the training data given the features we're using. We can't actually ask for any more from a learning algorithm than to find the provably best solution within its own hypothesis space. Our illustration was only in two dimensions though. We had one variable as the feature namely the temperature outside and another variable for the label data namely the number of ice creams sold. But your computer isn't limited to two dimensions and in fact most machine learning projects will have a very high dimensional feature space. The good news is we can do the exact same process in arbitrary m-dimensional space. Recall that for an arbitrary line in two dimensions, we use the notation h_x equals wx plus b. For m-dimensional space, we still have a linear function but now instead of one single feature x, we have a feature vector X with a spot reserved for each dimension, and for each of those features, we have some weighting w. So a line in m-dimensional space has the following form with column vectors w and x, h_x equals w transpose x. We use the letter W, and we call it a weight vector because it ends up telling us how much weight how much importance to place on each feature. How much does each feature contribute to our estimate of that final number? We can also write this using summation notation. To predict an m-dimensional space, we take the weighted sum over each of our m features. We multiply each feature by its weight and then add them all together. So in m-dimensional space, our goal is still to find the weight vector that minimizes the penalty function we chose in this case L2 loss. We don't even have to worry about b anymore because in general we can make one of our features a constant one and b becomes the weight on that. All that calculus we used when finding the best hypothesis in the space of all possible 2D lines, generalizes perfectly well to finding the best of all m-dimensional hyperplanes. This generalization is good news for us because it means we can use linear regression for machine learning applications that have more than two features. Now, let's take a look at how you use the linear regression function. To use linear regression models, you'll need to import the linear model package from scikit-learn unsurprisingly. Then we'll create a regress object by calling linear regression from the linear model package. There are several different parameters you can input in the linear regression function call. We won't be changing any of these here, but go ahead and take a look at the function documentation for more details, and we'll discuss some of them in a later module. Now, that we've created the linear regression model maker, we need to put it to work and fit a line or hyperplane to some data. To do so, we call the fit function which takes in x and y. Y is the one-dimensional array of the label for each example, X is the corresponding feature matrix for our training examples sized num samples by num features. Once we call the fit function, the regress object now contains our regression quam, the best hypothesis in the space of linear functions with num features dimensionality we use it by calling the predict function with are unseen or operational data as the input again feature matrix of num samples by num features. Calling predict uses the quam we just fit to calculate the value for each sample and return the array of predicted labels. There. Now, you've seen the main concepts behind linear regression including one form of penalty known as the L2 loss function. This loss was useful because it gives a shape that has a unique point for the minimum penalty, which means we can use the derivative to find our guaranteed best-fit line. We also saw how these ideas generalize to arbitrary m-dimensional space, and that makes them extremely useful for machine learning, and of course, we went over how to use scikit-learn to call the linear regression function. We hope you'll try this out for yourself and make some interesting predictions using lines.