0:00

What about the questions that we can answer once we've run a regression?

Â Well perhaps, the most used aspect

Â of a regression model is as a methodology for predictive analytics.

Â So businesses are really embraced protective

Â analytics in the last few years.

Â Always trying to predict outcomes.

Â Predicting for example a product that an individual might buy on a website.

Â Then we might want to predict the rating that somebody gives to a movie

Â that they watch on a streaming server.

Â We might try and predict the price of a stock tomorrow, so

Â prediction is very common task that we face in business.

Â We call our approaches to prediction.

Â Predictive analytics in general.

Â And if you have a regression, you certainly have a tool for prediction.

Â Because once you've got that regression line there,

Â the prediction is pretty straight forward.

Â It's take a value of X, go up to the line and read of the value in the Y direction.

Â So, an example question would be, based on our regression model for the diamond's

Â data set, what do you expect to pay for a diamond that weighs 0.3 of a carat?

Â The answer would be take 0.3 on the X-axis go up to the line and read off the value.

Â Or equivalently, you can plug in 0.3 to the regression

Â equation to work out that expected value.

Â And one of the other things, though, that regression will do for you,

Â it will just give you a prediction with suitable assumptions which we will

Â have a look at in a little while in this module.

Â With suitable assumptions we're able to get a prediction interval as well.

Â And that prediction interval gives us a range of feasible values for

Â where we think the outcome or forecast is going to lie.

Â And that in practice tends to be much more realistic than just trying to give

Â a single best guess.

Â Another thing that we do with these regression models.is interpret

Â coefficients coming out of the model.

Â The coefficients themselves can tell us things,they can give us information.

Â And so, I might also question,how much on average do you expect to pay on diamonds

Â that weigh 0.3 of a carat versus the diamonds that weight 0.2 of a carat?

Â Well,that's a change in X of 0.1 and given a linear regression

Â with a slope that happens to equal 3,720 basically.

Â What we can do is say, well, if we look at diamonds weighing 0.3 of carat

Â versus 0.2 of a carat we can anticipate paying an additional $372 for them.

Â Given the underlying regression equation so

Â we're essentially interpreting the slope in the regression.

Â Likewise, intercepts sometimes have interpretations and

Â intercept might be interpreted as a fixed cost, it might be interpreted as a start

Â up time so we often want to interpret coefficients.

Â 3:36

So let's beam in a little bit more on this prediction idea and

Â see one of the ways that you can immediately put a regression to work.

Â And this harks back to a discussion that we had in another module

Â when were talking about prospecting for other opportunities.

Â Now, in this particular example, where I'm looking I'm at diamonds,

Â I'm imaging that I'm a diamond merchant, or a diamond speculator.

Â I mean, the same ideas could easily work looking for new customers, looking for

Â new investment opportunities.

Â Well, let's say, we've collected some data, we've fit our regression model,

Â our linear regression model.

Â That's finding the best fitting line to the data and

Â then we come across a diamond.

Â And that diamond ways 0.25 of a carat and it's being sold for $500.

Â So I've added that point to the graph here and it's the big red dot.

Â Now, if I see a point like that which is a long,

Â long way beneath the regression line then it's potentially of great interest to me.

Â Because if I believe my model, and this is a huge caveat here.

Â Given that I believe my model,

Â then there's something going on with this particular diamond.

Â Now, one of the possibilities is that it's being mispriced by the market.

Â And if it's being mispriced by the market,

Â then it's potentially a great investment opportunity.

Â There is another explanation though, is that maybe there's some flaw associated

Â with this diamond and that's why it's going for such a low price.

Â I don't know which of those two is a potential explanation,

Â until I've gone to have a look at the diamond.

Â The point that I'm making here is that this activity of looking

Â to see how far away the points are away from the regression

Â line is a technique for ranking potential candidates.

Â And some people use the word,

Â to come up with a set of candidates that looked the most interesting to me.

Â And so that's one of the uses that you can put a regression model to.

Â 5:44

In some rate points a long way from the line can be of great interest.

Â I've shown you some regression lines but I haven't yet

Â told you how they're calculated.

Â So where does this regression line come from sometimes

Â called the line of best fit.

Â There's a methodology and that methodology is called the method of least squares

Â that is the most frequently used one to calculate these best fitting lines.

Â And so, it's not the only way of calculating the line to go through

Â the data, but it's a very commonly used one.

Â And if you pick up a typical spreadsheet program, it's the one that's going to be

Â implemented when you run your regressions there.

Â So the optimality criteria, because we're going to fit the best line,

Â is known as the method of least squares.

Â And, in words, what the least squares line is doing, is finding the line amongst

Â all the infinite number of lines that you could potentially draw through the data.

Â It's finding the line that minimizes the sum of the squares

Â of the vertical distance from the points to the line, and

Â I've illustrated that idea by beaming in on the diamonds data.

Â I've taken a small range, and

Â I've drawn a line there, I've drawn the points around it.

Â And the red lines are picking up the vertical distance from the point

Â to the line.

Â And what we want to do is find a line that minimizes

Â the sum of the squares of those vertical distances.

Â And we're going to call such a line, the least square line or the line of best fit.

Â So basically, what you're trying to do is find the line that

Â most closely follows the data, that's another way of thinking about it.

Â But there is a formal criteria that criteria is implemented in software and

Â you'll use that software to actually calculate a least squares lines

Â a regression for any particular data set that you might have.

Â So the least squares criteria is a line-fitting criteria.

Â So we've now seen how these lines of best fit derive that

Â derived through the least squares criteria.

Â 7:52

What these lines allow us to do is decompose the data into two parts.

Â That's one of the key insights with a regression, so a regression line can be

Â used to decompose the data.

Â In our case, when we're looking at diamonds, the prices into two components.

Â One component is called the fitted values,

Â those are the predictions, and the other component are known as the residuals.

Â So in terms of the picture on the previous slide, if we have a look here for

Â any given value of X, the fitted value would go up to the blue line.

Â And then the residual is that vertical distance from the blue line to the point.

Â So you can see, you can, ultimately, get to one of those points in two steps.

Â You take your X value, beneath it, first of all, you take a step up to the line,

Â and then, once you're on the line, you add on the little red line, the residual, and

Â you'll get to the data point.

Â So, that says that the data point can be expressed in two components, one,

Â the line and to the residual about that line.

Â So that decomposing of the data into two parts

Â mirrors a basic idea that we take to fit in these regression models.

Â And that idea is that the data we see is made up of two parts,

Â we often call that the signal and the noise.

Â And the regression line is our model for

Â the signal and the residuals are encoding the noise is in the problem.

Â Both of these components that come out of the regression, both the fitted values and

Â the residuals are useful.

Â The fitted values become our forecast, if you bring me a new diamond for a given

Â weight, let's say, 0.25 of a carat, what do I think it's price is going to be?

Â I simply go up to the regression line, the so-called fitted values, and

Â I read off the value of Y, the price.

Â 9:49

Now, the residuals are useful as well,

Â because they allow me to assess the quality of fit of the regression model.

Â Ideally, all our residuals will be zero,

Â that would mean that the line went through all the points.

Â In practice, that is simply not going to happen but

Â we will often examine the residuals from a regression.

Â Because by examining the residuals,

Â we can potentially gain insight into that regression.

Â And typically, when I run regressions one of the very first

Â thing I'm going to do is take all the residuals out of the regression.

Â I'm going to sort that list of the residuals and

Â I'm going to look at the most extreme residuals.

Â The points with the biggest residuals are by definition those points

Â that are not well fit by the current regression.

Â If I'm able to look at those points and explain why they're not well fit.

Â Then, I have typically learned something that I can incorporate in a subsequent

Â iteration of the regression model.

Â Now that all sounded a little bit abstract,

Â I've got an example to show you right now.

Â So here's another data set that lends itself to a regression analysis.

Â And in this data set, I've got two variables.

Â The outcome variable, or the Y variable, is the fuel economy of a car.

Â And to be more precise,

Â it's the fuel economy as measured by gallons per 1,000 miles in the city.

Â So if you're going to take let's say, you live in the city and

Â you only drive in the city.

Â How many gallons are you going to have to put in the tank to be able to drive

Â your car a thousand miles over some course of time?

Â That's the outcome variable.

Â Clearly, the more gallons you have to put in the tank,

Â the less fuel efficient the vehicle is, that's the idea.

Â Now we might want to create a predictive model for

Â fuel economy as a function of the weight of the car.

Â And so here, I've got an X variable as weight and I'm going to look for

Â the relationship between the weight of a car and its fuel economy.

Â We collect a set of data, that's what you can see in this scatter plot.

Â The bottom left hand graph on this slide and each point is car, and

Â for each car we've found it's weight, we've found it's fuel economy,

Â we've plotted the variables against one another.

Â And we have run a regression through those points, through the method of each graph.

Â And that regression gives us a way of predicting the fuel economy of the vehicle

Â of any given weight.

Â Now why might you want to do that?

Â Well, one of the things that many vehicle manufacturers are thinking about these

Â days is creating more fuel efficient vehicles.

Â And one approach to doing that is actually to change

Â the materials that vehicles are manufactured from.

Â So for example, they might be moving from steel to aluminum.

Â Well, that will reduce the weight of the vehicle.

Â Well, if the vehicle's weight is reduced,

Â I wonder how it will impact the fuel economy?

Â And so, that's the sort of question that will be able to start a dressing through

Â such a model.

Â So that's the set up for this problem but I want to show you why looking

Â at the residual chores can be such a useful thing.

Â So when I look at the residuals from this particular regression, I know

Â one of the residuals actually I found the biggest residual in the whole data set.

Â And that's the point that I have identified in red in the scatter plot and

Â it is the biggest residual.

Â It's a big positive residual which means that

Â the reality is that this particular vehicle needs a lot more gas

Â going in the tank than the regression model model would predict.

Â The regression model would predict the value on the line.

Â The red data point is the actual observe value, it's above the line, so

Â it's less fuel efficient than the model predicts.

Â It needs more gas to go in the tank than the model predicts so

Â is there anything special about that vehicle?

Â Well, at that point, I go back to the underlying data set and I drill down, so

Â when I see bigger residuals, I'm going to drill down on those residuals.

Â And drilling down on these residuals,actually identifies

Â the vehicle.and the vehicle turn's up to be something called a Mazda RX-7.

Â And these particular vehicles somewhat unusual,because it had,

Â what's term to rotary engine?

Â Which is a different sort of engine than every other single

Â vehicle in this data set.

Â Every other vehicle had a standard engine but the Mazda RX-7 had a rotary engine and

Â that actually explains why its fuel economy is bad in the city.

Â And so, by drilling down on the point, by looking at the residuals,

Â I've identified a feature that I hadn't originally incorporated into the model.

Â And that would be the type of engine.

Â And so, the residual and the exploration of the residual has

Â generated a new question for me that I didn't have prior to the analysis.

Â And that question is I wonder how the type of engine

Â impacts the fuel economy as well?

Â So that's one of the outcomes of regression that can be very, very useful.

Â It's not the regression model directly talking to you.

Â It's the deviations from the underlying model that can sometimes be the most

Â Insight for part of them model itself or the modeling process.

Â I remember in one of the other modules I talked about,

Â what are the benefits of modeling?

Â And one of them is serendipitous outcomes, things that you've find,

Â that you hadn't expected to at the beginning.

Â And I would put this out there as an example of that by

Â exploring the residuals carefully.

Â I've learned something new,

Â something that I hadn't anticipated and I might be able to subsequently improve my

Â model by incorporating this idea of type of engine into the model itself.

Â So the residuals are an important part of a regression modem.

Â