0:00

This lecture's about one of the most direct and

Â simple ways to perform machine learning using regression modeling.

Â If you've taken the regression modeling class in the data science

Â specialization, then a lot of this material will be familiar to you.

Â We're just using it in the service of performing prediction.

Â 0:16

So the key idea here is that we're just going to fit a simple regression model.

Â If you don't know what that is, don't worry about

Â it, I'll explain it briefly in the rest of the lecture.

Â But the idea is that basically we're going to fit a line to a set of data.

Â So that line will consist of basically multiplying a

Â set of coefficients by each of the different predictors.

Â And so then we get new predictors or

Â new covariance and we multiply them by the coefficients

Â that we estimated with our prediction model and then

Â we get a new prediction for a new value.

Â This is useful when that linear model is nearly correct, in other words.

Â When the relationship between the variables can

Â be modeled in a linear way, in

Â other words as a function of lines, then this is a useful way to predict.

Â It's very easy to implement, and it's also quite easy

Â to interpret compared to many machine learning algorithms in the

Â sense that, it's, you're fitting a set of lines to

Â a data set and the lines are relatively easy to interpret.

Â 1:17

So we're going to be using data on eruptions of geysers.

Â And so, they're, geysers are have a waiting time in between their different

Â eruptions and there's an amount of time that they actually erupt for, so

Â there's a data set that we can load in very easily that contains

Â some information on eruptions for a particular

Â geyser, Old Faithful in the United States.

Â Famous geyser.

Â So we can load the caret package and load the data for this eruptions.

Â And I'm going to set the seed so that all the

Â analysis that I'm going to perform after this can be reproduced.

Â Then I create a training set and a test set just like usual.

Â I create a training set because we're going to be building models.

Â Only in the training set and then applying them in the test set.

Â 2:00

And so, if you look at the data set, if you look in the training

Â set, you can see that we have just two variables, so it's a very easy example.

Â We have an eruption time and a waiting time.

Â So the waiting time is the time between eruptions and the

Â eruptions is the length of time that the geyser was erupting.

Â 2:18

If I make a plot of these two variables, I see waiting time here on the x axis.

Â And duration here on the y axis.

Â You can see that there's roughly a linear relationship.

Â Or you can imagine drawing a line through this that

Â predicts relatively well, the duration time from the waiting time.

Â 2:36

So we can do that by basically fitting a formula that's just a,

Â a line so remember the formula for a line is going to be the

Â eruption duration here is equal to a constant or an inter, what people call

Â an intercept term, plus another constant times

Â the waiting time plus the error term.

Â So.

Â As you saw in the previous slide, even if a

Â line fit through the middle of the data looks like

Â it's sort of a reasonable approximation to the relationship, there's

Â obviously not, the points don't exactly fall in a line.

Â And that's why we allow for some error in our model.

Â The error models, everything that we didn't have,

Â we didn't measure, we didn't understand about the relationship.

Â And so we can use the lm command in r to fill linear model.

Â So lm, relates the eruptions, that's going to be

Â the outcome variable that you're trying to predict,

Â the tilde says we're going to predict it as

Â a function of everything on this side of the.

Â The code right here, we're going to use the waiting data.

Â We're going to build that model using the data from

Â the training set, so if we do that, if we

Â do that, we get a summary of the output and

Â the point, the part to look at, assumes the prediction here.

Â Are these estimates so the estimate here is just the intercept that's the constant.

Â So that's B0, in this formula, and the waiting time estimate here is B1 in

Â this formula, and so if we get a new prediction, we just add minus 1.79.

Â Plus 0.073 times whatever our new waiting time is

Â and that produces our new prediction for the expected duration.

Â 4:16

So this is what the model fit looks like and so I'm basically

Â again, I'm plotting the train set, the waiting times versus the eruptions and

Â then I plot the fitted values and the way I do that is,

Â I can extract that from the linear model object nar with an LM1$fitted.

Â Fitted will give me the fitted values and then I plot that

Â versus the, the predictor variable that I used to predict the values.

Â So here this is the waiting time plotted versus duration.

Â The points come from this first plot command and then

Â the line that is fit right here, the black line.

Â Speaker 1: Comes from this lines command that's adding a line of faded values.

Â So you can see just like we saw previously there's a line

Â that's a reasonably good representation of

Â the relationship between these two variables.

Â Obviously it's not perfect, the points don't lie exactly on the

Â line, but it's a reasonably good capture of the main data set.

Â 5:12

So, to predict a new variable, we just again, like I said,

Â take the estimated value for b0 and the estimated value for b1.

Â We usually denote those with little hats above the values.

Â And then we just multiply them together using the formula from the previous page.

Â Remember we don't have, in this formula, we don't have an error term.

Â Because we don't know what the error term is for this

Â particular value, so we just use the parts that we can estimate.

Â So to get those values extract those values from

Â the linear model object you can use the coef.

Â Command that gives you the coefficients which is the name that

Â we use for those two variables, the estimates for those two variables,

Â and so coeflm11 will give you the intercept, a value beta hat

Â zero, and put lm2 will give you beta hat 1 a value.

Â Fit for, the waiting time.

Â And then, suppose we have a new waiting time that's 80.

Â Then the prediction that we get out is

Â 4.119 as the time for the eruption duration.

Â 6:16

So, another thing that we can do is we can actually predict using that LM object.

Â You don't actually have to, every time,

Â extract the coefficients and multiply them together.

Â So if we create a new data set.

Â Which is a data frame that has one new value that we want

Â to predict, so we say we have waiting time equal to just 80,

Â then if I type predict and I pass it the fitted model from

Â the training set, and the new data set, new data frame we created.

Â It'll give me the prediction for the new value.

Â And it matches, so it's using the same formula in this predict command,

Â that you would use if you actually calculated out the prediction by hand.

Â 6:52

So another thing to look at, is.

Â So, remember, we built this model on the training set, just like we always do.

Â And we want to see how it does on the test set.

Â So here, I have, separated the data into, Two separate

Â sets the training and test sets and I made two plots.

Â On the left there's a plot of the training data and on

Â the right there's a plot of the test data and so the

Â training data I plot the waiting time versus the duration and then

Â I put the model fit in and it's a reasonably good model fit.

Â That's to be expected because it.

Â Is the exact data that we use to build the model.

Â 7:37

So you can see that it doesn't quiet perfectly fit

Â the data anymore like it did in the training set.

Â It's a little tilted underneath over hear, but that's to be expected

Â because, the test set is a slightly different set of data, but

Â as you can see it still captures the overall trend or the

Â overall part of the variation that can be explained by the waiting time.

Â 7:59

The next thing to do is to get the training and test set errors.

Â So, to get the training set error, we get the fitted values,

Â lm1$fitted, remember lm1 was the object we actually fit to fit the model.

Â And the fitted values were the predictions that we get on the training set.

Â Then we can subtract the actual values of the eruption duration on the training set.

Â Square them, sum them up and take the square

Â root and that gives us the root mean squared error.

Â If you remember that from our lecture on the types of errors we could have.

Â So it basically measures how close the fitted values are.

Â To the, real values.

Â And we get a value of 5.752.

Â 8:39

We can also calculate the root mean square error on the test set.

Â And so, the way that we do that, is, we calculate, we predict again

Â now, using this LM fo, LM object that we can fit on the training set.

Â But now we pass it the new data set, the test data set.

Â So now this is predicting values on the test data set and we subtract

Â off our actual values since we know what they are on the test data set.

Â Square them and sum them up.

Â And since we didn't use the test set at all when we built are algorithm.

Â This is a more realistic estimate of the root mean square error that you would get

Â on a new data set compared to the value that we got on the training set.

Â And just like always, the test data set error

Â is almost always larger than the training set error, because.

Â Again we're moving to a new set of values that

Â weren't used to calculate the model, so that's represents the

Â added error, or error and variability you get when you

Â move to a new data set out of sample error.

Â 9:40

The other thing that you can do that a nice component of

Â using linear modeling for prediction is

Â that you can calculate prediction intervals.

Â So, again, here I'm just calculating on, on, calculating a new set of predictions.

Â For the test data set from using our linear

Â model that we've built on the training data set.

Â And I say that I also want a prediction interval

Â out and so that's just an argument that I've passed

Â to the predict function and then if I order the

Â values for the test data set and plot the test.

Â 10:14

Waiting times versus eruption times.

Â I can also add lines that show not only my predictions, that's

Â what I've got here in this black line that shows the predictive values.

Â But I can also show an interval that is the interval that captures.

Â 10:32

Percent of where we the region where we expect

Â the predicted values to land, so we expect most

Â of the predicted values to land in between these

Â two red lines if our linear model is correct.

Â And so this.

Â Shows you a little bit about the range of possible values we could predict.

Â Not just a single prediction, which can be useful for giving you an

Â idea of how well your model is likely to do on new, predictions.

Â It'll tell you what are the range

Â of possible predictions that you might get out.

Â 11:03

You can do the same thing in the caret package.

Â Again the caret package, I've shown you how to do it now by

Â hand but you can also very easily do it with a caret package.

Â I use the train function in the caret package to build the model.

Â And so again I, the eruptions is the output, outcome,

Â waiting time is the predictor and they're separated by this tilde.

Â And then I say which data set I want to build

Â the model on and for the method, I tell it, Linear Modeling.

Â So if you do a summary of that final model fit so the final model is the part of

Â the modFit objects that was created by Train that tells

Â you the exact final model that's being used for prediction.

Â And again, it looks very similar to the model that we fit

Â by hand, it's minus 1.79 for the intercept and 0.07 for the waiting

Â 11:51

So regression modelling can be done with multiple covariants

Â as well, and we'll have a lecture on that.

Â But you can also combine it with

Â all other prediction and machine learning methodology.

Â Again, it's sort of a good quick and dirty method for use.

Â It does miss.

Â 12:06

Getting a higher mis-classification error

Â when the relationship isn't necessarily linear.

Â A lot of prediction is covered with, regression modeling is covered in these

Â books, and would be a good place to go if you want more information.

Â [SOUND]

Â