0:11

We're going to be working with a data set of 15 books,

some of which are hard back, labeled as hb under cover type and

the rest are paperback, labeled as pb under cover type.

We have data on the weights and the volumes of these books.

And we're going to be building a model predicting the weight of the book using

its volume and cover type.

So volume and cover type are going to be our two explanatory variables.

In other words, our two predictors.

0:40

Here we have a scatter plot of weights versus volumes of these books.

Blue stars indicate books that are hardcover, and

orange squares indicate books that are paperback.

Can you identify a trend in the relationship between volume and

weight of hardcover and paperback books?

0:58

It appears that paperbacks that are indicated in orange squares

generally weigh less than hardcover books.

We can see that the relationship between weight and volume are similar for

the two types of books which is expected.

As volume increases, so does weight.

But we're also noting this difference that paperbacks generally

weigh less than hardcover books.

Next, we're actually going to fit the model.

You can actually follow along in R.

The data can be found in the library called DAAG, so

we can first load that library.

And the name of the data set is allbacks, so we can load that date file.

1:36

Once you have the data loaded, you can go ahead and fit the model.

We're going to use the same lm, linear model function that we used for

simple linear regression, to fit multiple linear regression models and

we once again use the same structure of the formula.

On the left-hand side we have weight, the response variable.

And on the right-hand side we have our explanatory variables, or in other words,

our predictors, separated by + signs.

So we're predicting weight from volume and cover.

2:08

The summary output for a multiple linear regression model looks

a lot like the summary output for a simple linear regression model.

That is a linear regression model with a single predictor,

except we have more estimates here because we have more variables for

which we're estimating a slope.

We can see our estimate for the intercept, 197.96,

as well as our estimates for the slope of the volume and the cover variables.

We can also see that one of the levels of the cover variable is

noted on the regression output.

And remember that the one that is noted on the regression output is

the non-reference level.

Which means that the hardcover books must be the reference level.

Also at the bottom of this regression output,

we can see our multiple R-squared to be 92.75%.

That means 92.75% of the variability of weight of

books can be explained by the volume and cover type.

This a pretty high R-squared but we would expect that to be the case because

what else could the weight of a book depend on?

Perhaps the paper type and maybe something like that makes up the remainder

about 7% of the unexplained variability.

3:19

We're going to actually parse through this regression output in detail

throughout this unit, but for now, let's just focus on the estimates for

the slopes and the intercepts and think about how to interpret these values.

First off, using these estimates, we can easily write the linear model.

3:38

As the predicted weight is equal to 197.96 + 0.72

times the volume of the book- 184.05 times cover : pb.

And remember we said that pb, the paperbacks, are the non-reference level.

Meaning that the hardcover books are the reference level.

And for reference levels, we always plug in as 0 in our linear model.

So for hardcover books, if we wanted to simplify this

linear model to see what would it look like only for hardcover books.

We would simply write out our linear model except plug in a 0 for

the cover variable because if you had the reference level, that's a 0.

If you have a book that is a paperback, we would actually plug in 1 for that.

A little bit of simplification gives us that the estimated weight for

hardcover books is 197.96 + 0.72 times the volume.

So here we have started with a multiple regression model and we have simplified

it down to a simple regression model for only one type of the books.

We are able to do this simplification easily because we have a categorical

variable as our second variable in our regression model.

For paperback books on the other hand, we're going to plug in a 1 for

the cover type in our regression model.

So if we simplify that out, we're going to get an intercept of 13.91.

Remember this is a lower intercept than the intercept for the hardcover books,

which makes sense because we had said that paperback books generally weigh

less on average.

So the line for those books should be at a lower spot on the scatter plot

than the line for the hardcover books, plus 0.72 times the volume.

So same slope, different intercepts.

This is what the regression models actually look like on the scatter plot.

For the hardcover books,

we have predicted weight is equal to 197.96 + 0.72 times the volume.

And for paperback books, we have predicted weight = 13.91 + 0.72 times the volume.

What this multiple linear regression has allowed us to do,

is to fit these separate parallel lines for the two types of books.

As opposed to imposing that we pick only one single line,

describing all of these books and imagine that line would be somewhere in between

these two in order to be able minimize the residuals.

But it really wouldn't do a good job explaining either the hardcover or

the paperback books.

6:16

Next, let's think about how we would interpret the regression parameters.

Let's start with the slope.

We have two slope estimates here, one for volume and one for cover.

The slope estimate for volume is 0.72.

And this means that all else held constant,

for each 1 cm cubed increase in volume,

the model predicts the books to be heavier on average by 0.72 grams.

Remember, we interpret slopes as what happens to y or what is expected or

predicted to happened to y on average as x increases by 1 unit.

Since we measured volume in centimeter cubes, we're saying here a 1 unit

increase corresponds to a 1 centimeter cube increase in volume.

And the other bit here that is new, is this bit about all else held constant.

Simply, what this is saying is that, if we were to keep cover type constant,

so, either for books that are hardcover or books that are paperback, but

not comparing one group to another.

For each 1 centimeter cube increase in volume is predicted to be

associated with a 0.72 gram increase in the weight.

The slope for the cover variable is -184.05.

And remember that hardcover is the reference level and

paperback is the non-reference level.

In this case we can think of this value as going from hardcover to paperback,

there is an expected decrease of 184.05 grams and weights.

8:11

Next, let's consider the intercept.

The intercept of this model is 197.96.

And remember the intercept is the predicted value of the response variable,

when x = 0.

In this case, we have two x's, so when volume is equal to 0,

as well as when cover is equal to 0.

And cover is a categorical variable.

It's never going to be equal to 0.

What that means is the reference level for the cover is being considered.

In other words, the intercept can be interpreted as hardcover books

with no volume are expected on average to weigh 198 grams.

9:15

To do the prediction, all we have to do is to plug in the appropriate values for

the volume and cover.

In this case, we have 197.96 for the intercept,

+ 0.72 x 600 for the volume- 184.05 x 1 for

cover type because we have a paperback book and that is the non-reference level.

Doing the math gives us 445.91 grams.

Meaning that this model predicts a paperback book that

is 600 centimeters cube to weigh 445.91 grams.

Before we wrap up this video I'd like to make a point about interaction variables.

Note that this model assumes that hardcover and paperback books

have the same slope for the relationship between their volume and weight.

This is probably not an unreasonable assumption for books.

However, we can think of examples where this may not be the case.

Imagine that we were trying to predict calories burned from number of minutes of

exercise and a categorical variable that we also considered in the model was sex.

So, male or female.

The relationship between number of minutes of exercise and

calories burned may not be the same for males and females.

And in that case,

it wouldn't really make sense to model this using two parallel lines.

If this assumption of parallel lines or this idea of the same slope for

the two levels of the categorical variable is not reasonable,

then we would introduce an interaction variable in the model.

These variables are beyond the scope of the scores but I wanted to make a point

about them, just so you realize that the simplifying assumption may not

always make sense and that there are remedies for when it does not.