这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 Duke University 的课程

线性回归和建模

706 个评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Multiple Regression

In this week, we’ll explore multiple regression, which allows us to model numerical response variables using multiple predictors (numerical and categorical). We will also cover inference for multiple linear regression, model selection, and model diagnostics. Hope you enjoy!

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

Using inferential techniques, we can determine which variables in

the model are significant predictors of our response variable.

In this video, we're going to talk about doing a hypothesis test and

constructing a confidence interval for the slope

estimates of the predictors in our models.

And we're also going to go through, in

addition to calculating this values, how to interpret them.

The data that we're going to be working

with come from the National Longitudinal Survey of Youth.

These are cognitive test scores of three-

and four-year-old children and characteristics of their mothers.

We have data on the kid's score, whether or not the mom

went to high school or not, the IQ score of the mother,

whether the mom worked during the first three years of the kid's

life, and the age of the mother at the birth of their child.

Using R, we can easily fit a model predicting the kid's

score from the other variables that are given in the data set.

First, we need to load our data.

So if you would like to follow along, you can do so using the code provided here.

Next, we want to fit our model.

We're going to start with what we call the full model, meaning it

includes all the explanatory variables that are given to us in the data set.

So, we're using the linear model function again, the lm.

And on the left side, we're putting the kid's score, our response variable.

And on the right hand of the, right side of the formula, we

list our explanatory variables, the high school status of the mother, IQ score

of the mother, whether or not the mom worked early on in kid's

life, and the age of the mother at the birth of their children.

To view the regression output, we use the summary function.

We're actually going to go through in

detail what just about every single value we're

seeing on the regression output means throughout

the rest of the slides in this video.

First, we will do inference for the model as a whole.

Here, our null hypothesis is that each one of the slope variables is equal to 0.

In other words, none of the explanatory variables

is a significant predictor of the response variable.

And the alternative says that at least one of the slopes is different than 0.

The test statistic that we use here is an F-statistic, and this output simply

comes from the bottom of the regression output that we saw on the previous slide.

We have an F-statistic on 4 and 429 degrees of freedom.

4 is the number of predictors, and 429 is simply n minus k minus 1.

We had 434 observations minus 4, the number of

predictors, minus 1, gives us the residual degrees of freedom.

And we are also given the p-value.

So, we really do not need to do any calculations by hand here.

We already have the p-value provided for us.

So what we need to focus on is, instead, the interpretation of what this means.

Since our p-value is less than 0.05, we

say that the model as a whole is significant.

We reject the null hypothesis, and the alternative hypothesis is suggesting

that there is at least something interesting to look for here.

The F test, yielding a significant result doesn't mean the model fits the data well.

It just means that at least one of the betas is non-zero.

The F test, on the other hand, not yielding a significant result doesn't

mean individual variables included in the model are not good predictors of y.

It just means that the combination of

these variables doesn't yield a good model.

Now that we know there is something worthwhile to

look for in this model, because we found out

that at least one of the betas is different

than 0, we can do individual tests on the slopes.

For example, the question says is whether or not the mother went to high school a

significant predictor of the cognitive test scores of

children, given all other variables in the model.

The null hypothesis here is that the beta associated with the high school status

of the mother is equal to 0 when all other variables are included in the model.

And the alternative is that it's different than 0

when all other variables are included in the model.

The regression output, as usual, gives us everything that we need.

All we need to do is to look on the row for the

mother's high school status and take a look at the p-value for that.

Since this is a small p-value, we can determine

that whether or not mom went to high school

is a significant predictor of the cog, cognitive test

scores of children, given all other variables in the model.

Even though we don't need to do any calculations by hand, it's

always a good idea to try to understand how the calculations that

are included in the regression output are actually getting done by the

software that you're using, so that you can understand what they mean.

So, let's go through the mechanics of testing for

the slope within the framework of a multiple linear regression.

As usual with a regression, we use a t-statistic in inference.

The t-statistic looks like point estimate minus

the null value, divided by the standard error.

Our point estimate is simply our slope estimate, and the standard error is the

standard error of this estimate that we

can grab easily from the regression output.

So, the t-statistic for the slope is simply b1

minus 0, divided by the standard error of b1.

How this is different from the single predictor regression case that we

covered in the previous unit is how we calculate the degrees of freedom.

The degrees of freedom here is n minus k minus 1,

where k is the number of predictors included in the model.

Let's take a moment to focus on this new measure of degrees of

freedom and actually highlight that it is not a new measure at all.

We just said that for a multiple linear

regression, the degrees of freedom is n minus

k minus 1, where n is the sample size and k is the number of predictors.

And earlier, in the previous unit, we had said that for a regression with

a single predictor, the degrees of freedom can be calculated as n minus 2.

If you think about it, in a single

predictor regression, the number of predictors is 1.

So, if we were to calculate the degrees of freedom as n minus k minus 1 for that case

as well, we would simply get n minus 1 minus 1, which comes out to be n minus 2.

And remember, the additional minus 1 is because with,

along with every single predictor for which we calculate

a slope estimate, we also calculate an intercept, and

that's where we're losing that one additional degree of freedom.

So, while we've introduced these two formulas slightly

differently, note that they mean exactly the same thing.

You start with your sample size, that is the

total degrees of freedom you have to play with.

And then, you lose the number of degrees of

freedom that goes for however many predictors you have.

And then, you lose one more for the intercept.

So, let's go ahead and verify the T-score and the p-value for the

slope of the variable mom_hs, that is the high school status of the mother.

We can calculate the T-score as the estimated slope minus the null

value 0, divided by the standard error of the estimated slope, 2.315.

And that yields a T-score of approximately 2.201, which

is what is given to us on the table anyway.

To calculate the p-value, we're going to need the degrees of

freedom associated with this slope, and that's n minus k minus 1.

434 is our sample size minus, we have 4 predictors,

minus 1 gives us a degrees of freedom of 429.

That's a value we can also find on the regression output right

next to the residual standard error at the bottom of the output.

Now that we know the T-score and the degrees

of freedom, we can use R to calculate the p-value.

And for that, as usual, we would use the pt function.

And also, we want to keep in mind that when we're

dealing with slopes and looking at the p-values that are provided

to us on the regression output, those p-values are always calculated

for hypothesis test that are, where the alternative hypothesis is two-sided.

The p-value here comes out to be spot on with what we saw in the table, 2.82%.

And given that this is a small p-value, we

would reject the null hypothesis in favor of the

alternative, and determine that mom's high school status is

indeed a significant predictor for the kid's cognitive score.

We've said numerous times throughout the course

that the construction of a confidence interval

follows the same structure, regardless of the

estimate for which you're constructing the confidence interval.

It is always a point estimate plus or minus a margin of error.

And in this case, our point estimate is simply our slope estimate.

And, we can calculate our margin of error as

the t-statistic times the standard error of the slope.

So, let's go ahead and calculate a

95% confidence interval for the slope of mom_work.

So, this was a variable that said whether or not the

mother worked during the first three years of the kid's life.

First, let's find our critical value.

And before we can get there, we need to know our degrees of freedom.

We've already confirmed that that was 429, so we want to find the critical

T-score associated with a 95% confidence level and 429 degrees of freedom.

This is a really, really high number for degrees of freedom, so we

know that the T-score is going to be pretty close to a Z-score of 1.96.

But let's go through the steps anyway to get the exact T-score.

We can draw our curve, mark the center of our distribution at 0.95,

and remind ourselves that each tail is then going to have 2.5% left.

And we can find the critical T-score using

the qt function and the associated degrees of freedom.

And R tells us that the cutoff on the lower bound is negative 1.97.

When constructing confidence intervals, we always use

positive critical values, so the t star

associated with 95% confidence level and degrees of a freedom of 429 is 1.97.

As expected, it is very close to the Z-score

because we have really high degrees of freedom here.

So to finalize our calculations, we start with our point estimate, roughly 2.54,

plus or minus 1.97 for the T-score, times roughly 2.35 for the standard error.

That gives us a confidence interval of negative 2.09 to positive 7.17.

And how do we interpret this interval?

This is simply going to be the interpretation

of the slope for this variable, except now, we're

also adding a statement to the beginning of

that about how confident we are of that estimate.

So, we are 95% confident that, all else being equal,

the model predicts that children whose moms worked during the

first three years of their lives scored 2.09 points lower

to 7.17 points higher than those whose moms did not work.