这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 杜克大学 的课程

线性回归和建模

633 评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Multiple Regression

In this week, we’ll explore multiple regression, which allows us to model numerical response variables using multiple predictors (numerical and categorical). We will also cover inference for multiple linear regression, model selection, and model diagnostics. Hope you enjoy!

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

Next, we introduce a new measure adjusted R squared.

We're going to talk about how to calculate this value as well as, what it means and

how it use.

We're going to use data from the US states on poverty as an example in this video,

remember we had data from 50 states plus district of Colombia.

The variables are percentage living in poverty in each state,

percentage of residents living in a metropolitan area, percentage white,

percentage of high school graduates, and percentage of female head of householders.

Here, we can see a bunch of scatter plots and a bunch of numbers.

So let's first pause for a moment and check to see what is going on here.

In the first plot for example, our y axis is percentage living in poverty and

x axis is percentage of metropolitan residents.

The correlation between these two numbers is -20.

Another plot we can take a look at here is in the intersection of metropolitan

residents and female householders.

So here the y axis is the percentage of residents living in a metropolitan area,

and the x axis is the female householders.

And the correlation coefficient between these two variables is on

the opposite side of the matrix, 0.30.

So, we can see scatter plots between

each one of the variables that are involved in our dataset

as well as the correlation coefficients on the lower half of the matrix.

The sizes of the correlation coefficient vary by the magnitude.

So, those that are highly correlated, either negatively or positively, but

magnitude was highly correlated are noted using larger font sizes.

And those that are not highly correlated are noted using smaller font sizes.

We call such plots, paralyzed scatter plots, and

they're very useful for an initial exploratory analysis of our data.

Especially, if you have all numerical variables involved.

We're going to start with a simple linear regression for

this dataset, where we have only one predictor.

The first step is going to be to load the dataset.

. So, if you would like to follow along,

you can do so by loading the dataset at this address.

And next, we fit our model.

We're going to call our model poverty simple linear regression or pov_slr.

You can call it whatever you want.

And remember we're using the linear model function the LM.

And the first argument is our response variable and then atilda, that stands for

versus, the explanatory variable, and the data file we're using is the file that we

loaded earlier this states state data file.

Let's take a look at a summary output for these data,

and we can see that estimates for the intercept and the slope as

well as many other statistics that might be useful in evaluating this model.

On this slide, we can more easily see the scatter plot for the relationship

between percentage living and poverty, and percentage female householder.

The correlation coefficient between these variables is 0.53, and

then our squared the percentage of variability in poverty explained by

female householder is simply the square of that number or 28%.

We also have our regression output cleaned up a little bit and rounded,

and we got rid of some of the values that we don't need for the time being.

And we can simply see the estimates,

the standard error of these estimates, the t scores and the p values.

And we can see with a small p value that female householder

is actually a significant predictor of percentage living in poverty.

We also mention that with linear models we can take a look at an ANOVA output,

which allows us to partition the variability in our response variable.

So the total measure of the variability in the response variable is sum of

squares total that's 480.25.

Remember, this is very similar to the variance of that

variable except not scaled by the sample size, and

we also can see how much of this can be attributed to our explanatory variable.

Percentage of female head of householders,

versus how much of it is unexplained by the model, that's the sum of square error.

Or we can think about it as the variability that's

left over still in the residuals.

Using the definition of R squared, we can also confirm that this is simply

the ratio of the explained variability to total variability.

Remember, explained variability is the sum of squares of the regression, 132.57,

and the total variability is sum of squares totaled, that's 480.25.

And we actually get the same value for R squared as expected of 28%.

Now that we have our base line model, we can add another variable to it, and

let's start with percentage white.

So we need to do in R is use the same linear model function.

And add white as an additional predictor to our model, and

then we can take a look at the summary output for this model,

as well as the anova output for the model, and for that we use the function anova,

that's wrapped around the regression model that we had specified earlier.

That looks something like this,

it's very similar to what we saw before except with an additional line

in both of our tables for the new variable that we've added as a predictor.

Note that the total variability,

sum of square's total has not changed, because this is the inherent variability

that is in our response variable percentage living in poverty.

So, regardless of how many variables you're using

in your model the total variability should not change.

However, what has changed is how this variability is being partitioned.

In this case, part of it is can be attributed to female householder, and

a much smaller part of it can now be attributed to a percentage white.

So if we wanted to calculate our square based on this output, and, keeping

in mind that R squared is the percentage of variability in the response variable,

that is explained by the model.

And in this case, our model is comprised of two explanatory variables.

We could calculate that as 132.57 + 8.21 to get us

the total explained variability in the model divided by the total

variability in our response variable which comes out to be roughly 29%.

We can see that adding another variable to our model now explains one more

percent of the variability in our response variable.

The R squared used to be 0.28 and now, it's 0.29.

The R squared value is going to go up

each time you add a new predictor to your model.

However, we need a more honest measure of whether

the added variable is actually a useful one.

And for that we introduce adjusted R squared.

This measure applies a penalty to R squared, for

the number of predictors included in the model, and the magnitude of this penalty,

is going to depend on how k, the number of predictors compares to n, our sample size.

The larger the sample size, the more predictors the model can handle, and

therefore the less the penalties is going to be for

additional predictors being added to the model.

While R squared always increases with the addition of each variable to the model

regardless of whether that variable is useful or not.

Adjusted R squared is only going to increase

if the added variable is actually of value.

In other words, if the additional percentage of variability in the response

variable explained by that new variable can offset the penalty for

the additional number of predictors in the model.

First, let's take a look to see how we can calculate adjusted R squared.

Here we have the multiple linear regression model, predicting percentage

living in poverty, from percentage of female house holders and percentage white.

And remember that our N sample size was 51, that's the 50 states plus DC.

So to calculate adjusted R squared,

I simply find the ratio of the unexplained variability to the total variability,

apply my penalty to that, and then we want to subtract that from 1.

That is 1 minus 339.47 over 480.25

times 51 minus 1 divided by 51 minus 2 minus 1.

51 was our sample size, and k, the number of predictors is 2.

Female householder and white, and this comes out to be 26%.

Remember, our R squared was 29%, however, our adjusted R squared,

with the penalty for the additional predictor, is only 26%.

So in summary, for the first model were we simply had

female house holder as our only predictor, our R squared was 28%.

However, for the second model were we have the additional variable white,

our R squared has increased to 29% while our adjusted R squared,

which apply the penalty for this additional variable stayed at 26%.

Remember, when any variable is added to the model, the R squared increases.

However, if the added variable doesn't really provide any new information or

is completely unrelated, the adjusted R squared does not increase.

Some properties of the adjusted R squared,

first, k the number of predictors can never be negative.

Therefore, adjusted R squared is always going to be less than R squared.

Second, adjusted R squared applies a penalty for

the number of predictors included in the model.

And third, we choose models with higher adjusted R squared over others.

The decision criteria is based on adjusted R squared as opposed to R squared because

R squared is always going to be higher for models with a higher number of predictors,

but those may not always be the favorable ones.