这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 Duke University 的课程

线性回归和建模

753 个评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Linear Regression

In this week we’ll introduce linear regression. Many of you may be familiar with regression from reading the news, where graphs with straight lines are overlaid on scatterplots. Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

Previously, we worked on evaluating the relationship between a

numerical and a categorical variable, using statistical inference methods.

Now, we're going to take a modeling approach to this, and we're going to fit

a regression model where the response variable

is numerical, and the explanatory variable is categorical.

We're still working with the poverty variable from the U.S.,

but this time the explanatory variable we're considering is region.

And this variable takes the value of zero, if the state is on the eastern side

of the U.S. and one, if the state is on the western side of the U.S..

So we can write our linear model as

predicted greater poverty equals 11.17, that's our intercept.

Plus 0.38 the slope, times region west.

We're not going to talk about how to estimate

these regions now we're kind of giving those to you.

But we're instead going to talk about how to interpret them.

So, for states that are on the eastern side, what we

want to do is to plug in zero for the x variable.

So that would give us 11.17 plus 0.38 times zero, just 11.17.

So the predicted poverty rate for eastern states is 11.17%.

For western states we plug in v, the value of one.

So in this case we're calling west a success, east a failure, if you will.

And if we solve for that, the predicted poverty rate then comes out to be 11.55%.

In

regression models, with explanatory categorical variables, we always code one

of the levels of that categorical variable to be what we call the reference level.

This is the level that we plug in zero for.

So then what do the slope and the intercept mean in this context?

The intercept basically tells us that the model

predicts an 11.17% average poverty percentage, in eastern states.

Remember, we've calculated this by plugging in the value

of zero for the explanatory variable, because the variable is

called region west, and an eastern state is not on

the west, therefore we're plugging in a zero for that.

The reason why we have to do this trick of plugging in a numerical variable is

that we couldn't simply plug in a level,

a categorical variable, and solve a mathematical equation.

So we're making due by labeling some of the levels successes and

some of the levels failures and denoting these with zeros and ones.

The slope, on the other hand, is once again going to

be about the actual relationship between

the explanatory and the response variables.

It tells us, that the model predicts that the average poverty

percentage in western states is 0.38% higher than in the eastern states.

Let's take a look at a new categorical variable.

We're going to call this variable region four, because it has four levels.

Northeast, midwest, west, and south.

So we've gone through our data set, and identified each one of the

states to be either in the northeast, the midwest, the west, or the south.

We want to write the linear regression

model based on the regression output below.

Remember that to write the regression model all

we need are the slope and intercept estimate.

So we're only going to be focusing on values

that come from the estimate column and their regression output.

Our response variable is percentage living in poverty.

So we're going to set the predicted

percentage living in poverty equal to the intercept

plus 0.03 times the midwest variable, plus

one point ni, seventy-nine times the west variable.

And four point sixteen times the south variable.

Once you have this linear model, depending on what state you are trying to

do a prediction for, we can plug in a bunch of zeros and ones.

For example, if you have a state from the midwest you would plug

in a one for mid-west and a zero for all the other levels.

Then, based on this, what is the reference level of this region four variable?

Is it northeast, midwest, west or south?

Th, remember, the reference level is what we identify as the zero level.

And from the regression output, you can think about that level

as the level that does not show up in the regression output.

So what if I was to plug it as

zero for midwest, zero for west, and zero for south.

Then necessarily, such a state is going to have to be in the northeast.

So here the reference level is the northeast.

Given what we know so far, we can

calculate the predicted poverty rate for western states.

We're going to start with our linear model that we developed a couple slides before.

And since we're looking for a western state, we would plug in

a zero for midwest, a one for west, and a zero for south.

And all we have to do now is to solve the linear model.

This is going to be nine point fifty

plus zero plus one point seventy-nine plus zero.

The one point seventy-nine is simply the multiplier for the west level times one.

And the other zeros are because we've plugged in zeros for the other levels.

This comes out to be 11.29%.

So we can say that the model predicts that the poverty rate for western

states is 11.29% on average.