这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 Duke University 的课程

线性回归和建模

704 个评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Linear Regression

In this week we’ll introduce linear regression. Many of you may be familiar with regression from reading the news, where graphs with straight lines are overlaid on scatterplots. Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

In this video we will define the least squares line and also talk

about how to calculate and interpret the slope and the intercept of the line.

We can't simply add up all of the residuals

because some of them are going to be negative and

some of them are going to be positive, depending on

whether the model is over or underestimating certain data points.

So we need to come up with a more clever approach.

One option is minimize the sum of magnitudes,

or in other words absolute values of their residuals.

Another option to minimize the sum of

square residuals, and this is what we called

the least squares, and this is the option that we're going to be sticking with.

So, why least squares?

This is, indeed, the most commonly used approach.

And it's also easier to compute by hand as well as using software.

But most importantly, in many applications, a residual twice

as large as another is more than twice as bad.

We had used the same idea when we calculated

the standard deviation earlier in the course is why.

This is the general form of the least squares line.

We have our explanatory variable x, that gets multiplied by this slope beta

1, and we also have an intercept where the line intersects the y axis.

And finally we have y hat which stands

for the predicted value of the response variable.

Before we talk more about how we actually come up with

this line, let's first focus a little bit on the notation.

Once again we have data from a sample and we are

going to be using that sample to estimate unknown population parameters.

The unknown population parameter for the intercept is beta 0, and

the point estimate counterpart of it, the observed value, is b,0.

So once again, we're using the Greek alphabet, Latin

alphabet approach for denoting our parameters and point estimates.

Similarly, for the slope, the parameter is beta 1, and the point estimate is b,1.

So how do we estimate the regression parameters?

Let's start with the slope.

Remember earlier, we said that this is a least squares line.

In other words, we're minimizing The sum of squared residuals.

To minimize sum of squared residuals, we could actually use a little

bit of calculus and calculate the slope and the intercept using that approach.

However, since this is not a calculus-based

course, we'll actually introduce some shortcut formulas.

So we can calculate the slope b,1 as

the standard deviation of y divided by the standard

deviation of x you may have heard of this

as rise over run times R the correlation co-efficient.

Let's illustrate this with an example, the standard

deviation of percentage living in poverty is 3.1%

And the standard deviation of percentage of high

school graduates is 3.73% in our data set.

Given that the correlation between these variables is -.75.

What is the slope of the regression line for predicting

percentage living in poverty from percentage of high school graduates?

First, let's parse through the information that's given us in the problem.

We're told that the standard deviation of percentage living in poverty is 3.1%.

So we can say that s,y is 3.1%.

Because remember, poverty is our response variable.

And we are also told that the standard deviation of percentage

of high school graduates is 3.73% so that x is 3.73%.

We are also given the correlation co-efficient as negative

0.75 so putting all of these in the formula For the slop.

B1 equals Sy over Sx times R.

We simply need to plug in the numbers, 3.1 divided by 3.73 times

negative 0.75, gives us a slope of negative 0.62.

Note that the sine of the slope is always going

to be equal to the sine of your correlation coefficient.

Conceptually speaking, this is true, because

we're clearly seeing a negative relationship

between the two variables so it makes sense that the slope is negative.

Mathematically speaking, remember that the standard deviation

is the square root of the variants.

And it's a measure of variability.

So the standard deviation of y and x are

always going to necessarily have to be positive numbers.

In the first part of the equation, we're dividing two

positive numbers, which is always going to yield a positive response.

And then we're multiplying by a value that could be negative or

positive, depending on, on the direction

of the relationship between the two variables.

So mathematically speaking as well, the sign of the slope is always

going to be the same as the sign of the correlation coefficient.

Calculating the slope by hand is clearly

very simple and actually sometimes unnecessary as well

because often times, we don't calculate these

values by hand, but we simply use computation.

What is really important is how do we interpret this number, negative 0.62.

For each percentage point increase in high school graduate rate.

We would expect the percentage living in poverty

to be lower on average by 0.62% points.

There are few things to remember here first the interpretation of the slope

is about the relationship between your explanatory and your response variable.

In other words, how do we expect the response variable

to change As we increase the explanatory variable by 1 unit.

When we're interpreting these, we also want to make sure that if we are dealing

with an observational study; such as the

one we have here, we avoid causal language.

So this is why we're saying that we Expect this to

happen on average as opposed to interpreting this value as something.

Like if you increase high school graduation rate by 1 percentage

point, you would be able to decrease poverty by .62 percentage points.

Next, we want to estimate the intercept and remember that

the intercept is where the regression line crosses the y axis.

For this, we're going to make use of the property that

the least squares line always goes through x bar, y bar.

In other words, it's always going to go through the mean of y and x.

We know that we can write the linear model as

y hat equals b0, the intercept, plus b1 times x.

And all we need to do now is to plug in x bar and y bar in our equation because we

know that the line has to go through this point and

rearrange things a bit to get the formula for the intercept.

So by rearranging a bit, we can see that the

intercept can be calculated as the average value of the

response variable minus the slope, which we already calculated in

the first step, times the average value of the exponetory variable.

Let's see how we can do that.

Given that the average percentage living in poverty is 11.35%,

and the average percentage of high school graduates is 86.01%.

What is the intercept of the regression line for predicting

percentage living in poverty, from percentage of high school graduates?

One value that's given to us is the

average value of the response variable, that's 11.35%.

Another value that's given to us is

the average value of the explanatory variable, 86.01%.

And we also know that we can calculate the intercept as

simply the average value of the response variable, minus the slope

times the average value of the explanatory variable and we calculated

the slope in the previous step as negative .62 as well.

So we have all the building blocks we just need to plug

them into the equation and the intercept comes out to be 64.68%.

Once again, the calculation is very simple once you have these givens.

And most of the time you're not going to need to do this calculation by hand.

What is important is to understand 2 important points.

1, that the regression line always goes through the center of the data.

And 2, how do we interpret the intercept?

Remember, the intercept is where we said the regression line crosses the Y axis.

In other words, it's the expected value of the

response variable when the explanatory variable is equal to zero.

So, in context, what we can say is that, states with no high school graduates

Are expected on average to have 64.68%

of their residents living below the poverty line.

Does this seem realistic, that there would be states

in the U.S. with absolutely no high school graduates?

Looking at the data we have that actually seems very unlikely.

We can see that all the states in the US actually have high school graduation rate

varying somewhere between 75% to about 95%, maybe.

So mathematically speaking, this is a construct that

is important for putting together our linear model.

However, in context, it is not a very useful number.

So putting the information together from the previous two steps,

we can write our linear model as a predicted percentage

living in poverty, is equal to 64.68, the intercept, minus

0.62, the slope, times the percentage that are high school graduates.

If instead of calculating these values by hand, we had

actually used computation, the regression output would look something like this.

Depending on the software you're using, you might get slightly different

formatting, but usually this is the

general constart for the regression output.

We can see our inner set.

And we can see that it's slightly off

from what we calculated and that's probably simply due

to rounding and we can also see our slope

as the estimate that's associated with the explanatory variable.

We are going to talk about what the standard are, the

t score and the p value mean later on in the course.

So for now let's just focus on the estimates column, where we can find

what we call our parameter or coefficient estimates for the slope and the intersect.

So to recap, we interpret the intercept as when

x equals 0, y is expected to equal the intercept.

As we discussed, this may be a meaningless value in context of the data, and

in those cases it might only be serving to adjust the height of the line.

The interpretation of the slope is slightly different,

since it's about the relationship between the two variables.

For each unit increase in x.

Y is expected to be higher or lower on average by the value of the slope.

So this basically tells us as we increase x by

one unit, what do we expect to happen to y?

And once again, depending on the type of study you have, you want to

be careful about interpreting the slope in a causal versus a correlational way.