这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 Duke University 的课程

线性回归和建模

781 个评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Linear Regression

In this week we’ll introduce linear regression. Many of you may be familiar with regression from reading the news, where graphs with straight lines are overlaid on scatterplots. Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

As with any technique we encounter, there

are conditions associated with linear regression as well.

These are linearity, nearly normal residuals and constant variability.

In this video we're going to go through

these conditions one by one, and look at diagnostic

tools that we can use to assess whether the

conditions have been met or have not been met.

First, linearity, which says that the relationship between

the explanatory and the response variable should be linear.

Which makes sense because we're using a liner model

to predict the response variable from the explanatory variable.

There are indeed methods for fitting a model to non-linear realtionships.

However, those are beyond the scope of this course.

So for this course we're going to be sticking with linear models only.

In order to check if the linearity condition has been met.

We can use a scatter plot of the data or a residuals plot.

Here we have a set of three plots, three relationships that are being displayed.

And on top we have the scatter plots of y versus x.

And at the bottom we have the residuals plots.

We're going to talk about what a residuals plot is in the next slide.

And go through how to decipher that.

So for now let's stick to the scatter plots and let's look to see, based

on the scatter plot, which of these display

a linear relationship versus which ones do not.

The first plot seems to display a pretty linear relationship.

In the second plot, we can see a bit of a bend.

And the third plot is hard to tell,

because there's a lot of scatter around the data.

However, even with a lot of scatter, there isn't an obvious non-linear pattern.

So, what is our residuals plot?

Since previously we looked at the residuals of Rhode

Island and District of Columbia, for their relationship between poverty

and high school graduation rate in the U.S., we'll stick

with those to exemplify how we build a residuals plot.

In Rhode Island, the observed high school graduation rate

is 81% and the observed rate of poverty is 10.3%.

Using our linear model, we can calculate what

the predicted of poverty would be for Rhone island.

So, for that we simply plug in 81% into our linear model.

And we see that the model predicts 14.46% for the poverty rate in Rhode island.

The difference between the observed and the predicted rates is r residual,

and that comes out to be negative four point sixteen percent.

This is basically the value that's shown in the residuals plot.

So the x-axis of the residuals plot here is once again high school graduation rate.

So if, again we have 81% for Rhode Island.

And on the y-axis we have the residuals.

And since Rhode Island has a negative residual,

the point associated with Rhode Island appears below the

zero line in the residuals plot and is

four point sixteen percent away from the zero line.

For DC, the observed high school graduation rate is

86% and the observed rate of poverty is 16.8%.

Using the same linear model and plugging

in 86%, we can actually calculate the predicted

poverty rate for DC, and we see the model predicts a poverty rate of 11.36%.

In this case, the residual can once again be calculated as the observed value minus

the predicted value, and that comes out to be positive five point forty-four percent.

And that's the same value that we're seeing on the

residuals plot, where on the x-axis we have 86% for DC.

And on the y-axis we have the associated residual.

The ideal residual would be zero, because that would mean

that the data point falls exactly on the regression line.

And that there is no difference between the

predicted and observed values for that particular data point.

With random checks, this is going to be unlikely to happen, but we like small

residuals and we want our residuals in the

residuals plot to be randomly scattered around zero.

There's going to be some that are positive

and some that are negative, because that corresponds

to some points falling above the regression line,

and other points falling below the regression line.

And we want them to have absolutely no pattern,

because what we want is for the linear model

to capture all of the pattern in the data,

and anything that's left over to be simply random scatter.

So just like we look for a straight line

in the scatter plot to check for the linearity condition.

In the residuals plot, we look for a random scatter around zero.

The next condition is nearly normal residuals, which says

that residuals should be nearly normally distributed, centered at zero.

This condition may not be satisfied if there are unusual observations

that don't follow the trend of the rest of the data.

And we can check this condition using a

histogram of a normal probability plot of residuals.

The histogram shows a somewhat symmetric distribution.

It is indeed centered at zero, and

the normal probability plot shows that there are

some values on the higher end of the tail, that actually steer away from normality.

But that's only just a few observations.

The last condition is constant variability, which says that variability

of points around the least squares line should be roughly constant.

This implies that the variability of residuals around

the zero line should be roughly constant as well.

This condition is also called homoscedasticity.

And we can check this using a residuals plot.

On the scatter plot, we can see that as the

x value varies, the variability of the data do not

vary a whole lot, they actually seem to be captured

around this constantly variable grey band around the regression line.

And when we look at the residuals plot, we

can also confirm that the variability of the residuals,

that is how far they are from zero, do

not vary by the value of the explanatory variable.

Checking regression diagnostics is somewhat of an art.

And it takes lots of practice to be able to tell

when a condition has been met or has not been met.

Let's play around with the following applet to get some of that practice.

Let's start with an example where things actually work well.

Here we have a linear trend between our explanatory and our response variable.

We can see a completely random scatter in the residuals plot.

The histogram of the residuals is centered at zero,

and the shape of the distribution looks fairly symmetric.

And the normal probability plot with almost all of the dots aligned on

the straight line, also indicates that the

distribution of the residuals is nearly normal.

Let's take a look at another example.

Once again, a linear trend, except this time the direction has changed.

So we have a downwards trend between our response and our explanatory variables.

Once again, a completely random scatter in the residuals plot.

A symm, fairly symmetric distribution in the histogram of the residuals.

And the normal probability plot looks pretty [UNKNOWN], good as well.

So what do these look like when the conditions have not been met?

What if we actually have a curved

relationship between our response and our explanatory variable?

In this case we can definitely see that the residuals

plot is no longer displaying a random scatter around zero.

The histogram of the residuals shows a right skew.

And that same right skew is shown on the normal probability plot as well.

So in this case, would it be appropriate to fit a linear model to predict y from x?

Definitely not.

Let's take a look at another example.

Once again, we have a curved relationship.

Not as extreme a curve, and it it might actually be somewhat difficult to

tell from the scatter plot if we didn't have the grey band around it.

But, the residuals plot highlights very well for us, that the relationship is not

linear, because the distribution of the residuals

does not show a random scatter around zero.

The histogram of the residual shows a distribution centered

at zero, but the distribution doesn't exactly look very normal.

And we can see that the normal probability plot also shows that

a lot of the points on the tails actually steer away from normality.

So these were two examples where the linearity condition has not been met.

What if the constant variability condition has not been met?

This is usually when we have what we call fan-shaped data.

We can see that when the value of the explanatory variable

is low, the variability of the response variable is low as well.

However, as x increases the data are fanning out

such that the response variable becomes more and more variable.

This yields up what we call a fan-shaped residuals plot where we can clearly see

that as the x increases, the variability of the residuals increase as well.

The histogram of the residuals looks

fairly symmetric, and it's centered at zero.

But looking at the normal probability plot, we can see

that we're actually steering quite a bit away from normality.

I hope that you will play around with this applet a little bit more to

get practice working with situations where the conditions

have been met and have not been met.

And the more you see these plots, the easier it's going to get for

you to be able to tell whether a condition has been met or not.