这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 Duke University 的课程

线性回归和建模

781 个评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Linear Regression

In this week we’ll introduce linear regression. Many of you may be familiar with regression from reading the news, where graphs with straight lines are overlaid on scatterplots. Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

In this video,

we're going to talk about the correlation between two numerical variables.

We're going to define what correlation means and

we're going to go through its properties.

Here we have a scatter plot of poverty rate versus high school graduation rate.

These data are from 2012, and in 2012 the poverty line in

the US was defined as having an income below $23,050 for a family of 4.

The response variable here is the percentage living in poverty.

Note that this is the variable on the y axis.

The explanatory variable is the percentage of high school graduates or

the high school graduation rate.

The relationship between these variables is linear, negative and moderately strong.

When we discuss the relationship between two numerical variables,

we always talk about the form, usually we worry about is it linear or non linear.

The direction is at negative or positive, and

the strength going from very weak to extremely strong.

One measure of the strength of the association between two numerical

variables is correlation.

In fact, correlation specifically describes the strength of the linear

association between two variables.

The key word here is linear.

So we only measure the linear association using correlation.

We denote correlation with an R.

Next, we're going to go through the properties of the correlation coefficient.

First, the magnitude or in other words,

the absolute value of the correlation coefficient.

Measures the strength of the linear association

between two numerical variables.

Here we can see three scatter plots.

And the strength of the association is going from pretty strong to very weak.

For the first plot, the correlation coefficient is 97.

For the second one its 0.69 and for the third one its 0.07.

We can see that higher the magnitude the stronger the strength of the association

between the two variables.

Two, the sign of the correlation coefficient indicates the direction of

association.

So here we have two scattered plots,

one with a positive association where the correlation coefficient is 0.98, and

one with a negative association where the correlation coefficient is negative 0.96.

Three, the correlation coefficient is always between -1,

which is a perfect negative linear association and positive 1,

which is a perfect positive linear association.

And the correlation coefficient of 0, indicates no linear relationship.

Here we have three scatter plots again.

The first one shows a positive perfect linear association.

The second one shows a negative perfect linear association.

And then the third one the correlation coefficient is 0.

As x increases nothing is happening to y.

Therefore there's no relation In shape between these two variables.

Four, the correlation coefficient is unitless and is not affected by

changes in the center or scale of either variable, such as unit conversions.

So here we have two scalar plots, the data comes from mammals and

on the y axis we have total number of hours of sleep and

on the x axis, we have body weight of these mammals.

The first scatter plot shows the body weight in kilograms and

the second scatter plot shows the body weight in pounds.

And remember that one kilogram is roughly 2.2 pounds.

We can see that the shape of the relationship of the two variables,

looks very similar, in fact without the access labels it would very difficult

to tell the difference between these two plots.

For both of these plots the correlation coefficient is -0.34.

Now one might think about is it even appropriate to calculate a correlation

coefficient here?

Because the relationship between these two variables do not appear to be linear,

and that's a very good point.

And we probably wouldn't want to

claim a linear relationship between these two variables.

But to demonstrate the fact that changing the units does

not affect the correlation coefficient.

These plots still serve a purpose.

Five, the correlation of X with Y is the same as of Y with X.

So what this means is that even if you swap the axis,

your correlation coefficient should stay the same.

Here we're looking at in the first plot, the total number of hours of sleep

on the Y axis versus life span of these mammals on the X axis.

And for the second plot, we simply swap the variables.

So now, the response variable is life span and

the expository variable is the total number of hours of sleep.

For both of these, the correlation coefficient is -0.38.

So changing the variables around does not affect the correlation coefficient.

Six, the correlation coefficient is sensitive to outliers.

Here we have two scatter plots again, and

the first one there is no outlier we have extended the x axis all the way to 50.

So that we can do a comparison and

in the second one there is one straight point that's an outlier.

The first plot shows a pretty strong relationship between the two variables and

in fact the correlation coefficient is close to perfect.

It's roughly 0.98, and the second plot we've simply moved one of the data

points from this original data set further away from the rest of the cloud.

And the correlation coefficient has gone down to 0.68.

So we can see that even with one outlier,

the correlation coefficient because it's sensitive, it will change greatly.

So we can see that moving even one data point to be an outlier will effect

the correlation coefficient greatly because it is so sensitive to outliers.

Let's take a look at this practice question.

Which of the following is the best guess for the correlation between percentage

living in poverty and percentage of high school graduates?

Note that we haven't provided a formula for the correlation coefficient.

There of course is one.

And you will get to use computation to calculate correlation coefficient, but

there's absolutely no reason in this day and age to try to calculate that by hand.

However, given a bunch of choices, we should be able to pinpoint

which of these following sounds like a reasonable guesstimate for

the correlation between these two variables.

First off, we can get rid of 1.5 or -1.5 right off the bat.

Because we know that the correlation coefficient can only be between

negative one and positive one.

We also see that the relationship between these two variables is negative.

Therefore, any positive correlation coefficient doesn't make sense here.

So next we need to choose between negative 0.75 and negative 0.1.

Note that negative 0.75 is much closer to negative 1,

meaning that it indicates a much stronger relationship.

So the question becomes, do we see a strong relationship here, or

a pretty weak relationship?

Sometimes it helps to look at the negative spaces on our plot.

So we can see for example, that there are some negative spaces on our plot and

if we were to block those off it would be a little easier to see that there is

indeed a somewhat strong relationship between these two variables,

even with all the scatter around the line.

Therefore, the correct answer here is going to be -0.75.

A correlation coefficient of negative 0.1 would look like much more of

a random scatter that takes place of the entire plot without leaving any negative

spaces for us to get rid off so that we can better see the linear relationship.

Another practice question.

Which of the following has the strongest correlation?

In other words, the correlation coefficient is closest to positive 1 or

-1.

The first plot shows a very strong relationship, but

the relationship is not linear.

Remember that we can determine the strength of the relationship

looking at how much scatter there is in the data.

Because there's very little scatter here,

we can see that the relationship is strong.

But once again, we wouldn't expect the correlation coefficient to be very

close to positive or negative 1, because the relationship is not linear, and

the correlation coefficient measures the strength of the linear relationship.

The last plot shows a weaker relationship, and

the second to last plot shows an ever weaker relationship.

So the strongest linear relationship here is option b.

So this one should be the scattered plot with a correlation coefficient

closest to positive one in this case since their relationship is positive.