这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 杜克大学 的课程

线性回归和建模

639 评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

Linear Regression

In this week we’ll introduce linear regression. Many of you may be familiar with regression from reading the news, where graphs with straight lines are overlaid on scatterplots. Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

Now that we've learnt how we fit a least squares

line, a question might be, why would we want to do it.

One use of the least squares line is that it

allows us to evaluate the relationship between the two numerical variables.

Another important use is prediction.

In this video, we're going to talk about how we do

prediction and what extrapolation means and why we don't like it.

Using the liner model to predict the value of the response

variable for a given value of the explanatory variable is called prediction.

It's as simple as plugging in the value of the x in the

linear model and looking to see what the resulting y, the response variable, is.

According to the following linear model, the model

that we came up with previously, what is

the predicted percentage living in poverty in states

where the high school graduation rate is 82%?

All we need to do to answer this question is to plug

in 82% for the explanatory variable and solve for the predicted response variable.

So we can write our model and simply plug in 82.

Note that we're not plugging in 0.82 here even though we're saying 82%.

Because you can take a look to see what the data looked like.

The high school graduation rate is a value that's between 0 and 100, as opposed

to 0 and 1, as shown by the values on the x-axis of the plot.

Doing a little bit of math, the result comes out to be 13.84%.

So the way we would interpret this number is that, this model predicts that in

states where the high school graduation rate is 82%, the predicted percentage

living in poverty is, on average, 13.84%.

Prediction is a useful and powerful tool, but we want to be careful.

Applying a model estimate to values outside the

realm of the original data is called extrapolation.

In other words, this is simply plugging in a value of the x into

the linear model that was not in the range of the original observed data.

In fact, sometimes the intercept might be an extrapolation as well.

Here, we've stretched out the scatter plot of

the data we've been working with to illustrate this.

Plugging in 0 for the x value into

the linear model will indeed give us the intercept.

But, does it seem like a wise thing to do?

We have no idea if this line actually extends out into infinity as

a straight line, or if it curves down, curves up, or curves down even more.

Any of these are possible.

And since we don't have data from states with such

low high school graduation rates, it's really not wise to believe

the value of the intercept as a plausible value of poverty

rate, if high school graduation rate is 0 for a state.

Extrapolation doesn't have to happen in the extremes though.

Take a look at this problem.

According to following linear model, what is the predicted percentage living

in poverty in states where the high school graduation rate is 20%?

Mathematically speaking, this is a very simple problem.

All we have to do is to plug in 20 for the high school graduation

rate, and the formula will spit out a predicted value for the poverty for us.

But is it wise to do?

Before we do, we always want to look back at our data, this could

be in a scatter plot format, or at least looking at the summary statistics.

But something that will tell us whether 20% is within

the realm of the data that we observed or not.

And in this case, because it is not, we do not

want to be doing this prediction because it would yield an unreliable estimate.