这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

Loading...

来自 Duke University 的课程

线性回归和建模

703 个评分

这门课程介绍一元和多元线性回归模型。 这些模型能够让你获得数据集和一个连续变量之间的关系。（比如说：）在教授的外表吸引程度和学生的评分之间有什么关联么？我们可以根据孩子母亲的特定特征来预测这个孩子的测试分数么？在这门课程当中，你将会学习线性回归的基本理论，运用免费统计软件R、RStudio分析一些数据例子来学习如何拟合、检验，以及如何利用回归模型去检验多元变量之间的关系。

从本节课中

More about Linear Regression

Welcome to week 2! In this week, we will look at outliers, inference in linear regression and variability partitioning. Please use this week to strengthen your understanding on linear regression. Don't forget to post your questions, concerns and suggestions in the discussion forum!

- Mine Çetinkaya-RundelAssociate Professor of the Practice

Department of Statistical Science

We're going to wrap up this unit on introduction to linear

regression with a discussion on variability partitioning.

So far within the framework of regression we've used a t-test as

a way to evaluate the strength of evidence for hypothesis test for

the slope of relationship between x and y.

Alternatively, we can also consider the variability in y explained by x,

compared to the unexplained variability.

Remember that percentage of variability in y explained by x was our R-squared.

Remember also that we like large R-squareds so we wonder, could we use

that notion to also do this hypothesis test from another point of view?

This idea of partitioning the variability in y to explained and

unexplained variability measures should not be new to you, remember that

we had seen that when we first discussed analysis of variance or ANOVA.

We can actually get an anova type output for our regression model as well.

So this type of output should be familiar to you from before.

Let's go through it one more time and

take a look to see how do these numbers relate to what we know about anova from

before where we had used it to compare means of multiple groups to each other.

One of the important columns here is the sum of squares where we have data

on the total variability.

So total variability in y is basically the sum squares total and

this looks very much like the variance of y if we didn't scale by the sample size.

What we mean by unexplained variability in y within

the context of regression is basically the sum of squares of residual.

So imagine that you have the residual for

every single data point in your data set and you square those and add those up.

Then the explained variability simply becomes the balance of these two numbers

because remember that the explained plus the unexplained variability

will get you to the total variability.

Next, lets look at the degrees of freedom column.

The total degrees of freedom is simply your sample size minus one,

this is the twin data that we're still working with.

So we had 27 twins minus 1 gives us 26.

Next we consider the degrees of freedom associated with a regression.

And since we only have one predictor here,

this degrees of freedom is simply going to be 1.

Then the residual degrees of freedom is the balance of these two,

26 minus one, 25.

In the next step, we want to have a measure of the average variability that we

call mean square and remember to get to mean square we take the sum of squares and

divide them by the associated degrees of freedom.

So the mean square regression is the sum of squares regression

divided by the degrees of freedom of regression.

And the mean square of residuals is simply the sum of squares residuals

divided by the degrees of freedom associated with the residuals.

Finally we get to the F statistic.

Remember that the F statistic is the ratio of explained to unexplained

variability In this case, that's going to be the Mean Square of

Regression divided by the Mean Square of Residuals.

Now that we had a refresher on the ANOVA table,

we can actually move on to doing our hypothesis test.

Remember our goal was to see,

is the explanatory variable a significant predictor of the response variable?

And we had set our hypothesis as the slope equals 0 for the null hypothesis.

And the slope does not equal 0 for the alternative hypothesis.

We have a pretty small p-value, meaning we would reject the null hypothesis and

in this case rejecting the null hypothesis means that the data provided

convincing evidence that the slope is significantly less than 0.

In other words,

the explanatory variable is a significant predictor of the response variable.

Now that we've been talking about the variability and the response variable and

partitioning this variability to explain the unexplained variability

Let's revisit this notion of R-squared one more time.

Remember that R-squared is the proportion of variability

in y explained by the model.

If this value is large,

we then say that there's likely a linear relationship between x and y.

If the value, on the other hand, is small,

we say that the evidence provided by the data may not be very convincing.

There are actually two ways to calculate R-squared.

We've already seen one of them using the correlation coefficient,

we simply take the square of correlation coefficient.

But another one is actually from the definition of R-squared.

We can directly calculate it as a proportion of explained

to total variability.

Now, that we've seen the anova table, and we know the measures of total

variability and explain variability, taking the ratio should be a simple task.

So let's go ahead and

quickly check if these two methods actually yield the same result.

If we square the correlation coefficient we get an R-squared of roughly 78%.

To do the calculation using the definition of R-squared,

we need to take the ratio of explained variability to total variability.

Remember that this means some of squares of regression

divided by some of squares of total.

And doing the math we actually get to the same results.

78% of the variability and foster twins IQ's can be explained by the model,

or in other words, the biological twins IQ's.