0:20
So hopefully, you'll appreciate after this section that creating confidence intervals
for linear regression slopes means essentially creating confidence intervals
for mean differences.
And the approach is business as usual.
We take our estimated slope and add and subtract two or
sometimes a little bit more standard errors.
And if we want to get a p value, the approach is the same as well.
We start by assuming the slope or the mean difference is zero, and then looking at
how far our result is from what we'd expect under that null hypothesis.
Similarly, creating a confidence interval for an intercept is akin to
creating a confidence interval for a single population mean and
follows the logic we used in Statistical Reasoning One.
1:01
So let's take a look at our arm circumference and
height example again to start.
So in the last section we showed the results from several simple linear
regression models, including this one with arm circumference and height.
And we estimated all we gave was the resulting estimated regression equation
based on these 150 data points that suggested that the mean arm
circumference was related to height via the following equation.
Take 2.7 and add 0.16 times the group of children's
height to estimate the mean arm circumference for that group.
I got this from a computer package,
but how does the algorithm work to estimate this equation?
2:16
Well, in regression, closeness is defined as the cumulative square difference
between each point's observed y-value and the corresponding estimated mean,
y-hat for that point's x-value.
In other words, the squared distance between an observed y-value and
the estimated mean value for all points with the same value of x.
So each distance for each observed point in our data set can be estimated
by taking that point's value, for example, that child's value of arm circumference.
And subtracting the predicted mean of arm circumference for
children with the same height.
And so this distance on a scatter plot looks like
this vertical distance between each point and the mean for
children with that height value that is shown on the red regression line.
3:09
So the algorithm to actually estimate that regression line, again,
is called least squares,
because it minimizes the overall squared distance between all points in that line.
And so what the computer does is given the data, it chooses the values for
the intercept and the slope that minimize the cumulative distances squared.
So if we were to actually take the values of beta not and
beta 1 hat that minimize the cumulative, we took each point in our data set.
3:42
Each child's arm circumference subtracted
the predicted mean via the regression equation.
We square that distance and add it up across all data points in the sample.
The algorithm chooses the value of the intercepting slope that minimize that
cumulative square distance.
And so the algorithm doesn't have to keep trying different combinations of
beta not and beta 1 until it finds the one that gives the minimum square distance.
This can actually be done pretty easily using a calculus-based
approach to minimize this function, choose the values of beta not and
beta 1 hat that minimized this function.
The end result of this minimization give us what sometimes are called closed-forms
equations.
Equations which we could use to solve for the optimal values of beta not hat and
beta 1 hat in terms of the x and y values in our data set.
But I would never expect anyone to do a regression by hand, in fact,
I've never done a regression by hand because the computations are arduous and
time consuming.
However, the equation is very cool and makes for a nice piece of apparel
as evidenced by the fact that I actually have it on my tie.
The end result, however, are estimates based on the data we have at hand, and
these are just estimates based on our single sample from our population.
So if we were to actually have different random samples from the same population.
For example, different random samples of 150 Nepalese children
from the same population of Nepalese children less than 12 months.
We might get different estimates of beta not and
beta 1 depending on the sample we used.
So in other words,
the values that minimize the cumulative square distance for
different samples of the same size would likely be different across the samples.
So there's some sampling variability in these estimates.
So all regression coefficients, the intercept and
slope, have an associated standard error.
That can help us make statements about the true relationship between the mean and
y and our x predictor based on a single sample.
So there is a true regression equation in the population that has a true slope and
a true intercept.
We can only estimate these quantities.
So just like we've done with everything else that we estimate,
we're ultimately going to have to deal with the uncertainty in these estimates.
6:07
So let's again go look at the estimated regression correlation relating arm
circumference to height based on this one sample of 115 Nepalese children.
And again, here's our equation.
But actually the computer will give us the resulting estimated standard error for
our intercept and slope.
So for example, the slope was 0.16 and
the estimated standard error is 0.014.
So it turns out, remember, these are mean differences ultimately, these slopes and
the intercepts are means.
And so the random sampling behavior of these estimated regression coefficients is
essentially the random sample behavior for mean differences in means.
Which we've already showed that is generally normal from sample to sample and
centered at the true value that we're estimating.
So we can use the same ideas we used back in Statistical Reasoning One,
for creating 95% confidence intervals, for the true underlying
population level slopes and intercepts, and to get P values.
So let's look at the estimated regression equation for
this height in Nepali children.
So this slope here 0.16 estimated the mean difference in arm circumference,
8:05
height was zero or not.
So our null hypothesis is that this true population level mean difference or
the slope is zero and the alternative is that it's not zero.
So we'll do this the same way we've always done hypothesis testing,
we'll assume our sample comes from a population where the true slope is zero.
And then we'll measure how far our result estimate is from zero
in standard error units.
And so if we do this, we get a slope that's 11.4
standard errors above what we'd expect to see under the null hypothesis.
So, translating this to a p value means getting
the probability of being 11.4 or more standard errors away either above or
below from mean of 0 on a standard normal curve.
And the p values very low.
We already knew it would come in at less than .05 if yeah think about it.
because the confidence interval for the slope did not include 0.
But it's quite low, it's less than 0.001.
So how could we write this up?
We could say something like this research used simple linear regression to estimate
the magnitude of the association between arm circumference and height in Nepali
children less than 12 months old, using data on a random sample of 150.
A statistically significant positive association was found,
we could put the P value in parentheses.
The results estimate that two groups of such children who differ by one centimeter
in height, will differ on average by 0.16cm in arm circumference.
9:55
In other words, it's an increase in arm circumference, with an increase in height.
And a 95% conference interval, which gives a range of possibilities for
the true mean difference in arm circumference per 1 unit difference in
height, in the entire population of such children,
goes from 0.13 centimeters to 0.19 centimeters.
What if I wanted to give an estimate in a 95% confidence interval, for
the mean difference in arm circumference for
children 60 centimeters tall compared to children 50 centimeters tall?
Well, from the previous lecture section, we know
that this estimated mean difference can be expressed in terms of the slope by taking
the difference in our x value, which is 10 centimeters, or 10 units, and multiplying
it by the estimated mean difference in y per one unit difference in x.
So, the estimated mean difference in arm circumference per 1 unit difference in
height was 0.16 centimeters.
So, if the differences in height is 10 centimeters,
this would accrue to a cumulative difference of 1.6 centimeters on average.
But how do we actually get the standard error for
this mean difference for more than a one unit difference in our x value?
Well it turns out anything we do to our slope we do to the standard error.
So if our resulting comparison yields an estimate of 10 times the slope estimate,
we would take the standard error for the slope, and multiply it by 10.
So the standard error, in other words, the estimated standard error of 10 times
the slope, is equal to 10 times the standard error of the slope.
So the standard error of ten times beta one hat ten times the standard error for
beta one hat of 0.014 centimeters.
And that turns out to be 0.14 centimeters.
So 95% confidence interval for the mean difference in arm circumference for
these two groups of children who differ by 10 centimeters in height is
the estimated 1.67 centimeter difference in average arm circumference,
plus or minus 2, times, that's standard error, of 0.14 centimeters.
And if you do this out, we get a confidence interval of
1.32 centimeters to 1.88 centimeters.
So that interval describes our uncertainty in the estimated mean difference in arm
circumference between two groups of children who
differ by centimeters in height.
13:03
So how are we going to compute a 95% confidence interval for this slope?
Well, this is exactly the same idea as we just saw.
But this sample was only 21 subjects.
So in order to get a confidence interval and p-value,
we're going to have to go slightly more than plus or
minus two standard errors to get our confidence interval.
And we'll have to compare our resulting difference between our estimate and
the null value, not to the standard normal curve, but to a t-distribution,
with n- 2, or 19 degrees of freedom.
13:36
And again, I'm not going to ask you to do this in a testing situation, or if I did,
I would give you this value.
The computer will handle this, but it's just nice to remember that in
smaller samples, we have to be a little more conservative.
So if we did this and we actually went to a t-distribution, or
let our computer do the work for us,
the number of standard errors required to get 95% coverage in the middle,
of the middle values, in a t-distribution with 19 degrees of freedom, is 2.09.
So, in order to get, this confidence interval we take the estimated
mean difference in hemoglobin per 1% difference in packed cell volume add and
subtract 2.09 times the estimated standard error of our slope, which is 0.046.
And we get a confidence interval that goes from 0.1 to 0.3 grams per
deciliter per 1% difference in packed cell volume.
So notice that that confidence interval does not include 0.
So we already know this result will be statistically significant at
the 0.05 level.
14:41
However, if we wanted to get the p-value for testing the null,
that the true slope of packed cell volume in the population from which the sample
was taken is 0 versus the alternative that it's not 0.
We'll again assume the null is true, assume the true slope is zero,
that our sample comes from a population where there's
no association between hemoglobin and packed cell volume.
We look at how far our estimated slope of 0.2 is from 0 in
terms of standard errors, and we get something that's
4.35 standard errors above what we'd expect under the null.
So the resulting p value is the probability of being 4.35 or
more standard errors above or below what we'd expect under the null,
but we're referring to get this to a t-curve with 19 degrees of freedom.
Nevertheless, in this example, the p-value comes in very low at less than .001.
So, the estimated slope is 0.2 with a 95% CI for 0.10 to 0.30.
So how can we interpret these results?
We can say, based on a sample of 21 subjects, we estimated that
packed cell volume is positively associated with hemoglobin levels.
And we could put the P Less than 0.001, if we wanted to.
We estimated that a one-percent increase in packed cell volume is associated
with a 0.2 grams per deciliter increase in hemoglobin on average.
16:17
Accounting for sampling variability,
this mean increase could be as small as 0.1 grams per deciliter or
as large as 0.3 grams per deciliter in the population of all such subjects.
So that brings in the confidence interval to express our uncertainty
in how much that mean difference
in hemoglobin is per one-percent difference in packed cell volume.
16:42
In other words, we estimated that the average difference in hemoglobin levels
for two groups of subjects who differ by one-percent in packed cell volume
is 2.2 grams per deciliter on average.
And accounting for sampling variability,
this mean difference could be as small as 0.1 grams per deciliter or
as large as 0.3 grams per deciliter in the populations of all such persons.
So what about the the intercepts?
So Paul and I've showed you how to construct confidence intervals and
do hypothesis testing for the slope from linear aggression and
multiples of the slope.
We can also create confidence intervals and get p-values, although they won't be
that useful for the intercept, in the same manner, and Stata and
other computer packages will present this in the output they get from regression.
However, as we've talked about when X1 is a continuous predictor,
many times the intercept is just a placeholder and
does not describe a useful quantity or a quantity of relevance to our data.
As such, 95% confidence intervals are not always relevant.
However when our predictor is binary or
categorical, the intercept may have a substantive interpretation and
a 95% confidence interval at least, may be of interest.
So let's take a look at an example of that.
18:03
So you recall, our analysis that we did in stat reading one, and
we just did as a linear regression in a previous section here, of length of stay
by age at first claim among the subjects from the Heritage Health Study.
And when we regressed average length of stay on an indicator of whether
the person was less than 40 at first stay or greater than or equal to 40,
we got a slope of -2.1 and intercept of 4.9.
So we interpret the slope as the estimated mean difference in length of stay for
persons less than 40 at first claim,
compared to persons over 40 and that was -2.1 days.
The younger group had average length of stays
of 2.1 days less than the older group.
And the intercept actually had meaning in this analysis.
It was the estimated mean length of stay for persons over 40 for
their first stay in 2011, their first claim.
19:25
After accounting for the uncertainty in your estimate,
this is the 95% confidence interval for
the true mean difference in length of stay, for all patients in 2011.
You can see it's rather tight because this was a large data set and
it indicates that it's on the order of two or more days.
If we did a hypothesis test of whether the true association was zero,
in other words there was no association between length of stay and
age of first claim The p value is quite low.
We know that it would come in at less than 0.05, because our 95% confidence
interval did not include zero, but this adds some specificity to the discussion.
If we did a confidence interval for the slope, the estimated mean, like the stay
for those who are over 40, for their first visit in 2011 was 4.9 days.
And this confidence interval has meaning.
it goes from 4.8 days to 5.0 and
it expresses our uncertainty in that estimated mean.
So we have a pretty strong, tight interval here that suggests the true
length of stay on average was close to 5 days.
Between 4.8 and 5 days for the population of patients that were over 40,
when they entered the hospital in 2011.
We could get a p-value for this, but it really doesn't make sense to test
whether the mean length of stay for this single group is zero or not.
21:15
So in summary, the construction of confidence intervals for
linear regression slopes is business as usual.
Take the estimate and add or subtract two estimated standard errors, or
slightly more in smaller samples.
And we can also get a p-value by taking our slope estimate and converting it to
number of standard errors that is above or below the null value of zero.
And then figuring out what percentage of results we could get that were that far or
farther, just by chance, if the null is true.
So the confidence intervals we get for slopes and the resulting p-values
are confidence intervals and p-values for mean differences.
And the confidence intervals for intercepts are confidence intervals for
the mean of y for a specific group or
a specific population, the population whose x1 values are equal to zero.
And as we've discussed, this is not always relevant or helpful when x1 is continuous.
We can information to the analysis when our predictor is binary or categorical.