A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

66 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right, welcome back.

Now we're going to get into some totally new territory, and

we're not going to replicate the results of analyses we

did in Statistical Reasoning 1 using simple linear regression.

We're going to expand our toolbox now to allow our predictor of interest to go

beyond binary or categorical and actually respect it as a continuous measure.

So hopefully by the end of this lecture set, you'll understand why treating

a continuous predictor as continuous instead of dichitomizing it and making it

binary or putting it into several categories can actually be beneficial.

You’ll be able to use the scatter plot display to assess whether

an outcome predictor relationship is reasonably described by a line.

And you’ll be able to interpret the estimated slope and

intercept scientifically from a simply linear regression model with

a continuous predictor x1.

So let's start with an example.

Our arm circumference data.

But this time we'll look at the association between arm circumference and

a child's height.

So this is the same data set we did in the last section,

where we looked at arm circumference as a function of child sex, but

here we're going to consider it as a function of child height.

So, the question we might have is,

what is the relationship between average arm circumference and height?

Can we quantify this so

that we can compare it to relationships in, between arm circumference and

height in children less than a year old from other populations?

So, again we have the arm circumference data,

we laid that out in the last section.

And then the height data is such that the mean for the 150 children less than a year

old is 61.6 centimeters with a standard deviation of 6.3, and it ranges because

there's a range of ages here between 40.9 centimeters and 73.3 centimeters.

So some of that's because of the age difference, and

some of that's because of individual variability between of several ages.

So how can we handle this?

Well up till now, this will require us to categorize our predictor of height.

So one crude way to categorize this is to actually categorize at the median, and

compare the mean arm circumference with a t-test between the group with greater than

median height and a group less than median height, and

we could also put a confidence interval in that.

We can also visually display this, and here's a box plot showing the arm

circumference distributions for the two height groups.

And I think it's pretty clear from this presentation that

arm circumference shifts up the distribution shifts up for

the taller group, relative to the shorter group.

And the potential advantages of doing it this way, well, we know how to do it.

We know how create a mean difference between two groups, do

a two sample t-test to get a p value, and this gives us a single summary measure,

the sample mean difference for quantifying the arm circumference-height association.

But there are some potential disadvantages if this throws away a lot of

information in the height data that was originally measured as continuous.

This only, we've only, we've taken things that were measured on a continuous

scale and put them into two crudely defined, very heterogeneous categories,

the height category below the mean, or above the mean.

And there'd be a lot of variation in the heights of each of those two

categories because we're taking something measured on a continuum and

putting it into two groups.

So you might say well,

we learned how to compare means between more than two groups.

Why don't we make our categorization for height less crude?

Maybe we'll make four categories and we'll do it arbitrarily by the quartiles.

So we'd roughly have a quarter of the data set,

25% of the observations in each of the height groups.

And then we could compare mean arm circumference with analysis of variance,

if we wanted to test for differences and 95% confidence intervals for

the mean differences between the different height quartile groups,

that mean difference in arm circumference.

So what are the potential advantages?

Well, it's always an advantage that we know how to do it.

This improves, perhaps a little bit, on our previous approach where we,

we were making height binary and crudely putting it into two categories, this, this

categorization of four groups is a little less crude than that previous approach.

But this still throws away a lot of information in the height data that

was originally measured, is continuous.

This will also require multiple summary measures,

six sample mean differences between each unique combination of height categories,

to quantify the arm circumference/height relationship.

And most importantly,

this does not exploit the structure we see in the previous set of box plots.

The fact that as height increases so does arm circumference.

So let's take a look at this again.

I think it's pretty clear visually that, that these two track pretty

closely as height goes up, as evidenced by increase in these ordered quartiles.

The arms circumference distribution shift upwards by somewhat similar amounts if

you're comparing, for example, the medians between them.

If we create these four different height groups and treat them

each as their own entity, each of these groups will only have about a quarter of

150 observations, so we'll have somewhere 37 and 38 children in each of the groups.

And we'll estimate means for each, mean arm circumference for

each, for the four groups each based on only 37 to 38 children.

And our precision will be affected by that smaller sample size.

When we view this and

estimate separate means, arm circumferences for each height group,

we are not recognizing the structure in this data, which actually might buy us

something in terms of how precisely we can estimate this relationship.

So what about treating height as continuous when

estimating the arm circumference/height relationship?

Well as we sort of alluded to in the first section of this lecture set,

linear regression is a potential option.

It allows us to associate a continuous outcome

with a continuous predictor via a line.

And what the line will do is estimate the mean value of our outcome for

each continuous value of height or predictor in the sample used.

So in other words, we'll be able to eat estimate height,

specific mean estimates of arm circumference using height as continuous.

This idea makes a lot of sense, but

only if a line reasonably describes the outcome/predictor relationship.

Now, we get some tentative evidence that suggests that's reasonable.

But let's look at a more detailed graphic to try and assess this.

So what I'm showing here is a scatter plot, and this is a lot more

informative than when our predictor was binary, I think you'll agree.

But what I have on this picture here are 150 points,

each representing one of the children, and then for

each child what's plotted on the vertical access is his or her arm circumference.

And it's plotted against his or

her height value measured on a continuum in centimeters.

And certainly there's some subjectivity in interpreting the association via

a graphic, and the longer the look at it the more you can see in the picture.

But I'm going to suggest that at least start a line is a reasonable descriptor of

the relationship between the average arm circumference and height in these data.

And I will proceed to estimate such a line using the computer and

present the results to you.

So what we can do is estimate a line using the computer, and the line will be of

the form y hat or the mean, is the stand in for the mean arm circumference

we estimate as a linear function of height measured in centimeters.

So, what hi, y-hat estimates is the average arm circumference for

a group of children all of the same height, x1.

So let me show you what we get when we run this on a computer.

We get an estimated equation that says based on these 150 observations we

estimate the mean arm circumference, as a function of height to

be such that you take 2.7 and multiply the height for the group you're looking at

by 0.16 to get the estimated average arm circumference for that group.

So our slope here is 0.16,

it's positive which corresponds to the relationship we saw on the scatter plot.

These are just estimates of the true relationship between height and

arm circumference in Nepali children less than a year old based on this sample.

So here's a picture or a scatter plot with the regression line superimposed.

So I plotted that line that the computer estimated on top of these data and you see

it cuts down the middle of the line so that, at any given height, if there's

a couple points, some values are above the estimated mean by the line some are below.

So there's variation in arm circumference around the estimated mean.

But the mean appears to increase with increased height.

So for example if we were looking at this line and wanted to estimate the mean arm

circumference for children 60 centimeters in height, well we have this equation.

We know their height value, the group of children we're looking at is 60.

So if we actually plugged 60 into this equation, what we get is an estimated

mean arm circumference of 12.3 centimeters for children who are 60 centimeters tall.

So in other words, if we looked at 60 on the x axis, the height axis,

went up this line, this value on the line is 12.3 centimeters.

Notice if you actually look at a thin band around that,

most of the points of individual children's arm circumferences at or

around 60 centimeters of height do not fall directly on the line.

What we're estimating at this point in the line is the mean arm circumferences for

children 60 centimeters tall, but the individual arm circumferences for

children for 60 centimeters tall will vary about this mean.

So you can see some of that variation in these points.

The few observations we have at 60 centimeters vary about that mean of 12.3.

So how can we interpret the results here?

How can we interpret the estimated slope?

Well the estimated slope is positive and it's 0.16.

What units is it in?

Well, arm circumference is in centimeters, and height is in centimeters.

So this is 0.16 centimeters in

arm circumference per one centimeter in height.

So what this slope estimates generically is the average change in

arm circumference for a one centimeter increase in height.

Or another way to think about this is in our mean difference formulation that

beta one hat estimates the mean difference in arm circumference for

two groups of children who differ by one unit, or one centimeter, in height.

And this difference is such that it compares the taller group to

the shorter group where the difference in one centimeter.

So, putting in this 0.16,

this result estimates that the mean difference in arm circumferences for

a one cenmetor, centimeter difference in height is 0.16 centimeters with

taller children having the greater arm circumference by 0.16 centimeters.

Notice that, and this is something we pointed out about lines, but

this estimate is constant across the entire height range of the sample.

That's the assumption we're making by estimating this line, that a one unit

difference in height results in the same 0.16 centimeter difference in

average arm circumference, regardless of the two heights we are comparing, so

long as they differ by one centimeter.

So for example, I could ask the question based on these results: what is

the estimated mean difference in arm circumference for

children 60 centimeters tall versus 59 centimeters tall?

And the answer is 0.16 centimeters.

How about children who are 45 centimeters versus 44?

The answer again is 0.16 centimeters.

72 versus 71, I could go on and on 0.16 centimeters.

So what have we done by exploiting this linear relationship,

well we'll see how to get confidence intervals on this in the next section, but

think about this, we were able to use all of the data at

once instead of breaking it up into subgroups of smaller numbers of children.

And all we had to do was estimate two numbers, the intercept and

slope using the entire sample of 150.

What's that going to do to our precision of the relationship between

arm circumference and

height as compared to categorizing the children into different height groups?

Well, we can use all the data.

And we only have to estimate two numbers to describe this association as only,

as opposed to only being able to use a subset of the data like the 38

children who were in height quartile 1 to estimate the mean for that group.

So by exploring this linear relationship, we see we're going to end up

with a more precise estimate of the arm circumference/height relationship.

Of course, this would not be an appropriate idea if this relationship were

not well described by a line, but when it is, this works well.

What if we wanted to compare the estimated mean difference in arm circumference for

children 50 cent, 60 centimeters tall versus children 50 centimeters tall.

Well we said that a one unit difference in height

results in an estimated 0.16 centimeter difference in arm circumference for

any two groups who differ by one unit in height.

If we were to extend the difference to 10 centimeters in height,

this would accrue additively.

So a difference in 10 centimeters in height will result in 0.16 plus 0.16

up to it, adding it to itself ten times or, in otherwise, 10 times 0.16.

So, the difference in estimated mean arm circumference for

a 10 unit difference in height is equal to 10 times the estimated mean difference for

a 1 unit difference in height.

So, 0.16 times 10 is 1.6 centimeters.

So this slope is very powerful because it allows us to compare any two

groups who differ by any heights observed in our data range in terms of

estimating the average difference in arm circumference between them.

And just to reiterate this is a really powerful result if our

data meets the linearity assumption because we can use all the data to

estimate two quantities, the intercept and the slope, which will allow us to

quantify differences in the mean across the entire range.

We don't have to break it up into smaller subgroups, and lose precision, et cetera.

So under the linearity assumption, if that's met, this slope is a very powerful

number, because it describes all differences in the mean of the outcome for

all possible unit differences and multi-unit differences in the predictor.

>> What is the estimated mean difference for

children who are 90 centimeters tall versus 89, or 34 versus 33.

Your impulse here might be to say 0.16.

0.16 just like we did before.

But this is a trick question.

And this will have ramifications for interpreting the intercept as well.

So the arrange of observed heights in the example is 40.9 to 73.3 centimeters.

So even though this line we've estimated theoretically goes on forever in

two dimensional space, we can only use the portion that corresponds to the x or

height range in our data.

So all this part out here is not applic,

we can't extrapolate about this relationship to other height group that

are outside of what we've observed in this sample.

So our regression results only apply to the relationship between arm circumference

and height, for children between 41 and 73 centimeters tall from Nepal.

So that leads us to question,

well then how do we interpret the estimated intercept?

Well by convention 2.7, the estimated intercept,

estimates the mean y, when x is zero.

So this is the estimated mean arm circumference for

children zero centimeters tall.

Well first of all it's impossible to have children who are zero centimeters tall, so

it would, it's a pretty big heads up that this doesn't make sense.

And technically speaking, even if we believed there could be children zero

centimeters tall, the height range in our data started at 41 centimeters, so

we do not have children who are at a value of zero.

So, this intercept is an important mathematical place holder, and

I'll show why in a second, but it doesn't have any relevance to our sample.

That number of 2.7 doesn't tell us anything about any of

the children's arm circumference for any group of children in our sample.

And this is frequently the case with a linear predictor when we use it as

continuous, when, and this is frequently the case when our predictor is continuous.

The scientific interpretation of the intercept is scientifically meaningless.

But we need this intercept to fully specify the equation of the line.

And just note in this scatter plot, it, you can trick yourself into thinking,

well, the intercept must be over here, just where it hits the Y axis, but

this Y axis starts at 39 centimeters.

So, this is a visual trick.

If we were to actually include zero on this picture, it would be way over here,

and the resulting line that we got between, below 40 and

0 would not actually apply to our population from which the sample is taken,

because there are no heights observed for that population in the sample.

Why do we need the intercept though?

Well, without the intercept, if all I knew was the slope,

which tells us the change there'd be no way to put this on the graph.

What I'm showing here are four different lines, including our regression line,

that all have the same slope of 0.16 centimeters.

They are all indistinguishable if we only know the slope.

But the intercept actually allows us to verify or

choose a specific line with the slope of 0.16 to describe our data.

So it's necessary to fully specify where this line sits,

even amongst the heights that are way above the intercept of zero.

Let's look at another example.

Here's data of laboratory measurements on a random sample of 21 clinical patients 20

to 67-years-old.

So a very wide variability in age in the sample, but what we

have on these people is their hemoglobin levels in grams per deciliter and

their packed cell volume or hermatocrit in a percentage, a percent of packed cells.

So in this sample of 21, the mean hemoglobin is 14.1 grams per deciliter.

There's some variability 2.3 grams per deciliter and

the values range from 9.6 grams per deciliter to 17.1 grams per deciliter.

The packed cell volume, or hematocrit, the average in the sample is 41.1%.

So this is actually measured on a continuum,

the percentage of cells in an assay that are packed.

So this is not binary, so each person's value is a percentage.

The standard deviation in the individual percentage measurements amongst the 21

people in the sample is 8.1 % and the range goes from 25% to 55%.

Here is the scatter plot display of these data of the hemoglobin on the vertical

axis versus the packed cell volume on the horizontal axis.

So each point here is one of the 21 persons in the sample and

it shows their hemoglobin corresponding to their packed cell volume.

So this is sparse data, but I'm going to suggest that a line is a reasonable way to

start describing the average association between hemoglobin and packed cell volume.

So if we go ahead and do this and

use the computer, here's the equation I get based on these 21 data points.

So, we're estimating, saying that a pers,

a mean hemoglobin level for a given group of persons with a packed cell volume of

x1% can be estimated by taking this intercept of 5.77

plus the slope of 0.2 times the packed cell volume for that group.

So how do we interpret this slope?

Well, first we might want to get our units straight.

The outcome, y hat, the mean hemoglobin, is in grams per deciliter, and

x 1 is in percent.

So the slope is in units of grams per

deciliter per percent of packed cell volume.

So this slope is positive,

which is consistent with what we saw in our picture here.

And this result estimates that the mean difference in hemoglobin levels for

two groups of subjects who differ by 1% in hematocrit or

packed cell volume, PCV, is 0.2 grams per deciliter.

So subjects with the greater PCV have greater average hemoglobin levels.

Here's a scatter plot display with regression lines.

So again, there's not as many points as there were with the previous example, but

what we're estimating here, each point on a line estimates the mean given the packed

cell volume, and we're using this linear association to interpolate

for that relationship in areas of our data were we don't have any observations.

So we can estimate, for example, the mean at hemoglobin for

persons with a packed cell volume of 45% even though we didn't have any

data points for that, if we're willing to make this linear assumption.

So what would be the average difference in hemoglobin levels for

subjects with pack cell volume of 40% compared to subjects with 32%?

Well again this slope, this slope compares the average hemoglobin levels for

subjects who differ in packed cell volume by 1%.

So any two groups who differ by pack cell volume of 1%, the group with the higher

pack cell volume, the average hemoglobin is 0.2 grams per deciliter greater

than the group with the lower, or the difference in packed cell volume is 1%.

So we wanted to compare subjects.

So 40% versus 32%, well that's a difference in eight units, or

8% packed cell volume.

The per unit difference in average mean hemoglobin is 0.2, and

we have an eight unit difference so we would, this is additive,

we just take eight copies of 0.2 added to itself or multiply it by eight.

And this difference is 1.6 grams per deciliter.

What is the estimated hemoglobin level for subjects with a packed cell volume of 41%?

Well, we could we could actually plug in, you know, we can estimate me,

me, specific means for specific groups just by plugging their x

value their x1 value into the equation, 41% and if we do the math we say, well,

we estimate that on average subjects with hematocrit or packed cell volume of 41%,

their average hemoglobin level is 13.97 or nearly 14 grams per deciliter.

What's the intercept interpretation here?

Well again, it's going to estimate the mean hemoglobin level,

when packed cell volume is 0%.

Pack cell volume or hematocrit rate cannot be 0%, and

in fact, the lowest value in our data sample is 25%.

So again, this is just the placeholder that's necessary to specify the full

equation of the line so that we can predict individual means given individual

group packed cell volume levels, but the number 5.77 does not

describe the average hemoglobin for anyone in our sample.

One more example, this is interesting, older but interesting data.

This was data on a random sample of 534 US workers in the year 1985.

And in the data set, had their hourly wages in

US dollars and other information about them, and

one of the pieces of information it had was their years of formal education.

So let's look at these data.

So on the whole, remember this is 1985,

the average hourly wage amongst the 534 workers in the data

set was $9.04 per hour, $9.04 U.S. dollars per hour.

There was some variability, though, in these hourly wages.

The standard deviation was five dollars and 13 cents per hour, and

the range was a dollar per hour ostensibly hopefully for people in the service

industries like waiting tables, because their salary is really based on tips.

This is the salaries reported by the employers, up to $44.50 per hour.

So, a fair amount of variability.

And the years of formal education, the average is 13 years.

So in the United States, a completion of secondary school or

high school is at 12 years of education.

So an average of 13 would indicate that the average person in the day had,

had, had one year beyond secondary school.

And there's standard deviation is 2.6 years and

it ranges from 2 years up to 18 years.

18 yeas the education would put somebody at the graduate school level.

When we look at this, it is not as clean cut of a display as we saw before, but

here's a scatter plot display of these hourly wages versus years of education.

And I'm going to argue to start that, and we have very little data for, for

persons who have less than six years of education, but

they are in our data set, so I'm going to argue to start that,

maybe a line is a reasonable way to start with these 150 data points to try and

describe the trend in wages as a function of years of education.

And that's a poorly drawn line, but it's the idea.

So, if we actually did this, and ran this from Stata, or

some other computer package, the resulting line looks like this.

The average hourly wages for a group of persons given their years of

formal education is given by an intercept of negative 0.75,

plus the slope of 0.75 times the years of formal education.

Absolutely a coincidence that the intercept and

slope have the same absolute value of 0.75.

That's just a case example with these data, and,

as we've seen before, the two do not have to be the same in absolute magnitude.

Here's a scatter plot display of a regression line, and so what you see and

something we'll get into measuring later,

is that well it does appear that mean does increase with years of formal education.

But for any given level of education, there's still a fair amount of

variability in the individual wages around that group's mean.

So what is the interpretation of the slope?

Well the slope is 0.75 and

the units are dollars per year of formal education.

So, what does this suggest, we could say that average, average wages,

hourly wages increase by 75 cents per year of formal education,

or the expected or estimated average difference in hourly wages between two

groups who differ by one year in formal education is 75 cents.

And we could certainly use multiples of this to compare those with

a college education, 16 years, versus those with a high school, etcetera.

What is the interpretation of the intercept at negative 0.75?

Well, it, it really has no relevance to anybody in our data set because it's

the estimated hourly wages per persons who had no formal education.

Now remember our range of formal education was low.

It started low, at two years, but there was nobody at zero.

And a hourly wage that's negative would imply that the person was

paying their employer to work.

So, this negative 0.75 has no relevance to the population for

which our sample was taken, but is necessary for specifying this full line.

So in summary, simple linear regression is a method for relating the mean of

an outcome y to a predictor, we'll call it x1, when x1 is a continuous variable.

It can also as we seen in previous sections be binary or categorical.

So when x1 is a continuous variable the estimated slope for

x1, beta one hat, has a mean difference interpretation.

And in fact, it always has a mean difference interpretation regardless of

what type of variable x1 is.

But it estimates the mean difference in y for two groups who differ by one unit of

our predictor, x1, or in other words, the change in mean y per unit change in x1.

The estimated intercept is required, we need that to fully specify the line, but

frequently, it's not a scientifically relevant quantity.

It estimates the mean of our outcome when x is 0,

and unless there are observations with the a x value of zero in our data set,

and hence in our population from which the sample is taken,

this intercept may not have a substantive interpretation.

In the next section, we'll show how to put confidence limits on these linear

regression quantities for both situations when our predictors are binary or

categorical, and when they're continuous.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.