A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 2: Regression Methods

81 个评分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

从本节课中

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Hi, welcome to Lecture One Section B for Statistical Reasoning II.

In this lecture section, we'll make things a little bit more concrete by talking

about specific type of progression, simple linear regression, and we'll

consider situations where our predictors are binary or nominal categorical.

So, hopefully by the end of this lecture set you'll understand linear regression in

general provides a framework for estimating means and mean differences.

And be able to interpret the estimated slope or slopes and

intercept from a simple linear regression model with either a binary predictor or

a nominal categorical predictor.

So now let's bring in some specifics about that Left Hand Side I

was leaving as an empty box before.

For a linear regression, the equation is actually relatively straightforward.

The regression model is the mean value of a continuous outcome.

As a linear function of the predictor X one.

And it is noted in the previous section and this applies to

any type of simple regression, X one can represent a binary predictor.

It can be modified to represent a nominal categorical predictor or

it can represent a continuous predictor.

We could also modify to represent a normal categorical predictor, so

we have a lot of flexibility with what our predictor choices look

like when we approach this problem as a regression framework.

So just to actually clarify that what we will be doing is

estimating our regression results from a sample from some larger population.

And to indicate that the intercepted slope quantities we get are just estimates of

some underlined population level quantities that we can't directly observe.

I'm going to dress these up and

put hats on them to indicate that they're estimates.

And just to keep the notation uniform.

And to be comparable to what you'll likely see in textbooks and

papers and that sort of thing, even though we're modeling the mean,

yes I mean the mean which will be represented with a bar over the top,

although variable Y, we'll frequently write this as y hat equals beta

non hat plus beta one hat x1 where y hat is analogous to y bar.

Just the mean of the y.

So, for any given value of x1 we can estimate the mean of y via this equation.

And what we'll see and remind you of,

remember this slope is what we sort of defined generically compares the outcome,

left-hand side for any two groups to differ by one unit x1.

And since our left-hand side is now a mean of a continuous variable,

this slope compares the mean value of y for

two groups who differ by one unit in our predictor x1.

And hence, we'll have a nice interpretation.

The slope will be interpretable as a mean difference between two groups.

So let's look at the first example to get some data on the table and

look at some real results.

So this is data on anthropometric measures from a random sample of

150 Nepali children who were between zero and 12 months old.

So less than a year old.

Question we might ask is,

what is the relationship between average arm circumference and sex of a child.

We've already looked at this before in a t-test context,

let's look at how it shapes up as a regression.

So the data on these 150 children, the mean arm circumference is 12.4 cm.

The within child average variability, or standard deviation, is 1.5 centimeters.

And the values range from 7.3 to 15.6 centimeters in this group of

children of mixed ages.

And a little over half, 51% of this sample is female.

Here's a box plot display of the data we'll be looking at.

And I just, and I'll do this now, remember we have a binary predictor.

One way to handle that is code one group as a zero, one group as a one.

So our males we will code as zeros arbitrarily.

And our females is ones.

And, so this is a box plot display.

And you can sort of see that the distribution, couple things.

First of all, the values for

males tend to be more variable than the values for females.

And they tend to rise or sit a little bit higher, larger.

For example, a median is at least slightly larger than the median for

females 75th percentile.

For males is larger than the corresponding percentile for females.

But, there is more variability in that middle box as well.

This is called a Scatterdot display, Scatterplot display and it's not

particularly useful for this type of situation where a predictor's binary, but

I just wanted to introduce this now for reasons that I'll show in a few slides.

And this is not as informative as the box plot.

And what this does is points, or plots all 150 individual measurements of

arm circumference versus the sex of the child, when it's coded as zero for

males or one for females.

So each point here represents one male and

his particular arm circumference on the vertical axis.

Each point here represents one female and

her particular arm circumference on the vertical axis.

So here is y's arm circumference a continuous measure, and our X or

X1 is not continuous but binary male or female.

And as we've laid out previously how are we going to handle sex as an X in

regression, well it only takes on two categories and one possibility that's

arbitrary is to code it as a zero for male children and a 1 for female children.

So, let's take that approach for the moment.

And the equation we will estimate using these data looks like this.

We estimate the mean arm circumference y hat as a linear function of

sex through these equation.

So we'll end up with an estimated intercept and slope, as well.

So just to be clear and reiterate something,

notice this equation first of all, its only estimating two values.

We only have two groups of children as defined by their predictor.

We have females and males.

So we're only ultimately estimating two mean values.

The estimate for female children in the generic representation before our

estimated mean, arm circumference are y hat ,for female children is

the intercept plus the slope for sex times one, since females are one.

So the mean for females is the sum of the intercept, and that slope beta one hat.

For males, it's simpler, the beta one hat drops out because males are zero and

we get just the intercept.

So beta one here, if we were to take the difference and average our

circumference between females and males, we'd be left with beta one hatf.

This estimates the mean difference in arm circumference for

female children compared to male children.

So it's, it's still a slope estimating the mean difference in y for

a one unit difference in x1.

But the only possible one unit difference when our x is binary is those who

are coded one, to those who are coded zero.

So here, done with the aid of the computer and

based on these actual data is the resulting equation.

Our estimate, our mean arm circumference to be equal to 12.5 plus

the slope of negative 0.13 times x sex, or x1, which is a one for females.

So the slope, as we've just clarified, equals negative 0.13.

And from the previous slide, we know this is the estimated mean difference in

arm circumference for female children compared to male children.

In other words, female children have lower arm circumference by

0.13 centimeters on average, relative to males.

The intercept is equal to 12.5 centimeters and

that estimates the mean arm circumference for male children.

You might say, well you've only got two groups here.

Is that slope really a slope?

Does it really describe the slope of a line?

Well we only need two points to establish any line in space and

what we're estimating are two means.

And in fact the slope is the slope of the line that connects the mean for

the group coded zero males to the group coded one females.

And this slope, it's hard to see scaling wise in this predictor is that difference

between the mean for females compared to males of negative 0.13.

The coding choice we made for our sex predictor is completely arbitrary.

There's no reason females have to be one and males have to be 0.

So what I'd like you to ponder, and I'll come back to the review exercises is for

this arm circumference and sex analysis what would the values of the intercept and

slope be if sex was coded as a one for females, and a zero for males?

Let's look at another example a data from our 2011 hospitalizations.

Data from the nearly 13,000 members of Heritage Health who had a length of stay,

cumulative length of stay of at least one day in the hospital in 2011.

So the question we might have is what is the relationship between average length of

stay and age of first claim?

And what we're going to do is,

this is going to be as the data represented, binary.

It's either less than 40 or greater than or equal to 40.

And I'm arbitrarily going to make the decision to make it

a 1 if they're less than 40.

And a 0 if they're greater than 40.

And in these data, the average length for everyone was 4.3 days,

with standard deviation of 4.9 days.

And a range from one day total in 2011 to 41 days total.

And 29% of the observations in this data were from persons whose first

stay in the hospital happened when they were less than 40 years old.

Their first 2011 hospital stay.

So here's a box plot display of these data.

And you can see we've already looked at this.

And ostensibly it's stat reasoning one.

But the distributions of length of stay are right skewed for both groups.

And the distribution shifts up for those who are greater than or

equal to 40 relative to the distribution for those who are less than 40.

So there's a lot of crossover here.

But those.

It visually, at least.

Those who are older than 40.

Greater than or equal to 40.

When they were first hospitalized in 2011 tended to

have slightly longer length of stays.

But that's a little hard to see in the visual so

we'll ultimately want to quantify that and see if that holds.

So we could fit this regression equation where we relate average length of stay to

our predictor x1, which is coded as I noted before.

So 1 if the subject is less than 40 years old at there,

when they first went in hospital in the 2011.

And a 0 if they were greater than or equal to 40 years old.

So this slope of negative 2.1 is the estimated indifference for

those who were coded one versus zero.

It's the estimated mean difference in length of stay for persons who were less

than 40 at their first hospital claim in 2011 compared to persons over 40.

So the younger group had an average length of stay of

2.1 days less than the older group.

And the intercept, estimates the mean length of stay when x1 is 0,

that's the group who is over 40 at their age, at their first claim.

And the average length of stay for that group was 4.9 days.

So what would be the estimated average length of stay for the younger group?

Well, we take this average length of stay for the older group, 4.9.

And add the difference between the two of negative 2.1 and

that would give us the mean length of stay, which would be 2.8 days for

the group who was less than 40 when they entered the hospital in 2011.

Let's look at another example.

Sometimes regression scenarios include predictors which are not continuous,

not binary, but are multi-categorical.

These are things, especially in the nominal world like subject's race, white,

African-American, Hispanic, Asian, or

some other classification, or say their city of residence,

amongst four different places, Baltimore, Chicago, Tokyo, Madrid for example.

So how can we handle this type of situation when we

have a nominal categorical variable in a regression framework?

So we're going to explore this using the example based on the academic physician

salary results we've looked at previously in Stat Reasoning One.

So this is the study in which data was collected on 800 U.S. academic physicians.

And included information about their yearly salary.

And a lot of additional information was collected including the sex of

the physician and other factors.

One of the other factors that was collected was the geographical region of

the U.S of the entire United States where the job was located.

So whether their job was located in the West, a part of the U.S.,

the northeast, the south, and the midwest.

So, the question that we might have to start is do average salaries differ by

geographical region and, if so, what is the magnitude of these differences?

So can we do this analysis as linear regression?

Previously, we would have thought of this as an analysis of variants where we're

comparing mean.

Salaries across four groups.

Can we set up an analysis of variances of regression?

And, if so, how can we handle a predictor that takes on four categories.

Like, is predict, or region of the United States.

Well the first approach, so

you might say, well let's arbitrarily give each region a numerical value.

And, just for example we'll say, x1.

Make it even a one if their job is on the west part of the United States.

It's a two if they're in the midwest, three for

a south, and four for a northeast.

This is totally arbitrary.

You could come, you could do this differently.

And then we estimate an equation that relates the mean salary to region by this

formulation we would in one occurrence of x, takes on a value of 1 through 4.

This is not a good idea.

That coding I just put out is completely arbitrary.

You could have come back and coded as a one for the midwest, a two for

the northeast, etcetera, and depending on how we code this.

The results we get will be different.

We will be treating this as an ordinal categorical variable and

there's no logical ordering to the categories.

So our estimated regression will depend on how we've

arbitrarily coded these four categories.

Furthermore, this type of coding assumes that the mean salary

difference between regions is incremental.

For example, under the coding I put forward, the diff we.

This makes the assumption, when it estimates things, that the difference in

average salaries between physicians in the South and the west, who differ by two

units in x is twice the difference between physicians in the midwest and the west.

And that's a strong assumption.

That's forcing inordinality on these data that may not be there.

And in my consulting collaborations I've seen people run models like

this where they just take the predictor as is.

And stick it in as coded one, two, three or

four and pay no attention to the fact that it's nominal and not ordinal.

And that can obviously have an affect on the results they get.

So, it's easy to get caught up in this idea when you're running models,

of just throwing things in and hitting the button on the computer.

But sometimes you want to pause and think about what you're actually doing.

So how can we handle this better when we have something that's not

inherently ordinal in nature, but is categorical.

Well it is kind of what we set up in the first section, but

now we'll do it specifically.

We designate one region as the reference region, so

I'm going to arbitrarily say the West.

And we make binary indicators for each of the three other regions.

You could do this differently, but

the ultimate inclusions would be exactly the same.

So I'm going to make three indicators for the other three regions, so I'm going to

make x1 equal to one if the position works in the midwest, and a zero if they do not.

When I create a variable called x2 which is equal to one if they work in

the South US, a zero if not.

And an x3 with is equal to one if they work in the Northeast section of the US,

and a zero if not, if otherwise, if they work in any of the other three regions.

So here's a table showing these x values for each region.

The West is the reference region.

It doesn't get its own indicator, its value for each of those is zero.

Because the West is not in the Midwest, not in the South and not in the Northeast.

The Midwest, its indicator is x1, it takes on the value of one.

When the observation comes from somebody who works in the Midwest and has zero for

the other two indicators.

If they're from the South, they're not in the Midwest so

the indicator for the Midwest is a zero.

The indicator for

the South, X2 is a one and the indicator for the North East is a zero.

And similarity for those from the North East.

They're not in the Midwest, so that's a 0.

They're not in the South, so that's a 0.

But they are in the Northeast, so that x3 is a 1.

So we can now fit the regression model, and it looks like this.

And, and this is a fancy equation to estimate only four mean salaries,

but it does this in a linear equation framework.

And so let's pause this for a minute.

So, what is the intercept estimate?

Well, this is the estimated salary when all of our x's are zero.

Well from the previous slide we saw that all x's were zero when we're looking at

the reference group, the West.

So here the intercept has meaning.

Its mean is, it is the estimated mean salary for physicians from the west.

So this intercept will be a number that estimates the mean salary for

physicians from the west.

And then each slope or each coefficient of the x's estimates the mean

salary difference between the region that has that corresponding x value of one and

the reference region, the western states.

So for example, just to write this out to give you an example.

If we looked at physicians in the midwest, their value of x1 was one and

their value of x2 and x3 is zero.

The model estimates the mean salary for

physicians from that group is being intercept plus the slope for x1.

Be it a not, plus beta 1.

For physicians who work in the West, all their x's are zero so, as

we've established before, their estimated mean salary is just the intercept.

So the difference in average salaries between those physicians who work in

the Midwest and those who work in the West is simply the slope for

that indicator of working in the midwest.

So here's the resulting equation.

The intercept is $194,427.

This slope for midwest is $4,416.

The slope for the south is negative $35.

You can figure this is plus negative 35.

So beta two hat the slope to the south equals negative $35.

And the slope for the northeast is negative $2222.

So let's make sense of this positions of the west.

Their average salary is $194,474 annually.

Physicians in the; for example, Midwest,

their average salary is $194,474 plus the slope of the Midwest

this $416 so, this means we can work this out.

But, this indicates that those physicians in the Midwest make $4,416 more

yearly, on average, than those physicians in the reference group, the West.

For the Southern physicians, their estimated salary is this intercept again.

Plus the for the indicator of the south plus negative 35.

So on average, physicians in the South make $35 less per year

than those physicians in the West.

And you can write this out for verification but

if we were looking at physicians in the North East.

Their average salary is $2,322 less than the average of

194,474 made by physicians in the west.

So in summary, simple linear regression is a method for estimating the relationship

between the mean value of an outcome, y and a predictor, x 1, via linear equation.

When x is binary, the slope estimate,

generically called beta 1 hat, estimates the main difference in why for

the group with x 1 equal to 1, compared to the group with x 1 equal to 0.

The intercept estimates the mean value of Y for

the group with x1 equal to 0, the reference group.

When x is a nominal categorical variable,

this can also be done with the ordinal categorical and

we'll look at examples with that for some of the other regressions.

You need to designate one category to be the reference group and

make separate binary axis for all other categories, and

the slope of those binary indicators estimates the mean

difference between the group whose value is one and the reference group.

What we'll do in the next session is show how to interpret the results from

linear regressions when we allow our X1 predictor to be continuous.