A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 2: Regression Methods

81 个评分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

从本节课中

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Welcome to Lecture 2.

In this set of lectures we'll take on the analysis method

called Simple Logistic Regression.

And hopefully, you'll see it, the results may look different on paper than what

we got with Simple Linear Regression, because our outcome type is different, but

the spirit is exactly the same.

So in this set of lectures we will develop a framework for

simple logistic regression, which is a method for relating a binary outcome

to a single predictor, and the predictor can be binary as well, categorical, or

continuous, and the way in which we'll relate these two things,

the outcome to the predictor, is again via linear equation.

So in this first section we're going to take on simple logistic regression when

our predictor of interest is either binary, two categories or

more than two categories.

So hopefully at the conclusion of this lecture section, you'll understand how

logistic regression relates a function of the probability or proportion of

individuals with a binary outcome, to a predictor filed in your equation.

And you'll be able to interpret the result, the intercept, and slope or

slopes, from a simple logistic regression model, when the models either

include a binary or categorical predictor, and in the next section,

we'll take on the situation where the predictor is continuous.

The first thing we have to establish formally is what

the left hand side of the linear equation for a logistic regression looks like,

and this is a bit more convoluted than it was with linear regression.

With linear regression, the left hand side was simply the estimated mean

of the continuous outcome and

we were predicting that by a linear function of our x1.

For a linear, a logistic regression, we have to work off a function of

the probability or proportion of having the binary outcome.

If the binary outcome is a y coded as 1 if the outcome occurs, and zero if not,

then we're looking at the proportion of probability that y equals 1, and

the function we're going to use we're going to start by looking at the odds of

the outcome occurring, and recall odds is just

the probability of it occurring divided by the probability of it not occurring.

But we're going to look at this on the log scale.

So our equation is going to relate the log odds of a binary outcome to a single

predictor x by this linear equation, and as noted previously, for

any type of simple regression, x can be binary,

nominal categorical, or not categorical, or continuous.

As with everything else we have done this far, we're only going to be able to

estimate the equestion, equation from a sample of data.

So to indicate that estimates that the slope and intercept we get are estimates,

again just like we did before, we'll put hats on top of them.

Technically speaking we're only going to be able to estimate the relationship

between the sample based log odds, and the predictor via this equation.

So we should put hats on these ps here to indicate that

we're estimating the proportions that in turn are used to estimate the odds and

the log odds based only on our sample.

But that is not consistently done in the literature,

the hats are left off of the ps, so I will from here on in not Include them.

You might say, well, walk odds, that doesn't sound like a very natural way of

thinking about things, and for the moment, just take it on faith, if you will.

But, if you trust me, but in the next section,

when we delve in the situation where x is a continuous predictor.

We'll explore the reason for this choice of scaling, and

see why it's necessary and appropriate.

So what we've got here ultimately well is an equation,

where if you give me the value of x1 for a group of persons or

subjects, I can plug that into my equation and estimate their log odds.

And this slope of x, remember when we started talking about linear equations we

focused on the comparison being made by the slopes.

We said generically a slope compares two groups, estimate of the left hand side,

for two groups whose x value differs by one.

So what the slope does here is it compares the log odds of the binary outcome for

two groups who differ by one unit of x1,

and hence this slope estimate is interpretable as an estimated difference

in the log odds of the outcome between two groups who differ by one x value.

Let's see, well, that's not very helpful, how, what is the difference in log odds?

How can I make sense of that?

Well, you may recall those wonderful properties of logarithms and

one of them includes the fact that the difference in

logs can be re-expressed mathematically as a log of a ratio.

So generically speaking I'm going to write out what the slope estimates here,

it estimates the difference in the log odds of the outcome.

For a group whose x1 value is, I'm just going to call it a plus 1, there.

Value of x1 is a plus 1 minus the log odds for a group whose x1 value is 1 unit less.

I'm just trying to be generic here and not put in specific numbers.

For the group whose x1 value is a, a could be any number.

The point is that the two groups being compared here in

this difference differ by one unit in the predictor.

So you may recall that the difference in

the log of one thing minus the log of another can be

re-expressed as the ratio of the first thing that we're taking the log of.

The odds that the group whose x1 value is a plus 1,

divided by the odds, or the second thing we're taking the log of,

the odds of the group whose x value is a, and the log of

the difference can be written expressed as the log of the ratio of those two things.

So now we're starting to see something that looks a little fam, familiar.

Odds divided by odds gives nothing more than an odds ratio.

The odds ratio comparing the odds of the outcome for two groups who differ by 1

unit in x1, and so what the slope is, is the log of an odds ratio,

and we'll see, with the concrete example and numbers in a minute.

If we have the log of the odds ratio, we can exponentiate or

anti-log that to get an odds ratio estimate.

So let's start with an example.

Let's go back to our data on anthropometric measures and

other measures from a random sample of Nepalese children, but

I'm going to expand the age range here to be between 0 in three years.

So we can explore the relationship between breastfeeding and

characteristics of children up to three years old.

So the first thing we're going to look is we're going to see,

if there's any relationship between breastfeeding and sex of the child and

we're going to quantify that.

And so in our data set the proportion of children in

this age group who are breast fed is three quarters or

75%, and the sex distribution slightly favors females in the sample.

52% of the sample is female the remaining 48% is male.

Now, we've seen several times now if we want to handle a binary predictor such

as sex,.

As an x, we can make one of the groups coded to 1, and make the other

group coded to 0, the reference group, and so just to mix things up, I'm going to

make males the 1s, and females the 0, or reference group for this analysis.

[SOUND] So what we're going to end up doing is estimating.

The logistic regression that looks like this,

the log odds of being breast fed, and I'm just going to shorten that to log odds,

equals an intercept plus a slope times our value of sex, x1.

And despite the fact that this is an equation, we're really only estimating two

outcomes here, the log odds of being breast fed amongst the males, and

the log odds of being breast fed amongst the female.

So let's just, just to get a little more practice and

understand where the difference in log odds,

let's write it out the estimated log odds of B breast fed for both these sex groups.

So if we do males, this is the log odds when x1 equals 1 we put our intercept,

beta nought hat plus our slope times 1, and so the log odds estimate for

male children is the sum of the intercept, and that slope, beta 1 hat.

For female children, their value of sex is 0.

So when we plug that in, this part disappears, it's just the intercept.

So again, beta 1 here is the difference in the log odds

being breast fed for males minus the log odds for females.

Another way to think about this is, is what we said before.

It's the difference in the log odds being breast fed for males.

Minus the log odds of being breast fed for females,

and we've seen before, we just saw that this could

be re-expressed as the log of the ratio of the odds.

Being breast fed for

male children, relative to the odds of being breast fed for female.

So this slope has a log odds ratio interpretation.

What about the intercept?

Well, we solve it, when we looked at the log odds for

the reference group, the log odds.

For the females, who's value of x1 is equal to zero, we got the intercept.

So this intercept estimates the log odds of breast feeding for one group,

the estimated log odds of breast feeding

for females, and we could certainly translate that into an odds.

And we'll show in a subsequent section how to backtrack or back compute this into

an estimated probability of being breast fed for females, for example.

So here's the result and equation that we get if we use a computer.

We get the estimated of log odds of being breast fed is

equal to an intercept of 1.12 plus a slope of .002 times sex.

So, this slope of .002 again estimates the log odds ratio of being breast fed for

males compared to the reference group of females, and the intercept is

equal to the log odds of being breastfed for female children in the sample.

So, these things are not so informative or helpful on the log scale so

let's, let's take things and antilogged them to get some clarity.

So this beta 1 estimate, is .002,

that's the estimated log odds ratio of breast feeding for

males to females, so if we exponentiate this, this will give us the estimated

odds ratio e to the .002 which is about equal to 1.002.

So what does this suggest?

The odds ratio of being breast fed for male children to female children is 1.002,

essentially there's no difference in the odds.

There's a slight .2% higher odds of, that males are breast

fed relative to females, but this ratio is very close to one.

What is the intercept estimate?

Well, the intercept here is equal to 1.12, this is the log odds,

this does not compare to groups, it's the log odds of being breast fed for

one group, the reference group, the group with x1 equals 0 females.

If we exponentiate this, take e to the 1.12, get 3.06.

This is the estimated odds that females are breast fed.

So they're roughly three to one that females in this sample are breast fed,

and very shortly, you may even be able to do it now, if you think about it.

But we'll show how to convert an odd testament for any

one group to a probability of proportion that, that group has the outcome.

So we'll be able to translate that into the estimate

of the probability of females in the sample being breast fed.

Yeah, the coding choice again for this binary predictor is completely arbitrary.

For this breast feeding and sex analysis.

Breast feeding, sex analysis.

I wanted you to think about what the values of the intercept and

slope would be, if sex was coded as a 1 for females and a 0 for males, and

then what would the subsequent odds ratio comparing males and females and

the odds for females look like under this scenario, and

I'll leave this to the review exercises, so you can start thinking about now.

But, actually go through it if you choose to do when you do the review exercises.

Let's look at another example.

This was a study published in 2010 in the Journal of

the American Medical Association looking at the risk of respiratory failure at,

after birth and gestational age.

So what they did was they looked at their context here was they

were considering late preterm births.

They account for an increasing proportion of prematory, prematurity-associated

short-term morbidities, partisu, particularly respiratory,

that requires specialized care and prolonged neonatal hospital stays.

And they went to assess short-term respiratory mini,

morbidity in late preterm births compared with term,

term births, in a contemporary cohort of deliveries in the US.

And so they retrospective electric data records from 19 hospitals across the US,

and they gathered an impressive amount of data on over 230,000

deliveries in these 19 hospitals between 2002 and 2008.

And what they used ultimately was a multiple, which we'll get to shortly,

logistic regression analysis comparing the risk of

respiratory failure by gestational age.

So we're going to look at an unadjusted analysis to get this started.

So here's the from their data set, collected from US records of the majority,

or 90% of the sample, had full term gestational ages, 37 to 40 weeks.

Another 5% came in at a late preterm at 36 weeks.

Another 3% at 35 weeks and the remaining 2% had a 34 week gestational age.

So even though the gestational age categories are ordinal, the authors did

not want to assume, and we'll talk more about this in the sec, next

section with equivalent of treat, this is a single measure and use the ordinality.

We would assume that the log odds linearly increases or

decreases the log onto respiratory failure with increasing gestational age,

and maybe the jump isn't consistent for each additional week of gestational age.

So what we're going to do just to explore this is treat this as categorical to

start, and we'll talk a little bit about the ramifications of that.

But there, I'm going to make four categories, okay?

And what we're going to do is, is make one the reference, and

we're going to use the four categories I laid out before.

34, 35, and 36 weeks and then 37 to 40 weeks, so we've got

four categories and so what we're going to do is like we've done with categorical

variables in linear regression we're going to make one group the reference, and

then create binary indicators separately for each of the three other groups.

So, I'm going to be consistent with how the authors did it.

We're going to use full term births, 37 to 40 weeks as the reference, and

then we're going to create individual indicators for

births that were at 34 weeks, 35 weeks, and 36 weeks, and we're

going to estimate a logistic regression equation looks like this.

We're going to read the log odds of respiratory failure to these indicators

via this equation.

So what do we get?

For example, for in the, so let's think about this suppose we're looking at,

suppose we're looking at children who were born at 34 weeks.

So x1 equals 1, x2 equals 0,

takes 3, equals 0.

Well, this equation going to estimate the log odds of respiratory failure for

this group to be the intercept plus beta one hat times one

plus the other slopes times 0, so the estimated log odds for

the group born at 34 weeks is beta nought plus beta one-half.

For the reference group of 37 to 40 weeks all x's are 0's,

and the log odds for this group, at 34 we,

37 to 40 weeks, is simply the intercept.

So this slope for the indicator of 34 weeks,

is going to estimate the difference in the log odds of respiratory failure, for

those born at 34 weeks, compared to the reference of 37 to 40 weeks.

So you can go through and show, we've shown via that setup that

the intercept is interpretable as the log odds of respiratory failure, for

the reference group.

We showed that the slope beta 1 is equal to the difference in log odds of

respiratory failure for the gestational age group of 34 compared to the reference,

and a difference in log odds,

as we've shown, is interpretable as a log odds ratio.

You can go ahead and prove to yourself that beta 2 hat estimates the log of

the odds ratio of respiratory failure for the group x2 equals 1, which is the group

with 35 weeks gestational age compared to the same reference that we had before.

And that beta 3 had as the log odds for the group with 36 weeks of

gestational age compared to the reference of full term births.

So here's what the results look like.

The estimated intercept is negative 5.5.

The slope for x1, the indicator of gestational age, 34 weeks, is 3.4.

The slope of x2, the indicator of

gestational age being 35 weeks is 2.8 and the slope of x3

Is equal to 2.0.

So, let's just think about this for a minute.

So this is, 2.0 is the difference in the log scale between the group

with 36 weeks of gestational age compared to the reference of 37 to 40.

So the difference for that one unit difference, if you will,

I've lumped 37 to 40 together, but you can think of

them as qualitatively one unit higher than 36 weeks is 2.0 on this scale.

The difference between the group at 35 weeks in that same reference is 2.8.

So this difference doesn't double when we go up by one unit in gestational age,

it compounds by another additional .8 but

it doesn't double, and then when we go the 34 weeks compared to the reference we get

another .6 because the difference between this group and the reference is 3.4.

So it doesn't appear that the association on the difference between

these resulting lower gestational ages,

and the reference is strictly linear that this count/gs difference does

not compound constantly for each one unit increase in gestational age.

And as such as it's a good idea then that

we made these categorical as opposed to treating it as continuous.

because incrementally the additional increase in odds,

is not the same for one unit increase in gestational age across these three levels,

these four levels.

So let's try and make sense of this.

So let's, let's start with some odds ratios.

So, we said beta 1 had equals 3.4,

so e the beta one hat equals the odds ratio of respiratory failure for

kids born at 34 weeks compared to the reference of full term

that's equal to essentially 30, that's pretty shocking.

This suggests that the relative odds of respiratory failure

are 30 times that of the reference group for the group that was at 34 weeks.

So just a huge increased risk because the odds is increased by 30 times here.

We do the comparison for the 35 week

gestational age group to the same reference, that's 2.8,

if we exponentiate that, that's beta 2 hat.

The age of the 2.8 is 16.4, so not as high

an increase as with the 34 week olds at gestational age of the 34 week group.

But certainly a substantial increase in the odds of respiratory failure, and

then if we do the last group And exponentiate that, that 2.0 at 7.4.

So I think, you know, certainly getting closer to term is better.

But being preterm, regardless of whether it's 34, 35 or 36 weeks is associated

with the large increase, and the relative odds of being, having respiratory failure.

What is that intercept interpretal as what's the log odds

of respiratory failure for the reference group, the full term group, and

that's equal to negative 5.5.

We exponentiate this to get the odds for

this group, this is at least some good news here.

The odds is .004 which is relatively low, and we'll, we'll see translate

into a low probability, so luckily the probability of respiratory failure and

I haven't shown explicitly is low in the group with the best outcomes,

and it's certainly worse for the other, lower gestational ages, but hopefully and

we'll talk about this in a subsequent section.

Hopefully, this doesn't result in a large, large probability respiratory failure,

even though the odds are increased by a sizeable amount.

So, in summary, logistic regression, again, is a method for

relating the binary outcome to a predictor x via a linear equation,

and the predictor can be binary, categorical, or continuous, but

we've only considered the first two situations, thus far.

What we get is a linear equation that relates the log odds of the binary outcome

to that predictor of interest, and we've shown that the slopes from the logistic

regression, for our x or x's, have log odds ratio interpretation, and

can be exponentiated to estimate odds ratios.

And the intercept for these situations, estimates the log odds of

the binary outcome for the groups, whose x or x values are 0.

So, in the next section we'll continue working on these ideas, but

we'll talk about the situation where we treat x as a continuous measure.