A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 2: Regression Methods

81 个评分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

从本节课中

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay, in this section we'll consider simple logistic regression when our

predictor of interest is now continuous.

And hopefully, this will give you some insight as to why we need to transform

the estimated binary outcome from a proportion to a log odds.

Why this transformation is necessary to be able to properly estimate logistic

regression equations, in general, to include predictors that are continuous.

We'll also talk about something called a lowess plot.

Which is analogous to a scatter plot presentation with simple linear

regression.

That helps us get a snapshot of the relationship between the log odds

over outcome and the continuous predictor x1.

And this will allow us to evaluate whether the relationship is relatively

linear in nature which is an assumption of the logistic regression model.

We'll also learn how to interpret the slope and

intercept from simple logistic regression models with the continuous predictor, and

translate this into an estimated odds ratio.

This will be very similar to what we did in the last section.

So let's just take a little bit more deeper look at the background of

the underlying model.

Why do we have to model the log odd which is already a function of x 1?

Well this is became an issue when our predictor can be continuous.

So let's just think about for a minute.

Well you might say, well John, the initial summary statistics we had for

binary data was a proportion.

It's a perfect way to summarize binary data, the proportion or

probability of the outcome occurring.

So why don't we just go ahead and fit a regression model that

estimates the proportion as a linear function of our predictor.

And we want this to work for all situations, including the situation where

our predictor is measured as a continuous variable.

Well here's the potential rub with this, the potential difficulty,

is that the way we define proportions, they have to be between zero and one.

So if we're estimating the proportion as a function of a continuous predictor,

x, we have to have an estimation procedure to give us the logistic regression,

intercept, and slope.

It estimates them such that it ensures that all predictions of the proportion or

probability we get for all x1 values in our sample of data.

All of the resulting combination of the intercept plus the slope times every x1

value yields an estimate between 0 and 1 for the outcome of the proportion.

And that's actually a difficult estimation procedure.

To constrain the slopes and

intercepts we get such that they always work with the x1 values in our sample

data to give us an estimate of the proportion between 0 and 1.

That's actually complicated.

So, maybe your first thought is well, we can be a little more flexible.

What if we turn the proportion into the odds and

try and model that as a function,

a linear function of our predictor.

Where our predictor can be continuous.

Well, this might work a little better, but here's the deal with the odds.

If the proportion itself, or probability, can lie between 0 and

1, then let's think about what the odds can be.

Well, let's take the lower bound on the probability proportion.

If that's close to 0 then our odds is going to be

such that the numerator's close to 0, and the denominator's close to 1.

So when the probability or proportion is close to 0, the odds is close to 0.

However when our probability gets large, gets close to 1, the numerator is close to

1, and the denominator will be close to 0, pushing this towards positive infinity.

So odds lives on a more open range than proportion or probability.

But it's still constrained by the fact that all the estimates of the odds

are positive.

And if we have a continuous predictor trying to estimate intercept and

slope such that all predicted odds for all observations in our data set turn out to

be positive, is again a tricky estimation issue.

So you'll recall, hopefully, from stat reasoning one,

if we take something that lives on the positive number line and log it.

The range of possible values covers the entire number line from

negative infinity to positive infinity.

So if we translate things to the log odds scale,

any estimated value of the log odds is quote, unquote, legal.

And there's no constraints on estimating the intercept and

slope given a range of x1 values in a data set.

So this is the most flexible approach and

that's why we do things on the log odd scale.

If we were to transform this formulation where we take the log odds

equal to a linear function over x.

If we exponentiate both sides and

then solve for the p or proportion in the log odds.

Another way to express this equation is like this.

That we're actually modeling the probability or

proportion as this function here.

And you'll notice that this is just another way of saying that no matter

what we get for our beta 1.

And regardless what the resulting predicted equation is for

a given value of x1, the numerator of this probability formulation or

version of the equation will always be positive because e is a positive number.

And a positive number raised to any power, be it negative or positive, is positive.

And the denominator is always one larger than the numerator.

So this formulation, if we keep, if we estimate things on the log odds scale

as a linear function of our x is when we translate back to the probability scale,

we will always get an estimated probability between 0 and 1.

So it's just another way of saying that this estimation process yields legal

values.

And we'll show later in this lecture set

how to convert from the log odd scale to the probability scale.

Let's get into some examples with continuous predictors to

really solidify what we're doing here.

So data here is taken from the 2009-10 NHANES,

or National Health and Nutrition Examination Survey.

There's a sample of over 6,400 US residents in these data, between 16 and

80 years old.

And what I want to look at here, based on the 6,400 residents is

the association between being obese and HDL cholesterol level.

So the HDL levels in the sample were averaged to be 52.4 mg/dl.

But range from a very low 11 to 144,

which is a very high value.

15% of the sample is classified as

obese in terms of using their BMI measurement to make that classification.

So the question we might want to ask is, can we estimate the association between

the risk of being obese and HDL cholesterol level?

And we can certainly use logistic regression and estimate a line

that relates the log odds of being obese to a linear function of HDL cholesterol.

So here in this formulation p is the probability of being obese and

this is the log odds.

And x1 is the HDL cholesterol level in milligrams per deciliter.

We can certainly get the computer to do this, the question is,

is this a good idea?

The formulation here makes a strong assumption about the nature

of the relationship between the log odds of obesity and the HDL cholesterol level.

As measured on the continuous scale.

So how can we take a look and see whether this is reasonable?

Well, when we had continuous outcomes and a continuous predictor,

we could do a scatterplot.

And see whether it was reasonable to assume

a linear relationship between the mean of our continuous outcome and our predictor.

What we're actually assuming is linear when we have a binary outcome is the log

odds of the outcome as a function of the predictor.

There's no way to directly present this, but

there's something that the computer can do to help aid us in this investigation.

And this here is something called a LOWESS plot,

which I think of as a smoothed scatterplot.

Which tries to plot a visual of the observed log

odds of the outcome here, obesity, as a function of HDL cholesterol level.

What this plot does is, for every value

of HDL cholesterol level in our data set, it goes to that value.

It chooses a small window of values around that.

And for all points in that window,

it computes the proportion of persons who are obese.

And it gives a little more weight to the points that are closer to

the value we're estimating this to than the points farther away.

It computes that proportion, turns it into an odds, and takes the log of it, and

then plots that against that particular value of HDL cholesterol level.

And then the window moves over a little bit to the next observed value and

does the same thing.

And so this plots the log odds in small windows of HDL and connects them.

And we can get a sense of what the shape of the relationship is like.

And so for the most part, this visual shows something that's relatively linear.

You may say, well, John, there's this huge drop-off here.

This is one of the problems with this type of graph.

The ends of it can be heavily influenced by one or two data points, or

at least a small proportion.

So in order to investigate what was going on here, I went back and

looked at the data.

The 99th percentile of the HDL cholesterol values is 101.

So there's very few data points over here that are above 101 mg/dL.

But it looks like not so many of them were obese, so that pulls this curve down.

But if we looked at them between the 1st and 99 percentile cluster levels and

looked at this function, it would be well described by this line here.

So I'm going to go ahead and

say it's reasonable to assume this association is linear.

And I'm going to estimate an equation using the computer to explain this.

So if we use the computer to estimate the results of this logistic regression,

what we get is something that looks like this.

The log odds of obesity is equal to an intercept of -0.05 +

-0.033, our slope, times cholesterol level.

So how can we interpret this slope of -0.033?

Well, to start, it shows that, at least based on the sample data, the association

between the log odds of obesity and HDL cholesterol level is negative.

Indicating higher HDL is associated with lower log odds, hence, lower risk.

But let's interpret the slope.

Well, generically speaking, we've said a slope of a linear equation

equals the difference in the left-hand side for

two groups who differ by one unit.

So I'll call it HDL + 1- log odds for

group with value HDL, just indicating that these are generically 1 unit apart.

And again, we've said a difference in log odds is akin to the log of the first

thing, Over the second thing.

So this is the log odds ratio of being obese for

two groups who differ by 1 mg/dL in HDL cholesterol level.

So if we exponentiate this, we get the estimated odds ratio of being obese for

two groups, Who differ by 1 unit in HDL.

So this odds ratio estimate is 0.967, or about 0.97.

So this suggests that the odds ratio of being obese for

two groups of persons who differ by on unit, which is one mg/dL in HDL levels,

is 0.97 for the higher HDL group to the lower.

So in other words, higher HDL subjects by one mg/dL have 3%

lower odds of being obese when compared to the lower group,

and again, where the difference is one mg/dL.

And this estimate is for any two groups who differ by one mg/dL

in HDL in our population from which the sample was taken.

And it only applies to the HDL values we solve in the data set.

So we saw an extreme range of 11 to 144 mg/dL.

And any one unit difference in that range,

this odds ratio describes the relative odds of obesity.

So your interpretation of beta 0,

well, beta 0 = -0.05.

And generically speaking, beta 0 is the log odds of obesity,

For persons with x1 or HDL equal to 0.

So again, this is just a placeholder.

Because we don't have any persons in our data set, thankfully,

with an HDL cholesterol level of 0.

We need this to fully specify our regression equation so that we could, for

example, predict the log odds for specific groups based on their HDL level.

But it doesn't have any scientific relevance to our data set.

So whether we exponentiate it or not, this is just a placeholder in our line.

You might say, well, I'm interested in comparing the relative odds of being obese

for two groups of people who differ by more than one unit, so, for example,

those who differ by 20 mg/dL.

And to put some concrete numbers on this, how about 100 mg/dL versus 80?

And, well,

this will turn out to be like it was in other linear comparisons we made.

But let's just write it out to be sure.

If we did the log odds when x1 = 100,

it's just, generically speaking,

the intercept + 100 times the slope.

Versus the log odds when the cholesterol

level is 80 = the intercept + 80 times the slope.

We take that difference, we get 20 times the slope,

or 20 times -0.033, which is -0.66.

This estimates the log odds ratio of obesity for

x1 = 100 to x1 = 80.

If we exponentiate this, we get the odds ratio estimate,

which turns out to be about 0.51.

So you see that 3% decrease over 20 years compounds pretty quickly.

The resulting odds ratio for

those at 100 compared to those at 80 milligrams per deciliters is 0.51.

For those at 100 milligrams per deciliter at 49% lower odds.

So this relationship is additive on the slope scale and

we'll show in the practice problems, how we could do this directly

from the original odds ratio without writing things out on the log scale.

Here's another example,

data on a random sample of 192 Nepali children between 1 and 3 years old,

so it's the same population from what we've been looking at Nepali children.

One of the things we might be interested in in this age range what's

the relationship between breast feeding, at this age, And the age of the child.

We might want to be able to estimate and quantify that so

we can compare it to other populations, children one to three years old.

So what we're going to do is we're going to estimate,

express the age association by a linear equation, we'll look at the log odds that

a child is breastfed has a function his or her age in months.

So here's a smoothed low s scatterplot of the log odds versus age,

Of the child and the age ranges, again,

from 12 to 36 months.

And there's a little bit of a curve here but to start, I'm going to

suggest that fitting in a line is not a bad approximation to what we're seeing.

So I`m going to go ahead and do this on the computer.

The estimated equation I get looks like this, the log odds of being breastfed

is equal to 7.3 + -0.24 times age in months.

So again, we have a negative association, at least for these data.

So let`s try and interpret these quickly now.

We won't go through the same detail we did before, but we've said that this

slope compares the outcome for two groups who differ by one unit in the x.

Our x is age in months, so this is the outcome function

for two groups of children who differ by one month of age.

And, of course, we're looking at difference in log odds so

this is the log odds ratio of being breast fed for

two groups who differ by one month in age.

And that's equal to -0.24, if we exponentiate that,

we get the odds ratio estimate.

It equals about 0.79.

So each month of age multiplies the relative odds of being breastfed by 0.79.

Or another way to say it, each month of age is associated with a 21% reduction

in the odds of being breastfed, okay?

So again, this estimate is for any two groups of children who differ by one month

of age in this population of Nepalese children between 12 and 36 months old.

So here's a question I'll leave for you to do and

we'll go over in the review questions is what is the estimated relative odds,

ie odds ratio, of being breast fed for

children who are 30 months old compared to children who are 6 months old?

What is the interpretation of the slope here?

Well age of children could potentially be zero so it's not as absurd

except our age range is between 12 and 24 months, we don't have any newborns.

So the log odds of being breastfed at birth or

when age is zero is estimated by the intercept.

But this doesn't really estimate the log odds of being breastfed at birth because

we don't have any newborns in our data set, we start at one years old.

So even though there are zero month old babies when they're born new,

this estimate it does not apply to them because we don't

have any in the sample we're using the estimated.

So this is going to be a placeholder,

again, that doesn't describe the log odds for anyone in their data set.

We could exponetiate it to get the odds for this group but again,

it doesn't describe anyone in our data.

One more example, in the last section I showed you an example where something that

could've been measured on a continuum, gestational age, was actually categorized.

And I suggested there was a reason for that.

Now I don't have, I'm not privy to the raw data so

I can't show you an actual low s-plot that I generate here.

But I did have the group data by groups of gestational age and

was able to come up with these logistic regression estimates.

But I just want to show you something quickly here.

So I don't know if you recall, but the reference group,

there were four gestational age groups, was 37 to 40 weeks, and then we

had indicators for being 34 weeks, 35 weeks,

And 36 weeks and this was looking at the log odds of respiratory failure.

Let's just graph this just crudely for a moment.

Suppose this is Our categories,

these are ordinal categories, 34, 35, 36, and

then 37 to 41 so I'll just say 37 plus here.

So we start with the reference group, the estimated log odds,

let's make it a 5.5, I'm just going to draw this here, I'm not, so it's -5.5.

Okay, so then the first 36 weeks how much larger is this than

-5.5 on the log odd scale, well I'm going to do using units of 0.5 so

it's 2 units larger, so its value is -3.5, 36.

And then, how much larger is 35 weeks in the same reference group?

Well, it's 2.8 units which puts it at about,

-2.7 on the log onto scale, I'm not drawing this perfectly to scale.

And then finally, when we get to the lowest gestational age group it was 3.4

units higher than the reference.

Which puts it at, And

again, this isn't perfectly drawn to scale but -2.1 there.

So what I'm showing here is that, for

the first three groups it looks like the relationship is roughly linear,

but then it really drops between the 36 and 37 weeks.

So in order to capture this disconnect here between these three groups and

the full term babies ,instead of smoothing that out t his

categorization schema recognizes that difference in trajectory.

I didn't draw this very well but I'm just trying to give you some insight.

So sometimes it makes sense to take something on a continuum and

model it as categorical if there is not evidence in the data that the association

is strictly linear between the log odds, the outcome, and our continuous predictor.

So in summary, simple logistic regression can be done with binary, categorical, and

continuous predictors.

When the predictor x1 is continuous,

the model estimates a linear relationship between the log odds y and x1 and

there are ways to see whether that assumption is met.

And there are visual tools for that.

And the regress, the resulting the estimated slope from logistic regression

with a continuous predictor still has a log odds ratio interpretation and

the intercept, a log odds when x1 = 0 interpretation,

although in many cases that's not relevant to our data.