A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 2: Regression Methods

81 个评分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

从本节课中

Module 4: Additional Topics in Regression

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this lecture set,

Lecture 10, we're going to actually talk about something called propensity scores.

Which will give us another approach to estimating adjusted associations above and

beyond doing traditional multiple regression.

Where we include all potential comfounders as predictors in

the multiple regression model.

So, let me give you an overview of what we're going to do in

the next three sections.

In some non randomized studies there is a very

specific outcome/predictor association of interest.

But because of the study design, confounding is a threat.

In such situations there may be many other potential predictors that can

also be confound.

The primary outcome exposed a relationship.

But the research interest in the other predictors is only for adjustment of the,

out, outcome primary predictor of interest association.

The researcher isn't necessarily concerned with the associations otherwise, and

he isn't concerned with quantifying them, etc.

Well, what is the potential problem when there's a lot of potential confounders?

Well using a finite set of data to estimate a multiple regression model with

many predictors to best adjust.

May compromise the precision of the main outcome predictor association.

So for example, if we have 200 observations and

we're trying to estimate a regression that relates an outcome.

Whether it be continuous, binary, or timed event to a single predictor.

Like participation in a program or not.

Expose, we'll call exposed or unexposed.

Where people self select to be in the program.

And we try to adjust for all other measurements we have on the persons.

Well if we have a fixed amount of data and we have to estimate a bunch of

other slopes, quantifying the relationship on some scale.

And if some of these things aren't necessarily related to the outcome per se.

Then especially in that case we may end up compromising or

watering down the uncertainty of our main predictor of interest.

Because we have to estimate some things that aren't statistically significantly

associated with the outcome.

However, these things may help uh,uh, get us a better adjusted comparisons for

the outcome main predictors association.

So in this situation of potential solution is propensity scoring, a method for

creating a single measure.

Such that subjects or observations with similar scores are similar in

terms of their entire set of confounder values.

So in this lecture set we will briefly discuss,

we're just going to give a conceptual approach to this.

The definition and computation of propensity scores.

We'll talk about straightforward adjustment using the propensity score as

the adjustment factor as opposed to multiple predictors.

And then, we'll talk briefly about an approach called propensity score

matching of subjects in the exposed and unexposed groups.

And the reasons why that may be employed in certain study situations.

So this first section, section A,

we're just going to set up the idea of propensity scores, define them.

And give and example of adjustment with propensity scores,

which will continue on and give more examples in section B.

So hopefully by the end of this section,

you'll be able to define a propensity score.

And define how it's computed with regards to a primary predictor of interest.

And explain how a propensity score can be used to adjust for

multiple predictors at once.

And the potential benefits over including each predictor separately in

a multiple regression model.

In other words,

using a propensity score approach to adjust an outcome exposure relationship.

As opposed to putting the potential confounders of interest in separately,

one at a time, in a multiple regression model.

So let's just motivate the situation.

Suppose a researcher is interested in the association between the outcome.

Whether it be continuous, binary, or timed to an event, and a binary predictor.

Whether the subject was exposed or unexposed.

And this has to be estimated from an observational study.

The exposure of interest cannot be randomized.

And let me just note that we can employ the approach we're

talking about in these lecture sets.

To situations were the predictor is multi-categorical as well.

But we won't expand on the mechanics of

that because conceptually it's exactly the same as what we'll be doing.

So because of the observational study design, confounding is a threat.

However, the researcher has collected data on many potential confounders and

is interested in adjusting for these.

How can she or [SOUND] he proceed?

Well, option one is what we've been talking about in the last three lectures.

Multiple regression where we include the predictor of interest, let's call it x1.

And, then put in potential confounders as separate predictors in the model.

And this might be fine, especially if the researcher were interested in

the adjusted association between the outcome and each of these predictors.

But suppose he or she isn't, all they really want to do

is get the best most adjusted estimate of the outcome x1 association.

What are some potential problems with doing the approach we had just talked

about in the last three lecture sets?

Well the researcher has limited amount of data.

The researcher is only interested in adjusting for

the confounders, x2 through xp we'll call them.

But is not interested in associations between the outcome in each of

these things.

So having to include so many extra x's in the regression model,

may compromise the the precision of our estimated interest.

If some of these confounders are related to each other, and

some of them are nonstatistically significantly related to the outcome.

But we want to include them for adjustment nevertheless.

We may be estimating things that don't actually add

information about the outcome.

And that's going to pull away from our ability to

estimate the outcome exposure interest association with the most precision.

So another option under this scenario with the researchers only interested in

that one association really in terms of quantifying.

Is to create a single numerical summary,

such that subjects with similar values on this numerical scale are similar.

In their values of a potential multitude of confounders,

x2 through xp repeated's just a number.

So for example if we had nine potential confounders we'd be looking at

x2 through x10.

So one such approach to creating this single

numerical summary measure is called a propensity score.

And like I alluded to before this can be estimated and

created if the main predictor of interest is binary.

Can also be extended to multi-categorical.

If our main predictor of interest is continuous,

it would have to be dichotomized or categorized.

In order to proceed with a propensity score approach to adjustment and

we'll see why in a minute.

So, how do we create propensity scores?

Well, here's what we do.

If the main predictor is binary, we'll call it a 1 if subject is exposed and

the 0 if they are unexposed.

And we have potential confounders,

I'll call them again x2 through xp where p is just the number of x's total.

Then what we would do to estimate a propensity score for

each observation is perform a logistic regression.

But what we're doing here, our outcome in this logistic regression,

happens to be our predictor of interest for the main analysis, x1.

And we'd estimate with the logistic regression is the log odds of being a 1.

So if 1 is exposed and

0 is unexposed, we're estimating through logistic regression.

The log on to being in the exposed group as a function of

the potential confounders.

So what we end up doing is estimating this multiple logistic regression and

getting an equation.

And then what we can do with this equation is for each observation in our sample.

Use the logistic regression results to estimate the predicted probability of

being in the exposed group from the logistic regression values.

Given the observation's values of x2, x3, etc.,

. up through x p.

So, we just plug those values into this equation.

Get an estimated log odds for the observation.

Turn it into an odds.

Then turn it into the estimated probability.

So it's the estimated probability that observations with these values of

the potential confounders are in the exposed group.

This is called the propensity score for each observation.

It's the propensity or

probability of being in the exposed group, given the confounder values.

So let me give you an example of this based on some research I've done.

And I actually didn't use propensity score adjustment for reasons I'll discuss.

But I'll show you and compare and contrast the results I

got the traditional way versus using the propensity score adjustment.

This'll be of interest to many of you, hopefully.

This is a study I did a few years back.

I was very curious about comparing course outcomes in

statistical reasoning between the online versus on-campus sections.

Knowing that the, potentially the student profiles were very different in

terms of demographic and educational characteristics.

So I did a survey on students.

And they consented to be, and not everybody participated,

where I collected information on demographics, educational history.

Had students do a baseline test to get a baseline knowledge score, etc.

And so, here's the abstract from the article published on this.

The objective was to compare student outcomes between concurrent online and

on-campus sections of introductory bio-statistics course offered.

And this was actually back in 2005.

It took awhile to publish because initially there wasn't much

interest in this.

And then when all these MOOCs, massive online open

courses became popular through Coursera etc etc, there was rejuvenated interest.

And so what the methods, we had 95 students in the online section and

92 in campus.

Overall invited to participate in a confidential online survey.

And then these were linked to the course outcomes by a data manager.

And the course outcomes were bec, prepared between the two sections adjusting for

differences in student characteristics.

And I'll present the results in tabular format in a bit,

but there were 72 participants from the online section.

Participation rate of 76%,

and 66 from the on campus section of participate rate of 72%.

And they can be unadjusted final exam scores for the online and

on campus sections for 85.1 out of 100 86.3, out of 100 respectively.

These are the averages, and 87.8 and

86.8 respectively between the online and on campus sections.

So the differences unadjusted were on the order of one point on average.

After adjustment for student characteristics,

the average difference in scores between the two sections was negative 1.5.

Slightly lower mean score and average from the online section, but

is not statistically significant.

And then 0.8 for the online section, compared to

the on-campus in term two 0.8 points higher on average for the online section.

But that was not statistically significant.

And so we concluded, and there's more discussion of this in the article,

that the results demonstrate that online.

An online campus.

Online, you know on-campus course formats of an introductory biostatistic course in

a graduate school of public health.

Can achieve similar student outcomes despite

potential characteristic difference between the enrollees.

So let me just show you what I mean by characteristic differences.

Here's some information that we collected.

So comparing the results between the online and campus groups.

So, so let's look at a couple things.

Look at the sex distribution.

The on-campus course was less, less than a quarter were male,

versus 42% in the online course, and that difference was statistically significant.

This part time about nine, eight only about a fifth of

the students on campus were part time compared to 90% online.

The distributions of degree programs was statistically equivalent.

However there was a greater by 5% proportion of

MPH students online versus on campus.

Similar distribution of prior statistical coursework,

except when we looked at having had statistics in graduate school.

But, what was interesting was,

there were striking differences in the highest degree obtained.

About two-thirds, or 65%, on the on-campus section.

The highest degree they had at that point was a bachelor's degree.

Whereas in the online section, a little over a quarter,

their highest degree was a bachelor.

And nearly half had an MD, as composed to 20% in the on campus section.

And then, another striking difference was with the age distribution.

And subsequently, it was correlated with years in school and

years in this last math course.

The average age.

You know online section in that the average age in the on campus section that

year was 30 years versus 38.2 in the online section.

So if any of these things were also associated with course performance.

They may confound the unadjusted comparison in

exam scores between the online and on campus sections.

So I did not use propensity scores to adjust for

these other factors extensively.

Because I was also interested in how these other

factors associated with course outcomes.

Even if they didn't necessarily confound the difference between the online,

on campus courses.

But I could have used propensity score approach and

since I have these data, let me show you how that would work.

So if I wanted to do a propensity score approach, what I would first do is

create a propensity score by using the data to run a logistic regression.

Where my outcome would be a 1, if the student was in the online section,

and a 0, if they were on campus.

And what I do is estimate through logistic regression the log odds of being

enrolled in the online section.

As a function of some of those other characteristics I showed.

The potential confounders.

So, including, the x's would include things like age, sex.

Having an MD degree versus not, years since last in school,

baseline knowledge score and having had a previous course in statistics.

I would run this logistic regression model.

And once I had the results, I could ask the computer to estimate for

each of the observations in my dataset.

The probability of being enrolled in the online section for each student.

We know whether they were or not.

But what we are at extensively the estimating is the probability of that

students with the same confounder or potential confounder values.

Would be enrolled in the online section.

And this is the propensity score for each student.

And then, and then what we would do in the next part was to

estimate the relationship between exam scores and core section.

Adjusting for these other factors as I would do another multiple regression,

in the case a linear regression.

because my primary outcome of interest is exam scores which is

measured from 0 to 100.

And my primary predictor of interest is an indicator.

We'll call it x1 of whether the student was on the online course, or on campus.

And then I could one way to handle this I could stick the propensity scores in

as continuous.

But a more common way to do this, and

because of the underlying linearity assumptions.

What I could check, would be to put the propensity scores in quartiles.

The propensity scores themselves,

they're more interpretable in terms of relative comparisons.

People with propensity scores closer to each other are more similar in terms of

their potential confounder values.

So putting them in quartiles is a reasonable thing to do as well.

So I put them into quartiles and included them in this model.

Ostensibly to adjust for the other factors I

included in the propensity score computation by using the propensity score.

Let me show you what the distribution pf propensity scores look

like by course section.

These are box plots.

And again,

what we're estimating is the probability of being in the online section.

And we see there are you know there's,

there are some differences it's clear that you know, those in the online section.

Their characteristics were used to estimate the probability of being in

the online section.

And they tend to have higher scores then those in the on campus.

But there's a fair amount of crossover even though their shifted upwards for

the online students.

There's a fair amount of crossover in these scores between the two sections.

So let me just show you the results through three different

approaches to estimating.

I'll give the unadjusted comparisons, the unadjusted mean differences and

confidence interval intervals.

And I'll do this for each term, the final exams for term one and term two.

And then I'll give you the adjusted that I report in the article.

This was adjusted the traditional way.

By doing a multiple linear expression of exam scores on course section.

Plus entering each of the potential confounders as separated x's and

estimating their associations as well.

And then, I reanalyze these.

I reanalyze these.

Adjusting with the propensity score approach.

And so lets look at for

example the difference in online on campus in the first term.

The unadjusted mean difference shows a slightly lower score,

shows a negative 1.3.

A lower score of 1.3 points on average out of 100 point test.

For the online can, section compared to on campus, but

this was not statistically significant.

After adjustment, for the factors that I noted and

also used in the propensity score.

This difference was qualitatively similar, negative 1.5,

and still not statistically significant.

If I did the adjustment via propensity scores I got something a little larger in

absolute value.

Negative 1.8 versus negative 1.5.

But still, after accounting for

the sampling variability, it's not statistically significant.

And the term two,

the unadjusted mean difference favored the online section slightly.

With an average score difference of 0.9 higher scores for the online section.

But this wasn't statistically significant after adjustment the traditional way.

By including the other potential confounders as their own

x's in the regression model.

This difference was very similar at 0.8.

Still not statistically significant.

In terms of adjusting with the propensity score approach where the predictors of

the propensity score were the same things.

I adjusted for individually as individual x's in this first approach,

this difference was a bit larger, 2.4.

So this was the 20 point, 20 question exam worth 100 points so

about half a question 2.4.

So the magnitude increased with this different type of adjustment, but

it still wasn't statistically significant.

So generally speaking the conclusions I would make regardless of whether I

adjusted the traditional way.

Or with the propensity scores would be the same.

In this example there isn't any specific benefit of

using propensity scores that can point to.

The confidence intervals for this adjusted difference were not

necessarily narrower when I use propensity score adjustment.

Versus the traditional approach.

And again, I was interested in the contributions of the other factors.

But I wanted to give an actual demonstration of how this would be

done using data that I am privy to.

So, let's talk about potential benefits, first, of using propensity scores.

It may allow for a more precise estimate of the outcome, predictor of

interest association than traditional adjustment with multiple regression.

In the example I just gave you that wasn't the case.

The width of the confidence intervals for the adjusted mean difference in

test scores between the online and on-campus section were similar.

Whether I did the traditional adjustment by including all

minor adjustment factors individually as their own x's in a multiple regression.

Or whether I did the propensity score adjustment it didn't have much effect.

But suppose I had estimated, suppose I had measured 20 or

30 potential confounders on these subjects.

Then, using the propensity score approach versus including the 20 or

30 as individual predictors in the multiple regression.

Really might help me estimate the difference between the online and

on campus sections with a lot more precision.

The other potential benefit is it allows for a single measure that allows for

comparing observations.

In terms of their similarity on many factors.

And that's a nice index.

If we wanted to cluster and

look at graphics showing the relationship within different groupings.

Based on this we could actually visually look at adjusting for

multiple factors in a single graphic.

Using the propensity score to measure the similarity.

Potential drawbacks one potential drawback is to investigate effect modification.

We can't include the potential effect modifier in

the propensity score computation.

So lets suppose I wanted to see if there were any differences in

the outcomes between those who had a previous statistics scores versus not.

In other words, whether the relationship between exam scores in

course section was modified by having a previous statistic score.

So what I'd have to do for

my final analysis is I'd have re-run the propensity scores.

Not including previous statistics courses as a predictor.

And then when I did my multiple linear regression where

I just do with propensity scores I would include my predictor core section.

My predictor of having had a previous stats course

The interaction In the x 1 times

x 2 and then the propensity score quartiles,

where the propensity score no

longer included previous stats course.

So it is just a little bit more cumbersome when we want to involve interactions, but

it can be done.

Another drawback is the possibility of some observations having

similar propensity scores, but having very different values of some factor.

Let me give you a toy example based on this.

Suppose that increased age is associated with

increased probability of being in the online section.

And let's suppose increased baseline score.

This wasn't the case but just baseline knowledge score is also

associated with increased probability of being in the online section.

Well, we could have somebody who's very young with a very high score.

A high baseline score.

And, somebody who's older with a low baseline score.

Look similar in terms of these two things.

The contribution.

Combined with these two things to their estimated propensity score being

the online section would be similar.

Because one would be getting a bump

because of their high baseline score but no so much because of their age.

And the other would be getting a bump because of their age, and not so

much as their low baseline score.

So it's possible.

And I'm just.

This is an example with two predictors.

But it's possible that these things can operate in tandem to,

to give a false sense of closeness on these distributions.

There are ways to investigate that and make sure that doesn't happen.

But that's a potential threat.

But in general, I want to give you a sense of what propensity scores are.

And how they're estimated in the next section we'll look at some examples from

the literature.

But in some propensity scores provide an alternative approach to

traditional multiple regression.

For estimating an adjusted outcome predictor association.

And, this is especially useful when the predictor of interest is binary and

there are many potential confounders of the outcome/predictor relationship.

And the adjusted outcome/confounder relationships are not of

scientific interest.

Propensity scores are the estimated probabilities of being in

the exposed group.

As expose, the unexposed group.

Estimated from a multiple logistic regression with all potential

confounders as predictors.

These scores then are estimated for each observation in the sample.

In other words, the estimated probability of being the exposed group for

subjects with the same values of the confounders.

As that particular observation.

And then their relationship between the outcome and

predictor of interest can be adjusted.

Using the only the propensity scores in a regression model.