A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

207 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay in this lecture section.

We're going to tackle the issue of

hypothesis tests for comparing incidence rates between

two populations, so focus on time

time to event, data comparisons between populations.

So upon completion of this lecture section you will be able to describe

two approaches in getting a p-value

for comparing incidence rates between two populations.

The first we'll do will look very familiar.

It'll be, the two-sample z test approach, which is based on comparing

the incidence rate ratio, or incidence rates on the natural log scale.

And this will look very similar to the two sample z test we did

for comparing proportions between two populations and

the two sample T-test for comparing means.

There's another test that's commonly used called

the log-rank test, and what it essentially

does is compared, compares the Kaplan-Meier

curves for the two groups under comparison.

The nice thing about the log-rank test, and we'll look at this at lecture set

11, is it can be extended to compare more than two populations in one test.

So let's look at our Primary Biliary

Cirrhosis clinical trial data from the Mayo

Clinic, where there were 312 patients randomized

to receive either the drug DPCA or

a placebo.

And the researchers were interested in asking the question, how does mortality

and hence survival for PBC patients randomized to receive the drug, DPCA,

compared to survival for PBC patients randomized to receive a placebo.

So, you may recall we summarized these data with the incidents

rates, and they were actually somewhat similar in the two groups.

The DPCA group when we computed the

incident rate it was 0.075 deaths per year.

In the placebo group, it was 0.071 deaths per year.

So if we computed the incidence rate ratio you may recall for

the DPCA group to the placebo it's 1.06. So that, risk in the of death in the drug

group, in the study follow-up period is 1.06 times the risk in the placebo group.

That's what we estimate with the study, so there's a slightly

higher risk, at least in our DPCA sample, compared to the placebo.

We could say subjects in the DPCA group had an estimated 6% higher risk

for death in the follow-up period when

compared to the subjects in the placebo group.

So, you may recall the 95% confidence interval we computed.

Remember we did that by taking things on the log scale

and an exponentiating or anti-logging them back to the ratio scale.

Went from 0.74 to 1.52.

This 95% confidence interval includes the null value for the ratio

of one.

So we have kind of a heads up about what

this means about the p-value for comparing the incidence rates.

So think about what this may mean in terms of how

the p-value will relate, to the standard cut-off of 0.05 or 5%.

So there are two approaches to getting

the p-value that generally yield very similar results.

The first is something we can hopefully by now easily into

it the logic to it, it's something that can be relatively easily done by

hand, even though the hope is that you use a computer in your research life.

The two sample z-test approach.

Another is called the log rank test, which is not easily done by hand.

But can be easily implemented on a computer.

So let's first look at the two sample z approach just to ground ourselves.

So the competing hypothesis we're looking at here the

null alternative boiled down to the null that

the incidence rates of the outcome, here it's death,

in the two populations we're comparing are the

same versus the incidence rates are not the same.

That's the alternative.

We can express these same two hypotheses via summary measures.

So, we can look at the incidence rate ratio for DBCA

to placebo the, null would be expressed as the true population

level incidence rate ratio as one, versus the alternative that it's not one.

And then finally, we can look at these on the log scale.

We express this in the log scale, and the null is that the log of

the true population level incidence rate ratio is

zero versus the alternative, that it's not zero.

So the approach we're going to take is,

we're going to assume the null, is the true.

We'll measure the distance

between, and we're going to do this on the log scale, the log

of R estimate incidence rate ratio and what we'd expect to be

under the null which was zero and we'll do this in standard

errors and then we'll convert this to a p-value to make a decision.

So with these data, we're going to actually

look at the standardized distance that we have.

So, we've got the estimated standard error of the

log incidence rate ratio, which is the square

root of 1 over the number of events.

And the placebo group plus 1 over the number of events.

And the DBCA group, and we're going to take the log of observed incidence ratio

and calculate its distance from what we'd expect it to be under the null of zero.

So, the log incidence rate ratio is 0.06.

So our distance measure, in terms of standard error, is this 0.06 over

0.18, which is about 0.33. I'm calling this Z here.

Sometimes you'll see it referred to generically as a Z in textbooks.

This is really an aesthetic choice.

The only reason it's called Z sometimes is because we're going to

get a P value from the normal distribution sometimes called the Z curve.

But you can think that this measure as just another distance measure between what

we have observed and what we would

expect standardized by the uncertainty in our estimate.

So let's get a p-value so we have a result.

We have our result that's 0.33 standard errors above

what we'd expect to happen when the null is true.

And now we know if the truth were equal incidence rates,

then the incidence rate ratio would be one, and the log of it would be zero.

And we're comparing the log here.

But we wouldn't expect all of our studies of two samples from this population

to yield log incidence rates that were exactly equal and differed by zero.

So we know there's some variation.

We got something, and this is not going to be drawn to scale,

something that was 0.33 standard errors above what we expect.

So our p-value is going to be found by

looking at a normal curve and looking at the

proportion of results that are 0.33 standard errors or

more away from the center of zero in either direction.

So the resulting p-value, if you look this up in a normal table the old fashion way

or if you use a computer or if you just believe me, which is fine, is 0.74.

And this result is not statistically significant.

We didn't expect it to be, because our confidence interval for the incidence

rate ratio included one, and the decision is to fail, to reject the null.

Now let's look at this other option for making,

getting a p value called the log rank test.

And this too measures the difference between what

we observe in our samples compared to what

we'd expect but it measures it through the

Kaplan Meier curves, and so we can think of

this null hypothesis, this underlying null, that the

incidence rates in the two groups we're comparing

the two populations are equivalent versus that they're

not could be expressed in terms of the underlying

survival curves that we estimate with the Kaplan Meier method.

It's under the null, the null is that the true population level survival curves that

chart the unfolding of deaths over time, are equivalent versus that they're not.

Which would be the alternative.

So what the log rank test does, is it

compares, it goes across the follow up time period.

And wherever there's an event,

at least one event in either of the two groups being compared, it stops and it

compares what happened at that time in terms of events in non-events

in both the groups it makes a little two by two table, showing what

happened to the number of events and non-events in each of the two groups.

And then it creates another two by two table that

expresses the expected number of events in each group, and the

expected number of non-events at that time.

So it's kind of like the chi squared test we did for binary data

But it does this at each event time across the entire follow up period.

And then at each event time, it's got what we'd expect and what, what we observed

measures the discrepancy between those two two-by-two tables

and then aggregates them across all the event times.

So it gets one

overall measure of discrepancy between the two curves.

It standardizes the uncertainty in that estimate coming from sampling

variability, in other words its standard error, and it then

compares that to a distribution to get a p-value.

So, here's what we have here and

the resulting sampling distribution for this comparison

turns out to be a chi-square distribution similar

to what we looked at with binary data.

So, here are the Kaplan-Meier curves for the placebo and drug groups.

And what the log rank does, and I can't really demonstrate this perfectly, but

whenever there's an event in at least one of the groups it makes that

two by two table showing events and non-events

in both the groups, the DPCA and the placebo.

And then it makes the expected two by two table.

And then it can measure at that time point the difference between what we

observed and what we expect under the null at that one point in time.

It does the same procedure and all event times across both groups

across the entire follow-up period and aggregates those discrepancies into

one number summary, that it standardizes and

then compares to a chi square distribution.

So this total aggregated discrepancy added up across all the

event times in both groups across the entire follow-up period

is comparative to the distribution of such discrepancies across samples

of the same size when the null is the truth.

And this gets translated into p value, so the end process

is exactly the same as every other hypothesis test we've done.

You need a computer to do this though.

It would be painful and painstaking to do it by hand.

So for the DPCA/placebo comparison, the p-value from the log rank test is 0.75.

Almost identical to the p-value of 0.74 from the two sample z approach.

Of course, the imprint is

the same in that we fail to reject the null.

Let's look at our study on

antiretroviral therapy and partner to partner transmission.

This is where more than 1,700

HIV discordant couples were randomized such that

the HIV positive partner would re, either

receive early antiretroviral therapy by a protocol.

Or standard retroviral therapy.

And what they looked at over the

followup period was the occurence of partner-partner transmissions.

And they saw relatively few number out of the 1700-plus persons.

Only 28, but only one of those 28 occurred in the group randomized to early therapy.

And so their estimated incidence rate ratio,

they call it the hazard ratio, but for

our purposes, that can be interpreted as

an incidence rate ratio, the same thing, was

0.04.

So, in a very large reduction in the risk of transmission among

couples; where the HIV-positive partner got the early antiretroviral therapy.

So if we actually ran the numbers on this

and got p-values, the p-value from the two sample t-test.

And, I could do this by hand, based on the limited

data I was able to glean from the article, was 0.002.

The p-value

from the log rank test as reported by the authors was less than 0.01.

So these two are consistent.

I don't know exactly what the p-value is from the log

rank and I don't have all the data to compute it.

But certainly our result from the two sample z-test is also less than 0.01.

So these are consistent in their order of magnitude and of course at the 5% level.

Each of these would lead to us rejecting the null and including

that there was a difference.

And that the resulting reduction in risk was statistically significant, and we

already saw that from the confidence intervals we created in lecture eight.

Just one thing to note though, just to talk about, what are the, what's

the difference really between how the two sample z-test and the log rank test work?

In the two samples z-test kind of looks

at one aggregate measure in each group

just aggregated, in one computation.

What the log rank test does is it, it also looks at ag, aggregated measure, but it

breaks it up into pieces, and the pieces are at each event time in each group.

And it turns out the log rank test,

by the way it's constructed.

So we have two kind, and this is not drawn per se to reflect what's in this data.

But the log rank test, gives a little more weight in its discrepancy computation

to differences that occur earlier in the

follow-up period compared to differences that occur later.

It gives more weight to the differences where there's more data around the time.

So we may in some

cases get slightly different resulting p-values from the two sample z test.

And the log rank cause the two sample z test does not apply the

similar differential waiting to discrepancies in terms

of how much they influence the distance measure.

But, as we've seen with the last two examples, in both cases

the resulting p-values were nearly identical so this is not a big issue.

So sometimes in the literature when looking at data from timed event outcomes

you'll see p-values reported for comparing

survival curves using tests with other names.

There's the Wilcoxon test, the Tarone-Ware and the

Fleming-Harrington are all the ones that'll pop up.

And these are all similar to the

log-rank test but use slightly different weighting schemes.

For example, the Wilcoxon weights the

difference between the Kaplan-Meier curves greater,

the later these occur in the follow-up period, as compared to the

log-rank which favors differences between the curves early in the follow-up period.

So let me interpret everything that we've done

with this anti-retroviral therapy study, into a paragraph.

And I know the authors have done this as well, but this would just be how

I would, if I were writing a paragraph

about the results of this study, summarize it.

I could say something like, in a study of 1.763 HIV sero-discordant couples,

the risk of partner-to-partner transmission among the 866 randomized

to receive anti-retro viral therapy was 96% lower than among the

877 randomized to receive standard anti-retro viral therapy.

And I would put in my p-value, and I left

guilty of this again, leaving in a zero, this is 0.002.

After accounting for sampling variability, the early ART

therapy could reduce the risk of partner transmission

from 69% to 99% at the population level.

And that's bringing in the confidence interval we computed in lecture eight.

And this p value of 0.02 means that if the underlying rate of partner to partner

transmission were the same in the populations of

sero-discordant couples giving early or standard anti-retroviral therapy.

Then, the chances of getting

a sample relative risk. Or incident, I should have said incidence

rate ratio here of 0.04 or something more extreme is

two out of 1000 is 0.002 or two out of 1000.

Let's look at our maternal

vitamin supplementation and infant mortality results.

So you may recall, this is a study that randomized pregannt women

in Nepal to either receive Vitamin

A during their pregnancy period, Beta-keratin, or

placebo, and then it looked at the incidence of infant mortality in the

six months following birth and compared it between the three groups of mothers.

And the estimate showed little difference, in any of the three groups.

The instance rate ratio comparing, the incidence of mortality,

in the vitamin A group to the placebo was 1.05.

So in a slightly elevated risk in the vitamin A group.

The comparison between the Beta-keratin and placebo group was 1.0

indicating equal incidence rates in the six months follow-up period.

And we saw before that certainly

confidence interval for these incidence rates, comparing

vitamin A to placebo and Beta-keratin to placebo included the null values of one.

The p-value if we were looking at the comparison for vitamin A to placebo, the

p- value from the two samples Z test is 0.55 from the log rank it's 0.52.

So they're similar order of magnitude, they differ slightly for the reasons

we've discussed previously.

They would both lead to the same overall decision to fail to reject the null.

Similarly, for the Beta-keratin to placebo comparison,

the p-value from the two sample z test

and the log rank differ slightly but they're

both of similar order of magnitude above 0.8.

And again, would lead us to fail to reject the null.

And this,

was also a large study.

so even though we haven't for, formally discussed power yet.

Even though the verbiage is such that the research has failed

to reject the null of no difference between the mortality and

the vitamin A placebo or Beta-keratin placebo, this study had ample

power to detect a difference if it existed at the population level.

So the ultimate conclusion was that this type of

vitamin supplementation was not effective in reducing infant mortality.

So to summarize.

Both the two sample z-test and the log rank test can be

used to test competing null and

alternative hypothesis about time to event data.

The log rank is most commonly presented in the literature, but

the two sample z-test is a nice easy to implement by

hand approach that is very similar in its approach to the

two-sample t-test for comparing means and

the two-sample z-test for comparing proportions.

Because of slightly different mechanics the p-values from the

log rank in the two sample z may differ slightly

in value But both tests use the exact same

logic conceptually as all other hypothesis tests we have seen.

We set up the null and alternative hypothesis.

We use some measure to measure the standardized distance between what we

observed and what we'd expect to have seen in terms of the null.

And then we compared that to a p-value which

tells us how likely our results, or something even less

likely, would be to have occurred just by chance if

the data samples came from distributions with equal incidence rates.