A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

235 个评分

Johns Hopkins University

235 个评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay, in this next lecture set, we're going to look at the hypothesis testing

situations for comparing proportions,

and incidence rates of time to event data between two populations.

And we're going to see a lot of similarities with what we did in the last

section with some extensions.

So in this first section, Section A, we're going to look at comparing

proportions between two populations, using the z-test approach.

And this will actually look very similar to what we did with comparing means

between two populations with the two sample t-test approach.

So upon completion of this lecture section,

you will hopefully be able to estimate and interpret a p-value for comparing

proportions between two populations using the two sample z-test approach.

Explain why, even though there are three different measures of association

when comparing binary outcomes between two populations, the difference in

proportions, or risk difference, the relative risk and the odds ratio.

We only need one p-value to test the corresponding null and

alternative hypotheses in terms of each of these three measures of associations.

Let's again look at a sample of HIV positive persons from an urban population,

and look at the relationship between CD4 count at the start of therapy, and

whether or not the subjects responded to therapy.

We've used this example many times before, so you'll recall that the observed

proportion of persons with CD4 counts of less than 250 when they started therapy.

The observed proportion who responded to therapy in the study was 25%,

as compared to nearly 16% among those who had CD4 counts of greater than or

equal to 250 at the start of therapy.

And in the last lecture sections on confidence intervals,

we not only looked at the different measures of association,

the risk difference, relative risk, and odds ratio.

But we also created confidence intervals for each, and we saw that none of

the three confidence intervals included the null value for the respective measure.

So how could we get a p-value to test, to complement these confidence intervals,

and test how likely our sample results are.

If on the underlying truth is, there's no association between CD4 count

at the start of therapy and response to therapy at the larger population level.

Well, one approach is called the two sample z-test,

and it sounds almost identical, and is certainly analogous to what we call

the two sample t-test when we compare means of continuous data.

Conceptually, the approach is exactly the same

with slightly different mechanics or inputs.

But conceptually what we do as we've done with the other hypothesis

test is specify our two competing hypotheses,

the null and the alternative, and start by assuming the null to be the truth.

Compute how far our sample result is from what's expected,

the expected difference under the null hypothesis assumption.

Translate this into a p-value reflecting how likely results like ours,

or even results even less likely are, under that assumed null.

And then make a decision about whether our results are consistent with what we'd

expect to happen just by chance under the null further outside of that

comfort range.

So let's just talk about the competing hypotheses here.

The null is that the underlying population proportions in the two groups we're

comparing are equal.

So in our example here,

that the underlying proportion of responders in the group with CD4 counts of

less than 250 is equal to the underlying proportion CD4 counts greater than or

equal to 250, versus the alternative that these two proportions are not the same.

Of course,

we could express this in terms of any of the measures of association we have.

We can express it in terms of the risk difference between the two groups,

Being equal to, Zero but null.

We could express it in terms of the relative risk,

if we compared the proportions of responders in the population for

those in the less the 250 group to the proportion in

the greater than equal to 250 group, that ratio.

And the null ratio would be 1, indicating equal percentages, and

we could also do the same.

And I won't write out the full expression for this,

but we could also express this as the odds ratio being equal to 1.

All three of these things are saying exactly the same thing as this null here,

that the underlying population proportions are equal.

So even though the comparison can be measured by three different metrics,

one hypothesis test covers the null until alternative is represented by any of these

three measures.

So let's look at the example we have here,

and compute the distance between the observed results and

what would be expected under the null using the risk difference.

So under the null, we'd expect the risk difference to be zero.

This two sample z-test takes our observed risk difference between the two groups,

the 25% in the first CD4 count group minus the 16% in the other,

the risk difference of 0.09, or 9%.

And divides by the estimated standard error of this risk difference

to figure out, in terms of standard errors,

how far away from that expected value of zero our observed difference is.

And we get a result that is actually, if you do the math here,

3.6 standard errors above what we'd expect under the null, a difference of zero.

So now the question we have to ask, because that's certainly not,

we didn't get exactly what we'd expect under the null.

But is this difference outside the range of likely values we would expect,

just by random chance if the null were true?

So by now, we know well that we could have gotten

an estimate that differed from what we'd expect.

Because these estimates vary across different studies of the same size just by

chance, even if the data were accommodating for

populations with the same proportion.

So we want to figure out whether our estimate is within the realm of

possibilities, to have happened just by chance, when the null is true or not.

So we're going to ask, how likely is it to get a result that is 3.6, this is

where we ended up, or more standard errors either above zero or

below zero, and you probably all have a knee-jerk reaction.

By now, you know immediately that this p-value is less than 0.05, and

you already figured that out because none of the competence intervals for

the measures included their null values.

But if we actually used a table on preferably a computer to do this analysis,

the p-value's 0.0003.

So the interpretation of this p-value is, if these data came

from two populations with the same proportion or probability response.

The chances of seeing the sample results, the difference that we saw in proportion,

or something even less likely, are 0.0003, 3 in 10,000, so very low.

This could have happened if the data came from two populations with equal

proportions of responders, but it's highly unlikely.

And it's certainly less than our threshold of 0.05, and we would reject the null.

But we already knew we were going to do that since our three measures

of association,

their respective confidence intervals did not include their null values.

How about our maternal/infant HIV transmission study?

Here, again, for

the 80th time in the course is this data presented in two by two table format.

And here are all three measures of association estimates and

the confidence intervals.

This comparison for all of these is in the direction of AZT to placebo.

And you can see the risk difference is negative because the observed

proportion of infants contracting HIV among the mother's given AZT

during pregnancy was much lower.

15 points lower than the group that got the placebo,

and the confidence interval does not include 0.

Similarly, the relative risk and odds ratios are less than 1,

favoring the AZT group, in terms of the outcome having lower outcome.

And the confidence intervals for these two values do not include their respective

nulls of 1, so we know we have a result that's statistically significant.

At the 0.05 level, these are 95% confidence intervals,

our standard for the course.

So how would we do the two sample z-test?

Well, in terms of the risk difference, the null is the proportion of children

contracting HIV after birth is the same between the mothers in the AZT group and

the placebo group where the difference is 0.

So we take what we observed, And

figure out its distance from 0, which is just what we observed,

the difference we observed, divided by its estimated standard error.

So our risk difference was -0.15.

The estimated standard error, which we showed, and we show again here,

but we showed in our confidence interval section was 0.036,

or 3.6%, giving a distance measure of -4.17.

So sometimes, you'll see this referred to as z,

the implication here is that we'll be comparing it to a normal curve always.

As opposed to with the t-test for means where, sometimes in smaller samples

you could use the t-distribution for the sampling distribution.

But in reality, just like I said with the t-test,

just think of this as a standardized distance measure.

So our result was 4.17 standard errors below what is

expected under the null for the difference to be.

We now leave in accounting for

the potential variation that comes from chance sampling,

we have something that's pretty far away from what we'd expect.

And if we were to translate this into a p-value,

we'd get something way less than 0.05.

And that's consistent with the confidence interval results we show,

none of which include the null value.

If you actually use the computer, the p-value here is about 0.0001.

So our interpretation is, if these samples came from two populations with the same

proportion, or probability, of HIV transmission to the infants.

The chances of seeing the sample results, or something even less likely,

is 0.0001, 1 in 10,000, so very low.

And again, we'd reject the null, and

that's consistent with what we saw on the confidence intervals.

But if we only had this p-value, we didn't have those estimates and confidence

intervals, it'd be very hard to comment on the clinical significance of this result.

Or even know whether the AZT group had the lower proportion of

transmissions than the placebo.

So our aspirin and CVD study in women, this is the randomized

trial of over nearly 40,000 women who got aspirin daily or

placebo, and were followed for ten years.

We started at 45 years of age or older, and they were followed for 10 years, and

they were looking for their first major cardiovascular event, and

here are the results from the two by two table.

And then here are the three estimates of association, the confidence intervals.

In the estimates all showed, this compares aspirin

to placebo in terms of the proportion who

incurred cardiovascular disease in the ten year follow-up period.

In other words, the incidence of disease in that follow-up period, and

all of the estimates show a lower proportion in the aspirin group.

But when you take into account the sampling variability,

these results all include the null value in the confidence intervals, so

we would say this is not statistically significant.

At the 5% level,

we already know that our resulting p-value will be greater than 0.05.

But if you actually go and do the computations and get the p-value,

it's 0.15.

So if these data, these samples came from two populations with the same

proportion of CVD cases, then the chances of seeing the sample results, or

something more extreme, is 15%.

Not very unlikely, given our cutoff for unlikeliness, and this is interesting.

This is an example of a high powered study and, again, we'll spend more time talking

about power, its relation to sample size, and interpreting non-significant results.

But this was a high powered study,

there were a large number of women in this study, a large sample size.

And so the take on this by the researchers and the clinical community

was that this showed they didn't just fail to reject this null.

They felt comfortable going with the null as the underlying truth because they had

a large opportunity, a large chance of detecting a difference,

did it really exist?

And so the clinical conclusion was that aspirin is not protective against

cardiovascular disease in women after 45 years old,

and they actually alluded to this in the results section.

They actually said, during follow-up,

477 major cardiovascular events were confirmed in the aspirin group,

as compared with 522 in the placebo group for a non-significant reduction in risk.

And the significance they're talking about here is statistical significance.

Finally, we'll look at the hormone replacement trial and

the risk of coronary heart disease.

And this was that famous trial that was stopped in the early 2000s.

But where in the sample when we randomized to get hormone replacement therapy,

1.9% developed coronary heart disease, compared to 1.5% in the placebo group.

So a higher proportion incurring the disease in the treatment group,

and that's born out by all three measures of association.

And it was statistically significant,

at least from the confidence intervals' perspective that none of these confidence

interval included the respective null value.

So we know the p-value will be less than 0.05, these are 95% confidence intervals.

But the p-value from the two-sample z-test turned out to be

close to 0.05, but less than at 0.42.

And the interpretation is, if these data came from two populations

with the same proportion of developing coronary heart disease,

the chances of seeing the sample results, or

something more extreme, is 0.042, 42 in 1,000.

And so we know the researchers stopped the study based on this result.

So in summary,

the two sample z-test is very analogous to the two sample t-test for means.

That provides a method for getting a p-value for testing two competing

hypotheses about the true proportions of a binary outcome between two populations.

And the underlying null and alternative hypotheses are given by this,

that the proportions are equal versus if they're not.

And these can be rephrased in terms of all three measures of association we've

looked at, these are all saying the same thing.

If the risk difference is 0, the relative risk is 1, the odds ratio is 1, and

that's because the underlying proportions are equal.

Conversely, if the underlying proportions are not equal, the risk difference won't

be 0, the relative risk won't be 1, and the odds ratio won't be 1.

So as such, there's only one p-value needed to test for

the statistical significance of any of these three measures of association.

And the test is performed using the observed difference and its estimated

standard error by the same process that we did with the two sample t-test.

Where a specifier null and alternative hypothesis assume for

the moment the null is true.

Measure how far our result is, our observed result is, and

what we'd expect the difference to be under the null which is 0, and

we'd translate that into standard errors.

And then, we'd get a p-value based on how far our results

are from we'd expect under the null, and make a decision.

Sorry, I didn't mean to cross that out.

[LAUGH] I meant to highlight and underline.

Anyway, in the next section, we'll look at yet

another way of doing this same test that is mathematically equivalent to the two

sample z-test, but has some advantages that we'll discuss.