A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

210 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this lecture set we'll talk about

some other ways to compare proportions between two

populations that will mechanically differ from the two

sample z-test but will be conceptually very similar.

Or identical to the other hypothesis tests

we've done, both through proportions and means.

So upon completion of this lecture section you should

hopefully be able to explain that the chi-square test for

comparing proportions between two populations gives the exact

same p value as the two sample z-test.

Explain the general principle of the chi-square approach.

Interpret the results from an exact test for

comparing proportions between two populations, called Fisher's Exact Test.

And explain the general principal of Fisher's Exact Test.

And then name situations where

Fisher's Exact Test, the P value is preferable to getting

a P value from the chi-squared or two sample z-test approaches.

So again, the start for the first example

to motivate this discussion, we will again look at

the data from the random sample of 1000

HIV positive patients enrolled in clinical trials start therapy.

And remember, we had classified them by their CD4 count at the start

of therapy.

And then they were followed up for a similar amount

of time and we have looked at who responded to therapy.

And resulting estimate proportions based on the sample data were 25% responding in

the group with starting the CD4 count of less than 250 and roughly 16% responding.

In the group with starting CD4 counts of greater than or equal to 250.

And we've seen the measures of association, the risk

difference was positive, the relative risk and odds ratios were,

of course, then above one, and all three confidence intervals,

or all three measures, did not include the null value.

And correspondingly the resulting p-value from the two sample z-test was

less than 0.05, in fact, quite less than 0.05 at 0.0003.

So, lets talk about the chi-square

or sometimes mentioned as the chi-squared approach.

The chi-square test is a general test for comparing

binary out comes of those two or more populations.

In this specific case of comparing two

proportions across two populations, the results from the

chi-square test and the two-sample are identical

in terms of the p-value and both depend

on the same central limit theorem based results.

The only reason I am bringing up the chi-square test in this course is because

you we'll actually generally we see that in

the literature, as opposed to the two-sample z-test.

Because the chi-square test has a nice feature, in that

it can be extended to more comparisons in one test.

So we can compare proportions between more than two populations in one test,

and we'll show that in lecture ten. So conceptually the approach to hypothesis

testing, the chi-square test is the same, it's just the mechanics are different.

And so, what we do with the chi-square test is we specify the two

competing hypotheses, the null and alternative and

then assume the null to be the truth.

And then compute how far the sample based estimates

are different from each other compared to what

we expected difference to be under the null hypothesis.

Translate this distance into a p-value and then make a decision.

Exactly what we did with the two-sample z test and the two sample and paired t-test.

So the two competing hypotheses we're testing again are the null that

the proportion of responders is the same with the population level in the

population of HIV persons with CD4 counts of less than 250 when they start therapy.

And the group of CD4 counts greater than or equal to 250 when they actually begin

a regimen of therapy versus the alternative that

these two underlined population proportions are not equal.

And we can certainly re-express these.

But I'll just do the null but we could do this as the risk difference, you know.

Again between theses two groups being equal to 0.

We could express this as the relative risk comparing the

proportions in the less than CD4, less than 250 to

greater than 250 CD4 counts of equal to 1 or

we could do it in terms of the odds ratio.

These are all ways of expressing

the same underlying null that the

proportions are equivalent at the population level.

So, how do we do this?

So the chi-square test follows, is conceptually a similar

approach to what we've done with all other hypothesis tests.

We compare what we see in our data to

what we'd expect to have happened under the null hypothesis.

But what we're comparing here is more than a single number now.

What we're going to actually do, is compute a, another two

by two table that represents how we'd expect the cell counts

to fall out with the same numbers of people in

each of the CD4 counts and the same percentage responding overall.

How we expect the cell counts to fall out under the null hypothesis.

So, the way this is gone about, the way to think about this is if we assume

that the underlying proportions of those who respond in both groups is the same.

Then,

one way to estimate that based on all these data

is to pull the information across these two groups and estimate

sort of the common null proportion based on the total number

of people who respond across the two groups of the total.

So, estimate a common proportion under the null hypothesis

of 20.6% responding in each of the two CD4

count groups.

And then what, what we do to actually get the expected values is apply this

proportion to the numbers in each of the groups that we have.

Let me show you analytically how this is done.

So for example, if we wanted to actually get the

expected value of people we'd see under the null hypothesis,

in responding in the CD4 count less than 250 group, we take this common proportion,

206 over 1,000. Multiply it by the total number of people

in this group 503, and this would give us an expected count of 103.6 responders.

Now this is just a theoretical quantity and it doesn't have to be

an integer value even though obviously we

could only observe integer values of people

responding or not.

So, what we see here in this data, at least in

this group, we saw a 127 responders out of the 503.

Under the null, we'd expect there to be a 103.6.

So, again the rational behind that expected is,

if we assume the underline proportions or response

of the same, we can pull our information

across the two groups to estimate that common proportion.

And then apply it for the total

number of persons in each of the groups to get the expected values.

So if we wanted the expected value here, we could take 206 over 1000 times 497.

And we could do the same thing for the

non-responders, except we'd be estimating the common probability of non-response.

In each of these groups under the null, which would instead of 206

out of a 1000, it would be remaining 794 who didn't respond out

of the total of the 1000.

And we could take say for example, 794 over

1000 times 503 to get our expected value here.

But actually, given the structure of the two

by two tables, again we are making the assumption.

That the row totals and the column totals are fixed and the

only thing that's random in our data is how the cell counts distribute.

Under that,

once we have cell count for the first, the expected cell count for the first cell.

We don't even have to do that multiplication

process because now given these column and row totals

the other three cells have to be whatever

makes things add up in the rows and columns.

So, for example, if there's 503 people in the CD4 count less than 250 group.

And 103.6

of them are expected under the null to respond.

Then the remaining 399.4 would be expected not respond

and I'll let you do the math if you're interested.

Again, I wouldn't make you do this by hand in

this class and you should really never do this by hand.

I'm just showing the behind the scenes approach once just to

get you, give you some sense of how are computing discrepancies

between what we see and what we would expect

when we have a comparing two by two tables.

The way we are going to measure the discrepancies here is a

little different and a less intuitive than what we have done before.

But what we actually do is aggregate, the discrepancies between the observed,

that's what the O is here and the expected values across each of the four cells.

To come up with one overall discrepancy measuring.

What this does, is it actually takes the observed difference in

this account, squares it and then divides it by the expected value.

You can think of this as the standardized difference for each cell.

Standardized is by the expected

variability because of random sampling error.

And what we end up with is a measure or called the chi-square.

This is the Greek letter chi-square.

It measures the overall discrepancy in this case, it was 13.37.

If you want to verify by that, by hand, you are welcome

to but again it's more about the principle than the computation.

The way we do this, though, just to get you started is in our first cell.

So, 127 responders. We would expect 103.6 under the null.

We divide that by the expected. We square that difference on top.

We go to the next cell. There were 79 responders.

In the CD4 count less than 250 group. If you look at the expected value

under the null, it's a lot higher, a 102.4,

we square that by 102.4 and you do it

across the four cells and if you did the

math, you will end up with this number, 13.37.

And then what we do is compare this resulting difference of 13.37

compared to the distribution of all

such differences occurring under the null hypothesis.

We look at the sampling distribution

but for this difference.

And the sampling distribution we're going to

use is different than we've seen before.

It's something called the chi-square and hence the name of the test.

With 1 degree of freedom.

And chi-square is a family of distributions, that

are uniquely defined by their degrees of freedom.

And the degrees of freedom depends on the number of comparisons we're making.

The idea behind one degree of

freedom, when we're comparing two proportions between two groups, is.

We solved this when we were filling in expected values.

If the row totals and column totals and hence,

total totals are fixed in a two by two table.

Then once I know one of the values of any of the cells, the other three.

Fall out because of addition.

And the have to be whatever equals the row and

column totals. So in a two by two table, only one of

the cell counts can vary freely. Once I fix that cell count, the others

are fixed by the row and column totals we've taken on.

This is a little trivia piece of where that comes from.

So, what does a chi-square distribution with 1 degree of freedom look like?

Well, looks different than the nice symmetric

and bell shape sampling distributions we've seen before.

Here, this is heavily skewed.

Can only take on positive values but the drill is exactly the same.

We figured out where our results fall relative

to the discrepancies we could have gotten and this

describes the distribution, the discrepancies we could have

gotten under the null when the null is true.

And the underlying proportions are response

of the same at the population level.

And then we look at the likelihood of being that far or farther.

For results as likely or less likely than that.

Because this is a decreasing distribution, the only results

in this direction will be as or less likely.

And the p value from this as I said

before, is exactly the same as what we get from

the two sample z test at 0.003. So why these two tests equivalent?

Well, it turns out it's just another trivia piece to ground this for you.

If we take the test measure, the distance measure.

We use for the two sample z-test.

Take our observed difference in proportions.

And standardize it by its estimated standard error.

If we square that, what we get

is this high squared measure.

We got, you know, if we summed up those

observed from the expected values over the expected values.

These things are mathematically linked from the two-sample

test and they are equivalent on different scales.

It also turns out if you took a normal distribution

to call the values that fall under a

normal distribution, a standard normal distribution with mean 0

and standard error 1, which is what we

use as our sampling bases for difference in proportions.

If you took all the values that fall under that distribution, square them.

And then plotted the relative frequency of them it would be

the chi square distribution.

So these two tests are doing exactly the same thing using the same data.

If you tried and verified this with our

data because of rounding these won't look exactly

the same but if you did it out to multiple decimal places they would be equivalent.

So, here is one more approach, we are

getting the p-value when comparing proportions between two

populations and this is called Fisher's exact test

and it can be used regardless the sample size.

The chi-square and two-sample z-test breakdown a little bit in smaller samples.

And again, we're not going to worry about the cutoff for that,

but it's good to know that they are large sample based results.

However, regardless of the sample size, the Fisher's

exact test generally gives the same results as the two-sample z

or chi-square test in terms of the decision that will be made.

The larger the sample size, the more

equivalent the p-values will be from both tests.

The reason it's not used universally is partially

cultural and that's probably because of computational issues.

In today's computer age, a Fisher's Exact test

which is computationally intensive can be done for

any study, any sample size result but traditionally it could only be done.

With less or computing power for small sample results

and that's why it was represented in such situations.

So again, recall the results of this study and we can said, before that the entire

group ignoring the divisions between those with starting CD4 counts of less than 250.

Starting CD4 counts are greater than.

>> Yeah, the idea of Fisher's exact test is kind of neat that one.

[CROSSTALK] And I want to talk you through it once.

>> So in the entire group there are 1000 persons and 206 responded.

And 794 did not.

The observed study results in what is expected into the null.

And it has a clever way of thinking what is expected of the null.

So, in the entire group of subjects are HIV positive

individuals, there were 1,000, and 206 of them responded to therapy.

So the way Fisher's test, exact test works is

it thinks about the data as marbles in a jar.

So it may

do something like this. It may use red marbles to represent

the 206 responders of the 1000 to therapy. So I'm

not going to draw [LAUGH] in the director's cut I draw all 206 but.

In this version, I'm not going to draw all 206

but these red marbles represent the responders in our sample.

And now I will use 794 blue marbles

to represent across the CD4 count groups the number of non-responders.

[BLANK_AUDIO]

So then Fisher's Exact test says in order to stimulate the null

distribution I need equal proportions of responders across the two

groups will shake this jar up. And if

we do that.

And the resulting jar is proportion of

21% responders should be evenly distributed among the a thousand.

[BLANK_AUDIO]

Values that represent both groups.

[BLANK_AUDIO]

And so, what Fisher's Exact test would do, was say, well in our study.

Of the 503 people, who were in the CD4 count group of greater

than or equal to 250, we observed a 127 who responded.

So, what's the probability of choosing 503 marbles?

At random from this jar, shaken jar.

[BLANK_AUDIO]

And getting 127 red

marbles. And then 376 blue marbles,

which would represent the 127 responses out of the

503 we saw in the CD4 count less, greater than or

equal to 250. So, that figures out the probability

of this which is actually computationally intensive.

And that would give us the p-value.

So, that just kind of a neat way of thinking about null hypothesis and what we

are actually thinking about, when we compute a

p-value, so that's why I showed it to you.

And what expect you to do is by hand and you need marbles or jars for this class.

So as when all this shakes down, here are the results we get

with p-values.

The p-value from the two sample z-test we saw from the last lecture was 0.0003.

For the chi-square test, it turned out to be 0.0003 and

we knew that would be the case cause these are equivalent.

Which should result in equivalent p-values.

Turns out in this example though, the p-value

from Fisher's Exact test was exactly the same too.

So here, all 3 agree.

Sometimes in smaller samples, the p-values may differ between the chi-square, and

two-sample approach and Fisher's test.

But even in those situations, generally the decision made is the same.

We go to our maternal ancient, HIV aid example.

I'm just going to quickly show you the

results of the chi-squared and Fisher's exact.

Here is the resulting comparisons on the measures of

association scales, they all show a reduced this compares AZT

to placebo, this reduced.

Proportion of HIV transmissions born to mothers

who took AZT during pregnancy compared to placebo.

None of the intervals include their null values or p-value from the two sample

z-test was, sorry I got carried away and missed a zero here.

It was 0.0001. Turns out if

you do fisher's exact it's exactly the same, 0.0001.

So here the p-values agree.

From both types of methods as well.

And an interpretation would be the same regardless since they're

all equal we can say it all in one fell swoop.

But the interpretation of the p-value is the

same regardless of the method used to create it.

So if the data came from two populations

with the same probability of HIV transmission, the

chances of seeing the sample results or something

more extreme is 0.01% or 1 in 1,000.

How about the Aspiring CVD data?

The clinical trial on nearly 40,000 women, 45 years and older

who did not have CVD at the time of the study.

They received the aspirin on alternate or a placebo for ten

years and they were followed until their first major cardiovascular event.

And, this is the data we've seen before.

And, if we look at the measures of association comparing aspirin

to placebo, all the estimates favor the aspirin group in

terms of a reduced risk or proportion of cardiovascular events.

But only by a slight amount

and the results were not statistically significant.

All three intervals included their null value.

The p-value from the two sample z-test/chi-square test was 0.5,

0.15 and I am taking it out to four decimal places

here just so we can compare to Fisher's exact.

But generally I would truncate that and round it to 0.15.

If we look at Fisher's exact, the p-value is 0.1586.

So they're similar but slightly different in value.

But they would result in the same

conclusion, that is failure to reject the null.

Let me show you one more newer sample, a sort of smaller sample,

just so we can compare and contrast the

results from the tasks in terms of their p-values.

So, this is a study where 65 pregnant women, all were who, were classified as

having a high risk of pregnancy induced hypertension,

who were recruited to participate in the study.

Of the effects of aspirin.

Another aspirin study here, on hypertension.

We can certainly use these tests for more than aspirin studies

as we've seen.

[LAUGH] These women randomized to receive

either 100 millgreens, milligrams of aspirin daily.

Or a placebo during their third trimester of pregnancy.

Here are the results off the 34 women who were randomized to receive aspirin.

Four developed pregnancy-induced hypertension,

proposed observed proportion of 12%.

Of the 31 women who received the placebo,

11 of the35% developed pregnancy-induced hypertension.

So let's look at the estimates of

association, measures of association 95% confidence intervals.

If we're comparing aspirin to placebo, we show a reduced risk, the risk

difference is negative, it's statistically significant although

why, because of the small sample size.

The confidence

interval is that 1 and similarly for the relative

risk and odds ratio is both favor the aspirin group

as well and they have y confidence intervals as

well but neither interval includes the null value of 1.

We look at the p-value from the two sample

z-test or chi squared test, it comes in at 0.0234.

For fisher's exact test,

it's 0.30378.

So in this case they are different the fisher's exact test is slightly higher.

But we would still make the same decision to reject the null.

You can imagine some small sample situations

where, if the p-value is close to

0.05 by either metric, you'd get one that was significant and one that wasn't.

But generally even in these smaller sample situations, the resulting decision will be

the same by either approach: the two-sample

z or chi-squared, or the Fisher's Exact.

So in summary,

this lecture second and the previous we've seen the two sample z- tests.

The chi-squared test, and fisher's exact test.

They all provide a method for getting a p-value for testing two

competing hypotheses about the true proportions

of a binary outcome between two populations.

And these two can be hypothesis if they are going to be expressed

in terms of multiple measures of

associations, so only one p-value is needed.

The two sample z-test and chi-square test give exactly the same p-value.

However, the chi-square test is usually what is referred to in the literature.

Two-sample z-test is a nice way to link these computations to what

we did with means, and it's also something that you can do relatively

quickly by hand in a pinch.

But in general, what will be used is the chi-square

test, and the resulting p-value interpretation of it is the same.

Fisher's exact test is the computer based test, is

rel, results will usually align with the other two tests.

But the resulting p-values may differ in slightly smaller samples.

All three though are generally appropriate for comparing proportions between two

populations, and the resulting p-values are interpreted in the same way.

So in the literature you'll generally see a chi-square test used sometimes.

The Fisher's exact test but just know that the resulting p-value interpretation is

the same regardless of the name method and there are multiple names to multiple

methods that end up testing the same hypothesis between binary outcomes

between two populations.

And most important to note from this section is while the mechanics

differ between the three tests, the basic approach is exactly the same.

We setup computing hypothesis.

We measure it in some metric. The distance between what we observed

and what we'd expect under the null, we convert it to a p value

and we make a decision.