A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

209 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay, in Section B, this lecture section, we are going to learn about the

hypothesis testing approach for comparing means between

two populations when the populations are paired.

And we're privileged to only sample data from those paired populations.

So we're going to discuss the paired approach.

In this lecture section, you will learn how to estimate and

interpret a P value for a hypothesis test of a mean

difference between two populations for a paired study design.

The method for getting the value is called the paired t-test.

The paired part becomes, because of the study

design, we're looking at paired samples from paired populations.

And the t-test part becomes, because

sometimes the sampling distribution used for the

test, just like we saw with Confidence

Intervals for mean differences, is a t-distribution.

So let's go back to something we looked at

when we were developing Confidence Intervals for paired mean differences.

And approach it from the hypothesis testing direction.

So this is a study we looked at

were two different physicians have assess the number of

palpable lymph nodes in 65 randomly selected male

sexual contacts, of men with AIDS or AIDS-related condition.

And this is the data we looked at before. There were two doctors and they

each saw the same 65 patients, and doctor 1, on average, found 7.91 nodes across

the 65 patients as compared with doctor 2, who found a lower number of nodes.

And if we took the difference comparing doctor 2 to doctor

1, It was an average difference of negative 2.7 nodes found.

but there was a lot of variation in the patient to patient differences

between the two doctors.

And, so we had cre, constructed a Confidence Interval for

the difference in the mean lymph, number of lymph nodes.

Doctor 2 compared to doctor 1 and it only included negative values.

Indicating that, systematically doctor 2 was estimating a

lower number of nodes on average than doctor 1.

And we said this wasn't a study per se about these

few doctors but it showed that this of method of diagnosis was

not reproducible.

So had all men been examined by

these two physicians, the average difference in number

of lymph nodes discovered by the two physicians

would be between negative 3.45 and negative 2.05.

That's how we can interpret or give

some substantive context to the Confidence Interval.

And, like I said all possibilities for this true

mean difference are negative, zero's not in the interval.

Let's take it it now from the hypothesis testing approach.

The Confidence Interval what we did in lecture

80, is we started with our estimated mean difference.

And used our understanding of the sampling

behavior of such mean differences around their truth.

To construct an interval that with

high probability would include the unknown truth.

For hypothesis testing, we're going to start in the opposite direction.

We're going to make a statement about the truth.

And then see how closely or

not are sample results jive with that assumption.

So what we're going to do is set up two competing hypotheses, the null, which is

the two doctors have the same mean had they examined all such patients.

And we looked at the mean number of nodes found between

the, based on the two doctors, the means would be equivalent.

And the very broad alternative that in fact the means,

of the two doctors that the population level are not equivalent.

Another way of phrasing them, is in terms of

a parameter, a mean difference between the two populations.

And this is something we estimate with our sample mean difference.

So, you could also rephrase this null as the, the

mean difference between doctor 2 and doctor 1 is zero.

Versus the alternative that it's not.

And then what we start by doing is assume the null's true, assume that these

data came from a population of men such that the true mean difference in number

of nodes found at the population level, had

the doctors examine all such patients is zero.

And what we're going to do is figure out how far

our observed mean difference was, in terms of standard error.

So we're going to standardize it by the potential

variability in our mean difference estimates around this assumed

truth of zero.

So i'm going to call this just, just to be consistent with textbooks that you'll see.

I call this, this really just a distance measure.

Like the statistical mile we're conventing our distance

in terms of node counts into standard errors.

And we understand the relationship between standard

errors and distance under a normal curve.

Which is what the sampling distribution will look like.

So what we saw was, we're going to, how we're

going to compute this, is we take what

we observed, the mean difference between doctor 2.

I'm going to abbreviate here.

And doctor 1 in our study, and divide it by its estimated standard

error, of this difference based on our study.

And I should note on the top, we're subtracting the distance

between what we saw and what we expect under the null.

And we'd expect under the null, our assumption

is that the two populations had equal mean.

So, we'd expect the mean difference on average to be zero.

So, this is our raw diff, distance on top,

divided by the number of nodes in a standard error.

So, let's do this.

It's the difference between doctor 2 and doctor 1

was negative 2.75 nodes on average.

We have the standard deviation, the differences of which was 2.3.

And we had 65 pairs.

That's just our standard standard error formula.

And so, when all the dust settles, we get a value of -7.83.

So our result, our difference of negative 2.75 nodes in the cost of

65 patients is 7.83 standard errors below what we'd expect

the difference to be zero, if the null is true.

So what we need to do now is figure out whether

that's far away, and I think you already know the answer.

But whether that's far away relative to other possibilities, or not.

So one way to do that is to translate this distance into

a p-value, by comparing it to the distribution of such mean differences.

Based on sampling variability, when the truth

is a population level mean difference of zero.

So we'll get this p-value, and it'll tell you, you know, how likely

it is to be as far, or farther away than our result was.

Given all possibilities for the estimated mean differences

when that truth is a mean difference of zero.

And then we'll compare the p-value, just the

preset rejection level, which is very commonly 0.05.

So, for our purposes and most of the research world, this is 0.05.

And we'll see whether it falls below or above

that, and we'll make a decision based on that.

So, let's translate this into a p-value.

So, we have a result that is 7.83

standard errors below the expected mean difference of zero.

Assuming the null is true.

How likely is this to occur just by chance when the null is true?

In other words because a random sampling error.

Well what are we looking at here? We're looking at a, our sampling

distribution for the distribution for all possible

mean different estimates we know is roughly normal.

Roughly normal, with some variation. And it's centered at the true difference.

And we're assuming the true mean difference in

this hypothesis for us being frameworked, to be zero.

So we got a result that was way off the charts, if you will.

It was 7.83 standard errors below zero.

And what the p-value will measure is

the proportion of results we could have gotten.

Proportion of sample, mean differences just by chance that are that far

or farther below zero. But it will also consider it, distance in

the opposite direction, the proportion that were that far or farther above zero.

So let me ask you this.

Something that's beyond, beyond 7.83 standard errors

away from the mean of a normal distribution.

But we know that that proportion

of results that fall that far out are very small.

We certainly know that they're less than

0.05, because that's farther than two standard errors.

We know they're less than 0.01, because that's farther than three standard errors.

And in fact to actually get the actual p-value, we

need to go to either table or use the computer.

And even the computer reports it as

basically a number less than some threshold.

As less than 0.0001.

So we got a p-value that was very small. So what is the interpretation of

this p-value. Well, the

p-value measures the

probability of getting,

a mean difference.

A mean difference, sample mean difference. In other words,

our result. A sample mean difference.

As extreme which means as far.

Or more extreme, which

means farther than, what we

solve, negative 2.75.

If in fact, the true mean difference, the population level.

Is zero or the means are equal.

So this measures the likelihood of our study results if

the data samples came from populations with the same means.

So in other words our data result, our

mean difference of negative 2.75, is very unlikely.

The wha, the probability of getting it is very small when the null is true.

So let's think about what this means.

We could have gotten the result

that we got, if the null were the guiding truth, but it's likely hood is very low.

Less than 0.001.

So, if we have that threshold of 5%

to declare things likely or unlikely, things that

are above 5% we'd say are consistent or

likely enough to have happened under the null.

To say they're consistent with the null hypothesis,

and if it comes in less than 5%.

Less than 0.05, reject the null hypothesis in favour of the alternative.

So for these data, what we, we've got is a p-value way less than 0.05.

And we would reject the null hypothesis in favour of the very broad alternative,

which, that, the data came from, populations with different means.

But how does this compare to the decision we would

make with the Confidence Interval, for the difference in population means.

It should agree, right?

And remember, the Confidence Interval for the difference

in population means did not include zero either.

So, we've ruled out zero as a possibility in

both approaches to handling the uncertainty in our estimate.

And what we're really doing as we're

established in the first part of the lecture

set, is this is about distance.

The Confidence Interval we take our estimate

an go a fixed distance in either direction.

To create interval that has high

likelihood of including the true mean difference.

For hypothesis testing we start with an assumption about

the truth an measure how far away we are.

So if we didn't include zero in our Confidence Interval, that means

our result must be more than two standard errors away from zero.

Either below or above it, and hence when we do the hypothesis testing,

and compute how far our result is from the assumed mean difference of zero.

The distance will come in at greater than

two, and our p-value will be less than 0.05.

So these two things will concur about the null value.

Just one thing to note.

The p-value's absolutely invariant to the direction of comparison.

If we instead presented the data and the

estimated direction of comparison, in terms of doctor 1.

Minus doctor 2.

Still, the underlying null is exactly the same.

That the two means that, two doctors at the population level

of all such patients, the true mean number of nodes are equal.

And that the

difference is zero. So, there's no difference in what we're

[INAUDIBLE]

in the null and alternative hypothesis.

But if we did the distance measure in this direction, what

we'd end up with is the opposite, of what we got before.

Instead a getting something that was 7.83, standard errors below

zero, we'd end up with something that was 7.83 standard errors above zero.

An of course because our p-value, approach is such that we look at

being as far or farther away from the mean, which is assumed to be zero.

In either direction, we're considering results that are beyond

7.83 standard errors above zero and more then 7.83 below.

And so we would get the exact same p-value we

have gotten when we compared things in the opposite direction.

because there we got a result on

this side but we are considering all, all, all possibilities on both tails as well.

So let's look at our cereal and

cholesterol example and do a p-value on this.

So we had 14 males with high cholesterol level.

And they were doing oat bran cereal as part of a diet for two weeks.

And corn flakes as part of a diet for two weeks.

And then at the end of each two

week period each male's cholesterol level was measured.

The average cholesterol level after two weeks of corn flakes for the 14 men was

171.2 milligrams per deciliter. For the oat ban cran group it was 157.8.

We did the difference, cornflakes compared to oat bran.

The average cholesterol level was 13.4 milligrams per deciliter higher after two

weeks on the cornflakes, breakfast versus two weeks on the oat bra.

And we saw the 95% Confidence Interval for difference did not include zero.

All possibilities for the difference in the direction of corn

flakes to oat bran were positive for the average difference.

So if we want to do a hypothesis testing approach, we go

ahead and in our minds set up the two competing hypothesis.

The null would be that the mean cholesterol level at

the end of the two week period is the same

for the population of such men who were, if they

were all given corn flakes or all given oat bran.

Another way to say that null is that the mean difference is zero

and the alternative is the very general, the means are not the same.

At the population level or the difference of zero.

So if we want to measure how far our result

is from what we would expect under the null hypothesis.

We take from observed mean difference 13.4 mg

per deciliter or minus the what it would be assumed

is null which is zero which is redundant to write that.

And then divide by our estimated standard error, which is the standard deviation of

the 14 individual differences, divided by the

square root of our pair size, 14 pairs.

This is approximately 3.23. So, we now have a result that's

3.23 standard errors above what we'd expect to happen under the null.

But we can have sample mean differences that

differ from the underlying truth, just by sampling error.

So we have to figure out whether 3.23 is out

of the realm of what we'd expect just by chance.

If the null were true.

So what we're looking at is a sampling distribution.

Looks like this, is centered to zero.

We got a result of 3.23 and we're looking at the

proportion of results that we could have gotten the estimated mean differences.

That we could have gotten, that are as far or farther than 3.32

standard errors above zero, but also we'll look in the other direction below.

So we going to turn this into a probability.

So we know if this sampling distribution were normal this,

the clearly, the p-value here would be less than 0.05.

Because our results are more than two standard errors away from zero.

But because this is a smaller sample size, 14 people, the true

theoretical sampling distribution is a t distribution, with 13 degrees of freedom.

So I, I'm not going to have us look that up.

A computer would do it, but it's not quite a normal distribution.

But it turns out nevertheless, on a t

distribution with 13 degrees of freedom, this is

relatively far out.

And so the p-value for this comparison, also comes in at low, less than 0.01.

And so we would reject that null hypothesis in favor

of the alternative. But we already knew we'd do that

because the Confidence Interval did not include zero.

For the mean difference.

Let's look at one more study, that, the

before versus after study with the blood pressure data.

You remember this study where we had ten women whose blood pressure was

measured right before they were put on a regimen of oral contraceptive use.

And then after three months of consistent oral

contraceptive use, their blood pressure was measured again.

And the difference after manage

before it was measured for each of the ten women.

The average of so this was paired.

The average of this differences was 4.8 millimeters of mercury, on

average the pull up pressure went up by 4.8 millimeters of mercury.

The variation in the ten individual

differences was 4.6 millimeters of mercury.

So, the resulting 95% Confidence Interval for the mean

difference and its estimate for the difference in blood

pressure after oral contraceptive use compared to before was

the observed difference we said was 4.5 millimeters of Mercury.

With the confidence interval of 1.5 to 8.1.

So all, possibilities for the mean shift were positive.

The resulting p-value for this paired

t-test is 0.016. So again, to interpret

this we'd say, if there was no difference in the population mean

systolic blood pressures after and before OC use in this population of women.

Then the chances of getting the sample ten women from this population, where

the sample mean, with a sample mean difference of 4.5 or difference even

more extreme, is 16 in 1,000. So,

it could happen.

We could have gotten this result just by chance, random sampling if the

two samples came from populations before and after contraceptive use

that were identical in terms of their mean blood pressure.

But the chances of getting this sample are low.

And they're lower than our threshold at 0.05.

So we could again reject the null in favour of the alternative, that

the true mean difference is not zero

and call this a statistically significant result.

So, in summary, the paired t-test is a

method of getting a p value for generally testing

the competing hypothesis that the population means are

equal, that's the null hypothesis versus that they're not.

And of course we can represent these in terms of differences in means.

And that's consistent with how we estimate a comparison

between the two populations using a sample difference in means.

So this paired t-test is a method for contesting

these using data from paired samples, from the pair populations.

And the resulting decision will, can confer about whether to reject the null.

Or fail to reject the null will concur with the

results from the 95% Confidence Interval for the difference in means.

In all three examples here, our difference

in means the Confidence Interval did not include

the null value of zero.

And all three examples, our resulting p-value was less than 0.05.

And statistically significant.

We'll see in future sections other

situations where our Confidence Interval includes

the null value in our resulting p-value will be greater that 0.05.

So

how then, what is the approach?

The first set up, or in your set up

the two competing hypothesis about the unknown population means.

Then start by assuming the null is true. And then compute how far

the observed result is, the sample mean difference is, from what we'd expect

it to be, under the null, which is always zero.

So it's redundant. And then we divide it.

We convert

that to standard errors. How many standard errors away.

From that null value are we? because we

are assuming that to be the truth.

And then we translate the distance in the p-value and we make a decision.

And the p-value measures the chance of

getting the study results that were observed.

Or something even less likely or more extremes sometimes called when the

samples are assumed to have come from populations with the same means.

In other words the chance of getting in the study results if the null hypothesis

is the truth.

And this p-value which is called the two sided p-value but we'll

define that more, specifically shortly, is in variant to the direction of comparison.

And the direction of comparison shouldn't matter in terms of

determining whether the results are likely or unlikely under a null.

Because the direction of comparison is arbitrary.

In the next section, we'll do the same

thing but we'll be comparing means between unpaired samples.