A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 个评分

Johns Hopkins University

238 个评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In this section, we're going to actually debrief on the

p-value for the first but certainly not the last time.

And even though we've only covered, so far, hypothesis testing for comparing

means between two populations via two samples both paired and unpaired.

And the principles we discussed about the p value will apply to

it and all such hypothesis testing situations that we cover in the course.

So in this lecture section the focus will be on

what a p value can and can't reveal about study results.

Upon completion of this section, you will be able to define type 1 error and

understand its role in the hypothesis testing

process and we've already alluded to this previously.

Explain what a p-value is and what it is not.

Contrast statistical significance

with scientific significance.

And start to appreciate why a

non-statistically significant result yields a decision of

fail to reject the null hypothesis, as opposed to accept the null hypothesis.

So let's just remind ourselves what p-values are.

First and foremost, they are probabilities or proportions.

Numbers between zero and one.

The interpretation of a small

p-value under the threshold we set before 0.5.

The interpretation is that the sample results we have, are unlikely under the

null assumption about the truth in

the populations that generated the sample data.

Generally, that the populations are equal on some characteristics interest that

we've looked at thus far is the mean of the populations.

And what

the p value is, is the probability of obtaining a result

as or more extreme than you did, or the researcher did.

By chance alone, assuming the null hypothesis is true.

How likely this sample results and other results

less likely are if the null hypothesis is true.

It's

very important to get this definition straight in terms

of understanding what likelihood we're measuring with the p-value.

We're measuring the, the only thing random about our

study and the way we've defined our sample results,

and we're measuring a statement about the likelihood of

our sample results under some assumption about the truth.

Here are some things the p value gets interpreted as but it is not.

p-value is not the probability that the null hypothesis is true.

We compute the p value under the

assumption that the null hypothesis is true.

As such, it cannot be a reflection on the probability of

the null hypothesis being true, because we've assumed it to be true.

The p-value tells us nothing about how well the study was conducted.

It tells us nothing about whether the study results

are important or not, is not the

probability that the alternative hypothesis is true.

Again, the p-value is calculated under a strong and fixed assumption

about the truth that the null hypothesis is what's going on and then, finally.

And this is, surprisingly comes up a lot in my discussion.

There's some sense that the p value legitimizes

the study results.

But the p value is not the probability that

the study finds are legitimate or are not legitimate.

In which case, a low p value would be good in that regard.

So again, the p-value only tells us about

the likelihood of our sample or study results or

what result less likely under a fixed assumption

about the truth that generated the samples we're comparing.

So the p value

alone imparts no information about the scientific or

substantive content in the results of a study.

So for example, If I initially presented the cornflake, oat bran study

and simply told you, I told you the researchers had found a statistically

significant difference in the average LDL cholesterol levels in men who had

been on a diet including cornflakes versus the same men on a diet

with the oat bran cereal.

That's all I told you that the results were statistically significant.

You have no way of knowing based simply

the fact that the p-value was less than 0.05.

There's no information in that statement alone about which diet showed

lower average LDL levels and how much the average difference was, and.

Does the average difference observed or its confidence

limits mean anything nutritionally. All the p value tells you is that the

results observed by the researchers between the two groups were very unlikely.

If, the underlying truth was that the cornflake and oat bran.

Groups had the same average cholesterol levels.

So what, how do we interpret a p-value, how do we use it to make a decision?

Well the p-value is small.

Either a rare event occurred under the null, or the null is not true.

And we actually set our threshold for rare events.

By rejecting at the .05 level, and it doesn't have to

be .05, but we're setting something called the type one error.

We could set it to be lower if we wanted it to be more conservative.

But, what we're setting in advance is we're allowing

for it, 5% chance of rejecting the null in favor of the alternative.

When in fact the null is true.

We're calling any study results we get that are less than 5%

likely if the null is true as being inconsistent with the null.

When in fact our results could have just been rare

events or rare samples relative to others under the null.

This al, this rejection level we set is sometimes

called or frequently called a type one error level.

Sometimes also called the alpha level or significant level, and this is

set in advance of performing the test, before any data is collected.

And the standard used in research is .05.

And we could argue about whether that's a reasonable standard or not.

That might be too high of a threshold for

rejecting when we shouldn't.

And you might say, I'd like to set the alpha

level at 0.01, but there are trade-offs for doing that.

So again, if the p value is less than some predetermined cutoff

like 0.05, the result is called

statistically significant at that cutoff level.

Usually just truncated to be called statistically significant

but technically it should be phrased as statistically significant

at the 0.05 level, or at the 0.01 level if your type one error threshold is lower.

And again this cutoff is called

the type one error level or the alpha level.

And the alpha-level is the probability of a type one error.

So the idea of setting the alpha-level or the type one error level relatively low at

5%, for example, is it's the probability of

falsely rejecting the null when the null is true.

It's the probability of false false

positive finding a difference in population means.

When there really isn't one.

The idea of setting the alpha-level low is keeping the

chance of making a mistake under the null true, low,

and only rejecting if the sample result is unlikely.

But we could argue about how likely is unlikely, but the threshold is

determined by this alpha level, and again, sort of the research standard is 5%.

So let's just sort of frame what we've done this far, this

is like a two by two table, but it doesn't involve any data.

We're going to compare

the decision we make based on the data in our samples.

Or the decision that researchers make versus the underlying unknown truth

and we will talk about some of these properties of hypothesis testing.

So suppose the null is really the truth that generated two samples

for comparing but we decide to reject based on the data we have.

We have in fact, if we reject when the null is true we've

made a type one error.

And we set that type one error to minimize this chance, minimize relatively speaking.

So we set it at the alpha level.

It's typically 5% but it doesn't have to be.

Okay.

What happens if we don't reject a null when the null is true?

Well, we've made the right decision under the null, right?

There's no fancy or unfancy, if you will, name for this, in Statistics.

It's just called, doing the right thing.

So right now, we've accepted, or I've urged you to accept,

the research standard of 5% to go along with current research practices.

But in reality, it's hard for us as humans to

process how unlikely is likely enough that we're comfortable with.

And so many of you may think that 5% is too high that you prefer

to set the threshold lower at 1% or 0.1% etc.

And I understand where you're coming from thought wise but there's some

trade offs for that, that we can discuss in live talk in

the BBS.

What's the other type of mistake we can make?

We have not focused on this yet, but

there's another type of mistake we can make, and

I just want to sew the seeds of something

we'll go into detail with, shortly in the course.

Well what if the, alternative hypothesis is really the truth?

That there is a difference, or some

difference in the population means we're comparing.

But we do not find it.

We do not reject the null.

Well this is called, this wrong decision when the alternative truth is called very

unimaginatively. Any guesses here, that type two error.

Statisticians don't have a lot of ideas

sometimes, when it comes to creative names.

This is also called the beta level.

But unlike the Type 1 error level, which is presented as a fault of the hypothesis

testing procedure, we sometimes talk about the complement of type two error level.

In other words, turn it to its good side, which

is the probability or chances of rejecting when we should.

And that's called, something called power.

Sometimes called one minus beta.

So here's the deal, and we're going to get into this in more detail shortly.

But the reason we've couched, and we haven't seen many non

statistically significant results yet, we'll see some more in upcoming lectures.

But the reason we couch our decision when we have a p value

of greater than the type one error or alpha level greater than 0.05.

Instead of saying we accept the null, we say we

fail to reject, is because of these ideas over here.

If we have a large type two error level for our hypothesis

test, that means that our chances of not rejecting when we should are high.

And therefore, when we don't reject the null it's

hard to determine whether we have not rejected the null

because the null is the truth.

Or because our chances of not rejecting it

when the alternative is true were relatively high.

So a high type two error level leads to something called low power.

In other words, the probability rejecting when we should turns out to be low, and

that makes it hard to interpret non-statistically significant results.

So, if p is greater than and

equal to .05, this is why we phrase the decision as

failure to reject the null as opposed to accepting the null.

We'll keep that verbiage going for the rest of the course but we'll see very

shortly that there are ways we can either evaluate, the power,

of a test that's been performed, or design, if we design the study in advance.

Design a study

to have high power and low type two error.

Let's just re discuss this connection

between hypotheses testing and in confidence intervals.

The confidence intervals gives a range of

plausible values for the population perimeter of interest.

And by population perimeter of interest I

mean some remeasures of the population level.

Talking about so far we talked about, when comparing

populations the mean difference.

If we're talking about a single group we could talk about the

population mean but in this context

and comparing groups we're focusing on differences.

And a confidence interval sort of says, here's the data use that coupled

with what we know about the behind the scenes sampling behavior of the data.

To give me an interval that can hopefully includes the truth.

Data take me to the truth.

Hypothesis testing comes at it in the

other direction.

It starts out with two choices for the population parameter of interest.

That there's no difference, and we'll see

this again with proportions and rates versus.

[INAUDIBLE]

that the two are not equal or the

difference is not equal to zero.

So let's think about this.

We've talked about this in the intro part of this lecture, but let's revisit it.

If zero is not, then let's just focus on 95% confidence intervals.

We could amend this to look at other

levels of confidence and other type one error levels.

But if zero is not in the 95% confidence interval.

Then we would reject the null of

the mean difference of the populations we're comparing.

Sorry

I tried to wrap it all into one big Greek letter here.

Let's just say.

At the, at the alpha level, a rejection level of 0.05.

And this is, in that font, a poor

representation of alpha there, so we'll do that.

Okay, so why use this. Well let's think about this.

How do we create a confidence interval?

We start with our estimate, which when we're

comparing means is the mean difference between the

two samples we have, whether they're paired or unpaired.

And we go a fixed number of standard errors in either direction.

Plus two standard errors above our estimate,

and two standard errors below. I won't fill in that thing there.

So think about it.

If we do not, in this range here, we do

not hit zero, if zero is not in this interval.

That must mean

that zero

is more than two standard errors away from

our estimate, whether it's above it or below it.

So, when we go to actually do a hypothesis test and we start with a null.

But the mean difference is zero and we measure how far our result is,

from zero in terms of standard errors.

We already know since the confidence interval

did not include zero that this distance.

In absolute value is going to be greater than two.

It's either going to be great than two or less than negative two.

And by our convention at the 5% rejection

level, if we're more than two standard errors

away under a normal curve, our result is less than 5% likely and we will reject.

The converse is also true.

If our 95% confidence interval did include zero, then we would fail

to reject the 0.05 level because of the same logic about distance.

Here's the thing.

The 95% confidence interval and the p-value are complimentary.

You pretty much know something developed. One, if you have the other and vice-versa.

But you don't have full information on the other based on one.

So, for example, in the blood pressure oral contraceptives example, a

95% confidence interval tells us that the p-value is less than 0.05.

Because it did not include zero in the interval.

However, it doesn't tell us that the p-value is 0.009, so if we just had the

confidence interval, we could ascertain where our p-value fell relative to 0.05.

But we wouldn't know its exact value.

The confidence interval and the p-value are complimentary.

However, you can't get the exact p-value

from just looking at a confidence interval.

And as we said before, you can't get a sense of the scientific substantive

significance of your study results by looking at a p-value.

So it's generally appropriate to report both.

Now you may say John, if the, the industry

standard is 5% for rejection. If the industry standard is 5% for

rejection, why not just report confidence intervals at the 95% level,

for differences?

And it will be clear whether or not you rejected the .05 level.

And I agree with you to some extent, but

we're going to see going forward there's some situations where, just

to show the lack of association or an association

going forward, which will then be quantified in more detail.

Sometimes, it's worth believe it or not, presenting just a p-value to start.

And there's other types of tests we'll get

into where we're comparing more than two groups.

Where we don't have an immediate confidence interval that

summarizes the entire comparison set, but the p-value will.

Furthermore, even though the industry standard is .05 for the p-value.

Others may have their own internal alpha levels.

And want to evaluate the results with regards to that.

If I just give

a 95% confidence interval, the reader can tell where the p-value lies, relative to

.05, but they can't necessarily evaluate it at .1, or .01 for example.

Another thing to note, and this is sort of, goes back to what

we discussed about what you can and can't get out of a p value.

But statistical significance doesn't improve or approve, apply causation.

So in the blood pressure or contraceptives example, there was a

statistical significant difference between the after

oral contraceptive use in the pre.

But we can't necessarily immediately attribute that to

oral contraceptives because there could be other factors

that can explain this change.

And so a significant p-value is only

ruling out random sampling chances is the explanation.

This is why we said that paired

groupings or paired tests are potentially problematic.

In terms of making strong conclusions even

if the results are statistically and substantially significant.

Because we don't have another group to compare

these folks to, who did not receive oral contraceptives.

So a comparison group is better to establish causality and we then look at.

Unpaired examples where it was easier to make that conclusion.

Very important here, and we'll continually harp this, but

statistical significance is not the same as scientific significance.

So, a significant p value doesn't necessarily

mean the results mean, are meaningful scientifically.

And vice versa, a non-specific p-value

doesn't mean the results aren't interesting scientifically.

So what can make things almost

guaranteed, differences guaranteed to be statistically significant.

Well, that would

be the case if we had a very small standard error such

that any difference in means being compared between the groups looked small.

Relative to the standard error, if it was a very small value.

And that can happen in big sample sizes.

So here's a hypothetical example, but we'll look at

examples going forward in the literature and talk about this.

we have 100,000 persons.

Suppose our blood pressure on oral contraceptive

study that. Paired study had a 100,000 women enrolled.

And among a 100,000 women there was a slight

increase in blood pressure on the order of 0.03

millimeters of mercury, but there was still fair amount

of variation in the individual changes in blood pressure.

Well in this situation, if you do this out, you'll get a p-value of 0.04.

So the result is statistically significant.

But the actual observed difference in means is

this less than a tenth of the millimeter mercury.

And if we slapped confidence intervals on it, it would go

from almost zero to a little over a half millimeter mercury.

I think it would be very hard for any clinician, who considered these results to

be of scientific interest, or important, yet

the p-value was statistically significant, less than .05.

So sometimes large samples can produce a small p-value, even

if the magnitude of the effect is small, and not scientifically.

Are substantively significant.

In the same regard, lack of statistical significance

is not the same as lack of scientific significance.

We must evaluate in the context of study and sample size

and this is what I was talking about with regards to power.

And we'll spend a little more time with this going forwards.

But when we have small samples sometimes a

small end can produce a non statistically significant result.

Even though the magnitude of the association at

the populations level is real and important and

the study can't detect it.

And sometimes in that case, small studies

are used or sometimes they're even designed without

power in mind just to generate preliminary data that can be used to design a larger.

High powered study and that's what we'll discuss in lecture sets 12 and 13.

So in summary, the p-value alone can only indicate

whether the study results were likely due to random sampling

chance or not.

If, in fact, under the assumption of the null, that

there is no difference in the measure being compared between populations.

So far, we've only looked at comparing means, but

this idea will hold for proportions and rates as well.

And not rejecting the null hypothesis, again, is

not equivalent to accepting the null hypothesis truth.

And we'll stick with that

verbiage for now.

And we'll dig deeper into this in lectures 12 and 13.