A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

209 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 4A: Making Group Comparisons: The Hypothesis Testing Approach

Module 4A shows a complimentary approach to confidence intervals when comparing a summary measure between two populations via two samples; statistical hypothesis testing. This module will cover some of the most used statistical tests including the t-test for means, chi-squared test for proportions and log-rank test for time-to-event outcomes.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay, welcome back.

In this relatively short section we're just going to debrief a little bit more

and check-in about where we are with

regards to the p-value and hypothesis testing.

So first I'd like to talk about the types of

tests we've been doing and explain them a little bit further.

The, the way we've demonstrated performing hypothesis tests are for

what are called two sided hypothesis test, in result two

sided p values.

And this is the standard approach that

you'll see reported in most of the literature.

two sided refers to the method of getting a p

value that measures being as far, farther from the null value.

Or the expected value under the null in either direction.

And we've seen that over and over again where for

example we get a result, that is three standard errors above

what we'd expect under the null.

To get the p-value we actually looked at the proportion of

results or the probability of being that far or farther above.

And we also look at, the other half.

Being the other tail being that far, farther below.

To get our two sided P value.

It is possible, and sometimes perhaps more logical to perform a one sided test.

For example, if we're comparing two populations on some metric, I'll just use

mean for example, but we could replace this with proportions or incidence rates.

Maybe our null is that the underlying means

for the two populations we're comparing are the same.

Maybe we're only interested

in alternatives such that The mean of the first population is

greater than the second, so this is more specific than the

two sided, I'll call it alternative we've been using that the

means are not equal regardless of the direction of their inequality.

>> So for example, in a clinical trial comparing a new

treatment versus a standard treatment for

reducing blood pressure in hypertensive individuals.

The researchers understandably may only be interested in the new treatment

if it reduces blood pressure more than the standard treatment on average.

There's no interest in the treatment if it's less effective than what's currently

being used or if it perhaps even resulted an increase in blood pressure.

>> So, this may be logical from a scientific perspective.

But, it turns out the, these types of tests raise some suspicion

in literature, and among reviewers because,

the appropriate one sided test will always result in the

P value that is half of the two sided P value.

So, we, for example, if we were testing means.

And our one sided alternative where the true

mean for population one was greater than population two.

We ended up with sample results such that, the sample mean for sample one

from population one was greater than the sample mean for population two.

And then we did a one-sided p-value based on this alternative.

And suppose we get a result that was three standard errors about what we'd expect.

We'd only consider this portion of the tail.

We wouldn't consider the other part.

And so it wouldn't be as far or farther in either direction, it would

be as high or higher above what we'd expect, and only look in that direction.

You can see how in some situations, thought, this

could cause something that on the two sided version,

supposed to be on the two sided value of 0.08.

Well, the appropriate one sided test would yield a p value of 0.04,

because we'd only be looking at half of what we'd get for the two-sided.

And as such, this is seen as an attempt to make

things that are not initially

statistically significant using the two-sized approach.

Statistically significant by only considering half.

So these are not commonly

used or presented in the literature.

Even though it may make scientific sense

to only look at alternatives in one direction.

>> So unfortunately there's sort of a sanctity of

p being less than five in the research community.

As a collaborating statistician, I've

certainly been asked on numerous occasions,

by researchers, what can be done to make their results statistically significant.

When the P value or P values they have are greater than 0.05.

You know, as we know and have discussed, a P value is only a piece of the story.

And significant P values do not necessarily mean important results

in terms of the science or medicine or public health.

And nor do non-significant P values mean unimportant or useless results.

And we'll talk more about that in the section on power.

But because of this research culture sanctity around the p-value

being less than 0.05 is being the be all

end all sometimes, the one-sided p-values are seen as opportunistic.

>> So how do we keep track of all

the tests we've named in lecture sets 9 and 10?

Well the first thing I hope you've realized over the course of this is that

regardless of the name, the result, the

approach of our test are universally the same.

We specify competing

hypotheses are no and alternative

[SOUND].

Then we create some measure of the discrepancy between our sample results.

Our sample estimate of the way we characterized

the null alternative that the measure of association

and the expected value,

expected value,

under the null.

So we,

and we standardize this distance by the uncertainty in our estimate standardized.

And then we compare our discrepancy measure, compare this to

a, a distribution to the theoretical sampling distribution.

Theoretical sampling distribution of all such

discrepancy measures,

under the null hypothesis, when the null is true.

And we use this to get a p-value that tells us how likely our

results are relative to what else could

have happened just by chance onto the null.

And we make a decision.

The names of the tests distinguish the, both the tests in terms of the

type of data being compared between the

groups, and the specific mechanics of getting

to a p value.

One can always look up the name of

the appropriate test, given the data being compared.

the important thing to note is that are conceptually the same, and

the resulting P values have the same interpretation across all the tests.

They measure how likely our study results were on the

assumption of no difference in some measure at the population level.

And the resulting p values will agree with the corresponding confidence intervals

for the chosen measures of association with regards to the null hypothesis.

You will undoubtedly see other tests in the literature that we

have not yet covered or will not cover in this class.

Again, though, if you can figure out what measure's being compared via the test.

And you can interpret the p-value in the context of the comparison.

And we'll go further into this throughout the rest of the course.

And in fact in the next lecture set we will discuss extensions to the test

covered in lectures 9 and 10 handle comparisons

between more than two populations in one test.

But let's think about what we've done thus far and let's just summarize.

For comparing two populations we'll first discuss the

unpaired situation because that's what we've done consistently

across the three.

So for means we have one test really the two sample t-test.

For proportions, we have three tests, two which are identical to each other.

They just look, they're identical mathematically,

but they look different in their computation,

two sample, z test. And the chi square.

And then we also have that fisher's exact. All of these three tests are tests

in the exact same null and alternative, just get it in slightly different ways.

For time to event we have the two sample z approach,

and log rank.

And again, these are testing the same underlying null

hypothesis, and they do it in slightly different ways.

I'm just going to circle for each of these, the most common used.

Obviously there's only one that we've

[UNKNOWN]

for means, but this is the most common

that you'll see in the literature for these comparisons.

And then for the parent situation, we have

an, there are paired situations for binary data

and timed to event, but they're not that

common, and I haven't included them in this course.

But for the mean, sometimes you'll see a paired t-test.

One last thing I want to discuss in regards to

P values and rejection levels at this juncture, is just as

it is possible to have different levels for confidence intervals, other

than 95%, we can have a 90% confidence interval 99% etcetera.

It is possible to evaluate p values

at different alpha levels, or rejection levels.

You could have alpha level 10%, and alpha level of 1% etcetera.

However again, the standard in research is to use 95% confidence intervals.

And a rejection level or alpha level or type 1 error level also none the

less of 5% and that's what we will do through out the rest of this course