Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

来自 Johns Hopkins University 的课程

Mathematical Biostatistics Boot Camp 2

41 个评分

从本节课中

Discrete Data Settings

In this module, we'll discuss testing in discrete data settings. This includes the famous Fisher's exact test, as well as the many forms of tests for contingency table data. You'll learn the famous observed minus expected squared over the expected formula, that is broadly applicable.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so let's go through our more

mathematical de, development, where we're assuming a model.

Right, so now, before we were, when we were talking about

it as a randomization process, we were kind of conditioning on the data.

We said, oh, you have so many treated, you have so many control.

You have so many tumors, and so many non-tumors, and we're simply re-doing

the randomization process on the computer under

the hypothesis that the randomization is irrelevant.

Right?

That whether

you received the treatment or the control was irrelevant.

That's one way to think about Fish's exact test.

Now, we're going to talk about a different way.

So let's let X be the number of tumors for the treated and Y

be the number of tumors for the control, and were null hypothesis is going to

be H naught p1 equal to p2 equal to the common proportion, where we're going to

assume that X is binomial with whatever

its sample size was and success probability p.

And Y is

binomial with whatever its sample size was

and binominal probability p under the null hypothesis.

Under the alternative they would have to be different.

Probabilities.

By the way, if you, if this is true, right?

If this is true, if both X and Y are a bunch of IID Bernoulli sums, then X plus

Y is just the sum of more Bernoullis, n1

plus n2 Bernoullis, all with a common success probability p.

And so, it, it's an interesting and, and fairly obvious fact

that if you add two binomials with a common probability that the

sum of the two binomials is also binomial, with a total

number of trials equal to n1 plus n2 and the same probability.

And this is clear, because if X is comprised as a sum of n1

Bernoulli's with probability p, and Y is comprised as the sum of n2 Bernoullis.

With probability

p and then X plus Y is simply the sum of n1 plus n2 IID Bernoullis

with probability p hence its binomial n1 plus n2 and p.

So now the way we've characterized the problem now

we have two numbers X and Y that are random.

Every, in our two by two table, there are no other free numbers.

Right? If, if we

know X and we know n1 then we know the number of non tumors for the trigger.

We know Y and we know n2 then we

know the number of non tumors for the control group.

So, in that two by two table with know

both, we know the margin the, that n1 and n2.

And then if we know X and Y then we know the, the second the.

The, that which is the first column of, of numbers then we know

the second number of columns.

so we only have two free numbers in our

four numbers in our two by two table there.

so, but, we still have one parameter that

we don't know, even under the null hypothesis.

The null hypothesis says h naught p1 equal to p2 equal to p.

Okay?

So what if we were to then try and

figure out a strategy to get rid of that parameter?

Find a distribution that doesn't depend on it.

and, and it turns out that the probability Of one of the data points given the sum.

And it doesn't matter, we just pick the first data point, you,

you could, you get the same procedure if you pick the second one.

Probability of X given X plus X equals z, it

turns out that this follows the hyper geometric probability mask function.

And I give the hyper geometric probability mask function right there.

Now what's

interesting about this. Is this hyper-geometric mass function

is exactly the probability distribution from a couple of slides

earlier where we have so many bins of t's and n's and we have so

many balls labelled t and c, for treated and controlled.

And how, if we randomly allocate treated and controlled

balls to the bins.

That, you know, the first bin able to hold six balls, and the

latter bin being, the, the end bin being able to only hold four balls.

And I need to allocate ten balls, five treated and

five controls randomly to that process, to the, to those bins.

That's the hyper-geometric.

Its the, the other way to think about this idea that is the

distribution of 2 by 2 table where you're permuting the t's nd the c's.

we've in the t's and the n's fixed in the way that we described earlier.

Of course that's identical to permuting in the t's and

the n's leaving the t's and the c is fixed.

so again you wind up, you wind up if you have

the same data and you assume that the row margins are the

margins that include the randomized treatment or you assume the column

margins are margins that had the randomized treatment you wind up with

the same procedure provided you have the same data set.

So that, that's interesting.

perhaps comforting, perhaps discomforting, either way now before remember we

only had two numbers, we had two success probabilities, X

and Y or in this case it's a tumor; so

I'd hardly call that successful, but let's say two success probabilities.

Using the convention of calling a binomial event a success

regardless of how successful it is.

we have the two success probabilities at the onset, when we know

the value of the sum, then we only have one left, so

in that whole margin when we, when we assumed we only had

two free two free cells, given that the, the row margins were fixed.

Now we only have one free cell given that the row margins

were fixed and now that the sum is fixed.

And so, this is exactly what Fisher's ex, est, exact

test really tells you is that, the, you know, it, it,

as you vary that upper left hand cell or any

cell holding both margins fixed you get the remaining three elements.

You can obviously, you know, put in a value for the upper left-hand cell.

And you can obviously go through the exercise of finding

the other three cells, very easily, given the margins.

but more than that, we also have this distribution on that cell.

the hyper geometric distribution, that can a, that arises if we take

the distribution of the upper left hand cell and condition on the sum.

Note that this distribution does not contain p.

It got rid of it.

And there's a mathematical reason

for that.

It's the so-called conditioning on a sufficient statistic.

So, when you condition on the sufficient statistics for p, you get rid of it.

In this class, we won't go over that.

We won't go over the mechanics of why, we won't go over the

mechanics of the logic of how Fisher came up to condition on X plus Y.

Or what, how that mathematical, that mathematical development works.

suffice it to say for the needs of this

class, that when you condition on the sum you do get rid of that probability.

And there is a very general mathematical principle that is relying on, the relies

on the fact that X plus Y is sufficient for these for the parameter p.

Okay, so let's derive this conditional distribution.

So we know the probability of X.

it's just the binomial probability, here.

We know the probability of Y, and let's say z minus x.

This'll make the derivation a little bit easier, but we can plug in anything here.

Here provided z-x in an integer between 0 and n2.

Then it's this binomial probably right here, and

then we said already that X plus Y

is binomial, so the probability that X plus

Y equals z is this probability right here.

[NOISE]

Okay, now putting everything together, the probability X equals x.

And X plus Y equals z over the property X plus Y equals z.

That's exactly this conditional probability just

using our rules of conditional probabilities

is that we know quite well from mathematical bio statistics boot camp one.

And then if X equals x and X plus Y equal z then that's the

same thing as saying X equals x and Y equals z minus x of course.

And then X and Y are independent, so we can, factor those two possibilities.

And then if, then on the previous slide we had all three of these expressions

plug-in, and you'll find that you wind up

with the hyper-geometric distribution that we described before.