Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

来自 Johns Hopkins University 的课程

Mathematical Biostatistics Boot Camp 2

48 个评分

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

从本节课中

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so let's move on to another very important distribution,

maybe not quite so important as the normal distribution but awfully darn important.

Maybe, maybe the second most important distribution.

I think a compelling case could be made for

the Poisson distribution being the second most important distribution.

I don't know, I have leanings in that direction.

Okay, so the Poisson distribution is used to model counts.

The Poisson mass function is this guy, lambda to the x,

e to the native lambda all over x factorial,

where this is the probability x takes value x for parameter lambda.

And it turns out that lambda is an easy parameter because it's just the mean,

it's expected value of x is lambda.

But what's interesting about the Poisson random variable

is that its variance is also lambda.

Well, we knew its variance had to be some function of lambda because the Poisson

distribution only depends on the one parameter.

But it is kind of neat that it works out to the mean and

the variance are equal in this case.

And then notice in this case, x ranges from 0 to infinity.

So you might use Poisson's to model counts that are sort of unbounded.

So, however many people show up at a bus stop,

sure you know there's some bound, there's only so many people in the world.

But for all intents and purposes, that's an unbounded number, you can't really put

a number on what the limit of that really is in a meaningful way.

Of course it's bounded by the number of total people, but It's kind of

conceptually unbounded in a different way than if I flip a coin five times and

I know I'm going to flip it five times that's a different problem.

Right?

We know that the most number of successes I can get is five.

Whereas if you're modeling how many people show up at a bus stop, then you really

kind of don't know a realistic version of what that upper limit would be.

So some uses for the Poisson event time data,

if you're counting anything like the number of people that show up at a bus

stop, the number of photons that are detected from a nuclear reactor

per given unit time, these are all reasonable thing to model with Poisson.

So, radioactive decay is the classic one because you can kind of demonstrate via

some limiting arguments that radioactive decay does follow a Poisson distribution.

Survival data, it turns out there's this deep connection between a lot of these

classic survival analysis models and Poisson models.

Survival analysis being modeling the time until some event.

Classically time until death for modeling for looking at diseases.

Any kind of unbounded count data you're going to model as Poisson.

Contingency tables, so if collect a bunch of people,

collect a bunch of characteristics on them, and just create cross classified

tables of how many people fell into this different classification.

That's called a contingency table and it turns out modelling the counts of

contingency tables, you model them with Poisson usually.

And then binomials, which are clearly not Poisson, if you have n being large and

p being small, then people tend to model them as Poisson.

In fact, people do this so frequently that they don't even

bother to mention that they are approximating a binomial with a Poisson.

So if you're modeling say for example a disease that is very rare and

you have large sample sizes.

Let's suppose you're studying autism and vaccination rate so

that the percent of the kids with autism is very small,

the number of kids that gets vaccinated is very large.

So if you were to model the autism rates,

you would do it probably more likely with a Poisson than you would with a binomial.

I take that example because I saw someone doing that just the other day,

that they were studying that question.

Okay.

So this is where it comes from,,

the Poisson distribution comes about from the so-called

Poisson processes, and Poisson processes,

if you define a like mean number of events per unit time.

You let the kind of a window h that you're looking at be very small,

and we can assume that in that interval of length h

the probability of an event occurring is lambda times h.

While the probability of more than one event is negligible.

So imagine you are monitoring your best stop using,

I going to look at windows of 0.1 second, and

then only one person can show up at a time in that 0.1 second.

So let's just assume two people are showing up holding hands or

something like that, and it's a commuter day where everyone's coming by themselves.

That if you take a small enough time window, you're only going to get one

person showing up to the bus stop at any given time.

And then whether or not a person shows up in one interval doesn't impact whether or

not a person shows up in another interval.

That assumption maybe, all of these for the bus stop example are suspect.

But any rate these are the underlying assumptions for

something to be a Poisson process.

And then if you take an interval and you count the number of

events that occur in that interval that's a Poisson random variable.

And that's the original derivation of the Poisson random variable through

Poisson processes.

So I want to emphasize that the idea of studying rates and

using Poisson Is really kind of highly tied together.

The lambda parameter from a Poisson has a unit, right?

If I'm looking at radioactive decay lambda is the decay per unit time and

t is the number of that many time points I monitor.

So If I wanted to look at radioactive decay per minute,

t would be in minutes and lambda would be the rate per minute.

So note, lambda is expected value of x over t,

is the expected count per unit time, and t is the total monitoring time.

So we always use the Poisson distribution this way.

Where we think of sometimes t is 1, in which case it drops out.

But in our heads, we're always thinking of a Poisson

random variable as being a model for rates.

Here's the Poisson approximation to the binomial, if you ever need to use it.

We assume that lambda is n times p and x is a binomial np.

Recall, we're making the lambda np.

And n gets large and p gets small, but lambda stays constant.

Then you can approximate x as Poisson.

Let's do it.

So imagine the number of people that show up at a bus stop is Poisson

with a mean of 2.5 per hour.

And you are watching the bus stop for.

Oh, I'm sorry, this is not an example of a Poisson approximation of a binomial.

This is just a regular Poisson.

So we want the number of people that show up

at a bus stop is Poisson with a mean of 2.5 people per hour.

We watch the bus stop for four hours, what's the probability that three or

fewer people show up the whole time?

That is exactly just a Poisson probability.

3, remembering that when we do p for probability distribution,

it does three or less, which is what we want, so we put three in there.

Lambda is 2.5 times the number of hours that we monitored,

4, and that works out to be 1%.

Okay, so see how we used it?

Lambda was the event per unit time, and

t was the number of units of time that we measured.

Okay, so there's your Poisson example.

Let's go through an example of the Poisson approximating the binomial.

We flip a coin with success probability 0.01, so p is small.

We flip it 500 times.

What's the probability of 2 or fewer successes?

So again, pbinom(2, size equals 500, prob equals 0.1.

Here's the exact calculation.

It works out to be 0.12.

Here's the Poisson approximation, ppois(2 lambda = 500 times 0.01.

n times p, and that works out to be 0.1247.

So pretty close.

So that's the Poisson approximation of the binomial.

So here we just showed you exactly how accurate they were in this specific case.

And then, in your regression class, you will actually cover modeling

counts using kind of a Poisson version of regression.

So it's a very convenient model, and it's great for modeling things like rates.

And I would also mention this, Poisson approximation to the binomial is so

common that people don't even acknowledge that they're doing it.

It's just sort of done immediately in a lot of applications,

particularly certain epidemiological applications.

If you're studying something like an infection or something like that that's

rare relative to the size of the population you're studying,

people just automatically use Poisson approximations.