A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 约翰霍普金斯大学 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

207 评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 2C: Summarization and Measurement

This module consists of a single lecture set on time-to-event outcomes. Time-to-event data comes primarily from prospective cohort studies with subjects who haven to had the outcome of interest at their time of enrollment. These subjects are followed for a pre-established period of time until they either have there outcome, dropout during the active study period, or make it to the end of the study without having the outcome. The challenge with these data is that the time to the outcome is fully observed on some subjects, but not on those who do not have the outcome during their tenure in the study. Please see the posted learning objectives for each lecture set in this module for more details.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this set of lectures, we're actually going to take

a look at some ways to summarize time to event data.

And you'll recall, time to event data rises when we start with the cohort

of observations that we follow over time,

until they have some well defined binary event.

And over the course of the follow up period, some

people will drop out of the study before having the event.

Some people will have the event, in which case,

we'll know the time that they had the event.

And some

people will make it to the end of the study without having the event.

And so we'll need methods that account for two dimensions of these data.

One, the number or occurrences or for, of

the event of interest, and then also, we will

need to account for the total time the persons

in a given group are followed during the study.

So what we're going to develop in this set of

lectures are numerical summary measures that encapsulate both those ideas.

And these include incidence rates, we're

summarizing the results from any one sample.

And incidence rate ratios which compare the results from two samples.

But there's also a richesness in the state that

can't be fully appreciated with those aforementioned summary measures

and which we'll be moving towards a graphical approach

that will help us actually track the two dimensions visually.

And create a summary statistic

that looks at both the time and frequency event occurrence as it unfolds over time.

Okay, in this next series of lectures, we'll talk about yet another outcome

data type, Time to Event Data, sometimes

or most commonly actually called Survival Data.

In this lecture section, and subsequent sections, we'll define the types

of studies that lend themselves to Time to Event Data, the challenges

with analyzing it using methods we've already established.

Like summarizing for continuous data or binary data, and we'll talk about ways

to summarize this data appropriately as

well, both in numerical and visual formats.

So this first section we're actually going to talk

about a numerical summary measure for timed event analysis.

Aka survival analysis, and again, we'll also set up

the idea of this type of data and explain

how it arises.

So hopefully upon completion of this lecture section, you will be able to

distinguish between calendar time and study time scales for time to event data.

Defined censoring it the context of time to event studies.

Explain why either ignoring the time component, or averaging

subject follow-up times, can be problematic for summarizing such data.

And finally you'll be able to compute event

incidence rates using event counts and cumulative follow-up times.

So let's start with an example to set the stage for this.

This is a classic group clinical trial performed at the Mayo

Clinic on subjects with Primary Biliary Cirrhosis, a condition of the liver.

And from here on in we'll call it PBC. And this was a randomized clinical trial

that ran from January 1st 1974. Through December of 1983.

So there was a ten year accrual and follow-up period.

So patients were enrolled over that time and followed up until

the study ended or if they died or dropped out first.

The primary endpoint here was death.

So just to sort of summarize the key characteristics

of this trial, this was a randomized clinical trial.

Set up to investigate the effects

of D-Penicilliamine and its effects on survival

in patients with primary biliary cirhossis.

So the primary outcome of interest was death in the followup period.

Ultimately the researchers were interested in evaluating the

effect of this drug, D-Penicillamine, or DPCA, on survival.

But what are some of the challenges with these types of data.

We recruit people.

The researchers recruited persons to be in this

study, who were afflicted with Primary Biliary Cirrhosis.

And then actually

had to follow them over time to see

whether they died or not in the follow-up period.

And then ultimtely wanted to compare the treatment in placebo groups.

So let's talk about some situations that can arise in these types of data, using

three case studies on three potential, or

hypothetical, subjects who were enrolled in this study.

Subject number one, we'll call him or her, enters at the

start of the study, so enters as soon as the study begins

in January of 1974. And then dies seven years later.

So there's actually two ways of measuring his or her contribution to the study.

We can do it in terms of calendar time,

[SOUND]

which actually tracks this person's tenure in the study as

a function of the dates he or she was involved.

So this person started at the beginning, January 1974, and actually was in the

study for seven years, at which point they had the event of interest under study.

If you will, death. So they made it through December of

1980 or, or January of 1981. Seven years, on the calendar.

If we wanted to map this to their tenure in the

study relabeling the axis as time zero when they started the

study, and then the actual time after time zero that they

actually had the event or finished their tenure in the study.

we would do it so they

would look exactly the same as the previous diagram but we relabel things.

Instead of putting the date at the beginning

where they started we call that their time zero.

That's when they enrolled in the study.

And then we'd follow them until they actually either

left the study or had the event of death.

In which case this person did at seven years.

How about subject number two, hypothetical subject number two, somebody who does not

start when the study started, but actually enters four and a half years later in June

of 1978 and then is lost to follow-up two years later in May of 1980.

So what do we mean by lost to follow-up?

Well, this means that they drop off the researcher's radar.

And the last known visit or follow up visit to the study, they were still alive.

So all the researchers know is that two years after this person entered the study,

They were still alive and they don't know what's happened since that last check-in.

So in terms of the study time window, we can reset this person's start

date to their time zero.

And then their accumulated follow-up time was two years.

And this is not drawn to scale.

But I just put a little circle here, just notation wise

to indicate that at that point, they were lost to follow up.

So what do we know about this person in terms of the outcome of death?

Well, we don't know when they died, whether it happened shortly

after we lost track of them, or the research is lost track

of them, or whether they're still alive at this point of time.

But we do know they didn't die one year after they started the study.

We do know they didn't die a year and a half after they started the study.

We do not know that they didn't die two years after the start of the study.

So we don't have a full piece of information on them but we

know something about a lower, value bound on when they could have died.

After the study started.

So now let's look at hypothetical patient number three in this Mayo Clinic study.

The subject enters in November of 1980. And

actually stays until the completion of the study.

At which point, the researchers stopped following

this person up because the study is over.

So this person actually lived throughout their entire term in the study.

They did not drop out early, but they did not die

either, have the event under study, by the end of the study.

So what do they look like in terms of the study time calendar?

Well their initial start date of 11/80 or

Novmber 1980 would be considered their time

zero, their entry into the study at which

point they were followed for three years

and then they finished the study still alive.

So again another person who did not ulimately die.

During their time in the study.

So the only thing we know about the event under study of death

is that it had to happen more than three years after this person started

the study.

So it's sort of a partial piece of information about their death time.

So now let's put these three

subjects together, these three hypothetical subjects

in this real study, let's put

them together on this study-time window graphic.

So, let's look at what we've got.

We've got this first subject here, the first one we looked at, who contributed

7 years of time in the study

and ultimately had the event under study death.

The other two subjects contributed less time in the study.

The subject number two was there for two years

before he or she was lost to follow up.

In subject three, enrolled later in the study window, and made it to the

end of the study, ultimately contributing three years of follow up to the study.

So collectively we have three patients who contributed various times of

follow up, and only one had the event under study depth.

So we want to think about how can we

analyze this data, but first let's introduce some

terminology that comes up frequently, or is often

used in timed event data, it's called Censoring.

So first, we'll define patient one, the first person who was in

the study for seven years, then ultimately had the event under study.

Ultimately died, we'd call him or her a complete observation.

We have total information about the event under study.

We know that he or she had the outcome under study, death, and

we know how long it took after they enrolled in the study, seven years.

So they are complete key pieces of

information about time to death, after study enrollment.

Patients two and three are considered to

be censored observations, meaning that they either

dropped out of the study, or finished the study without having the event of

interest, or understudy death.

So we have partial information, as we talked about before, on these two people.

So while patient two was still alive when he or she was lost to follow-up.

We know that he or she survived at least two years on the study clock.

So we know they didn't die in the first year after enrollment.

They didn't die in the year and a half after enrollment.

Similarly, the patient three survived three years

on the study clock.

So how could we summarize this situation numerically?

Well, the first thing to do is just say look, John, death is a binary outcome,

let's treat it as binary and report the

proportion who died in the follow up period.

Well, with these three subjects in these hypothetical

situations here just illustrate We have three people total.

One of them died, so the proportion of those three who died

in the study period was 33%.

There's a potential or a real problem with this, though.

As looked before on that study time calendar where we

compared the experiences of the three different subjects, the amount

of time at risk of the outcome At risk of

death in the study period varies from person to person.

This proportion does not recognize the differential

time at risks these three people had in terms of dying in the

followup period, and they give all three equal influence in this summary measure.

Well we could take another approach.

You might say John, well you know

these time measures that we have, they're continuous.

Seven years, two years, three years.

Why don't we just treat them as continuous and do what

we usually do to summarize numerically a distribution of continuous data?

Take a mean or something like that. A median etc.

Well, let's look at taking the mean.

If we took the mean of these three times,

these three follow-up times, seven plus two plus three

divided by three, the average time in the study is four years.

It does measure the average time in the study and that is useful.

From characterizing patient or subject participation in this study.

However, this does not, does not, capture average time to

death since enrollment because only one of these three persons died.

So if

we tried to use this as a measure of the average

time to death after enrollment in this study, It was systematically

underestimated because the two year and three year measurements for

those two censored observations are lower bounds on their actual time to debt.

So this is not going to work either.

By the way, this approach, treating this time, is continuous data

and summarizing them vis-a-vis means, will work if there's actually no censoring.

So what can we do to actually capture the two dimensions of this data?

We have one we want to capture the binary nature of the outcome.

Whether or not the person died or not.

But there is a time component that varies from person to person.

And we saw that the previous two approaches using what we've

known from continuous or binary methods missed one of these two pieces.

So option C, which frequently used, is something

called incidence rate, and it actually is

like a hybrid of the two previous approaches.

What we do is we take the total number of events, in this case, deaths

that occurred in this sample, and divide by

the total amount of person follow up time.

Contributed by the persons in the sample.

So we saw one outcome, one death.

An the total amount of follow-up time, amongst the three people

in the study was, 7 plus 2 plus 3 years, or, 12 years.

So we could summarize the time to death

experience for these three persons in the follow-up

period of the study As saying there was one death per 12 person years of followup.

Now note this computation that assumes that the underlying

rate of death is constant across the entire followup period.

We will get to some methods in the second term that allow us

to adjust that assumption a little bit, but it's not necessarily a bad assumption.

What we're doing is assuming that the risk of death, as quantified by this estimated

incidence rate is the same in one year of follow-up, two years of follow-up, etc.

That across this entire 10 year follow-up period

that we've measured these three people on theoretically,

the incidence rate is one in 12. Okay.

So let's, let's, let's, before we dig into the

incidence rate more, let's look at some more examples.

Okay?

So this is an interesting study that was published

in the New England Journal of Medicine in 2011.

And it was about, it was anti retroviral therapy.

And partner to partner HIV transmission.

And what they did in the study was,

they found what are called Zero Discordant Couples.

So it's sexual partners who are zero discordant in terms of

HIV, where one partner was HIV positive and the other wasn't.

And they did this across multiple countries.

And what they did was they randomized these couples where

the, where the HIV positive partner would either receive anti-retroviral

therapy immediately As soon as they enrolled in the study

or after they met the clinical definition, a certain clinical cut

point, which is delayed therapy, the standard used.

And the primary prevention output, the outcome of interest, the event they were

looking for, was linked HIV transmission with

these zero discordant sexual partner and couples.

Here's a little snippet of the study.

They enrolled HIV positive one, serodiscorant

couples at 13 sites in nine countries.

They need a pilot phase, phase to

assess the feasibility of recruiting in April 2005.

And then enrollment actually took place from June 2007 through May 2010.

And they actually follow up people up through February, 2011 to see if there

were, HIV positive occurrences amongst the originally HIV negative partner

in these couples.

So what they said then from the method section, is that as of February

21, 2011, sorry the results section, there were a total of 39 HIV transmissions.

and they quantify this in the cohort, this group of people

has an incidence rate of 1.2 cases per 100 person years, so.

1.2 cases of transmission for a hundred person years

of follow up. So how can we think of that?

Well we can think if we were to follow 100 HIV

[INAUDIBLE]

discordant couples for a year Under the study conditions where some

were randomized who aggressive or early treatment and others to the standard.

we'd expect of those hundred couples each followed for a year we'd

expect to see on at, at that one plus cases of HIV transmission.

Another way to think about this is if we were

to follow 200 people, 200 couples each for six months.

they would contribute 200 couples each for six months would contribute

a total follow up person time of a hundred person years.

And in those 200 couples followed for six months these also expect to see 1.2 cases.

So let's look at another study.

[UNKNOWN]

Maternal Vitamin Supplementation and Infant Mortality.

This was a study done on women in Nepal, where they

randomized the women to receive vitamin A, beta carotene, or placebo.

And a sub-cohort of these women were pregnant.

And so they used that to look at the birth outcomes

and risk of infant mortality in the first six months following birth.

And they want, ultimately want to compare that across the three groups of mothers

in terms of whether they receive the treatments,

vitamin A or beta carotene, or the placebo.

So they say, in their results section, a total of 43,559 women were enrolled.

15,892 of these women contributed 17,373 pregnancies and

ultimately there were nearly 16,000 live born infants in the trial.

So, the investigator are part of Hopkin's and they

were kind enough to share two thirds random sample.

Of the live birth

data with me so that I could actually do some analyses.

So I don't have the full data set, but have two thirds of it that

was randomly sampled, so I'll just summarize

what I found in these data to start.

If you were actually look at the total

follow up time for the infants born to these

women, the live births in the sample I have,

collectively the total follow up time contributed by these.

10,295 infants in the first six

months following birth was 1,627,725 days. And the total deaths

in the six month follow-up period was 644.

So let's see if we can now quantify the risk of death using the incidence rate.

So, if we do it based on the recipe I gave you before, we would compute the

number of deaths, 644, divided by the total

exposure time contributed by all the infants under study.

And this would come out to be a rate, since the original

time unit was days, this would be roughly .0004 deaths per day.

And this is a little bit hard to interpret on

its own, and it will be easier to interpret when we

start comparing these rates between differeing groups.

But frequently what's done when we have a rate

that's a non-integer value, or at least doesn't have

a whole number component to it Is we can

convert it to a different per unit time measure.

So for example, to start, we can convert this measurement that was

deaths per day to deaths per year of follow up time by

multiplying it by the number of days in a year.

So this would accumulate to about 0.146 deaths

per person year or infant year or follow-up time.

Now interestingly, we only follow these infants for 12 months so,

we'll have to think about what that means in a minute.

But we can then also convert this to

something where we have an integer number of that.

So if we took this 0.146

deaths per year, and multiplied it by 500, 500 years

we'd get a measure that was 73 deaths per 500 person-years of follow up time.

So these three numbers are actually equivalent in terms of expressing the

risk of infant mortality in the six months Following the birth in this study sample.

But when we think about

73 deaths per 500 years, we can really only think about

that applying to the six months of follow up we have.

We can't follow influence, we can't extrapolate

it beyond six months after their birth.

So, it's a little misleading to take this number at face value.

Obviously, we're not going to follow a single infant for 500 years.

But, even if we were to follow 500 infants for a year,

we don't really know what would happen in terms of the risk

of death after the six-month period.

So this, this number can really be interpreted as

500 years of follow-up within the six months of birth.

So it could be something like following, if we were to follow

a thousand roughly 1,000 infant for up to six months, that well

you know, some would drop out and some would die early, but

ultimately we contribute almost 500 years of follow up time, we'd expect

to see about 73 deaths.

Or if we we're to follow 6,000 infants for a month.

They would collectively, with the caveat that some

may die in that month or drop out,

collectively contribute about 500 years of follow up

time and we'd expect to see about 73 deaths.

So just a note on the terminology be-, before we go further in this topic.

The analysis techniques in this type of prospective

data, and I'll call it prospective cohort data.

This per-, prospective means that we're

following our observation units overtime from

some starting point to some defined

endpoint, or some drop-out like we've seen.

Cohort just means group. So we're following a group over time where

time to an event is of interest.

It has several synonymous titles, the one I've been using is Time to Event Analysis.

Most frequently it's called Survival Analysis.

And that, but that implies that the event of interest is death.

And it can be death, but it doesn't have to be death.

Now, for example, this can be, these types of techniques that we'll talk about here

and we'll explore further in this lecture section, can be used for any type of

study where we're following subjects over time to a

well defined binary outcome, and there may be sensory.

So for example, if we took a bunch of people who

were smokers and put them into some sort of counseling program.

And then followed them for a year and wanted to see what

the rate of quitting smoking was, then our outcome would be quitting smoking.

It's not a bad outcome, It's a good outcome.

But we can compute the incidence rate for

risk of quitting smoking over the follow-up period.

Or we can look at something like time to completion of

a high-school equivalency degree, in the U-S, sometimes called a G-E-D.

After getting some sort of,

mentoring program.

So these things do not have to be death, an they do

not have to be necessarily, negative

outcomes, to be analyzed under this framework.

So the ideas we've developed in this lecture section,

an we'll continue to do so throughout the lecture

This timed event data analysis techniques we're talking about

sometimes fall under the umbrella, something called survival analysis.

I prefer to call it timed event

data analysis, but the common language used is survival analysis.

I think that's a little misleading because we

can certainly study things like factors associated with death.

Where we're following people over time until they have death or are censored.

But there's other outcomes that certainly can be studied in this framework.

For example, these types of techniques have been used to study factors

associated with time to quitting smoking, or time to finishing a PhD program.

Or time to fin-,

[INAUDIBLE]

recidivism after a person is released.

So don't feel limited by the nomenclature that's often used.

This can be used for any cohort study situation where we're following

observations until a well-defined binary event.

So in summary, and I'll verbalize it, in

this, in summary What have we covered here?

Well, we've shown we've set up the situation,

that manifests time to event data.

We've discussed the issues that come up with following subjects over time.

The issue of getting censored observations and some full

pieces of information from any cohort we're following over time.

And we've discussed why our previous summary methods for binary and continuous

data falls shorts of describing adequately going in the sample when we have sensory.

So we introduced the incidence rate as a way to summarize the time to have

been experienced Which recognizes both the binary niche

or the outcome, and the dimension of time.

In subsequent sections in this lecture, we'll see how

to compare the survival experience across samples, comparing incidence rates.

And also how to visually and graphically

describe the timed event experience in groups.