Welcome to Lecture Set 5 in Public Health Statistics where we'll talk about

data where we have an element of time that we have to deal with in summarizing it.

Throughout this set of lectures,

we'll look at numerical and graphical measures for both summarizing results in

single samples and comparing results between

samples when there's an element of time in the measures we're looking at.

So, we'll first talk about sample incidence rates as

summary measures for time-to-event analysis and spatial data collected over time.

So, for spatial data,

that includes event counts and person totals in a defined observation period,

perhaps across multiple regions,

you will be able to summarize the person count standardized event counts as event rates.

For time-to-event outcomes where we have individual outcome times known,

we have data at the individual level,

we will be able to distinguish between

calendar time and study scales for time-to-event data.

Define censoring in the context of time-to-event data.

Explain why either ignoring the time component or

averaging subject follow-up times can be problematic for summarizing

time-to-event data and hence we need other measures like

incidence rates using event counts and cumulative follow-up times.

So, whether we have spatial data,

where we have counts in person totals in a defined observation period,

so the number of cases per number of persons per year,

or we have individual times that we can aggregate to get the total person time,

we'll see that incidence rates are

a numerical summary measure that addresses the time component.

So, let's first talk about event rates for data without known event times,

and this would include spatially collected data over a fixed period of time.

So, for some data involving events occurring over time,

the exact event times are not recorded but are grouped into time intervals.

This is the case for many death and disease rates group by area, country, state, city,

region, zip code, et cetera,

by year, so by area by year or some other unit of time.

So, let's look at an example of this.

Here are some data from the year 2002 on the incidence of

lung cancer diagnoses in the state of Pennsylvania in

the United States and on the overall state population as well.

The data that we have are stratified by county, sex,

race and age groups, but for now,

we'll aggregate them into an overall look at the state.

So, how can we summarize this for the entire state of Pennsylvania in the year 2000?

Well, we can compute what's called an incidence rate of

lung cancer in the year 2000 by taking the total number of

cases that accrued in that year and divide by the total person-time at risk in that year.

So, in these data from Pennsylvania,

the individual diagnoses times for new cases in 2002 are not known,

so what we do for the total person-time at risk is assign

each person one year for each member of the sample being analysed.

So, everybody we assume lived in Pennsylvania for the entire year and were observed for

a year and each contributed a year of follow

up to our understanding of the process of lung cancer.

Certainly, people develop lung cancer at various points during that year,

but we're going to count everyone equally in the person count-person year time.

So, in Pennsylvania in 2002,

there were 10,279 lung cancer diagnoses and 12,281,054 residents in the state.

So, again we don't know the exact times of

the lung cancer diagnoses and we don't know when the residents,

how many stayed for the entire year versus moved in or out,

so we're going to assume that all 12 million were there for

a year and that all cases were diagnosed at the end of the year,

and the incidence rate under this assumption represented by IR,

with a hat over it to indicate that it's an estimate,

is 10,279 cases per the 12, 281,054 person-years.

So, if we convert this to a decimal,

this comes out to be about 0.0008 lung cancer cases per person per year,

sometimes given as person-year.

The incidence rate can certainly be rescaled for different time periods,

it's usually done so the numerator has denature component.

So, for example, in this incidence rate that came out as a decima to bel

0.0008 cases per person-year or per person per year,

we could rescale it to per 10,000 person years by

multiplying it by 10,000 and then our numerator would be an integer of eight.

So, another way to express the same summary statistic is eight cases per

10,000 person years in the state of Pennsylvania in the year 2002.

So, you might say, "Well, wait a minute John,

isn't this just a proportion of a measure for a binary outcome?

You take the number of cases out of the total population count."

Essentially, it is, but there is an element of time,

so technically it's a rate because we are looking

at this proportion over a year of follow-up.

Nevertheless, you could think of this in percentage or proportion terms.

There's a 0.08 percent incidence of new cases in the year.

In other words, 0.08 percent of the sample under study developed lung cancer in 2002,

and remember they were 0.0008 cases per

person per year which as a percentage is just 0.08%.

However, even if we think of this as a percent over some unit

of time or a proportion like we did with binary outcomes,

these rates tend to be very smallest proportions and as such,

their statistical properties will differ

from the proportions as we have defined them previously.

These tend to be proportions that are very close in

numerical value to zero when taken as a proportion,

and technically, there's also an element of time.

So, in some situations,

we have time-to-event data where we know the event times.

So, in the lung cancer dataset,

we knew the total number of cases that were diagnosed in the year,

but we weren't given information on when the diagnosis was in that year.

Was it in the first month,

was it in the 11th month, et cetera?

We don't know and that's why we had to make

those assumptions about the cases being diagnosed at

the end of the year and everybody living in

Pennsylvania for the entire year to get an incidence rate.

But for some time-to-event data,

the individual event times are known,

and then this individual event time information

can be incorporated into incidence rate computations.

So, this is the case for many longitudinal cohort studies where

subjects are followed from a defined starting point up to a certain amount of time.

So, let's look at an example here to get picked this off.

We're looking at a randomized trial conducted at

the Mayo Clinic in Rochester Minnesota in the US, and here's a description.

This was on patients with

primary biliary cirrhosis and they were randomized to either receive a drug or placebo.

The study began on January 1,

1974 and patients were accrued-

Up until December 1983.

During that 10 year period,

422 patients with primary biliary cirrhosis satisfied the emissions or entry criteria for

the study and three of them consented to enter the study and were

then randomized to be on the drug or placebo arm.

The primary outcome from interest for this study was survival

or in other words another way of looking at it was death in the follow-up period.

Ultimately, the researchers were interested in comparing the incidence of death

and those who got a treatment D-penicillamine versus those who got a placebo.

But let's use this as a springboard for thinking about what can

happen when we follow subjects over time in a cohort setting.

So let's look at a couple examples of patients we may get in the study.

So we have patient one, for example.

He or she enters the study right when the study started in January of 1974,

and is followed for seven years at which point he or she dies.

So he or she has the event after seven years.

In terms of the study-time window,

this person's time zero happened to be

the start of the study so in terms of the study time,

they were also followed from timer randomization for seven years.

So what do we know about this person?

We know that he or she didn't ultimately have

the event under study death and it was

seven years after they were assigned to a treatment group.

Let's look at another person. Subject two.

Subject two did not enter the study at

the beginning of the study remember there was a long accrual period so people

were admitted well after that initial start date of January 1974.

So this person enters in June of 1978,

and is followed up until the end of May 1980.

So in terms of the calendar time,

they started well after the beginning of the study

in the sense that when the study was open for people to participate,

they started four years after that time,

and were followed for two years in after

they entered at which point they were lost to follow-up.

At that point that when the last visit in May of 80,

they were still alive.

So all we know about this person is that they entered in

June of 78 and we're still alive as of May of 1980.

We don't know when they went on to die,

we just know it didn't happen before May of 1980.

So in terms of our study time if we're calculating time and study,

this person when they entered the study is there times zero for the study,

that's when they were randomized to the treatment or placebo,

and after that they were followed for two years that period

from June to beginning of June to the end of 1978 to the end of May 1980.

They were followed for two years at which point they were lost to followup.

We have no information about what happened after two years,

all we know is that they were still alive after two years from randomization.

Then let's suppose we have another person who entered the study in November of 1980.

So later in the accrual period,

they were still enrolling people in November of 1980 and they actually stayed

alive and were still alive at

the last measurement or checking for the study, December of 1983.

So this person wasn't lost to follow-up so much as that they

were not followed anymore because the study officially ended.

So what do we know about this person from their time

of randomization till they were stopped being followed?

They made it was a three-year period.

So they survived if you will or made it three years into the study

without having the event of death at which point we were

no longer doing the study and they were no longer being followed.

So if we put all three of these subjects together on the study time graphics,

so mapped everyone to their measurement in terms of study design,

here's the first patient.

He or she from the time of randomization

which was time zero made it seven years in which point they died at the event.

Patient two made it two years from the time of randomization,

there time zero and was still alive at

two years when the researchers last saw the patient.

Then patient three made it three years at which point

the study ended and the person patient three was still alive at three years.

So all we know about their time to death was,

it had to be more than three years after they were assigned to a treatment group.

So in terms of a complete versus censored observations,

subject one is what we might call a complete observation.

We know that he or she had the outcome under study

they actually died after seven years in the study.

So we know that they died and when they died.

Subjects two and three are called censored observations.

We have partial information about the outcome under study of death.

We don't know when they died but we have a lower bound and when they could have died.

So while subject two was still alive when he or she was lost to follow-up,

we know that he or she survived two years on the study clock.

So they couldn't have died after one year of treatment assignment or after

one and half years if they did die was beyond two years.

Similarly, we know that subject three made it three years

without dying before the study ended.

So for both of them,

we don't have a death time but we have a lower bound on whether death time could be.

For subject two it had to be

more than two years and for subject three it had to be more than three years.

So there is some information and they're

partial information from these censored observations.

So how could we summarize what happened with these three patients numerically?

Well, option A would be to treat death as

a binary and report the proportion who died in the follow-up period.

So with these three subjects if we were just looking at these three subjects,

a cohort of three,

one of the three patients died the proportion of those three who

died was one in three or 33 percent.

The problem with this is the amount of time

that each of these patients was followed after

randomization and hence their time at risk of death

from randomization varies from person to person.

Taking a simple proportion ignores this fact and gives

all three persons equal influence in computing a summary measure on the event.

So, another option would be to treat the follow-up times

as continuous and report the average time.

So, treat this as continuous measure.

So, we'd average the seven plus two plus three over three is four years.

But what would this be in average time four?

Well, only one of the three subjects died while in the study,

so this average is not capturing the average time to

death since follow-up only the average follow-up time.

If we tried to use this as a measure of average time to death,

it would systematically underestimate

the true average because we were

including two persons whose times were not their death times,

but the time of last follow-up.