A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

235 个评分

Johns Hopkins University

235 个评分

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

从本节课中

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Here we go, Lecture 2: Continuous Data Measures.

So in this first section, we're going to look at some

useful summary statistics for summarizing samples of continuous data measures.

Some of these will look familiar, like the mean and the median, and some

of them you may have heard of or may have not, like certain percentiles.

And we'll just go through all of these and look at some examples.

So upon completion of this lecture section, you will be

able to compute a sample mean and Standard deviation by hand,

albeit we won't be doing that often in this class.

And interpret the estimated mean, standard deviation, the median

and various percentiles computed for a sample of continuous measures.

So what we're going to be talking about here include measures of central tendency.

Sometimes they're called measures of the center of the data, which includes

the aforementioned mean, and median, which is sometimes called the 50th percentile.

An then a measure of data

variability that's universally used, called standard deviation.

And then other measures of location,

percentiles, and we'll define all of these.

So let's start with the sample mean.

Many of you know how to compute a mean, or may remember fondly

how to compute a mean from some other point in your mathematical life.

but what we do to compute a mean of

data is add up the continuous data points and then

divide by the total number of points, the sample

size, which will frequently represent with a lowercase letter n.

The sample size n is the number of observations or pieces

of data in the sample.

So for example, at you, we're going to look at an example of doing this by

hand, you would only want to do this by hand for small samples of data.

Obviously we'll rely on the computer from here on in to

compute it because that's a much more efficient use of our time.

But let's just look at a small sample, five systolic blood pressures

measured on five persons, and the readings are a 120 millimeters of mercury,

80 millimeters of mercury, 90, a 110, and 95 millimeters of mercury.

And just to introduce some mathematical notation, it will be helpful,

we can represent this sample five points by math type notation.

We might call the first observation x1, the second observation

x2, and the fifth observation of 95 millimeters of mercury x5.

And then the sample mean

for these five data points is easily computed by

adding up the five values and then dividing by five.

And then in stat notation, the sample means is

frequently represented by a letter with a line over it.

For example, the mean of these x's would be represented by x with a

line over it, frequently called x bar, and this is a notation universally used.

So we'll be relying on this from here on end Sometimes

the letter will change, but there will always be a bar over the sample mean.

So, just to show you how we compute this,

and jog your memories if you've done this before.

The sample mean for these five points is just the sum

of the five points divided by the number of them, five.

And the sample mean for these five data points is 99 millimeters of Mercury.

So here's a generic formula representation of

what we've just did, and we're going to use little notation in

this class but it's going to helpful to use it in certain situations.

So I want to introduce the summation notation here.

And so what the, the generic formula m, math speak looks like is, is x bar.

The sample mean value can be estimated by the numerator

here, is just math shorthand for adding up all the values.

That Greek

letter Sigma means sum in Math notation, and

it's referenced, I'll go down here to explain it.

The sigma is referenced by i equals one to n,

and right next to it is x with the subscript i.

What this all means is add up the x values from the first to the nth.

So x1 plus xn, that's just a generic way of saying add up everything.

And then we divide of course up here, we take

that sum and then divide by the n number of values.

So again Sigma just means sum.

So what else, let's talk about the sample mean a little bit more.

As I noted before, it's sometimes called the sample average or the arithmetic mean.

In smaller samples, and we'll look at this in a minute, it, one data

point can make a great change in

the sample mean, it's sensitive to extreme values.

What we'll see and we'll harp on for the rest of the

course is that increase sample sizes lead to a more stable sample mean.

In other words, it's less influenced by any single point in the

sample, and that will have implications for what we're doing down the road.

And just to remind you about what we are always

doing in statistics, is we want to study some larger population process.

But we can't observe, every value in the population process and

we can only estimate it from an imperfect subset, the sample.

So, why do we call this the sample mean? To distinguish

it from the unknown, unknowable value of interest, mu to the population mean.

The mean of all values of the population.

Just a reminder, in Statistics when we use Greek letters, it generally means the

thing we want but we can't have, and can only estimate by a sample quantity.

So, the bigger the sample, the less influence any single sample

point has on the value of the sample mean, and this is

sort of intuitive.

If I had a sample with two observations, then I will be averaging two observations

in the influence of each of those two observations as on the order of 50%.

They each control half the mean.

If I have ten observations then the individual

influence of any one point diminishes to a tenth.

If I have and 100 observations each individual

point has 100th of influence on the mean.

And what this, the implications

of this are, as the sample size gets larger, the mean becomes

a more stable entity across different random samples of the same size.

And we'll see further on in this course that not

only does the uncertainty in the mean as an estimate of

the underlying truth decrease, but the, we're better able to predict

how the means across the multiple samples will vary in value.

Another value of interest that helps us understand where the centre

of the data is called median, also called the 50th percentile.

What does that mean, the 50th percentile?

Well, median is a synonym for middle and it literally means the middle value.

Another way to think about the middle value of

a sample data, sample of data, is to think

of it being the value that's greater than half

the data set and less than or equal to the

other half.

And that's the idea of the 50th percentile.

It's greater than or equal to 50 % of the

values and less than or equal to the other 50%, 'kay?

So it's easily defined in a small sample,

especially when you have an odd number of observations.

So we have five observations here.

If I line them up in order from smallest to largest, the median is the

one that's just equal distance from the

largest and the smallest, it's the middle value.

So five points, the value in the middle

is the third, and that's 95 millimeters of mercury.

The median, the sample median unlike the sample mean,

especially in small samples, is not sensitive to extreme values.

The median is based on order of the values not the

actual values themselves, where as the mean involves the values themselves.

So, for example, in this small sample, five observations.

If we went back to original data source and found

that we misrepresented, the largest value was not 120 millimeters

but was extremely high, it was 200.

And we went back and recomputed our

summary statistics, the median will still be 95.

But the sample mean would go from

the previous estimate of 99 millimeters of mercury

up to a 115 millimeters of mercury, a

substantial increase because of that one sample value.

And again, in smaller samples, this can be an issue in larger samples,

we'll see that the mean is more robust

and less variable with changes in extreme values.

Well, how would you compute a sample mean if you had an even number?

It was very nice with an odd number, we just found a middle value.

Well, if the sample size is even, then we'll pretend

we picked up a sixth data point here to illustrate this.

We picked up one more observation with a blood pressure of 125

millimeters of mercury, what we would do is find the two middle values

and average those two.

So we have six observations, the third and fourth

ordered values are sort of in the middle of those.

And we would take the average of those two, the 95 and

110, to get an estimated sample median of 102.5 millimeters of mercury.

So we can handle this even when we have an even number of observations.

What

about measuring variability?

Certainly, measures of center are interesting, but it's

also sort of nice to quantify how much

variation there is in the individual values around

at least one of those measures of center.

And so something that's commonly used is called the sample standard deviation.

And it starts off with just to sort of ground us, we

have to talk about another quantity called the sample variance, which we're going to

represent with the letter S squared, and we'll talk about why in a minute.

If we take the square root of this

value, we'll get what's called the sample standard deviation.

So let's look at this variance issue.

What we do, and we're going to introduce some more

math notation now that we've cleared up the sigma.

the sample variance is roughly the average, I'm going to put

it in quotes here because there's a slight twist on

that, but the average of the square of the deviations around the sample mean.

So, in other words, what we do is measure how far each individual data

point is in our sample, how far it is above or below the mean.

So we take that value and subtract the mean from it.

If it's above the mean, that difference will be positive.

If it's below the mean, that difference will be negative.

If it's exactly equal to the mean, the difference will be zero.

And we're

going to add these up, but so that the don't cancel each

other out because the mean is in the middle of the data.

If we just took the actual deviations as

is and added them up, we'd always get zero.

So in order to actually get a

measurement of the variation, we square these differences.

So we take the difference between the first data

point and the mean, square it, add that to the

difference between the second data point and the mean

squared et cetera, and we do that for all n

data points. And so, the numerator is sort of the total

squared deviation of all points from the sample mean, and then we divide by n-1.

Pretend it's n for the moment, the number of sample points.

So ostensibly, this is the average square distance

of any point from the centre of the data.

So now, in order to actually get the average distance, we can take the square

root of that average square distance, and that's our Sample standard deviation.

So you can think of this as measuring how far an average any

single data point is above or below the mean of all data points.

So just do an example of computing this by hand.

You only have to do it once or twice in this

course and from there on in, we'll rely on the computer.

And it's more important to understand what it

means and how to compare it across different samples.

But recall those five systolic blood pressure measures in millimeters

of mercury with the sample mean of 99 millimeters of mercury.

So if we were to actually try and compute the

sample standard deviation to start the denominator portion of that

first step.

The variance is just, we're going to sum up the difference of each of the

five points from that sample mean of 99 squared, that's what this says here.

Do that for all five points.

So the first blood pressure measurement unordered by

size was 120, 120 minus 99 is 21.

We'd square that, and then we'd add it to

the next blood pressure measurement minus the sample mean.

80 minus 99

is negative 19.

That would also be squared, and then add it to 90 minus 99 squared et cetera.

And here you can see the actual computation, we do it all out.

What we get is a total cumulative squared distance of all

five points from the sample mean, is 1020 millimeters of mercury squared.

So that's

not a unit we're familiar with.

I've never gone to a practitioner and have him or her tell me

my blood pressure in squared units, but this was just for the computation part.

Now we're going to take the square, we're going to divide

this by the sample size less one to get

the average square distance, more or less, and then

we'll take the square root to get the average distance.

So, when we divide this by 5 minus

1, they'll going to explain that in a minute, sort

of you can think of this as the average

square distance is about 255 millimeters of mercury squared.

So then we take the square root of that to get our sample standard deviation.

And when all the dust settles, it turns out to be about 16 millimeters of mercury.

So on average, any single person in this sample has

a blood pressure of 16 millimeters of mercury away from the sample mean.

So a couple notes on the sample standard deviation.

the bigger it is, and we're going to refer to it as lower case

s here, the more variation there is in the data points and the sample.

So if we were comparing this sample from some population to another

sample from another population and we

were comparing the variability of blood pressures.

If the second population had a higher estimated sample

standard deviation, which shows that there is much variation.

As measures the spread about the sample mean, they will always be

spread about the sample mean unless all of our values are the same.

That's can only be positive or equal to zero.

If you think about it we start by measuring square distances which have

to be positive, we sum them up, take the square root of that.

So that will be positive unless all of our

values are exactly equal to each other, in which case

the mean will equal those individual values and there'll be no variation.

You won't see that in real life data samples.

The units of s, and we saw this, are the same as the units of the data.

For example, our distance of blood pressures from their sample mean

is measured in millimeters of mercury, the unit of blood pressure.

It's often abbreviated, you'll see it in papers, sometimes as

SD, SD or sd. But we will

be using s, and s squared is the best estimate from our sample

of the underlying population in variance that again, the thing we can't have.

So we'll represent it with a Greek letter, the Greek letter sigma, sigma squared.

And so the square root of that squared, s is the best estimate of the population

standard deviation.

How far all observations in the population fall from the population mean.

We can only estimate that based on our sample.

So speaking of that, let's get back to this n minus 1 in the denominator.

I keep saying we're averaging, but technically we're not fully

averaging because we're not dividing by n but n minus 1.

Why do we do this?

Well, it turns out what we really want to know, what we really want to measure

is not the distance of each point from the sample mean

but the distance of each point from the true population mean, mu.

But of course we don't know mu, so we have to estimate it with the sample

mean so that if we put that X bar in, so back to the formula here.

The thing is, X bar is being customized by our sample values.

It's only dependent on our sample values and not values in the population

that are not in our sample.

So you can show mathematically that sample values will tend to get slightly

closer to the mean based only on them than to the underlying population mean.

And so our, if we just average these, we would tend to underestimate

the variation that we want to get at, the total variation of the population.

So we, to counter that underestimate, we make

the denominator slightly smaller so that this fraction is slightly higher in value.

You can see that this is going to be immaterial in most samples.

The difference we divide, to between dividing by 200 and 199, for

example, when the sample 200 isn't going to make much of a difference.

But I wouldn't worry about this, I just wanted to

give you a heads up as to where it's from.

So let's look at

[LAUGH],

and we're going to study more than blood pressure in this class,

but we'll just look at another data example with blood pressure.

So this is blood pressure data taken from a random clinical sample of 113 men.

And here are the first 50 measurements of the 113, and

you can see, vis, you know, there's a lot going on here.

I can't fit all 113 on the scru, sc, slide, and if I ask you

to make sense of this, it would be hard to do by looking at these data.

The first guy,

for example, his measurements are 142, the next person 116, etcetera.

So let's, let's take these 113 points

and summarize them down to some key quantities.

So if we do this, and I used a computer to do this, I wouldn't expect any of us to do

this by hand the sample mean for these 113 men is 123.6 millimeters of mercury.

However, there is variation around

that mean.

And if we estimate the sample standard deviation,

it turns out to be 12.9 millimeters of mercury.

So the average male is 13 millimeters of mercury above or below that mean of 123.6.

And the median for these 113 values is 123.0

millimeters of mercury similar but slightly less than the mean.

And again, these are estimates based on the 113 observations

of their respective underlying population qualities.

Section two A, insert two, take one.

So certainly other percentiles besides the median can be very important

and insightful when looking at the

distribution of continuous outcomes as well.

And this just reinforces the idea that the computer is

a very helpful tool when using and doing data analysis.

So there are some other values that are going to help us

understand and quantify characteristics of continuous

data, and this includes sample percentiles.

We've already talked about the median being the

50th percentile, and let's just generalize that idea.

In general, the pth sample percentile is that value in a sample of data

such that p% of the sample values are lesser than or equal to that value.

And the remaining 100

minus p% are greater than this value.

And we'll look at some examples to, to put a face on this.

But, percentiles can be computed by hand but again are generally done by the

computer, and that's the, well, the approach

we're going to take from here on in.

The, these values will be given to you from the computer.

So, for example, this data set on 113 men, the systolic blood pressure measurements.

based on the results from the computer,

for example, the 10th percentile for these 113

blood pressure measurements is 107 millimeters of mercury.

Meaning that approximately 10% of the men in this sample

have systolic blood pressures less than or equal to that value.

And the remaining 90%, 100 minus 10, 90% of the men have systolic blood pressures

greater than this 10th percentile. Similarly, the 75th percentile

for these 113 blood pressure measurements is a 132 millimeters of

mercury, meaning that approximately 75% of the men in the sample have

systolic blood pressures less than 132 millimeters of mercury.

And 100 minus 75 or 25% of the

men had systolic blood pressure values greater than 132.

These are sample-based estimates of the population level quantities once again.

Here are some other percentiles of the data, some ones that,are,

will be commonly used and including the 50th, 25th and 75th.

And then the 2.5th and 97.5th sort of describe the extremes in the data.

Not quite the largest or smallest value, but they give us some sense of where

most of the data values lie between you

know, 2.5th percentile is greater than only 2.5%.

And then, conversely, the 97.5th

percentile is less than only 2.5 %. So, the range from these two

values encapsulates most of the data, and we'll get into that in the next section.

Here's another example, length of stay claims at Heritage Health

with an inpatient stay of at least one day in 2011.

So this is from the Heritage Health data.

there's 12,928 claims, and only included in these are length

of stay values for people who had at least one day.

And here I'm obviously, I'm not going to show

you over 12,000 values on a single screen.

Here are the first 50 measurements, just to give you some

sense of these data.

the first person in the data set, what this measures is the total length

of stay across the total visits, hospital visits the person had in the year 2011.

So this person accumulated 21 inpatient days.

This next person had one day length of stay, et cetera.

So let's look at, this is interesting.

Let's look at this, and think about how these results differ slightly

from what we saw with the blood pressure data.

the estimated mean length of stay is 4.3 days,

on, the average person had inpatient stay of 4.3 days.

There's fair amount of variation relative to that,

but yeah, the standard deviation is 4.9 days.

so on average the average person had values 4.9 days above or below the mean.

Of course

it couldn't be below the mean, but we'll get to that.

And then finally, the estimation the per, the sample median

which estimates the population median is two days look at that!

The median here is substantially smaller than the mean, and we'll

sort of look at the implications of that in the next section.

Here are some other key summary statistics percentiles.

So here is that median of two days.

Look at this, both the 2.5th percentile

and the 25th percentile are the same value.

One day, how could that be?

Well, it actually means that over 25% of persons in

this data set had the same value of one day.

That's the minimum required to get into the data

set, a lot of people only stayed for one day.

The 50th percentile is two days, and then the 75th percentile is five days.

And on the upper end, we have a 97.5th percentile of 20 days.

Start thinking about, look, the 50th percentile is

closer to the 25th in value than it

is to the 75th even though there's a

similar proportion of data in both those groups.

Just think about that, we'll get that,

into that in the next section.

So, just to summarize what we've talked about.

Summary measures that can be computed on a single sample of continuous data include

the mean, standard deviation, median, 50th percentile, and other percentiles.

These sample based estimates are the

best estimates of unknown, underlying population quantities.

For example, x bar, the sample mean is the best estimate

of the unknown population mean mu.

S, is the best estimate of the population standard deviation sigma.

Soon, we will discuss how to address the uncertainty in

the sample-based estimates as they relate to the unknown truths.