A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

来自 Johns Hopkins University 的课程

Statistical Reasoning for Public Health 2: Regression Methods

81 个评分

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

从本节课中

Introduction and Module 1A: Simple Regression Methods

In this module, a unified structure for simple regression models will be presented, followed by detailed treatises and examples of both simple linear and logistic models.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So the term correlation gets used a lot not just in statistics, but

in everyday life.

Well, in this section we're actually going to show how to use the results from

linear regression to measure correlation.

It's something we get in the output from a computer package that does

linear regression models.

And we'll show two different ways of measuring this that are related to

each other, but have some slight differences in terms of the,

how the numbers come out and their interpretations.

Now, let's talk about using the results from simple linear regression to

get information about the strength of the linear association between our

outcome and predictor.

So the slope of a regression line estimates the magnitude and

direction of the relationship between y and x1.

Especially, when x1 continuous,

encapsulates how much y differs on average with differences in x1.

The slope estimate and the standard error can be used to address the uncertainty in

this estimate, with regards to the true magnitude and direction of

the association in the population from which the sample was taken.

Slopes do not impart any information,

however about how well the regression line fits to the data in the sample.

The slope gives no indication of how close the points get to

the estimated regression line.

And one of the things about the slope is it can be made arbitrarily larger or

smaller just by changing the units of either x1,

when x1 is continuous or y, or both.

So let me give you an example where x1 is continuous.

This is arm circumference and height.

Arm circumference we had measured in centimeters on these 150 Nepali children.

And height was measured in centimeters.

And hence, the regression equation we got gave us an estimate of the relationship

between arm circumference and height when both were measured in centimeters.

You may re, may recall under this scenario the slope for height was 0.16 centimeters.

We would expect an average difference in

arm circumference of 0.16 centimeters per 1 centimeter difference in height between

two groups of Nepali children, in this age group.

But suppose we had actually gone back and converted our height measure to inches.

So each children's height in inches instead of centimeters and

re-ran the regression model.

We have done nothing to change the relationship between arm

circumference height in this sample.

We're simply expressing height in different units.

If we do this, the resulting slope of height is now 0.41.

Because a one unit difference in x is a larger difference than when we

were recording height in centimeters.

This is estimated 0.41 centimeter difference in

arm circumference on average per 1 inch difference in height.

So we can arbitrarily inflate the absolute value of the slope or reduce it just by

playing around with the units of our outcome, or predictor, or both.

There's another quantity that comes out of simple linear regression that can

be estimated.

It's called the,

it has a fancy [LAUGH] kind of silly name, the coefficient of determination.

More frequently referred to as R squared.

This is a number that ranges from 0 to 1 with larger values, that is values closer

to 1, indicating closer fits if you will, of the data points to the regression line.

What R squared essentially does is, it's a relative comparison.

It measures the strength of the association between our outcome y and

our predictor x1, by comparing the overall variability of observed y-values

around their regression based estimates of the mean, for groups with

the same x-values to variability in the y-values ignoring the information in x.

So let's diagram this a little more detailed.

So how close, for example does,

do the arm circumferences points get to the arm circumference mean

estimates based on height, using that linear regression equation?

Well to, and

understand this, let's just go back to statistical reasoning one for a minute.

And say, let's pretend, you know, we didn't know how to do regression.

And so we decided to ignore our information about height, because we

didn't know how to relate our continuous outcome to a continuous predictor.

And we forgot that we could dichotomize height or

put it into quartiles and do a t-test or an ANOVA.

Well, recall the way we'd originally quantify the variability in

arm circumference if we didn't take into account any other information about our

children is measure, essentially we can think of it as the average.

Average, difference or

distance, between any sample value.

And the overall sample mean.

So this measures how far on average each of the sample, the observed, if you will,

y-values, for example, each individual child's height falls from

the overall mean height of all 150 children.

And it is this example, s or the sample standard deviation is 1.48 centimeters.

Some children fall closer to the mean, some fall farther.

But on average, the distance above or below the mean is 1.48 centimeters.

So if we were to try and

visualize, I mean put this is on the scatterplot which includes height although

we're ignoring any information right.

And what this would look like is a regression issue is just

plotting a horizontal line at the overall mean of all 150.

And what we do is then measure the distance of each observed child's arm

circumference from the overall mean of, they're all 150 arm circumferences.

We'd square those distances, add them, average them and

take square root to get the average distance of any individual child's arm

circumference from the overall mean of all children in the sample.

There's another standard deviation of regression,

sometimes called as the root mean square error.

That sounds fancy, but let's think about it for a minute.

Error is another word for deviation.

Square is synonymous with square.

So this is squared error or squared deviation.

Mean would be the average square deviation and root is the square root of that.

So, this is basically the square root of the average distance.

Just like we saw with the previous quantity, but

what this does is measures the distance of each individual y value.

Not from the same overall mean of all y-values, but from its x-specific mean.

So for example, in this arm circumference and height example we would

compare each child's arm circumference value to the estimated mean

by regression for all children with the same height or x1 value.

So we're allowing the mean estimate in regression to change depending on

another factor, our x variable.

And we're measuring the discrepancy between our observed y-values and

the x-specific means estimated by the regression line.

And in this example if we calculate that and

don't worry about this detail, divided by n minus 2.

You can think of this as the average distance of

each observed point around its regression estimated mean, y.

It's 1.09, that's less than when we ignored height.

So the idea is we get closer,

our individual points get closer to their height specific arm circumference means.

Their y hats based on their corresponding value of x1,

then they do if we ignore the information in x1 and

measure the distance of each y from the same overall mean.

Right?

So what we do here is instead of computing the distance from that flat overall line I

drew before, we compute the distance of each point from its

corresponding mean estimate based on height from the regression line.

We'd square these distances and average them to get the overall squared difference

between 150 sample points and their regression line based estimated mean

arm circumferences, and we take the square root of that to get the average distance.

The idea is if we don't actually reduce our variation in the y-values around

their regression estimated means once we've taken into account x1 is

essentially our variability in our y-values after accounting for

x1 is the same as when we ignored x1.

Then knowing x does not yield a better estimate for

the mean of y than using the overall mean y bar.

We don't do any better than if we estimate the same mean for every one.

So in other words,

if there's no additional information about our outcome y based on x1.

In the arm circumference height example, this would mean that we would not

reduce the variability in individual arm circumferences around our height specific

means, relative to the variability around one overall mean arm circumference for

all children in the sample.

However, if s of regression, sometimes written as s with a little subscript y bar

x1, which this just means the variation of y given a predictor x1 from regression.

The smaller this variability of individual values around the regression estimated

means is relative to the overall variability on average when we ignore this

other predictor s, then the closer the points are to the regression line.

So what R squared functionally measures is how much smaller the variability of

individual values around their regression based estimates is

than the overall average variability, ignoring our predictor x1.

And as such it can be interpreted as the amount of

variability in y explained by taking x1 into account.

So this you can get from a computer.

I'll throw up a bonus and lectures that are optional to show you how

explicitly this is calculated, and it may give you a little more intuition to it.

But, this generally comes from a computer, and so I got this from the computer

the R squared from this regression of arm circumference on height is 0.46, 46%.

So, child's height explains an estimated 46% of

the overall original variation in arm circumferences.

46% of that variation in the individual arm

circumference values around the same overall mean estimate.

We can also express this with another quantity called R,

not surprisingly, it's something to do with the square root of R squared.

Ironically though, R squared is always written with a capital R and r,

this quantity, is always written with a lower case r, but they are related.

What r is, is essentially the square root of R squared, but with a catch.

It's the proper sign R squared.

Technically speaking, the square root of R

squared has two potential values, r or negative r.

And the sign comes into play,

because it will tell us about the direction of the relationship.

R squared is always a positive number.

There's no information in this number about whether y tends to

increase with increasing x, or y tends to decrease.

But r or the square root of R, with that sign appended to it,

gives us information about the direction.

r is called the correlation coefficient, not to be confused with

regression coefficients, which are our slope and our intercepts.

So these are great names, huh?

They're not [LAUGH] very distinctive, but

correlation coefficient refers to this value.

r is just the transformation of R square.

In the grand scheme of things, 0, R square is between 0 and 1.

If there's no added information in x1, then R squared will be close to 0.

The close, the points don't get any closer to the regression line

that uses x1 than they do to a flat line giving the overall mean y.

If the points all line up perfectly on a line, R squared would be 1.

But that's not going to happen in real life where there's biological,

sociological and other variability in the measures we're looking at.

R squared could be negative or positive.

If the relationship between y and

x1 is positive, then the corresponding value of r, which will be

the positive square root of R squared and it will also be between 0 and 1.

If the corresponding relationship between y and x1 is negative however, r will

be the negative square root of R squared and it will range from negative 1 to 0.

Negative 1 would be, mean perfect negative correlation.

Then all of the points, individual values lined up on a line with decreasing slope.

r would be higher in absolute value than R squared.

So in this example, r would be the positive square root of the R squared

value we got of 0.46 because we observed the positive slope, the relationship

between arm circumference and height is positive and r is equal to 0.68.

So, from this example, child's height explains an estimated 46% of

the variation in the arm circumferences.

And this, of course, is just an estimate based on the sample.

If we wanted to actually account for

the uncertainty in our estimated amount of variation explained at

the population level, we'd have to put a 95% confidence interval on this.

Unfortunately, that's not an easy thing to do and the procedures we have for

it is not so good.

So, a lot of times when this measure is reported in the literature,

unlike other quantities, there will be no information about the uncertainties.

So just know that when you see it.

It's not the truth, it's just an estimate.

[SOUND] So one way to think about this is well,

there's still 54% of the original variability in

arm circumferences not being explained by taking into account child's height.

So some of this remainder left over variability may be explained by

other factors above and beyond height.

So we'll soon talk about expanding the regression to

include more than one predictor.

And this will allow us to see if we can even better explain the variability in

the outcome y by taking into account more than one predictor in the single model.

Let's give another example.

If you go back and look at that hemoglobin impact cell volume example we did.

If you computed the R squared or the computer did, it's equal to 0.51.

So packed cell volume explains an estimated 51% of

the original variation in individual hemoglobin levels in the sample.

If it was a positive association, so if we compute the correlation coefficient,

it's just the positive square root of 0.51 which is 0.71.

Correlation coefficient doesn't have as easy to interpret a physical

interpretation, but higher absolute values mean stronger correlation and

more variability being explained by the predictor.

Here's another example we looked at, wages and years of education.

And the R squared here is substantially lower than the previous two examples.

It's on the order of 0.15 or 15%.

The corresponding correlation coefficient is the positive square

root of that 15, 0.15 or about 0.4, 0.39.

Even when we're comparing means between two groups and we get a mean difference

estimate, a confidence interval, and a p value just like we did with the t-test,

is with the regression approach, the computer will also give us this R squared,

even when we have a binary predictor.

And so, if we looked at the R squared value for

arm circumference as a function of sex, it comes in at 0.042 or 4.2%.

So this means that sex explains about 4.2% of the variation, individual variation,

original variation arm circumference.

You may recall when we looked at the box plots of the distributions of

the arm circumference values for males relative to females,

there was a lot of shared values in the two box plots.

But the percentiles, median 25th and 75th, for

females were shifted slightly lower than males.

And we estimated a small mean difference between the two.

So that means that, you know,

the amount of variability explained by sex is minimal, and this R squared is 4.2%.

The corresponding correlation coefficient when x is coded as a 1 for females and

0 for males, is negative square root of that R squared, because there's

a negative association between arm circumference and the female sex.

If we had coded sex in the opposite direction, the R squared value would

be exactly the same, 0.042, but the r value we get would be 0.20.

Because we'd be associating arm circumferences males compared to

females and males have slightly higher arm circumference.

So what? You might think [LAUGH] well, you know,

some of these R squared values are large, some aren't.

You know, what is a good R squared?

There's a couple of important things to keep in mind about these two quantities.

First of all, these are estimates based on the sample of data and

they're frequently reported without some recognition of sampling variability.

So, you just have to think that there's uncertainty in

the estimates that are reported, and it will not be accompanied by

something like a confidence interval as it would be with a slope, for example.

But here's the thing, low R squared and hence r, are not necessarily bad.

Many outcomes, especially in medicine and public health, sociology and such,

can not or will not be fully or close to fully be explained in terms of variability

by any one single predictor, or by any set of multiple predictors for that matter.

And this is really important for me to highlight actually,

that a lower R squared enhanced absolute value of r is not a non finding.

Sometimes when people see these values they think that a lower association level

is not indicative with an important association.

But many phenomenon in public health and medicine cannot be easily or

completely be explained by any one factor or multiple factors for that matter.

However, the trends we'd see are important enough to

influence medical practice decisions and policy decisions, et cetera.

The higher the R squared values you can think of, the better x1 predicts y for

individuals in the sample, as well as the population,

as individual y-values vary less about their estimated mean based on x1.

So this gives us more information about who will benefit, if you will, or

suffer from the increased exposure of x1, when the R squared is high.

But many times, there's, there may be important overall associations between

the mean and y and x1 trends, if you will.

Even though, there's still a lot of individual variability in

the y-values about their means estimated by the predictor.

So, for the example, in the wages example, years of an education explain

an estimated 15% of the variability in the hourly wages.

The association was statistically significant showing that

average wages were greater for portions with more years of age.

However, for any single education level, measured by years of education,

there's still a fair amount of variation in wages for individual workers.

So I can advise that more education is associated with

an average increase of wages, but it's hard to guaranteed that any

single person will experience an increase in wages, if they get more education.

What about slope versus R square?

Let's come back to that now.

Slope estimates the magnitude of the direction of

the association between y and x1.

It estimates a mean difference in y for two groups who differ by one unit next.

The slope will change if the units change for y and or for our predictor x.

And because of this, because the, the size of the slope, the absolute value of it.

Is totally affected by our choice of units.

Larger slopes,

therefore are not indicative of stronger relative to smaller slopes, and smaller

slopes are not indicative in weaker linear association relative to larger slopes.

The size of the slope is a function of the units we use.

R squared and hence r measures the strength of the linear association, and

r also includes information about the direction.

Neither of these two measures the magnitude,

how much the average y differs by one unit of x.

And neither R squared or r changes with changes in units.

So it's invariant to the choice of units.

Because when we change the units,

we're not changing anything about the strength of the relationship,

just how we quantify the association.

If you have r, you can compute R squared simply by taking r and squaring it.

If you have R squared, you can almost compute r.

But if all you have is R squared, you won't know what sign to assign r.

So you need to see a scatterplot, you need to see the regression slope estimate or

something that tells you about the direction of the association.

R is useful however, even though it doesn't have as

easy to pull off a physical interpretation as R squared.

Like we said, if you have r, you can get R squared to get a sense of

how much variability and one thing is explained by the other.

But it's really nice summary measure when we're comparing sets of variables.

For example, in a paper and we just want to show how strongly associated pairs of

variables are and include information about the direction of the magnitude.

So if I was writing a paper on anthropometric associations in Nepali

children, I might first present a table like this, where I have age, weight,

height and arm circumference in both the rows and columns of this table.

And wherever two of them intersect, it gives the correlation between the two,

the r value.

So, this not only tells me about the relative strength of

the linear association, it tells me the sign of it.

So for example, not surprisingly, weight and age are positively

correlated to a relatively high degree, height and age even more so.

Sex, which is in this case, coded as a 1 for females and 0 for

males, is negatively associated with weight and height, but

to a smaller degree than the other correlations.

And this just tells me that females have lower average weight and height values,

but there's a fair amount of variation in these values between the two sexes.

So in summary, R squared measures the strength of

association between a continuous outcome and a predictor in a linear regression

format by comparing the variability of the points around the regression line.

And remember, even with a binary predictor, there is a regression line.

There's just two points on it to the variability in

y-values ignoring that predictor.

The correlation coefficient r is the properly signed square root of R squared,

and hence provides information about the direction of the association as well.