0:48

And then what I'm going to do is simulate data,

where the outcome y only depends on x1.

An error, and then doesn't depend on the other two variables.

And then I'm going to look at the standard error for the x1 variable for

the model that just includes x1, that's this one.

The model that includes x1 and

x2, that's this one, and the model that includes x1 and x3.

And I'm going to do that over and over again.

And I'm going to look at the standard deviation of the simulated estimated

coefficients.

Now the reason I'm doing this by simulation, is because the variance

inflation occurs on the actual standard errors, not the estimated standard errors.

It's actually a little complicated, in that if I put another variable into

a regression model, it doesn't necessarily inflate the observed standard error.

And the reason is because the variance is estimated rather than the actual variance.

And that has a kind of conflicting property.

So you don't necessarily see the variance

go up when you include a regressor in an regression,

an unnecessary regressor into a regression model.

You have to look at it through simulation.

Let's go ahead and do this.

2:13

This is sort of the ideal setting,

where the three regressors don't have anything to do with one another.

Now what you see is in fact.

So the first model this is the standard error, the standard deviation

of the beta coefficients I got crossed across all simulations.

Here's the second one.

It actually went down a little bit with the second one.

Now we know theoretically that can't happen.

But there is some Monte Carlo error when I do my simulation.

So let;'s just say, these two are about the same.

And, the third one is also about the same, 0.032 as opposed to 0.031.

So nothing that bad.

The variance inflation by including the extra variables did almost,

was negligible.

And the reason that is the case, is because these other to variables, x2 and

x3, have nothing to do x1.

I simulated them independently.

Let's look at a case now,

where I've simulated them in such a way that they depend heavily on x1.

So I'm going to generate my x1 as I did the last time,

just as a random normal vector.

But my x2 depends on x1, but also some noise.

3:20

And my x3 is heavily dependent on x1, very dependent on x1, okay?

So now, let's repeat the simulation.

And see what happens to the variability and the estimated coefficients.

Okay, now we see huge amounts of variance inflation, especially for

this third model where we include x2 and x3, right.

We see the standard air has gone up by a factor of five, okay.

And that goes to the general rule, If the variable that you

include is highly correlated to the thing that you're interested in, right?

You are going to inflate the variance more.

And so, you need to concern yourself with putting

highly correlated variables into your regression model, unnecessarily, okay?

So if I have diastolic blood pressure in my model, and

I put systolic blood pressure on my model, which is basically the same,

relatively the same thing, they're very correlated.

If I did that unnecessarily, that's going to make my diastolic blood pressure,

standard air, it's going to increase it.

7:26

This looks at the case that you're in, versus the ideal case

where what you're interested in is uncorrelated with all the other variables.

That's the so called variance inflation factor.

And here I give a little simulation.

I'm not going to show it.

Where I just show you that the ratios can get estimated, and

I show you a little simulation.

That they can get estimated pretty well so, let's actually look

with this Swiss data, let's look at the variance inflation you,

so what I'm going to do is fit the model What happens if there's some

residual effect of the first aspirin when you give the second one?

I'm going to fit the model with all the variables included, so

when you do a ~., it includes all the variables.

And then I'm going to do vif(fit), that gives me my variance inflation factors.

Oh, but, you have to remember to load this library, c, a, r., okay?

Let's load that, and then now let's try it again.

I would rather than the variance inflation factors, I would look at

the inflation factors for the standard deviations, which is the square root.

So let's look down here.

So on the first line I have the variance inflation factor by itself and

then in the line below it, I have the square root of it which I tend to prefer.

So let's look at the first line just because it's the more normal thing,

the variance inflation, so for the agriculture 1 is 2.

What does that mean?

It means the standard error for the agriculture effect is double

of what it would be if it were orthogonal to all the other regressors, okay?

And we see that some of them are quite high, examination is quite high,

3.67, so almost 4 times

than the instance when if it were orthogonal to all the other regressors.

And we know that education, I'm sorry, examination and

education are highly related, so that's why these two have very high

11:53

So let's talk about model selection in general, automated model selection.

I'm going to fit the model with all the variables included, so

when you do a ~., it includes all the variables.

It really, I think, in one point was a statistical topic but

it's really moved in to the realm of machine learning for the most part.

And I would say, though, even for relatively simple

linear regression models, the space of model terms that you have to search among

explodes really quickly when you start including interactions and

polynomial terms like the square of a regressor and so on.

If you have a lot of regressors and you're interested in how do I reduce this space,

then there's a lot of factor analytic and things like principal components,

those kind of techniques that are available to you,

to reduce your co-variant space down to size.

Now however, those come with consequences.

Your principal components or factors that you obtain might be less interpretable

than the original data that you're interested in again.

Again, this is probably better served in a multivariable class,

a multivariate class, or a class on machine learning.

But for us we're going to mostly consider the case where we have a relatively

small number of regressors, and we're going to pick through them with a highly

interactive process between the analysts, the data, and the scientific context.

13:23

Another thing I would mention is that good design can often eliminate the need for

a lot of this model discussion.

We've talked a lot about how randomization can really prevent a lot of

the problems that we're talking about with making our

variable of interest unrelated to nuisance variables that we're not interested in, or

in nuisance variables that we don't even know about; however,

there's other aspects of design that can serve the same purpose.

For example, if we stratify and randomize within strata.

The classic example of this, when this was developed, was R.A.

Fisher was working in field crop experiments and they needed,

let's say you were trying a different kind of seed,

you might block on different areas of the field that you were going to plant in and

randomize the different seeds to those areas.

So you might have two different kinds of seeds, but they will have been distributed

in a systematic way that is fair across the field, but

also then, within that design, there will also be some randomization.

This topic of experimental design is a pretty broad topic.

Another great example is in biostatistics, the field that I work in the most,

a very common kind of design is a cross-over design, and

in that case, you try to use every subject as their own control.

So let's say, for example,

you're interested at looking at two different kinds of aspirin, and

you might give the aspirin to one group of people, and then the aspirin to

another group of people, and the other aspirin to a different group of people.

Let's say they have different gels or whatever that determine how much it,

how it gets absorbed in your stomach.

So if those two groups aren't the same, either the randomization wasn't

very good and there was some certain imbalance that you just got unlucky about,

or if the study was just observational, then the comparison of those two groups

might be biased by whatever differentiates the groups rather than group one receiving

one kind of aspirin and group two receiving a different kind of aspirin.

15:34

On the other hand, if you can give a person one kind of aspirin and

then later on give them a different kind of aspirin when they have another

headache, that would compare each person to themselves, right?

Control block on the person so to speak.

So that's a design strategy.

Now there's some nuance with this design strategy as well.

What happens if there's some residual effect of the first aspirin when you

give the second one?

So maybe you could handle that with some sort of wash out period,

long wash out period or something like that.

But anyway, the point of that design is to make it, so that you're comparing people

with themselves to control and everything that's intrinsic to the person.

These across time periods control for that by giving both aspects to each person.

Maybe you would randomize the order in which they received them that's

called a crossword design.

At any rate, the broader point that I'm trying to make is, it's often the case

that good thoughtful experimental design can really eliminate the need for

some of the main considerations that you would have to go through in model

building if you were to just collect data in an observational fashion.

[SOUND] The last thing I would say is there's one automated search model

technique that I like quite a bit and I find it very useful and

it's the idea of looking at nested models.

So, I'm often interested in a particular variable and I'm very

interested in how the other variables that I've collected will impact it.

So, I'm interested in a treatment or something like that.

Some important variable, but I'm worried that my treatment groups and

imbalanced with respect to potentially some of these other variables.

So what I'd like to look at is the model that just includes the treatment by itself

and the model that includes the treatment and let's say, age.

If the ages weren't really balanced between the two treatment groups and

then one that looks at age and gender, if maybe the genders between the two

groups weren't really balanced and then so on.

And this idea of creating models that are nested,

every successive model contains all the terms of the previous model

leads to a very easy way of testing each successive model.

And these nested model examples are very easy to do, so I'm just

going to show you some code right here on how you do nest and model testing in R.

So I fit three linear models to our SWF dataset,

the first one just includes agriculture.

Let's pretend that, that's the variable that we're interested in and

then the next one includes agriculture and examination in education.

I put both of those in,

because I'm thinking they're kind of measuring the same thing.

But now after this lecture, I'm concerned over the possibility that they're too

much of measuring the same thing, but let's put that aside for this time being.

And then the third model includes Examination + Education + Catholic +

Infant.Mortality.

So, all the terms.

So now, I have three nested models and I'm interested in seeing what happens to

my effect as I go through those three models.

The point being, in this case, you can test whether or not the inclusion of

the additional set of extra terms is necessary with the ANOVA function.

So I do anova fit1, fit1 and fit5.

That's what I named them, one, three, five.

And then you see down here, what you get is a listing of the models.

Model 1, model 2, model 3 and then it gives you the degrees of freedom.

That's the number of data points minus the number of parameters that it had to fit.

The residual sums of squares and

the excess degrees of freedom of going Df is the excess degrees of freedom of

going from model 1 to model 2 and then model 2 to model 3.

So we added two parameters going from model 1 to model 2, that's why that Df is

2 and then we added two additional parameters going from model 2 to model 3.

So the two parameters we added from going from model 1 to model 2 is we added

examination and education, they're two regression coefficients.

Going from model 2 to model 3, we added Catholic and

Infant.Mortality there to regression coefficients.

With this residual sum to squares and the degrees of freedom,

you can calculate so-called S statistic.

And thus, get a P value.

This gives you the S statistic and the P value associated with each of them,

then here it shows that yes, the inclusion of education examination Information

appears to be necessary when we're just looking at agriculture by itself.

Then I look at the next one it say, yes, the inclusion of Catholic and

Infant.Mortality appears to be necessary beyond just

including examination, education and agriculture.

So if the way in which you're interested in looking at your data naturally

falls into a nest model search as it often does,

I think when you're interested in one variable in specific.

As in this case,

I think this would be a pretty natural way of thinking about the series of analyses,

then some kind of nested model searches a reasonable thing to do.

It doesn't work if the models that you're looking at aren't nested.

For example, if I had the first model or model 2 had an examination, but

not education and the third model had education, but not examination.

This wouldn't apply, you'd have to do something else.

And there, I think you get into the harder world of automated model selection with

things like information criteria.

So I would put all that stuff off to our prediction class and

just leave you this one technique that's useful in the one specific instance

where you've decided to kind of look along a series of models,

each get increasingly more uncomplicated, but including the previous one.

So, I hope in this lecture that you've gotten a couple of model selection

techniques that you can use.

I hope you've also learned that there are some basic consequences that occur, if you

include variables that you shouldn't have or exclude variables that you should have.

These has consequences to your coefficients that you're interested in,

they have consequences to your residual variance estimate.

We didn't even touch on some other aspects of [INAUDIBLE] model that could occur,

such as absence of linearity and other things like that, non-normality and so on.

So again, it's generally necessary to take your model that's the a grain of salt,

because more than likely one aspect of your model is wrong.

And I'll leave you then with this famous quote by George Box who vary

famously said, all models are wrong, some models are useful.

And I think that's a very credo to go along with that yes, for

sure your model is wrong, but it might be useful in the sense of being

a lens to teach you something useful and true about your data set.