Too much choice can be a bad thing.

You have a dataset,

you need to fit a regression model to predict something,

but you have possible predictor variables

coming out of your ears.

How are you going to decide which predictors

to leave in and which to leave out of it?

That's a really important question.

Let's look at some guiding principles

to steer your course.

So a good start is to read existing relevant literature.

Studies in high profile peer review journals

are more likely to have been done well than

those in little-known journals at

unscrupulous publishing houses that

will accept anything if you pay them,

you can also ask experts if you know any.

These sources will give you a few suggestions of what to

include but they probably won't do the whole job for you.

So before going any further,

let's be clear about what your model is trying to do.

You want to predict a patient outcome with

enough accuracy to be useful and realistic,

but you don't want a model that's so

complicated that you can't interpret its coefficients,

which for logistic regression are odds ratios,

once you've done the explanation.

You also need your model to be robust,

that means that it should also work well when you

apply it to another dataset with different patients.

So let's first consider the pros and

the cons of the model with only two predictors,

and then for one with a 100.

I always like to exaggerate to

illustrate a point, it's fun.

First, let's say your model has only one predictor

which you've selected based

on your reading of the relevant literature,

to make it really easy,

let's say it's gender defined

as either a male or a female.

This model will have a grand total of two parameters.

One for one gender say female,

and one for the intercept which will include

the odds for the other gender, so males.

This model has some obvious advantages,

it's quick to run even on a slow computer,

and simple to interpret and explain to other people.

The parameters of the model,

that's the intercept and the odds ratio for

the effects of being female compared with being male,

will have nice narrow confidence intervals

because they're each based on a lot of patients.

If your dataset contains

1000 patients and the gender split is 50-50,

you've 500 patients to estimate

the odds for each gender, that's a lot.

This model will be robust but

the outcome is hardly likely

to be only due to the patient's gender,

the models predicted power would be poor.

To get better prediction,

you'll need to use more predictors,

so let's consider a model with a 100 of them.

Say you've just thrown them all in together,

the discrimination of the model

as measured by the C-statistic,

may well be high.

Let's say it's now 0.85,

whereas with just gender in it,

it was just 0.53,

so a huge improvement.

But this model would have taken much longer to run,

and you've got a lot of

interpreting and explaining to do.

Some predictors will have low p-values, but many won't.

Some predictors will have

large standard errors and wide confidence intervals,

meaning that the estimated odds ratios for

these predictors have a lot of

uncertainty about their real values,

that is they are unstable.

This model is not robust,

its output probably can't be trusted.

If you fitted the same model

to a different set of patients,

you'd probably get some very different odds ratios,

this is called overfitting which I'll

explain in more detail separately.

So what should you do?

You need to prune the model and clear out the junk.

To do this, there are

some exotically named technical tricks

that can be used in regression,

but are also considered machine learning methods,

these are beyond our scope in this course.

If prior knowledge isn't enough to help,

there are some other commonly used approaches,

commonly used but really smelly,

so smelly that I can

barely bring myself to describe them,

but I must because they are so widespread.

The first is forward selection.

Here, you start off with no predictors in

the model and then you try them one at a time,

starting with one with the lowest p-value,

then you keep adding variables

until none of the remaining ones are significant.

Often defined by p less than 0.1,

so you don't miss important ones, this is horrible.

You might think you're keeping

only those where p is less than 0.1,

but actually all this testing

of this and testing of that,

means you've no idea what the real p-values are,

and the confidence intervals don't make sense in

the situation either, it's not robust.

A variant on this is stepwise selection,

which allows you to drop variables if they

become nonsignificant when you add a new one,

this is also horrible for the same reasons.

Thirdly, there's backwards elimination.

Here, you put all the possible predictors in at

once and you drop the nonsignificant ones,

beginning with the least significant one,

that with the highest p-value.

This is the least bad of the three,

I use it sometimes, though with caution.

Choosing which predictors to keep in your model

is a vital task and an art,

but can be fraught with danger.

Using prior knowledge is good,

and backwards elimination is useful when used carefully,

but forward selection and

stepwise selection are too smelly,

even to be considered.