0:09

With associational analyses, the basic goal is to determine whether a key

predictor and an outcome are associated with each other in the presence of many

other potentially confounding factors.

But sometimes you wanna be able to predict the outcome with

all of the available information that you have.

0:25

So you don't necessarily have to distinguish between, say, a key predictor

and a set of other predictors, okay, you just want to use all the information.

So, and furthermore it doesn't matter kind of if the variables are related to

the outcome in some sort of causal or mechanistic way, but

if they carry any information at all about the outcome,

they may be useful in a prediction setting, and you might want to use them.

Because you're usually not interested in developing a kind

of detailed understanding of how the variables are related to each other, or

how about they're related to the outcome.

You just want to be able produce solid and high quality predictions, and

so any variable that could play a role in that might be useful to you.

So, what are the expectations that we have about prediction problems?

So when we build a prediction model the thing that we want is to be able to find

a feature or a set of features that can produce good separation in the outcome.

Because typically the outcome is typically going to be some binary outcome or

some multi-class outcome where that can take either two values or

just a handful of values.

That is a typical setup of a prediction problem.

And you want to be able to separate the two classes using a set of

features that you collect, and a model that you develop, okay?

So here's a very simple example of a single predictor

on a binary outcome that produces very good separation, okay?

This data is simulated, so

I just want to show you so you can see what it looks like.

On the y-axis I have the values of the outcome which are just 0 and 1, so

it's a binary outcome.

You can think of this as like not having the disease and having the disease or

2:00

any sort of binary class outcome like that.

On the x-axis here I've got a simulated predictor that ranges between -2 and 2,

roughly.

And it's continuous, so it takes all the different values in between there.

And you can see that there's a gray area

here that I've highlighted in the plot that's kind of near the middle.

And you can see that in that grey area that the outcome will take values 0 and

1 depending on the value predictor.

So the outcome can take either value in that grey area.

To the right of the grey area you'll notice that the outcome is always 1,

and to the left of the grey area you'll see that the outcome's always 0.

So the goal of most prediction algorithms is essentially

to minimize the size of that grey area.

Cuz that gray area is the area where you have the most uncertainty.

It's because this is the range of the predictor

where the outcome could actually take both values.

So you have some uncertainty there.

Once you're outside that gray area, you have almost absolute certainty,

because the outcome will either be 0 or 1.

So the goal is to minimize the size of this gray area using some set

of features that you can collect.

3:05

So, for this example, I'm gonna use a dataset on the creditworthiness

of a group of individuals.

And the dataset is taken from the UCI Machine Learning Repository,

which is an excellent repository for all kinds of machine learning and

prediction types of data sets.

So the data set classifies individuals into good credit or bad creditworthiness.

And they include a variety of variables to help you to

predict the creditworthiness of these people.

So the basic process that we'll go through here is we'll first split

the data set into a training and test set.

Then we'll fit the model to the training data.

Then we'll make predictions based on the model, but on the test data.

And then we'll compare the predictions on the test data to the truth

that we know from the test data to see how well we did.

So here are some of the results from the model that we fit.

I won't get into the details of

the model cuz it doesn't really matter at this point.

But here's a plot that you might make.

And the basic idea here is on the x-axis we have our predicted probability of being

good, a good credit quality.

And on the y-axis we have the actual truth of whether you're a good or

bad credit, you have the good or bad credit.

Because this is the test data set, we actually know the truth, and so

we can compare the truth with what our prediction says.

And so a couple things you'll notice here, first of all you'll notice it doesn't

quite look like that picture where I simulated the data.

So, it doesn't quite match our expectations of having this very good

separation, right?

All along the range of the x-axis you'll see there are both values of bad and

good and so the separation isn't necessarily so good.

Now you will notice there is a big clump of points in the range of the x-axis.

But about point 6 they are mostly in the good credit qualities category.

So as you can see as the predictor quality goes up you'll see that the number

of actual good credit quality people increases.

So there is at least an association there, so that's good.

But one thing you'll notice is that the prediction scores were all kind

of on the high end, they're all basically greater than a half, and so

there isn't a lot of range there.

5:10

So ultimately, it's not clear that this prediction algorithm is particularly good.

It seems like it's having some difficulty finding out, finding a good combination

of features that can separate people with good and kind of bad credit risk.

Something that can also be helpful is to compute a set of summary statistics about

the prediction algorithm, and you can see that here.

5:28

So here at the very top is what's called a confusion matrix and

it shows the number of predictions that are in the truth, bad or

good, that's called the reference, and then what we predict to be bad or good.

And you'll notice immediately that most of the predictions are just of good.

So the algorithm just basically classifies everyone as good credit quality.

So I guess that makes sense because most of the individuals in this data set have

good credit quality.

So if you were to make a prediction it would be easiest just to say

you have good credit.

And then your prediction would be right about 70% of the time.

6:04

You can see that the accuracy of the algorithm is about 70%.

And so that's okay, but

the problem is that the algorithm's specificity is very poor.

Which means that if you actually have bad credit, the probability that the algorithm

will classify you as such is only about 2.6%, so it's very low.

So if you truly have bad credit the algorithm will have a difficult time

picking that up.

6:28

So there are a couple of things to think about

when you see the results of a prediction model like this one.

So, the first thing to think about is prediction quality.

First, you have to ask yourself is the model's accuracy good enough for

your purposes.

So you solve the summary statistics from this particular model, which is okay.

It had a 70% accuracy, it had 2.6% specificity.

Is that good enough for your purposes?

Well, it depends on what your purposes are.

For example, in many medical applications where the outcome is the presence of

a disease you may want that test or that algorithm to have a very high sensitivity.

So if they truly have the disease, you'll want the algorithm to pick it up.

Because then you can send them into treatment,

and then kinda send them down the road to recovery.

However, if the treatment for that disease is very painful or

there are a lot of bad side effects you may want to be careful about

exactly who you send out to treatment.

And particularly, you wouldn't want to send someone who didn't have the disease

to a treatment that's going to be very painful and have a lot of side effects.

So there, you would make sure that if someone, you would want to make sure that

if someone does not have the disease, that you pick that the algorithm picks that up.

And so there are different metrics that you want to favor over each other,

depending on the kind of decision that will be made and

the consequences of those decisions.

And so, for example, in a financial application like the dataset we just

looked at with good and bad credit quality, there may be asymmetric costs

associated with mistaking good credit for bad versus bad credit for good.

So one scenario might have very little cost or

another scenario might have very high cost.

And so you wanna think about,

given the outcomes of your decision what types of metrics you want,

whether sensitivity or specificity, or all these other kinds of metrics,

which ones are going to be most important to you in your setting.

8:15

Every setting is going to be a little bit different, and so

you're not going to always focus on a given metric for every single application.

So a hallmark of almost all prediction algorithms is tuning parameters.

Most algorithms have lots and lots of tuning parameters.

And even though they're called tuning parameters, they can often have a very big

impact on the prediction quality of the algorithm depending on how you set them.

8:41

And so you should be careful about how you set them and how you change them around.

Now, there's no prediction algorithm that I'm aware of

where a single set of tuning parameters works fine for all problems.

So for any new data set that you bring into a prediction algorithm,

you'll probably have to tune it a little bit, and that's okay.

But the most important thing, excuse me, is that you're gonna want to make sure you

keep track of all the tuning parameters you set and

the process through which you set these tuning parameters.

Because ultimately you're gonna want this model to be reproducible.

And if you can't remember or you can't reproduce the tuning parameters,

you'll never be able to reproduce the algorithm itself.

9:21

So one last thing I want to mention is about the availability of other data.

So many prediction algorithms

these days are very good at exploring the structure of complex data and

making very good predictions, especially once you get the tuning parameters right.

And so, but now it may be that if your model is not working very well,

that you have to change to another algorithm or another procedure because

different procedures can work well in different settings and

different types of data structures and different types of data set up.

And so it's not necessarily true that all, the algorithms are exchangeable,

you may want to change the algorithm.

However, if you try a few algorithms and they all seem to be producing kind of

a similar quality of prediction, regardless of how well you tune them,

it may be time to get more data or other data to help you predict the outcome.

It could just be that the dataset you have only has an intrinsic amount of ability

to predict whatever outcome you're interested in.

And you have to kind of get other data that will have better predictive power.

And so think about that as you're building prediction algorithms and

you're seeing the results.

That there always might be this possibility that you have to bring in

additional data to improve your predictions.