0:04

Let's talk a little bit about P hacking or hacking the P value.

Let's begin by understanding what P values mean.

Generally you have a population in

the world and there is some distribution of that population.

And often the kind of distribution that you see is

a Gaussian or a bell curve that looks something like this.

So for instance, if we look at heights of people,

we have a distribution like this.

Most people are kind of medium height,

and there are a few very tall people and a few very short people.

Now let's suppose that somebody has this magic tonic that

they claim is going to make people taller.

So what do you do? You do an experiment.

You have a null hypothesis which is to say,

this tonic is useless.

And then you say, well,

let me give somebody this tonic,

actually it won't to be one person,

it'll be a control group and a treatment group,

but let's just consider for my discussion right

now what happens with one because it's easier to see what happens.

And then we say, well where does this person lie?

We just observe what the height is of this child who was given this tonic growing up,

and we say what height this person achieved,

and what we do is draw a line at this point,

which is usually aligned where we call

the P value boundary and the typical P value that one sets is at 0.05.

So this orange area here is five percent or .05 of the entire area under the curve.

So we just choose this vertical line such that that happens.

And then we say this person was given this tonic.

Does this person lie to the right or to the left of this vertical line?

If this person is to the left then we say,

well it looks like the height of

this person could just have been a random distribution of the normal population,

so the tonic is probably not effective.

If this person has a height to the right then we say,

oh this person may not have had

any benefit from the tonic but

the chances that that's the case are less than five percent.

And since this person's height is high compared to the typical for the population,

we are going to call this a significant effect,

and we are going to guess that it is quite likely that this tonic was valuable.

So the thing that we see is that this computation of P values has,

underlying it, many assumptions.

First, is the distribution nicely bell-shaped?

For many real world phenomena,

things are actually bell-shaped,

but for others they aren't.

The theory of P values can be applied irrespective of the shape of the distribution.

But if that's what one is doing then the mathematics changes

and one is often not so careful in terms of thinking through what the distribution is,

and where the P value boundary should be.

The other problem we have is how many things we are testing,

not how many subjects but how many treatments.

So a standard problem we have is one of

multiple hypothesis testing and the issue here is the following:

Suppose I have a hundred different claimed

tonics and I just test each of them independently.

We know that for each of these there is

a five percent probability of its being declared significant,

purely by chance even if this tonic was doing nothing good.

Well if I'm testing a hundred different independent tonics,

well, I'd expect five out of these hundred on average to meet that criteria.

And so just by testing hundred useless tonics

that did nothing whatsoever to help in changing the height of a child,

I would have found five that I could claim after a scientific test did so.

Now this may sound very hokey but this is actually

the kind of thing that one ends up having to do in high-throughput biology.

We have things called gene chips and we can test

many different genes for biological questions of interest.

And if a single gene chip has 20000 genes on it,

you're actually running 20000 independent tests.

And at the end of the day if you're

testing to a five percent probability in terms of a P value,

you're going to get hundreds of genes.

A thousand genes on average in one gene chip with

20000 genes that are shown to be significant purely by chance.

Now the actual problems that we have are actually quite a bit deeper than that.

When you have independent hypotheses

such as the 20000 genes we're talking about and we test them in parallel.

Then there is good mathematical theory.

One can have the right statistics,

apply that to correct for the multiple tests,

and we can do all the right thing.

However there are more complex situations where it's much harder.

So if we have sequential hypotheses,

each slightly different from the previous one,

then it's hard to even develop the mathematics

for this because the exact dependence

between different hypotheses may be hard to characterize.

So for example if you are a pharmaceutical company developing dozens of drug candidates,

and we test them each independently

but each candidate comes along at a different point in time,

and they have complex interrelationships in terms of

how the research on one has informed the research on the other,

what one should expect is most of these drug candidates to fail and a few to succeed.

And since we are actually testing dozens of candidates in parallel,

we may not do a multiple hypothesis testing correction.

Things can be even more complicated.

Often we have observed data first and then we devise the hypothesis.

Standard P value mathematics was developed for

traditional experimental techniques where you designed

the experiment first and collected the data afterwards.

And so you have your hypothesis first and then you collect

the data specifically to test the hypothesis.

In data science we have the data first and we often don't even have a hypothesis in mind.

We want to go poke around in the data and learn from it.

And indeed there's a lot that you can learn by performing

exploratory analyses and so in today's data science driven world,

exploration is the first phase of data analysis.

Well, what that means is that we are now

developing hypotheses that fit the observed data.

Of course if you test a hypothesis afterwards,

it's going to do well with the observed data.

But to avoid this problem,

one could, if one were correctly doing this,

separate exploratory data, that

is the data used for expert analysis which is the training data,

and we could separate that from the test data on which the evaluation is reported.

And if we have enough data that we've collected,

this is indeed something that is possible to do.

To conclude, humans have many biases.

No human is perfectly fair even with the best of intentions.

Psychologists have done considerable work on

explaining to us many unconscious biases that we all have.

Biases and algorithms are usually easier to

measure and so if we say we expect algorithms to be perfect,

and the result of algorithmic decision making to be perfect,

I think we're asking for possibly the impossible.

However,

there are mathematical definitions of fairness that can be applied in many cases.

We can prove fairness within some scope of assumptions for algorithms,

and by watching for some of the kinds of errors that we've been talking about,

we can try to make sure that our algorithms are as unbiased as possible.

The point is we need to recognize that algorithms too can have

biases and they can reflect the biases of the people who built them.

And by being cognizant of this,

we can minimize the bias that algorithms actually show.