0:00

This lecture is about the basics of Experimental design.

You'll see a lot more about this

topic in other courses like Inference and Prediction.

But I thought I'd just give you a little bit of the basics, so you can get started.

0:11

So why should you care about experimental design?

I'm going to use an example to illustrate why you should.

This is an example that actually comes from my area of research.

This is an area where they have used ge-, genomic information, so information

about your genomes to predict what kind of chemotherapy you might respond to best.

This was an incredibly exciting result, because it would open

the door for all sorts of personalized medicine within cancer therapeutics.

0:44

Even more unfortunately, the studied had already been on going

for a long time, and they'd actually started clinical trials.

But they were using those predictions that they'd developed in

the earlier study to decide what chemo-therapies people should get.

This was causing all sorts of problems for

people because they had performed a poor analysis.

So this ultimately led to a lawsuit from the people that were

enrolled in the clinical trial against the original investigators who developed the

predictive model, this illustrates how a very exciting result can lead you

astray if you are not very careful about experimental design and analysis.

1:18

The first thing to be aware of when performing any sort of an

experimental design or data science project

is to care about the analysis plan.

This is actually a publishes paper where, believe it

or not, in the abstract of the published paper, it

says insert statistical method here because the person who was

writing the paper didn't know what the statistical method was.

It's critically important that you pay attention to

all aspects of the design and analysis of the

study so that everything from the data cleaning

to the data analysis to the reporting so that

you don't end up, first of all in

silly, embarrassing situations like this, but more importantly so

that you're aware of what are the key issues

in the study design that can trip you up.

1:57

Regardless of what study you are doing, you need to plan

for data and code sharing, you all now have GitHub accounts

that you have developed either through this course or previously to

the course, so that is a good place to share your code.

If you want to share your data with very

small amounts, you can share it on the GitHub.

Larger amounts might go on a site like FigShare, where you can

share your scientific data, and also other kinds of data with other people.

But you might also need a larger data sharing plan, if your data

are much more unwieldy, or large and can't be shared on the websites.

If you don't have a data sharing plan, can I recommend mine?

The leek group has developed a guide to data

sharing that you can find at this website here, that

outlines all the steps that are involved in taking a

broad data set and making it available to other people.

So

2:46

the first thing that you need to do when you're

actually performing an experiment is formulate your question in advance.

So the key component of data science, if we were going to define it

for the purposes of this course track,

is that it's actually a scientific discipline.

And a scientific discipline requires that you're answering a specific

question when you're using data, so here's one example of that.

This is actually a study that was performed by Barack

Obama when he was running for re-election for United States Presidency.

3:39

They wouldn't want to show the website to every single

person, because it would be expensive to run that experiment.

So the population of people is all possible donors in the United States.

All possible people that might go to Barack Obama's website to donate.

They might select some small subset of those with a probability argument

to decide to show them one version or the other of the website.

This would result in a sample, a much smaller number of people that had

observed the data, or had observed the

websites and decided whether to donate or not.

They would then calculate descriptive statistics, say the total number of

donations that they got over a certain number of visits, or

the average number of donations, amount of donations, they got over

a certain number of visits, and then they would use inferential statistics.

To try to decide if the statistics that they

calculated on this small sample would play out in the

same way if they applied it to the large

population of all people that might come to the website.

4:38

So in here is one scenario, suppose that the

data work that they collected hypothetically turned out this

way, there were two versions of the website, there

was the donate version and there was the sign-up version.

And so what they could do is they could say for every

1000 visitors, what was the total number of dollars that were donated?

So if they did that over a, or the average number of dollars donated.

5:01

So for a 1000 visits you might get about $6,000 on average over a course of a week.

And so, suppose you ran that experiment three different times.

You might get observations of very

different amounts of dollars that you would

observe for the donate case, and for the sign-up version of the website.

5:21

So what you see here is that the average number of donations may

be more or less the same, or it may be slightly different, but

it's very hard to tell, because at each of the different experiments, you

get a highly variable answer, under both of the different versions of the website.

5:35

Here's another hypothetical case.

Suppose in this case, you got a very slightly smaller number of dollars donated

under when you showed the sign up version than you did the donate version.

Here, the variability is much smaller, so you can tell

for sure that the donate version gives you more money.

You might implement the donate version, but it's not

sort of the that would be a huge benefit.

5:58

The much bigger benefit comes when you run

the same experiment, and a domain version has

small variability, and the sign up version has

small variability in the total number of dollars donated.

But there's a large difference between the amount people donated when they saw

the donate sign, and the amount they donated when they saw the sign outside.

So this overwhelms the variability that they saw in

the experiment and so it is very interesting for them,

and would suggest they should only show the donate

version of the website to people that came to visit.

Another important issue is the issue of confounding.

So suppose in a particular study that you measured both shoe size and

literacy and suppose that you were looking

for correlations between shoe size and literacy.

6:43

You might actually observe quite a few of these correlations,

because people with small shoes tend to have less literacy.

But what you might be missing is that

age is actually the variable that's causing this relationship.

When you're very young, say a baby, you have a

very small shoe size, you also have a very small literacy.

7:02

As you get older, you get a larger shoe size

and a larger literacy so that the variable age is actually

a confounder for the relationship between shoe size and literacy,

which might be very, very small once you account for age.

So if you only measure shoe size and literacy, you might

be led astray if you observe the correlation between these two

variables this is what is called confounding, so it's paying attention

to what are the other variables that might be causing a relationship.

This plays out all the time in lots of different studies, so here's one example.

This is actually from a possibly serious paper in

The New England Journal of Medicine, where they plotted

chocolate consumption in kilograms per year per capita versus

the Nobel laureates per ten million in the population.

And they observed that there's a relationship between these two variables.

7:53

So what could be going on here?

There's lots of reasons why you might consume

more chocolate and have more Nobel Prize winners.

For example, your country might have better education system if they

have more money and so they would have more Nobel Laureates.

And if they have more money per capita, they

would also probably consume more choclat, chocolate per capita.

And so there's one reason, not many reasons why

though, this observed correlation may not be an actual relationship.

That is due to a scientific relationship

between chocolate consumption and Nobel Laureates per capita.

This is sometimes called Spurious correlation and it's also

why you often here the phrase correlation is not causation.

Even if you observe that two variables are correlated with each other, you have to

prove to yourself that they're not correlated

because of some other variables we didn't measure.

So there are some ways that you can deal with potential confounders.

So one way is you can fix var, fix a bunch of variables for example, in the

case where you're considering websites for Obama, 20 2012,

you can fix Obama 2012 on all the websites.

That should be a two there and then, that way, that

variable doesn't change no matter what the text is that you're showing.

People see you fix that variable so it can't be a con factor.

Another way is, that you can stratify.

So suppose you have two website colors, and you want to try out both of

those website colors with both phrases, and

you want to know which phrase works better.

Then what you can do is use both phrases equally on both colors.

That way, you've stratified your sample so you

have both website colors used equally with both phrases.

9:28

If you can't do any of the, either of the other two things, if you

can't fix a variable or stratify it, then what you can do is randomize it.

And so what we mean by randomization is that you need to be able to

use a computer program, or flip a coin in order randomly assign people to groups.

Why does this help?

So, suppose that there's a particular study that we're preforming, and

suppose that there are ten experimental units that look like this.

So suppose there's a confounding variable, and lighter values of that confounding

variable correspond, lower values of that

confounding variable correspond to lighter colors.

And higher value of course, compounding variable correspond to the

darker colors, so you have these experimental units, and suppose

they are ordered by values in the compounding variable, and

suppose we are silly enough to apply one treatment applied it.

All of the var, all of the people that have a high value of the confounding

variable, and another treatment to all the people

that have a low value of the confounding variable.

Then it would be impossible to distinguish whether

any difference we saw was due to differences

in treatment, or due to differences in the

confounding variable, but if we randomly assign the treatment.

Then some of the people that get the green treatment will have

low value of the confounding variable and some will have high values.

And so since that'll be balanced, we'll

be able to determine whether the difference will

be to determine that the difference in treatment

is what we're actually observing in the outcome.

10:52

So for a prediction study, you actually have slightly

different issues that come up from the inferential case.

So again, you might have a population of individuals, and you might not be able

to collect the data on all those

individuals but you want to predict something about them.

So for example, we might get individuals that

come in and we measure something about their genome,

and we want to predict whether they should get

whether they're going to respond to chemotherapy or not.

11:15

So what we do is, we collect these individuals and we might collect

observations from people that did respond to

chemotherapy, and did not respond to chemoptherapy.

And then what we want to do is build a predictive

function so that if we get a new individual, we can

predict whether they're going to respond to chemotherapy up here, as

an orange person, or not respond down here as a green person.

So the idea here is that we'll still need to deal with probability and sampling and

potential compounding variables, because, when we're building this

prediction function, we want it to be highly accurate.

11:48

Another issue that comes up is that,

prediction is slightly more challenging than inference.

So for example, if you look at these two

populations, the, the population for the light grey curve has

a mean, value about here, and the population for

the dark grey curve has a mean value about here.

If you look at the distribution of

observations from these two populations, and so what

you can see is, there's a difference in the P value of these two populations.

But if I tell you, for example, that

I've observed the value that comes, say, right here.

It's very difficult to know which of these two populations it came

from, because it's relatively likely that it came from the light gray population.

But it's also relatively likely that it came from the

dark gray population, so it's very difficult to tell the difference.

For prediction, you actually need the distributions to

be a little bit more separated, so these also,

these two distributions also have a different mean, but

now, they're far enough apart based on their variability.

That if I give you an observation that lands right about

here, you know it probably came from the dark grid population.

Where if I give you an observation here,

it probably came from a light grid population.

So it's important to pay attention to the

relative size of effects when considering prediction versus inference.

13:02

For prediction there's several key qual, quantities that we need to be aware of.

So the first set of key quantities is, positive and negative statuses

so, for example, this is comes from a test a medical testing scenario.

So you can have, suppose you can either have a disease or not have a disease.

So plus means you have the disease, and minus means you don't have the

disease, and you have a test to determine whether you have that disease or not.

And the plus means you do, the test was

positive, and a negative means the test was negative.

13:39

If you do not have the disease, and you test positive, that's a false positive.

If you do not if you have the disease, but don't test positive, that's a false

negative, and if you don't have the disease,

and you test negative, that's a true negative.

13:53

Some quantities that you need to pay attention to are the

probability you have a positive test, given you have disease that sensitivity.

The probability you have a negative test,

given you have no disease that specificity.

The probability you have disease, given you have a positive test.

This is probably what you care about if you come in to

the clinics, so I tell you that you have a positive test.

Then you want to know, what's the probability you have disease.

Similarly, I tell you if you have a negative test, you might

know, you might want to know what's the probability that you have disease.

Accuracy is just the probability that you are correct in

the outcome, regardless of whether you are positive or negative.

14:32

Another key component of paying attention to big data

and data science is data being aware of data dredging.

So suppose you want to consider something silly, like jelly beans cause acne.

This comes from an XKCD cartoon.

So, scientists could investigate, and they find out that jelly

beans aren't related to causing accuracy or sorry, causing acne.

So then what they say well that settles

that, but you could try to change your hypothesis.

You could say oh, well, actually it's just purple jelly beans that cause acne.

No, it's brown jelly beans that cause acne.

No, it's pink jelly beans, and you keep

going and going and going until you found a

result that you liked and you'd come up with

good news, green jelly beans are linked to acne.

But this of course ignores the fact that you

can try a whole bunch of different things first.

So one thing you have to pay attention to when dealing with big data,

or any data science problem, is being aware of the problem of data dredging.

15:27

So in summary, good experiments have

replication so that you can measure variability.

They'd measure that variability, and compare it to the signal that they

were looking for, and they'd generalize to the problem that you care about.

Also important, good experiments are transparent

in both their code and their data.

15:50

Also, in any data science pro, problem, you need to be aware of data dredging.

The other important issues with experimental

design, will be covered in later classes.