0:02

So this lecture's about why you should care about statistics.

Â As Steven already told you,

Â genomic data science consists of three different components.

Â There's biology, computer science, and statistics.

Â When people talk about genomic data science,

Â they often think about biology and computer science,

Â and I think statistics often ends up being the third wheel.

Â And so this lecture's to hopefully motivate you as to why statistics is

Â a very important component of genomic data science.

Â This is a really exciting result that came out in the Journal of Nature Medicine.

Â And so, the results suggest that it's possible to take genomic measurements and

Â predict which chemotherapies are going to work for which people.

Â This is an incredibly exciting result in genomic data science,

Â because it was sort of the holy grail, using genomic measurements to personalize

Â therapy, and particular, particularly personalized therapy for cancer.

Â And so, everybody was very excited about this, and people at all,

Â institutions all over the world tried to go back and reproduce that result.

Â And so, one of those groups was at MD Andersen Cancer Center.

Â So, that group of people consisted of two statisticians, Keith Baggerly and

Â Kevin Coombes.

Â And those statisticians tried to chase down all of the details and

Â reperform the analysis.

Â They did this because their collaborators were really excited about it and

Â actually wanted to use it at MD Anderson in order to tailor therapy.

Â But it turned out that there were all sorts of problems with the analysis, and

Â they had trouble getting a hold of the data.

Â And so because of these problems,

Â they were actually unable to reproduce most of the analysis.

Â And this ended up being a huge scandal in the world of genomic data science, because

Â this very high profile result, this result that everybody was sort of chasing after,

Â turned out to sort of not work out once all the details were checked out.

Â So this is actually an ongoing saga.

Â It, actually started off as a sort of a discussion between the statistician

Â at MD Anderson and the group at Duke that actually performed the original analysis.

Â And over time, they had a large set of interactions where they were trying to

Â settle on the details of how the analysis was performed.

Â It turned out that due to some lack of transparency by the people who

Â did the original analysis,

Â clinical trials actually got started using this technology.

Â They were assigning chemotherapy to people using sort of an incorrect data analysis,

Â and it was because the statistics weren't actually really well worked out.

Â This is so

Â serious that now there are ongoing lawsuits between some of the people that

Â were involved in those clinical trials who had been assigned therapy and

Â the institution Duke that actually was behind the creation of these signatures.

Â So missing out on why statistics will be part of the genomic data science pipeline

Â caused a major issue, so big that actually lawsuits were generated.

Â 2:29

This actually spurred an Institute of Medicine report.

Â So this Institute of Medicine report dictated that there are a whole

Â new set of standards by which people should develop genomic data technologies.

Â And much of this report focused on statistical issues, reproducibility,

Â how to build statistical models,

Â how to lock those statistical models down, and so forth.

Â And so, the first issue, the first thing that we,

Â I, I hope to motivate you is that we should care about statistics.

Â And I've just got a couple of silly examples here.

Â This is actually from a published abstract of a paper.

Â And in the abstract,

Â you can see where I've highlighted, that it says, insert statistical method here.

Â So, the authors of this paper cared so little about the statistical analysis

Â that they left a generic statement about what statistical method they were using.

Â So this sort of suggests how sort of the relative ranking of where statistics

Â falls in people's minds when they're thinking about genomic data science.

Â And that sort of issue can cause major problems like we saw with

Â the Potti scandal.

Â So this is actually also not just in genomics,

Â it's actually a more general problem.

Â So this is actually from a flyer from Berkeley, and so they talk about all

Â the different areas that are sort of applying data science these days.

Â And if you notice, statistics is listed, but there's actually no application area.

Â And so this sort of, again, suggests that people think of statistics not necessarily

Â as something that's important for data science.

Â And that sort of lack of statistical thinking is a major contributor to

Â problems in genomic data analysis, both at the level of major projects,

Â but also at the level of individual investigators.

Â And so the question is, how do we sort of change this perspective and

Â how do we make sure that people care and know that caring about

Â statistics is just as important about, as caring about the biology or

Â the computer science when doing genomic data science.

Â