[MUSIC] Okay, so where are we now? So we motivated the discussion of statistical inference and estimation by bringing up this so called decline effect where the effect size of scientific results seem to be going down over time and reproducibility is suffering. And so we need some reasons for this. Publication bias, mistakes and fraud and this multiple-hypothesis testing problem. So, we use these motivations, these scenarios to bring out various topics and techniques. So, we talked a little bit about basic statistical inferences where I just give you an overview and that's it. We talked about effect size. We brought up a specific term, heteroskedasticity. For fraud protection we brought up Benford's Law, and then with multiple hypothesis testing, which is perhaps the most important part of the discussion, we talked about the familywise error rate, and the false discovery rate and gave correction procedures for both of these. Okay. So this hopefully was a tour of not just some basic concepts, but also some, if not advanced, at least things that don't necessarily come up in a stats 101 course. But I think it's pretty important for us data scientists to understand. In fact, as a data scientist, there's a view among statisticians that these topics are not very well understood. And, in fact, they'll point to typical machine learning classes where understanding the population, understanding the various biases, understanding how to correct for the problems that can arise, is not taught at all and is more of a blind application of algorithms. And so I think it's pretty important to sort of go over this choice of topics. Now, so, what about big data? What changes? Well, so Brad Efron, who's a world-renowned statistician, describes it this way. It says, classical statistics was fashioned for small problems, a few hundred data points at most, and just a few parameters. And the bottom line is that we've entered an era of massive scientific data collection, with a demand for answers to large scale inference problems that lie beyond the scope of classical statistics, and so suggests that something is changing in the era of big data. Now, what can go wrong here? Well as we talked about, you can find spurious relationships in big data. And so this is a picture that I got from a colleague recently that was emailed to him. Which is a plot that someone took the time to make, may or may not have been as a joke, but as you can see here it says Internet Explorer vs the Murder Rate. Okay, and so this is the murders in the US in blue along with the market share of Internet Explorer in the green, and the corresponding discussion that went along with this plot was somewhat amusing, right? Talking about various theories for why the murder rate might be going up as Internet Explorer market share, or murder rate goes down, as Internet Explorer market share also goes down. But, the point here is that, without some common sense, or without the application of understanding the scenario of the problem, you can make discoveries of this form. Okay. All right, and so other examples that have been talked about in the literature. Again, brought up as bad examples. You know, the number of police officers and the number of crimes. So why might these two things be correlated? You know, maybe police officers cause crimes. Well no, probably because in densely populated areas there are both more police officers and there are more crimes. By the way, just to point out again these authors here are not authors that made these claims. These are authors that brought up the mistake. Okay, amount of ice cream sold and deaths by drownings. Why would these things be correlated? Well, there's a seasonality, right? In the summertime, you sell more ice cream and more people go swimming. And then one is stork sightings and population increase. Used as evidence that storks do indeed bring newborns to families. Well again, in more densely populated areas there's more people to actually see the storks and so you get an increase in sighting. So, these kind of procedures to remove bias, and these procedures to understand the population you're sampling from and to understand the possibilities for these correlations, these things are taught in statistics programs but are not typically taught in machine learning classes. [MUSIC]