[NOISE] Most people who don't understand statistics are very suspicious about what anyone says, based on a relatively small sample. As you know, we don't have the time or the means to study populations and we rely on sample study. Suspicious or not, time and time again if we have an unbiased and representative sample our conclusions will be right in majority of the cases. We tested drug on a relatively small number of people and then they decide that the drug is safe and effective for the entire world. Why does it work? Where does this powerful in sight come from? Most is due to what is known as the central limit theorem. If you could have one superhero in the world of statistic, it would be this theorem. To understand this theorem, it is best to go through an example. Movies are made to make profit for the studios. One important group are the teenagers, those aged between 15, 13 and 19. In order to prioritize the type of movies that the studio make, they may first want to know something about this age group. Are they spending money to go see the movies or do they wait for them to come out in DVD and online, which may not be as profitable. Or which genre they prefer to see in the movie theatre. If you could ask every teenager in the country, most will fall near the average spending, and further we get from the average, the fewer teenagers we find that's spend that kind of a money, but they can't ask all teenagers. So instead to do the study we select the random sample of teenagers and ask them. Let's assume that in our first study we ask a group of teenagers how much they spent on movies in a year? This histogram shows the response from each of the 25 teenagers surveyed. There's a great range among these 25 but when we take their average as a group, we find the average spending to be about 728. We know the chosen group may not be representative of an entire population. So we take many such samples. In this study, we are going to take 99 more groups of 25 teenagers and record each group's average spending on movies. So at the end, we'll end up with a total of 100 average spendings. Plotting the average of each group defined the averages of these groups or somewhat bell-shaped. This histogram shows the distribution of the sample means. Now, let's see what happens if we take sample size of 50 in each group, and survey 100 such groups. Now, let's do the survey including 100 in each group, and survey 100 such groups. Now, do the survey including 250 in each group and survey 100 such groups. Did you notice what happened to the distribution of the mean dollars being spent? As the sample size increased, the spread has decreased. And the distribution is becoming more and more evenly spread around the mean making it a more classical normal distribution. This is what central theorem gives us which is the average of samples have approximately normal distribution. And as the same size increases, then distribution of averages become both more normal as well as narrower. Recall, when the distribution is narrower it has a smaller standard deviation and therefore the mean is much more representative of a typical observation. Would this also be true if they didn't have a population that is normal? Here, we had assumed that the money spent on movies for the population of teenagers are normally distributed. What if this was not the case? Here is the distribution of average daily temperatures for the city of Chicago, collected over many, many years. This distribution is bimodal. And the two peaks represent the most observed temperatures during the cold months and the warmer months. What would be sampling distribution of sample means look like for such a distribution? So let's do the same. To give an estimate for average temperature of Chicago we will take several samples. 100 samples. The sample sizes are 5, then 25, and finally 100. Let's see how this will turn up. This is the distribution of the sample means. Remember, we took 100 samples, 5 observation selected in each of these 100. The histogram here shows the distribution of means for each of these 100 samples. Increase the sample size to 25. The histogram here shows the distribution of the means for these 100 samples. Now, increase the sample size to 50. The histogram here shows the distribution of means for these 100 samples. And here's a histogram that shows the distribution of the needs for 100 samples when each sample had 100 randomly selected observations. Just as we saw before, as the sample size increase, the distribution of averages become both more normal as well as narrower, even though the underlying distribution for our data was not normal. So the central limit theorem shows that regardless of the true distribution of the population, the distribution of these sample means will always be approximately normal if the sample is large enough. There is something very special about this phenomena which gives us the normal distribution. This means that we can use the properties of normal distribution, whether or not the actual population is normal. Using this knowledge, we can now take samples and make inferences about the population. For example, based on a sample of consumers tested decide which features will be appreciated most by the overall consumer market. But based on a sample of likely voters, who will win an election? I mentioned this earlier, but it's important to repeat this. This histogram you see on this slide that represents the distribution of the means of the 100 samples we had taken. Like before, we can describe the distribution by central tendency and it's variability. When we are working with sample means, that's the distribution for these means, the central tendency is X bar and it's variability is referred to as a standard error, which is calculated by taking the population standard deviation and dividing it by the square root of n, our sample size. Mathematically, then you should see that as the sample size increases, the denominator of this equation increases, which means the variability we see for the distribution of the sample means decreases. That is why resource obtained from larger sample sizes are more reliable. The larger the sample size, the narrower the distribution of the sample means become. And thus, the closer the mean of the sample gets to the true mean of the population. This is why sample size matters so much in any statistical analysis. We will learn more about what the sample size should be in the next module. Finally, we have arrived at the payoff of all this. Since the sample means are distributed normally, thanks to our superhero central limit theorem, we can harness the power of normal curve. That is, we expect about 68% of all sample means, to lie within 1 standard error of the population mean. 95% of the sample means will be within 2 standard errors of the population mean. And 99.7% of the sample means will lie within 3 standard errors of the population mean. Knowing this, now we can take just one sample and we can assess if we were unlucky and ended up with a sample that provides a mean closer to the tails rather than the center of the distribution and being able to assess this will allow us to know to what degree we can be confident about our sample study. This is the topic of our next module. So more on this later.