>> In this video, I would like to illustrate the concept of empirical rule and the central limit theorem. So I'm going to do this by using our temperature data for New York. We have over 26,000 data points for New York, which represents average daily temperatures for the last 25 years. And what I'm going to do is then I'm going to select the sample from this dataset over and over again. Let me first show you some characteristics about this data. I have calculated the minimum average daily temperature observed in the last 25 years, and this turns out to be 2.6 degrees, and the maximum observed in this data set is 94.5. For this data set, the average of the over 26,000 data points, shows an average of 55.2 as the average daily temperature that you expect with a standard deviation of roughly about 17.38. So what I have done is that I have taken a sample from this data set. The samples I have taken will have 72 observations in each. So, for example, sample number one, we selected 72 random days out of our larger data set, and if we found out that the average for that set, for that sample, was 54.7. And then we -- I repeated this again and took a second sample. I have, in total, taken 100,000 different samples from this data set, and what I'm going to do is that I'm going to show you how the averages of these samples will distribute. So that will be the central limit theorem, and then show you the concept of empirical rule. So to do this, I first want to know what bins I need to pick, and for that, I'm going to see what was minimum average that I got, the minimum average that I notice in my 100,000 samples was 47.2. Maximum that I saw has been 64.2. So at least one sample resulted in an average daily of 47.2, and one sample represented 64.2. And, as you know, my actual average was 55.2. So there are a couple of samples that are quite far off from here. But let me find out what is the average that I get from the entire set of samples I have taken, so that's -- this would be the average of the 100,000 that I have done, and this turns out to be 55.2, which is what I had for my population. So this is something we know, that the average of the averages, often times it comes very close to the actual population average. And it would be the same thing if you were doing proportion right away. So we can take the standard deviation of the sample means also. I can come up up with this value by just taking the standard deviation function and selecting the average temperatures for my 100,000 samples, and this gives me the value of 2.05. Now, as a reminder, when you are talking about the standard deviation of sampling mean, we actually call this standard error. Standard error, which shows the standard deviation of the sampling means distribution is your standard deviation of population divided by square root of N. And, of course, if I don't have the standard deviation of the population, if I don't have this and I have only one sample, I would use, S, the standard deviation of that sample in here. So let me see what value if I get if I use the theoretical way of calculating it. BAsed on our calculations, the standard deviation is 17.38. So, therefore, I am looking for -- I am looking for 17.38 divided by the square root of 72, because that's what I had taken in each of the samples. So if I use this equation, what would the standard error be? So this is my theoretical way of finding the standard error. It would be equal to 17.38, divided by square root of 72. My sample size in each sample. And it's 2.048. So look at these two values now. Look at this value and look at this value. 2.05 and 2.048. So they are roughly about the same. So this is why the standard error is calculated by taking the sample center deviation, dividing it by the square root of 72. This is the variability you see in the sample means. So now let's look at our complete data, and see what the histogram would look like for the entire 26,000 observations that we had. So here's the histogram I have for over 26,000 temperature recordings that I have for New York City. And this does not look like a normal distribution that you are used to seeing. So what would it look like if I plot these values? So I'm going to use these values as a way of selecting my bin when I'm creating the histogram for my sample averages. So let's say that I would like my bins to go anywhere between 45 and, let's say 66 or 67 or 65, something like that. So I'm going to say 45, and I don't want to do all the numbers by myself. So what I'm going to do then is I'm going to go to home, pick this, pick series, and now, say go from 45 in half a degree temperature differences, all the way, let's say to 65, and this would be definitely -- so I'm going to say go to 66, and I'm going to fill up a column. Just remember to click on this also. And it should fill this column for me all the way to 66 in half a degree increments. And it does. So if I just look at this, we will see that it goes all the way to 66. So I'm going to use this as my bin to create the histogram that represents the distribution of my sample means. So I'll go to data. Data analysis. Pick histogram. In histogram, the input range is going to be where I have my 100,000 sample averages, scroll back up, and I am also going to give my bin to it. I'm going to say, use these bins to find the frequency. So I pick the entire set of bins that I have, and I have used labels for both, so I'm going to say there is a label, I want to make sure it gives me a histogram. So click on chart output, and I'm going to put it on this table somewhere so we can look at it, so I'm going to say, output range, and I'm going to put the output range, remember to click in this window before clicking anywhere on the spreadsheet, and I'm going to put it somewhere here. It's going to recreate my bins and then it's going to put the frequencies next to it and then give me the histogram, so, okay. And here's my histogram. So I'm going to move it so you can see. Here's my histogram. I'm going to make it larger. So what do you see? This is the distribution of my sample means. Definitely looks a whole lot like a normal distribution. If it's not a perfect match, pretty close, right? It is centered around 55.2, and the distribution's variability, the shape of it is your standard error, which is 2.048. So it is much more tightly together, and it's a normal distribution as compared to my complete data. So this is the illustration of central limit theorem. Most of the sample means will fall near the center, and it will be a normal distribution. Why does the empirical rule matter here? So please remember that our standard error is about 2.05, so I am just going to round this up to 2.05 to show you the idea of the empirical rule. I have gone ahead and redrawn my histogram, and also have [inaudible] with the theoretical normal curve, and what you can see is that it is very close. Sure enough, there are few places that our observations do not exactly fit the normal curve. But we can all agree that it is a very, very close fit. So now I can talk about the empirical rule, and how does it matter? If you remember, empirical rule states that 98 percent of all observations for a normal distribution will fall, plus or minus, one standard error from the mean. In our case, our mean is 55.2, and our standard error for the distribution of the sample means is 2.05. So when I say plus or minus one standard error, this implies 55.2, plus the 2.05, and 55.2 minus the 2.45. And that would be 57.25 and 53.50. So if I were drawing this now in my histogram, where would that fall? So 53.15 is roughly around 53, so I'm just going to take that bar and highlight it. So right here. And 57.25 falls within 57 and 57.5. So somewhere here. So what you're saying is that 68 percent of samples will return a mean that will be within plus or minus one standard error of the true mean, which is 55.2. And 95.5 percent will fall within two standard error. And in this case, that would be 55.2 plus 2 times the 2.05, and 55.2 minus 2 times the 2.05. And these values will be 59.3, and 51.1. So, again, if you were trying to draw this, 59.3 is roughly around here, and 51.1 is about here. So it's 95.5 percent of the sample means that you would get would fall within these two boundaries, and 99.7 percent of the sample means will fall within three standard error, and in this case, that will be 55.2 plus 3 times the 2.05, and 55.2 minus 3 times the 2.05, and that would be 61.35, and 49.05. So when we go to draw that, this is about 61, and this is about 49, and this would include 99.7 percent. And you can see that, with our data, the empirical rule has worked out. And that would be the case in every case. That's why this is a theorem. So it doesn't matter if you're underlying distribution is normal or not. The distribution of the sample means, if you have a large enough sample size, would be normal distribution and empirical rule will hold.