This next set of lectures will consider a famous distribution in the realm of statistics, something called the the normal distribution. And we'll define normal distribution, talk about some of its unique properties and when we can and cannot use these properties. To learn about population level data from a sample of data taken from the population. In this first lecture section regarding the normal, we're going to talk about some general properties of the normal distribution. So upon completion of this lecture section you will be able to describe the basic properties of the normal distribution or normal curve. Describe how the normal distribution is completely defined by its mean and standard deviation. And recite the 68-95-99.7% rule for the normal distribution with regards to standard deviation distances from the mean and proportion of observations that fall within these distances. So let's just start this off. The normal distribution is a theoretical probability distribution that is perfectly symmetric about its mean, and its median, and its mode. So the normal distribution is such that the mean is exactly equal to the median. Which is exactly equal to the mode which we haven't defined yet, but that's the point of highest peak in the distribution, or the most likely observation. So this you can think of this curve here as if we had an infinite number of observations that follow a normal curve and did a histogram of those values. And smooth the curve over, it would look like this bell shaped symmetric distribution. The normal distribution is also called the Gaussian distribution, those two are synonymous. And it's called this sometimes in honor of it's discoverer Carl Friedrich Gauss, a very famous German mathematician. So normal distribution, there's an infinite number of normal distributions but all are uniquely defined by two quantities. A mean, we'll call that mu for now, like we did with population means before. And standard deviation, which we'll call sigma for now. There are literally an infinite number of possible normal curves for every possible combination of mu and sigma. And here I've tried to draw three of this infinite number on the same number line to illustrate different distributions with different means and standard deviations. So this first curve has a mean of -2, and it has the most variable values of all three of the curves. The second curve here, has a mean of 0, and this third curve has a mean of positive 1. I could have drawn many more to fill this out. And we could spend our whole lifetime drawing all possible combinations of mu and sigma and their respective normal curves, and still not finish the job. Where does this curve come from? Well, it's a theoretical distribution and given a data set or a data population whose values follow a normal curve. The proportion of values falling between a and b under a normal curve, so within 0 to 1 or within negative 1 to negative 1.5, etc. Under a normal curve is given by this complicated integral that takes the area between a and b as the area under this function here. Look at this formula. Well, it's got Greek letters, it's got exponents, it's got logarithms, it's got an integral, it's a mathematician's dream. And I don't expect you to use this formula or memorize etc, in fact, I don't use it. But I just want to show you that when all the dust settles, despite the fact that this is very complicated, the only pieces in here that are variable are the mean and standard deviation. And once we specify those, we have the entire information about the function and can integrate it across any of ranges in the data to get the respective area or proportion of observations in the distribution that fall between a and b. So again, the only reason I point this out is to show you that despite the fact that this function is complicated and involves things like the numerical constant pi. And the numerical constant e, and square roots, and exponents, etc., the only two pieces it needs to fully be specified are the values of mu and sigma. So all normal distributions regardless of their mean and standard deviation values, have the same structural properties. We've already know the first one which is the mean is equal to the median, which is equal to the most frequently occurring value in the distribution, the mode. The values are symmetrically distributed about the mean in a bell-like shape such that values further from the mean are less common than those closer to the mean. So the closer we are to the mean, the greater proportion of data points there are. Again, the entire distribution of values described by a normal distribution can be completely specified by knowing just its mean and standard deviation. And since all normal distributions have the same structural properties, we can use a single reference distribution. Sometimes called the standard normal distribution, to elaborate on some of these properties. And in the next section, we'll show that every normal distribution can be easily rescaled to this standard normal distribution. And that just means that we only need one set of reference tables or one reference function on a computer to calculate anything we need with regard to any normal distribution. So here is the first piece of the 68 5, 99.7 rule for the normal distribution. In any normal distribution, 68% of the observations fall within plus or minus 1 standard deviation of the mean. So if I have data that follow a normal distribution, the majority of the data values will fall with plus or minus one standard deviation from the mean of all the value, 68%. And if we know this, we know some other key things about the normal distribution. If 68% of the observations fall within one standard deviation of the mean. Then the remaining 32% fall beyond or outside of one standard deviation from the mean. But because of the symmetry is the normal distribution, this 32% is equally split between the proportion that's more than one standard deviation above the mean. And the proportion that's more than one standard deviation below the mean. These are the tails with regards to one standard deviation. And each of these pieces, 16% of the data fall because of the symmetry of the curve. So there's several ways to talk about the 68% above and beyond what I just showed visually. But with regards to the number 68% we can say for data whose distribution is normal, 68% of the observations fall within one standard deviation of the mean. Another way to say this is that the proportion of observations fall in a normal distribution that fall within one standard deviation of the mean, is 68%. And we use proportion and probability interchangeably, generally speaking, in statistics, so we could also say. That the probability that any randomly selected value from this distribution falls within plus or minus one standard deviation of the mean is 0.68 or 68%. So if I pull a random observation from this distribution, there's a 68% chance that it's within plus or minus one standard deviations of the mean. A rule that we'll use commonly throughout the rest of the course is this one that 95% of the observations fall within two standard deviations of the mean. Technically speaking, it's not two. It's 1.96 but for quick back in the envelope calculations, we can use plus or minus two. When we're using the computer for most of our computations, it will recognize that it's 1.96 technically. So what does this mean? It means if 95% of the observations under a normal curve fall within plus or minus two standard deviations of its mean. The remaining 5% are equally spread out in the tail that's more than two standard deviations above the mean. And the tail that's more than two standard deviations below, so cumulatively, these tails have 5%, which is equally distributed to 2.5% in each of them. And finally, almost all, 99.7%, of the observations fall within three standard deviations of the mean. So almost all the distributed points in a normal curve fall within plus or minus three. There's a very small percentage beyond three standard deviations in either direction. Now technically speaking in a true normal curve, which again, is a theoretical distribution. The tails go on forever meaning that there are very few percentage-wise observations ,that fall beyond three standard deviations. But technically they can be any value on the real number line. In real life no data has infinite range, so we'll see that the normal distribution is only an approximate model for some types of real data. So what does this mean though? Let's go back to the 95% plus or minus 2 standard deviations thing, because this will be what we use most often in the course. So if we look at this and parse it similar to what I did before, but talk about things in terms of percentiles. The middle 95% of values in the normal curve fall between the mean minus 2 standard deviations and the mean plus 2 standard deviations. So that means, as we said before, that 2.5, if 95% is in the middle, the remaining 5% is equally split between the two extremes. 2.5% of the observations that follow a normal curve are smaller than and then hence the remaining 97.5% is greater than a mean -2 standard deviations. On the flipside, 97.5% of the values are smaller than, or less than or equal to, and hence 2.5% of the values are greater than the mean plus 2 standard deviations. So this mean minus 2 standard devs gives us the 2.5 percentile, For the normal curve or for a normal distribution. And by taking the mean plus 2 standard deviations we get the 97.5 percentile. So where do these rules come from? How do I know about these relationships? Well, to tell you the truth I did not pull out that integral I showed you before and integrate from negative two to positive two standard deviations to figure out that 95% of the values under the curve fall within that range. Somebody else did the hard work for us, and now it can be done by computer. But all of the information I quoted and much much more can be found in what's called a standard normal table. And the standard normal distribution is a normal distribution with mean mu equal zero and standard deviation one. And as I alluded to before, any normal distribution with mean mu and standard deviation sigma can be rescaled to the standard normal distribution. And we can just use this as a reference for making statements about proportion of observations falling within plus or minus a certain number of standard deviations or beyond. So if you wanted to find a standard normal dev table in the Pre-Internet Era, you would only need find any statistics textbook and open to the back and you would find it. In this Internet area, if you go into Google or your favorite search engine and type standard normal table, you'll get plenty of hits. Here's the example of one of the first hits I got by typing standard normal table into Google and here's the URL for this reference. Now here's the thing about these tables. There are several different ways to tell the story of standard deviation values under the normal curve. And then once we have one of these pieces, we can figure out the rest by the properties of symmetry of the normal curve, but you have to be careful in looking at what the curve tells you. So for example, when I go to this table here, it shows me at the top for a given number of standard deviations. What it's telling you, it's not framing it in the way I explained it to you before. What it's telling me is the proportion of observations that fall between the mean and that number of standard deviations above the mean. Not plus or minus that number, but just above the mean. So here's the mean of zero Here's the number of standard deviations above the mean. I'll call that z, and it's telling me that proportion in here. So for example, if I were to look up one in this table, number of observations that fall within one standard deviation above the mean. It's going to give me 34% roughly and we'd say, well, how did you know that a priority? Well, what if we know about one? We know from what I told you that, 68% of observations fall within plus or minus one standard deviation from the mean of any normal distribution, including the standard normal. We getting half that value in this table and half of 68% is 34%. But certainly if I know 34%, I can figure out the rest, right. I can use the symmetry of the distribution to fill in this part. That I have told you another 34% would be within 0 and -1 standard deviation in total of 68% in the middle, and then we can figure out that there's 32% remaining in tales and because symetry that's split. 16% more than one standard deviation above the mean. 16% more than one standard deviation below the mean. Here's another example of a normal table that's telling you something different. If you look at this, for any given number of standard deviations, it's giving you the proportion of observations that are less than or equal to that number of standard deviations above the mean. So, I'm going to skip explaining this because we're going to jump to a function. In all, what gives the same thing and it's much more user friendly than trying to figure out what a particular table has given you and where the standard deviations come into play. So, in this class, anyway, I'm not going to have you look these things up that often. Just for the practice of doing so once or twice, but we will do, when we use R, if we need to make these computations. And any results I give you will be coming from the computer. So, generally speaking, for working off the cuff I want you to be familiar, was the 68-95-99.7% rule. But generally other such computations wrapped into other analyses later in the course as well, will be completely handled by the computer. But we can use R as a calculator to give us an automatic standard normal table. And the relevant command that looks up values as if it were a standard normal table in R is called pnorm. And if I want to convert any standard deviation value above or below the mean, I'll call this z, the corresponding proportion under a normal curve. The syntax is pnorm(z). And as with the print version of the normal table however, it is important to know what information this returns. And this gives us something akin to that second table I showed you. It tells me that if I give it a standard deviation value, it will give me the proportion of observations that are less than that number of standard deviations. Less than or equal to that number of standard deviations above the mean or below the mean if the number is negative of a normal curve. So it's different than how I explained these things to you, but we can certainly use this to get to there and figure out what this would give us using the rules I gave you. So for example, if I type in pnorm(1) and hit Return, it's going to give me 84.1%. Why is that? Well, it's because what do we know about one standard deviation? We knew that the percent of observation's within plus or minus 1. Standard deviation from the normal curve, we knew that that percent of observations was 68%. We also know by symmetry of the normal curve that their mean is 32% is equally distributed in the tail, that means there's 16% in a more than one standard deviation below the mean. So the total number of observations that are lesser or equal to one standard deviation above the mean. Is 68% in the middle plus 16% here or the 84% that's given by pnorm. So we'll be using pnorm again in subsequent sections here. So we'll get a little more practice with it. But in summer, the normal distribution's a theoretical probability distribution. That is symmetric and bell-shaped. There are literally an infinite number of normal distributions, and each can be completely describe by only two quantities, the mean and the standard deviation. And for all normal distributions regardless of their mean or standard deviation, 68% of the observations described by a normal distribution fall within plus or minus 1 standard devs of the mean. 95% fall within plus or minus 2 standard deviations of the mean. And almost all, 99.7% of the observations described by a normal distribution, fall within plus or minus 3 standard deviations of the mean. Other such percentages for other values can be found using the standard normal table. Or the pnorm function that we saw in R. So in subsequent functions we'll be applying these properties to data examples where we can assume that the data comes from approximately normally distributed population. And then we'll show if that's the case we can use the data mean and standard deviation to make statements about percentile values in the sample as estimates for the population. And we'll also show what happens when we apply these rules to data that are clearly not from an underlying normal distribution.