Earlier in the course, we looked at how to use the mean body mass index, BMI, of a sample of people to estimate the mean BMI of the whole population to avoid having to chase millions of people around with a tape measure and scales. Now, that's OK for means with an approximately normally distributed variable, but does it work for proportions and why? Suppose you want to know what proportion of your country's population eat the recommended five portions of fruit and vegetables a day. You find a 100 random people and you ask them. 20 of them say they eat five a day. If you take another random 100 and ask them, maybe 27 of them would say they meet the target amount. If you do it a third time, maybe 22 meet the target. If you keep taking random samples and recording the proportion of the sample who get their five a day, then the resulting frequency distribution will look something like this. You can see it's a bit skewed to the right. This is the binomial distribution. It describes the probability of having a given number of events in a given number of trials. In our example the number of events is the number of people getting their five a day. The number of trials is the number of people. The event here is any binary variable. It's either yes or no, success or failure, survive or death. Each person either eats their five a day or they don't. Each sample of a 100 people gives you one data point on this graph. Now, continue gathering random samples. To demonstrate the results, let's ask R to randomly generate samples and plot the resulting frequency distribution. First, let's ask it to plot 15 sets of a 100 people, where the real proportion is 25%. Notice that's skewed to the right. Now, let's ask it to plot 500 sets of a 100 people. Skew is still there, but it's much less noticeable. Does this remind you of something else? It looks rather like the normal distribution. But why is that? Why would something that's either yes or no when sampled enough times look like the normal which is for continuous variables? Because of one of the most powerful results in statistical theory. Something called the central limit theorem. Try dropping a phrase into casual conversation with your friends and bask in their admiration of you. The theorem says that if you take any distribution then as you increase the sample size the distribution increasingly resembles the normal. There are one or two exceptions to this that you can forget about, but the theorem holds for the binomial and the Poisson, and for things like one random variable subtracted from another random variable. You can use this theorem to derive 95% confidence intervals just as we did with BMI earlier in the course. Suppose you have a sample of 50 people, and 20 percent of them eat the recommended five or more portions of fruits and vegetables a day. You want to use this to estimate the proportion eating five a day in the whole population. Now, we know that 95% of values of the normal distribution lie between 1.96 standard deviations of the mean. We also know the sample mean has some uncertainty about it if we use it to estimate the whole population mean. The same is true for proportions. This uncertainty is measured by the standard error of a proportion. This and other related formulae are all easily found online. So, using the formula for the standard error of a proportion, you can construct a 95 percent confidence interval as follows. Our sample proportion is 20 percent, and the 95 percent confidence interval might be something like 16 percent to 24 percent. The size of the standard error here and the width of the resulting 95 percent confidence interval depends on the probability of the thing you are measuring. Here, that's the proportion of people getting their five a day. And it also depends on the number of people in your sample. So, the bigger the sample, the more the binomial distribution resembles the normal, and therefore the more valid this approximate method for calculating the 95 percent confidence interval becomes. For ten people, it won't work. So, how many people do you need for it to work? That depends on the proportion. A rule of thumb is for np to be greater than five, where n is the number of people and p the proportion eating five a day, and for n times one minus p to be greater than five. If not, you'll need to use tables of the binomial distribution or get the software to help you. So with large enough samples, many important distributions in medicine resemble the normal distribution. The great thing about this is it makes the important task of calculating 95 percent confidence intervals much easier.