In this video, we introduce the chi-square goodness of fit test, where we evaluate the distribution of one categorical variable that has more than two levels. And by evaluate, we mean we are going to be comparing the distribution of that categorical variable to a hypothetical distribution. Let's give an example. In a county where jury selection is supposed to be random, a civil rights group sues the county, claiming racial disparities in jury selection. Distribution of ethnicities of people in the county who are eligible for jury duty based on census results are given in this table. So we can see that in this population we have 80.29% whites 12.06% blacks, 0.79% Native Americans, 2.92% Asians and Pacific islanders, and 3.94% other ethnicities. We are also given the distribution of 2500 people who were selected for jury duty in the previous year. And we can see that of the 2,500 people, 1,920 were white, 347 were black, 19 were native America, 18 were Asian and Pacific Islander, and 130 were categorized as other race or ethnicity. The court retains you as an independent expert to assess the statistical evidence that there was discrimination. You propose to formulate this issue as a hypothesis test. In this case, your null hypothesis remember, always says there's nothing going on. People selected for jury duty are a simple random sample from the population of potential jurors. The observed counts of jurors from various race ethnicities follow the same ethnicity distribution in the population. Your alternative hypothesis says there is indeed, something going on. In this case, your hypothesizing that people selected for jury duty are not a simple random sample from the population of potential jurors. The observed counts of jurors from various ethnicities, do not follow the same race ethnicity distribution in the population. So when we're evaluating the distribution of one categorical variable, against this hypothetical distribution, that's the true distribution of potential jurors in our population. Our null hypothesis says that the observed data follow the hypothesized distribution, and the alternative says the observed data do not follow the hypothesized distribution. So how do we evaluate these hypotheses? We want to quantify how different the observed counts are from the expected counts. And if the observed counts are very different from the expected counts, in other words the deviations are large from what would be expected based on sampling variation or simply chance alone. Would provide strong evidence for the alternative hypothesis. This is called a goodness of fit test since we're evaluating how well the observed data fit the expected distribution. If the jury selection is random, then we would expect the observed count to follow the percentage distribution in the population. Meaning that we would expect for example, 80.29% of the 2500 people to be white, that means we would expect about 2007 white jurors to be selected if in fact the jury selection is random. Now, this doesn't mean that, that's exactly what is meant to happen. If the jury selection is random, we would of course expect some sampling variation or chance around this. But what we're going to be evaluating at the end of the day are, are the observed counts so different from the expected counts that we might suspect something is going on here. Or are they only slightly off from the expected counts, and we maybe wouldn't expect something to be going on here? Similarly, for the black jurors, we would expect 12.06% of the 2,500 people to be black, so that gives us 302. And we can actually go through the entire list and calculate the expected number of Native Americans that would be 2500 times 0.79% so only about 20. And that would be 73 for the Asian and Pacific Islanders, and finally, 98 for other races. Once you do this expected calculations you should always check to see that the counts that you calculated actually add up to your total sample size. So especially if you need to do a little bit of rounding to get these count right, you want to make sure that your counts at the end of the day add up to a total sample size, and in this case they in fact do. We are starting to introduce a new technique here so let's go through the conditions required for this technique. The first one is independence. We want to make sure that our sampled observations are independent of each other, for which we need a random sampling or random assignment. And if we're sampling without replacement, we want our sample size to be less than 10% of the population. And we also want to make sure that each case only contributes to one cell in the table. So we don't want to for example, identify a potential juror as both white and black. That is a possibility, but for the purposes of the chi-square test, we want to make sure each case can only go to one cell in the table. This is also another way of thinking about independence because if our cases showed up in multiple cells in the table, then the observations wouldn't exactly be independent of each other. For the sample size, we want to make sure that each particular scenario, in other words, cell must have at least five expected cases. Let's remind ourselves quickly the anatomy of a test statistic. The general form of a test statistic is a point estimate minus a null value divided by the standard error of that point estimate. Earlier we calculated the expected counts for our table, and it appears that we indeed do have at least five expected cases. We also have no reason to believe that these observations are not independent of each other. So it appears that we have met the conditions for this hypothesis test. What we need to do next is to develop a new test statistic for count data. But let's take a look back at what we've been working with so far. The general form of a test statistic is that it's a point estimate minus a null value divided by standard error of the point estimate. There are two things a test statistic tries to accomplish. One, it identifies the difference between a point estimate and and expected value assuming that the null hypothesis were true. Two, it standardizes that difference using the standard error of the point estimate. So these two are going to be useful when we start thinking about developing a new test statistic for camp data. That is called the chi-square statistic. When dealing with counts and investigating how far the observed counts are from the expected counts, we use this new test statistic called the chi-square statistic. It's calculated as the observed minus the expected for each cell squared divided by the expected counts and we want to sum this over all of the cells. Remember, when we mean cell we're basically referring to levels of the categorical variable. And we're introducing a new term here, not to confuse you but because when we get to the other chi-square test that we're going to talk about when we have more than one categorical variable we're dealing with, it's going to help to make the distinction between a level and a cell. So in this case, for example we had five cells, so k would be five. We saw in the formula that we squared the differences between the observed and expected count and also namely this is a chi-square test statistic so it is obvious that we're doing some squaring here, but why do we do that? One, we want to make sure that our standardized differences are positive, because otherwise, if you add some positives and negatives to each other, they're going to cancel each other out. But another way of getting rid of negative signs would have been to use an absolute value sign, and it seems like that's not what we're doing here, because by squaring, we accomplish one more thing. We accomplish that highly unusual differences between the observed and expected counts appear even more unusual. In order to determine if the calculated chi-squares statistic is considered unusually high or not, we need to first describe its distribution. The chi-squared distribution has only one parameter, the degrees of freedom. It influences the shape, the center and the spread of the chi-square distribution. And for goodness of fit test, the degrees of freedom can be calculated as k-1 or k stands for the number of cells. Here we can see a bunch of chi-square distributions starting from the blue thick line that has only 2 degrees of freedom, going up to the pink dotted line that has 9 degrees of freedom. Take a look over here and think about how the shape of the chi-square distribution changes as well as the center and the spread as the degrees of freedom increases. So let's put everything that we've learned so far together. Our null hypothesis was that the observed counts of jurors from various race ethnicities follow the same ethnicity distribution in the population. Our alternative hypothesis was that they do not follow the same race ethnicity distribution in the population. To calculate the chi-square statistic, we're going to need those individual components. So for the chi-square statistic for this hypothesis test, for the first cell white, we take our observed count and subtract from that the expected and square that and then we divide that by the expected count then we add to that the quantity, the same quantity for the blacks. So that's going to be 3407- 302 squared divided by 302 again, the expected count. And we can go through this for each one of the other cells as well. The chi-square statistic then comes out to be 22.63. In order to find our p-value, we also need to know something about the distribution of the chi-square statistic, and for that we need our degrees of freedom. We have five levels here. White, black, Native American, Asian and Pacific Islander, and other. So 5- 1 = 4 degrees of freedom for this test statistic. Then the only thing left is the calculation of the p-value. The p-value for a chi-square test is defined as the tail area above the calculated test statistic. Because the test statistic is always positive, and a higher test statistic means a higher deviation from the null hypothesis. So just like F tests, within that chi-square tests as well, the p-value is always defined as the tail area. That's above the observed test statistics. So our chi-square distribution looks something like this. It's right skewed and remember that it's a squared value, so it always needs to be a positive number and we shade the tail area beyond the, observed chi-square statistic that we calculated. Well how do we find that tail areal? One option would be to use R. For that we can use the p chi-square function where we feed in our observed chi-square statistic, the degrees of freedom, and I've also specified that we don't want the lower tail. Because as we just showed in a chi-squared test we always want the upper tail and that p-value comes out to be a pretty small p-value 0.0002. Another possibility is to use the applet. So in the applet we first pick the chi-squared distribution then we select our degrees of freedom, which was 4. And then we want to make sure we're getting the upper tail, and we're going to look for the tail area that's beyond 22.63. It appears that in this particular applet, the maximum we're getting is 14.9. But even the tail area for the chi squared distribution with four degrees of freedom for any area that is beyond 14.9 is roughly 0.5%. Remember that our test statistic was much larger than 14.9, meaning that the tail area that's going to be left beyond that is going to be much smaller than 5%. And lastly, we can also use a table to find this p-value. So a chi-square table looks something like this. It works a lot like the t tables where instead of probabilities incited, we actually have some critical values. First, we want to locate the row that's associated with our degrees of freedom, so that's 4. And then within this row, we want to locate our observed chi-squares statistic, which was 22.63. It seems like that value would be off the table, and it's going to be on the right side of the table. Meaning that our p-value, so the tail area that's going to be left beyond it is going to be over going in the same direction as well. We can see that as we move to the right on this table, the p-values are getting smaller and smaller. Therefore, we can say that based on this table, our p-value is a number that's less than 0.001. Which seems to agree with the exact key value operation from R. With such a small p-value, we would reject the null hypothesis, which in this context means that the data provide convincing evidence that the observed distribution of the counts of race ethnicities of jurors does not follow the distribution in the population.