In this video, we introduce the chi-square goodness of fit test, where we evaluate

the distribution of one categorical variable that has more than two levels.

And by evaluate, we mean we are going to be comparing the distribution of that

categorical variable to a hypothetical distribution.

Let's give an example.

In a county where jury selection is supposed to be random, a civil rights

group sues the county, claiming racial disparities in jury selection.

Distribution of ethnicities of people in the county who are eligible for

jury duty based on census results are given in this table.

So we can see that in this population we have 80.29% whites 12.06% blacks,

0.79% Native Americans, 2.92% Asians and

Pacific islanders, and 3.94% other ethnicities.

We are also given the distribution of 2500 people who were selected for

jury duty in the previous year.

And we can see that of the 2,500 people, 1,920 were white,

347 were black, 19 were native America, 18 were Asian and

Pacific Islander, and 130 were categorized as other race or ethnicity.

The court retains you as an independent expert to assess

the statistical evidence that there was discrimination.

You propose to formulate this issue as a hypothesis test.

In this case,

your null hypothesis remember, always says there's nothing going on.

People selected for

jury duty are a simple random sample from the population of potential jurors.

The observed counts of jurors from various race ethnicities

follow the same ethnicity distribution in the population.

Your alternative hypothesis says there is indeed, something going on.

In this case, your hypothesizing that people selected for

jury duty are not a simple random sample from the population of potential jurors.

The observed counts of jurors from various ethnicities,

do not follow the same race ethnicity distribution in the population.

So when we're evaluating the distribution of one categorical variable,

against this hypothetical distribution,

that's the true distribution of potential jurors in our population.

Our null hypothesis says that the observed

data follow the hypothesized distribution, and

the alternative says the observed data do not follow the hypothesized distribution.