Now we're going to use the chi-square test of independence to test the hypothesis I proposed about smoking frequency and nicotine dependence, from working with NESARC data. Specifically, is how often a person smokes related to nicotine dependence among current young adult smokers? Or in hypothesis testing terms, is smoking frequency and nicotine dependence independent or dependent? That is, are the rates of nicotine dependence equal or not equal among individuals from my different smoking frequency categories? For this analysis, I'm going to use a categorical explanatory variable with six levels. The number of days smoked per month, which you may remember I called USFREQMO, with the following categorical values. Smoking approximately 1 day per month, 2.5 days per month 5 days per month, 14 days per month, 22 days per month and 30 days per month. My response variable is categorical with 2 levels. That is, the presence or absence of nicotine dependence in the past 12 months called TAB12MDX in the NESARC data set. To run this in Python, we'll import the SciPy stats library. Next we will request our contingency table of observed counts, which I am calling ct1, and we'll use the Pandas crosstabs function to generate these. Within parentheses, I include my two level categorical variable. Tab12MDX followed by a comma and then my categorical explanatory variable, which can have two or more levels. Here, USFREQMO. Now I want to generate the column percentages which will show me the percent of individuals with nicotine dependence within each smoking frequency level. As you can see, I used the counts for my contingency table ct1 to generate these. They're actually 2 steps to calculating the column percentages. In the first step, I create an object called colsum in which I sum the values in each column using the sum function. The axis=0 statement tells Python to sum all the values in each column. Once we have the column sums, all we have to do is create another object that I called colpct in which I divide each value in the ct1 object by it's column sum. I do this in the second line of code. Finally, I ask Python to print the column percentages using the print function. As an extra note, in Python, the object ct1 here is actually called a two-dimensional array, where the columns represent the first dimension, called axis = 0, and the rows represent the second dimension, called axis = 1 Finally, I request chi-square calculations, which include the chi-square value, the associated p-value, and a table of expected counts that are used in these calculations. I call these calculations cs1 and ask Python to print them. My results first include the table of counts of the response variable by the explanatory variable. You can see that there were 64 participants who smoked approximately one day a month without nicotine dependence. And seven participants who smoked once a month with nicotine dependence. At the other end of the table, among smoking daily, that is 30 days a month, 521 participants do not have nicotine dependence. And 799 do have nicotine dependence. Next, our table of column percentages makes these counts more meaningful, by showing the percent of individuals with or without nicotine dependence within each smoking frequency category or level. Examining these column percents for those with nicotine dependence, that is, TAB12MDX = 1, we see that as smoking frequency increases, the rate of nicotine dependence also increases. Now, looking at the chi-square results, the chi-square value is large, 165. And the P value, shown in scientific notation, is quite small. Approximately 7.4e-34. Which clearly tells us that smoking and nicotine dependence are significantly associated. So why did we calculated the column percents? To better understand this choice, let's look at three different tables that pull apart the different numbers represented in a cross-tabs contingency table. For example, we're gonna use percentages from a chi-square table examining the distribution of insured and uninsured individuals by geographic region. Table A shows row percentages. Each cell includes the percent of observations within each row. That is, within region Northeast, Midwest, South and West. That are either insured or uninsured. As you can see, adding across the rows gives us 100% of the observations within region. Table B includes the total percent of observations in each cell. Here, the percentage in each row and column add up to 100%. Finally table C shows column percentages. Each cell includes the percent of observations within column that is within groups either insured or uninsured. Adding down the columns gives us 100% of observations by insurance status. So which of these percentage types should we calculate when trying to interpret the chi-square results for smoking frequency and nicotine dependents? If the output is set with the explanatory variable categories across the top of the table, and response variable categories down the side, it will be the column percent that we want to interpret. In other words, we're interested in whether the rate of nicotine dependence differs according to which explanatory group the observations belong to, that is, which smoking frequency group. Notice that we are not interested in the column percentages for those observations without nicotine dependence. Indicated with a dummy code of 0. Instead, we're interested in describing the presence of nicotine dependence within the smoking frequency groups; that is, these column percentages circled in blue. If I want to graph the percent of young adult smokers with nicotine dependence within each smoking frequency category, I would first import the seaborn and matplotlib.pyplot libraries and then add the following code. First setting out explanatory variable to categorical and a response variable to numeric. And then requesting a bivariate bar chart. With smoking frequency categories on the x-axis, and the mean for nicotine dependence, which is the proportions of ones on the y-axis. Now I can visualize the association, and see even more clearly that there seems to be a positive linear relationship, that is the more days per month a young adult smokes, the more likely they are to have nicotine dependence. I know from looking at the significant P value, that I will accept the alternate hypothesis. That not all nicotine dependents rates are equal across smoking frequency categories. If my explanatory variably had only two levels, I could interpret the two corresponding column percentages and be able to say which group had a significantly higher rate of nicotine dependents. But my explanatory variable has six categories. So I know that not all are equal. But I don't know which are different and which are not.