Is major depression associated with smoking quantity among current young adult smokers? Or, in hypothesis testing terms, are the mean number of cigarettes smoked per month equal or not equal for those individuals with and without major depression? The explanatory variable here is categorical with two levels, that is the presence or absence of major depression. The response variable, Smoking Quantity, measured by the number of cigarettes smoked per month, ranges from 1 to 2940. First, we have the LIBNAME statement, which calls in or points to the course data set, and tells us where to find the data. Next, the data step begins with the DATA command. And then it read in the specific data set. In this case, the nesarc data. We also added LABELS to help read the output easily. And set appropriate missing data. Then we created secondary variables measuring usual smoking frequency per month. Number of cigarettes smoked per month. And also packs of cigarettes smoked per month, as both a quantitative and categorical variable. Just before the end of the data step, the data was subset to observations including young adult smokers who had smoked in the past year. We concluded the data step by sorting the data by the unique identifier. As you know, all procedures that follow the data step, which is the code located between the data statement and the PROC SORT statement, are written to request specific output or results. So we'll need to add the syntax for analysis of variants after the PROC SORT statement. To conduct and analysis of variance, we're going to use the PROC ANOVA procedure. We start with PROC ANOVA; and we follow this with the CLASS statement. >> The CLASS statement identifies the categorical explanatory variable. The name of my categorical explanatory variable is MAJORDEPLIFE. So I include this after the word CLASS and as always, end the command with a semi-colon. >> Next we write a MODEL statement naming the quantitative response variable. In this case, that's NUMCIGMO_EST then the equal sign, and then the categorical explanatory variable ,which is MAJORDEPLIFE, and end the statement with a semicolon. Finally, the MEANS statement tells us which groups you would like to compare mean number of cigarettes smoked per month. So again, we include the categorical explanatory variable, MAJORDEPLIFE, and end the statement with a semicolon. Don't forget, run semicolon. So now we're ready to run the program and take a look at the output. >> Proc ANOVA first displays a table that includes the following. The name of the variable in the class statement. The number of different values or levels of the class variable. The value of the class variable. And the number of observations in the data set and the number of observations excluded from the analysis because of missing data, if any. So here we see our categorical explanatory variable, MAJORDEPLIFE, as two levels. And the values are 0 and 1. Of the 1706 observations, 1697 were included in the analysis. Proc ANOVA then displays an analysis of variance table for the response variable, also known as the dependent variable from the MODEL statement. >> In this case, our response or dependent variable was NUMCIGMO_EST. Our calculated F statistic, called the F Value in this output, is 3.54. The significance, probability, or P value associated with this F statistic, is labeled Pr > F. And as you can see, the P value is .0601, just over our P value .05 cut point. If we look at the means table, we see that young adult smokers without major depression, as indicated by a value of 0, smoke an average of 312 cigarettes per month. And that those with major depression, indicated by a value of one, smoke on average 341.5 cigarettes per month. Because the P value is greater than 0.05, actually 0.06, we must accept the null hypothesis and say that these means are statistically equal. And that there's no association between the presence or absence of major depression in the number of cigarettes smoked per month among young adult smokers. >> If I chose to reject the null hypothesis, I would be wrong six out of 100 times. And again, by normal scientific standards, this is not adequate certainty to reject the null hypothesis and say that there is an association. Instead, we're going to accept the null hypothesis and say that there is no association. Had the P value been less than .05, I would know that there was a significant association and to interpret that as significant. I would look at the means table, if P would've been less than .05, I can see that individuals with major depression smoke more than individuals without. And again, with a significant P value, I could have said that young adult smokers with major depression smoke significantly more cigarettes per month than young adult smokers without major depression. >> So, we've shown you the ropes in terms of a categorical explanatory variable that has two levels, as it did here with depression. For this interpretation, all we need to know is the P value and the means for each of the two groups.