Using inferential techniques, we can determine which variables in the model are significant predictors of our response variable. In this video, we're going to talk about doing a hypothesis test and constructing a confidence interval for the slope estimates of the predictors in our models. And we're also going to go through, in addition to calculating this values, how to interpret them. The data that we're going to be working with come from the National Longitudinal Survey of Youth. These are cognitive test scores of three- and four-year-old children and characteristics of their mothers. We have data on the kid's score, whether or not the mom went to high school or not, the IQ score of the mother, whether the mom worked during the first three years of the kid's life, and the age of the mother at the birth of their child. Using R, we can easily fit a model predicting the kid's score from the other variables that are given in the data set. First, we need to load our data. So if you would like to follow along, you can do so using the code provided here. Next, we want to fit our model. We're going to start with what we call the full model, meaning it includes all the explanatory variables that are given to us in the data set. So, we're using the linear model function again, the lm. And on the left side, we're putting the kid's score, our response variable. And on the right hand of the, right side of the formula, we list our explanatory variables, the high school status of the mother, IQ score of the mother, whether or not the mom worked early on in kid's life, and the age of the mother at the birth of their children. To view the regression output, we use the summary function. We're actually going to go through in detail what just about every single value we're seeing on the regression output means throughout the rest of the slides in this video. First, we will do inference for the model as a whole. Here, our null hypothesis is that each one of the slope variables is equal to 0. In other words, none of the explanatory variables is a significant predictor of the response variable. And the alternative says that at least one of the slopes is different than 0. The test statistic that we use here is an F-statistic, and this output simply comes from the bottom of the regression output that we saw on the previous slide. We have an F-statistic on 4 and 429 degrees of freedom. 4 is the number of predictors, and 429 is simply n minus k minus 1. We had 434 observations minus 4, the number of predictors, minus 1, gives us the residual degrees of freedom. And we are also given the p-value. So, we really do not need to do any calculations by hand here. We already have the p-value provided for us. So what we need to focus on is, instead, the interpretation of what this means. Since our p-value is less than 0.05, we say that the model as a whole is significant. We reject the null hypothesis, and the alternative hypothesis is suggesting that there is at least something interesting to look for here. The F test, yielding a significant result doesn't mean the model fits the data well. It just means that at least one of the betas is non-zero. The F test, on the other hand, not yielding a significant result doesn't mean individual variables included in the model are not good predictors of y. It just means that the combination of these variables doesn't yield a good model. Now that we know there is something worthwhile to look for in this model, because we found out that at least one of the betas is different than 0, we can do individual tests on the slopes. For example, the question says is whether or not the mother went to high school a significant predictor of the cognitive test scores of children, given all other variables in the model. The null hypothesis here is that the beta associated with the high school status of the mother is equal to 0 when all other variables are included in the model. And the alternative is that it's different than 0 when all other variables are included in the model. The regression output, as usual, gives us everything that we need. All we need to do is to look on the row for the mother's high school status and take a look at the p-value for that. Since this is a small p-value, we can determine that whether or not mom went to high school is a significant predictor of the cog, cognitive test scores of children, given all other variables in the model. Even though we don't need to do any calculations by hand, it's always a good idea to try to understand how the calculations that are included in the regression output are actually getting done by the software that you're using, so that you can understand what they mean. So, let's go through the mechanics of testing for the slope within the framework of a multiple linear regression. As usual with a regression, we use a t-statistic in inference. The t-statistic looks like point estimate minus the null value, divided by the standard error. Our point estimate is simply our slope estimate, and the standard error is the standard error of this estimate that we can grab easily from the regression output. So, the t-statistic for the slope is simply b1 minus 0, divided by the standard error of b1. How this is different from the single predictor regression case that we covered in the previous unit is how we calculate the degrees of freedom. The degrees of freedom here is n minus k minus 1, where k is the number of predictors included in the model. Let's take a moment to focus on this new measure of degrees of freedom and actually highlight that it is not a new measure at all. We just said that for a multiple linear regression, the degrees of freedom is n minus k minus 1, where n is the sample size and k is the number of predictors. And earlier, in the previous unit, we had said that for a regression with a single predictor, the degrees of freedom can be calculated as n minus 2. If you think about it, in a single predictor regression, the number of predictors is 1. So, if we were to calculate the degrees of freedom as n minus k minus 1 for that case as well, we would simply get n minus 1 minus 1, which comes out to be n minus 2. And remember, the additional minus 1 is because with, along with every single predictor for which we calculate a slope estimate, we also calculate an intercept, and that's where we're losing that one additional degree of freedom. So, while we've introduced these two formulas slightly differently, note that they mean exactly the same thing. You start with your sample size, that is the total degrees of freedom you have to play with. And then, you lose the number of degrees of freedom that goes for however many predictors you have. And then, you lose one more for the intercept. So, let's go ahead and verify the T-score and the p-value for the slope of the variable mom_hs, that is the high school status of the mother. We can calculate the T-score as the estimated slope minus the null value 0, divided by the standard error of the estimated slope, 2.315. And that yields a T-score of approximately 2.201, which is what is given to us on the table anyway. To calculate the p-value, we're going to need the degrees of freedom associated with this slope, and that's n minus k minus 1. 434 is our sample size minus, we have 4 predictors, minus 1 gives us a degrees of freedom of 429. That's a value we can also find on the regression output right next to the residual standard error at the bottom of the output. Now that we know the T-score and the degrees of freedom, we can use R to calculate the p-value. And for that, as usual, we would use the pt function. And also, we want to keep in mind that when we're dealing with slopes and looking at the p-values that are provided to us on the regression output, those p-values are always calculated for hypothesis test that are, where the alternative hypothesis is two-sided. The p-value here comes out to be spot on with what we saw in the table, 2.82%. And given that this is a small p-value, we would reject the null hypothesis in favor of the alternative, and determine that mom's high school status is indeed a significant predictor for the kid's cognitive score. We've said numerous times throughout the course that the construction of a confidence interval follows the same structure, regardless of the estimate for which you're constructing the confidence interval. It is always a point estimate plus or minus a margin of error. And in this case, our point estimate is simply our slope estimate. And, we can calculate our margin of error as the t-statistic times the standard error of the slope. So, let's go ahead and calculate a 95% confidence interval for the slope of mom_work. So, this was a variable that said whether or not the mother worked during the first three years of the kid's life. First, let's find our critical value. And before we can get there, we need to know our degrees of freedom. We've already confirmed that that was 429, so we want to find the critical T-score associated with a 95% confidence level and 429 degrees of freedom. This is a really, really high number for degrees of freedom, so we know that the T-score is going to be pretty close to a Z-score of 1.96. But let's go through the steps anyway to get the exact T-score. We can draw our curve, mark the center of our distribution at 0.95, and remind ourselves that each tail is then going to have 2.5% left. And we can find the critical T-score using the qt function and the associated degrees of freedom. And R tells us that the cutoff on the lower bound is negative 1.97. When constructing confidence intervals, we always use positive critical values, so the t star associated with 95% confidence level and degrees of a freedom of 429 is 1.97. As expected, it is very close to the Z-score because we have really high degrees of freedom here. So to finalize our calculations, we start with our point estimate, roughly 2.54, plus or minus 1.97 for the T-score, times roughly 2.35 for the standard error. That gives us a confidence interval of negative 2.09 to positive 7.17. And how do we interpret this interval? This is simply going to be the interpretation of the slope for this variable, except now, we're also adding a statement to the beginning of that about how confident we are of that estimate. So, we are 95% confident that, all else being equal, the model predicts that children whose moms worked during the first three years of their lives scored 2.09 points lower to 7.17 points higher than those whose moms did not work.