Next we discuss model selection, which is the science and art of picking variables for a multiple regression model. We're going to talk about stepwise model selection methods, based on criteria of p-values, or adjusted R squared. And we're also going to mention briefly at the end, that sometimes we might just pick variables based on expert opinion. One stepwise model selection method is backwards elimination. Here, we start with a full model that is a model with all possible co-variants or predictors included, and then we drop variables one at a time until a parsimonious model is reached. The, on the other hand, we could also do forward selection, where we start basically with an empty model, and then we add variables one at a time until a parsimonious model is reached. There are many criteria for model selection. We're going to be focusing on p values and adjusted R squareds. However, other model selection criteria that you might hear of are AIC, that's the Aikake Information Criterion. BIC, Bayesian Information Criterion. DIC, Deviance Information Criterion. Bayes factor or Mallow's Cp. There are many others that you can stumble upon as well, but these tend to be the most commonly used ones. The latter ones that we listed are beyond the scope of this course, though. Let's start with backwards elimination using the adjusted R squared method. Here, we start with the full model, the model with all possible predictors. We drop one variable at a time, and record adjusted R squared of each smaller model. Then, we pick the model with the highest increase in adjusted R squared. We repeat until none of the models yield an increase in adjusted R squared. Let's give an example for how to do that using the dataset that, from earlier. For kids cognitive scores and predicting that value from mom's high school status, mom's IQ score, whether or not the mom worked during the first three years of the kid's life, and mom's age. The adjusted R squared for the full model is 20.98%. And the second step, what we do, is we try removing each one of the variables one at a time. So for example, here we've removed the high school status, and the adjusted R squared is 20.27%. This is not an increase over what we had started with. So, we know that it's not a good idea to move to this model. We can also try removing mom's iq and we get a really low adjusted R squared if we do that. It must be that the iq variable is very important for the prediction of the kid's cognitive score. We can also try removing mom's work status and that gives us an adjusted R squared of 20.95%. Still not an increase from the original full model. And lastly, let's try removing mom's age at birth of the child, and we can see that the adjusted R squared has actually increased. We had started with 20.98% and now we are at 21.09%. A tiny increase, but still an increase, so we know, that in the first step, we need to pick the model where we're predicting kid's score from high school status, IQ. And work status of the mother. Next, we move on to the second step, where we once again try removing each one of the variables one at a time, and we can see that none of these options actually yield an increased adjusted R squared. Therefore, our final result is going to be the model that predicts kids. Cognitive test score from Mom's high school status, Mom's IQ and the work status of the mother. In backwards elimination using the p-value method, we once again start with the full model, then we drop the variable with the highest p-value and refit a smaller model. We repeat this until all variables left in the model are significant. To give an example, here is our full model, where we were predicting the kids' cognitives course from the four predictors and the variable with the highest p-value is mom's age. So in the first step, we remove mom age from the model and refit the model again. Using only high school status, IQ and work. And once again we can see that mom's work status has a non-significant p-value. And therefore, we would remove that from the model as well. And refit the model one more time with simply high school status. And the IQ score and we can see that now both of these predictors have significant p-values so we would stop here. As you can see we resulted in a slightly different model using the p-value approach verses the adjust R squared approach. And this is not unexpected. We would expect to get very similar models but not necessarily exactly the same model because our decision criteria is different. Let's take a look at another example for practice. The following model uses data from the American Community Survey to predict income from hours worked per week. Race, and gender. Which variable, if any, should be dropped from the model first when doing backwards elimination using the p-value approach? Hours worked has a tiny p-value so we would certainly not drop that from our model, and similarly gender. And the other variable that's in the model is our race variable. And we need to consider this variable all at once because we can't simply drop one level of an existing variable. And because at least one of the levels of this variable has a significant p-value, you can see that for Asian we're seeing a tiny p-value. We would actually keep this variable in the model as well. Therefore, we don't drop any variables here. This is an important point, so let's repeat that. If you have a categorical variable with multiple levels, you cannot drop part, some of the levels of that variable and keep others. You either need to decide that to keep the entire variable as a whole or drop it as a whole and in this case, because there is at least one level that has a small p value, meaning that there is some significance there. We would actually keep the entire variable. If all of the levels of the variable had high p-values such that they wouldn't be, there wouldn't be any levels that are significant, then we would drop the entire variable as a whole. So we talked about two approaches, adjusted r squared verses p-value. We even mentioned that sometimes they yield slightly different results. Then how do we know which one to use. We use p-value approach if what we're interested in is finding out which predictors are statistically significant. On the other hand, if we're interested in more reliable predictions from R model we want to use the adjusted R squared method. The p-value method depends on the somewhat arbitrary 5%, or whatever other percent you use for your significance level cutoff. And if you use a different significance level, you're going to end up with a different model. It's used more commonly though, since it requires fitting fewer models. Remember at each stage of the adjusted R squared method, we had dropped one variable at a time. And refit a bunch of models to determine which one to go with, versus in the p-value approach, you simply drop the variable with the highest p-value and proceed. And it's the more commonly used approach because it's easier to implement. However, because it relies on this arbitrary significance level cut-off, it might be more favorable to use the adjusted R squared method for model selection. Let's now talk about forward selection. We start with single predictor regressions of response versus each explanatory variable. We then pick the model with the highest adjusted R squared, add the remaining variables one at a time to the existing model and pick the model with the highest adjusted R squared. We repeat until the addition of any of the other remaining variables does not result in a higher adjusted R squared. Let's illustrate this with an example using the cognitive test scores data. We start with four simple linear regressions, one for each of the candidate predictors in our data set, and then we pick the model with the highest adjusted R squared. So the first variable that we're going to be adding to our model is going to be mom's IQ. In the next step we then try the remaining 3 variables and once again pick the model with the highest adjusted R squared and that's going to be mom's IQ with the addition now of mom's high school status. And net in the next step we once again try the 2 remaining variables and if there's an increase in the adjusted R squared which there is. Then we move on to the one more complicated model. Lastly, we try the full model, but the adjusted R squared does not go up, therefore we're going to stick with the model in step three. And note that we arrived at the same model, whether we went backwards or forwards, using the adjusted R squared criteria. To do forward selection using p-values, we start with single predictor regressions of response versus each explanatory variable. We then pick the variable with the lowest significant p-value. And we add the remaining variables one at a time to the existing model, and pick the variable with the lowest significant p-value again. We repeat until any of the remaining variables do not have a significant p-value. We talked about algorithmic ways of doing model selection, however, sometimes variables can be included in, or eliminated from the model based on expert opinion as well. For example, if you're setting a certain variable you might choose to leave that variable in the model regardless of whether its significant or whether it would yield a higher adjusted R squared or not. So to wrap things up, let's finally fit our final model. Remember we had selected the variables mom's high school status. Mom's IQ, and mom's work, and if we take a look at the summary output, we can see that the variables mom's high school status, and mom's IQ are statistically significant at the 5% level. And mom's work status is not, but remember, we selected this model using the adjusted R squared method, which tells us that including that variable actually gives the model higher predictive power even though the variable may not be statistically significant. If we had used the p-value approach to do the model selection, we would not end up with any variables that are not statistically significant in our model.