In this part of the module, we will look at variable selection. So why do we do variable selection? I think we should look at the pros and cons of it. Very broadly, this is the idea of bias versus precision. If we use fewer of the variables, what happens is that our estimate is biased because we're not using all the information, as simple as that. So there is a little bias in the way we're estimating. But, the precision of the prediction or how precise and how certain we are about the coefficients of the regressions imputes. So the trade-off is really between the two, between deliberately not using some of the variables in order to make the model more precise, but then there is a danger of creating bias. So as you do more and more of this, you will get an idea of it. There are a lot of features and you don't want to use all of them, because if you use all of them, there is too more noise, and then you run into another problem, which we'll talk about in a second. Before we come to it, there are four measures we use for measuring how well a model fits the data. So one is the one we just saw, the Residual Mean Squared error RMSE. We can use another metric called Mallows's Cp, and I'll show it to you. It's an automatic output from R or Rattle. Basically, there is a stopping criteria as we keep comparing different models. We select a model which satisfies a particular criteria in as far as Cp is concerned, or we could use what are known as the Information Criteria. Which is in the regression parlance in the estimation parlance called AIC and BIC. I would not be going as much into that, but mostly into RMSE and which will fit. The main idea being, in any kind of model that will work as parsimony. Can I use the least set of variables and the most important variables to predict or explain the phenomenon? That way, I'm not overfitting the model, I'm not reacting to noise. So often, when you're reacting to noise, when you're trying to predict something new, your prediction may be off. The other danger of using many variables is some of these variables, they may be collinear. What we mean by collinearity is these variables may be strongly related to one another. That leads into a problem of you may find your fit is excellent, but none of the variables look significant. Because they're significant only together, unless you remove one of them which depends on the other too. Remember, in the previous example, we had fuels of three type and we removed one of the fuel types because with all three, just won't be able to fit it. There is more information, unnecessary information out there, what we call collinearity, fuel diesel plus fuel petrol plus fuel CNG is equal to one. When such a situation happens, the regression cannot be estimate. Therefore, collinearity is a problem. Collinearity is also a problem when you try to do what we're trying to do next, which is called variable selection. I'll talk about variable selection in a minute. But one of the dangers is, that if data are highly collinear then you may have to get rid of the collinearity before you go into variable selection. So that's a cautionary statement I make. You can read the references given at the end of this lecture, which will tell you why that's a good thing. So remember, if there is lots of collinearity in the data, you may like to understand them and eliminate it before you do what I am going to do next. What is model selection? So let's say, I have 10 variables, then I can decide to include one in the model or I can drop it out. So think of it, each variable can either be in or out. So totally, you have 2 times 2 times 2 times 2, 2 to the power of 10 possibilities, 1,024 possible combinations. For test, 1,024 models to develop, it's sometimes difficult, isn't it? But let's say we had 25 variables, then you have 2 to the power of 25, or you have 64 variables, you have 2 to the power of 64 variables, and that's the famous chessboard problem, remember. You just estimate. So we need to use more heuristic, which is not very optimal methods. We cannot do exhaustive enumeration of all the models and then do variable selection. So in this very [inaudible] of subset selection, I'm going to talk about three techniques which are programmed in R as a package; Forward selection, Backward elimination, and the Exhaustive search. Exhaustive search, you search every combination and find the best one, in terms of one of the criteria we've talked about, Root Mean Square Error and Mallows's Cp, or Information Criteria. The two other methods which are more heuristic and less time consuming are forward selection and backward elimination. How does forward selection work? You start with the first variable, which is the most correlated with the response variable and included in the model and see if it is significant. If it is significant, you keep continuing. Then you regress that variable against the response variable, look at the residuals, that means you take out the effect of the first variable and then see how the residuals of that they correlate with the remaining variables, then pick the variable which has the highest correlation, bring that into the model. So basically, it is searching one variable at a time, bringing it into the model if it is highly correlated with the residuals of the model developed so far. It stops when the variable brought in is no longer significant. Backward elimination goes the other way around. You start with the full model with all the variables, and you drop one variable at a time. Technically, there are two ways of thinking about it, but the easy way of thinking is you drop the variable which has the least effect on the Root Mean Square Error. You start with an insignificant variable, which has the least effect on the Root Mean Square Error, drop it, read on the aggression, continue, find the next variable which has the least effect on the Root Mean Square Error, continue. You stop when none of the variables are insignificant. Of course, initially, if all the variables are significant, you don't drop any one of them. So that's the basic idea. Now, here's the bad news. The bad news is good news, which we like, is that Rattle cannot do this. So I'm going to use another method of doing this, and I'm going to teach you a way to run a R script.