So far in this unit, we have learned how to fit multiple linear regression models, how to interpret results coming out of a multiple linear regression model. We've also talked about inference using the multiple linear regression. And lastly, what we are going to do now is to go through the conditions required for the multiple linear regression model to be mapped valid. These conditions are that we need linear relationships between our numerical explanatory variables and our response variable, our residuals need to be nearly normally distributed, we want constant variability of residuals, and we also want independence of residuals, which basically speaks to independence of the observations in our sample. First, linear relationships between numerical explanatory variables and our response variable y. We're mentioning numerical here because it doesn't make sense to ask for a linear relationship between a categorical variable and another numerical variable. So each numerical explanatory variable needs to be linearly related to the response variable. We check this condition using residuals plots, that is, the residuals versus the explanatory variable. We're looking for a random scatter around zero. And note that we're using the residuals plot, instead of a scatter plot of the response variable versus the explanatory, because the residuals plot allows for considering the other variables that are also in the model and not just the bivariate relationship between a given x and our y. As illustrative examples, we're once again going to use the cognitive scores data set from the previous videos. And we had decided that the final model is going to have the mom's high school status, mom's IQ, and mom's work status as the explanatory variables in the model. Note that the only numerical variable in our model is mom's IQ score, so that's the variable we're going to be focusing on for the linearity condition. In order to create this residuals plot, we're actually going to need our residuals. And note that those are actually saved in the object that r creates for, for our linear regression model. And we can use the dollar sign operator to get at the residuals, and we're going to place that on the y-axis of our plot, and plot it against moms' IQ score, that we can obtain from the original data set. This is what our residuals plot looks like. Remember, we said that we want a complete random scatter and we want our residuals to be randomly scattered around zero. Seems like we're definitely meeting the condition here. The next condition is nearly normal residuals with mean zero. Remember that some residuals will be positive and some are going to be negative. On a residuals plot we look for a random scatter of residuals around zero. This translates to a nearly normal distribution of residuals centered at zero. And we can check this using a histogram or a normal probability plot. So, once again, using R, we can make a histogram of our residuals that are stored in the object for the regression model. And we can also make a normal probability plot using the functions qqnorm for the plot, and qqline for the, guidance line that we're going to use to see if the points actually align on a straight line. This is what our plots look like. We are seeing a little bit of a skew in the residuals. However, the skew doesn't look too bad. And looking at the normal probability plot as well, except for at the tail areas, we're not seeing huge deviations from the mean. So I think we can say that this condition seems to be fairly satisfied. The next condition is constant variability of residuals. We want our residuals to be equally variable for low and high values of the predicted response variable. So we check the residuals plot of residuals versus the predicted values, that's e versus r y hat. And note that we're using residuals versus predicted, instead of residuals versus x, because it allows for considering the entire model with all explanatory variables at once. We want our residuals to be randomly scattered in a band with a constant width around zero. So in other words, we're looking to see nothing like that resembles a fan shape. It is also worthwhile to view the absolute value of residuals versus the predicted values to identify any unusual observations easily. As usual, we can easily create both of these parts in R. Here for example, we have our residuals on our y axis, and on the x axis we have what R calls the fitted values. What this basically means is our predicted values, or in other words our y hats. And we can also calculate the absolute values of these residuals and plot that against the fitted values as well. So here's what our plots look like. The first plot is a residuals versus fitted plot. We don't see a fan shape here. It appears that the variability of the residual stays constant as the value of the fitted or the predicted values change, so, the constant variability condition appears to be met. The absolute value of residuals plot can be thought of simply the first plot folded in half. So if we were to see a fan shape in the first plot, we would see a triangle in the absolute value of residuals versus fitted plot. Doesn't exactly seem to be the case, so it seems like this condition is met as well. Lastly, independent residuals, and note that independent residuals basically means independent observations. If we have any time series structure, or if we're suspecting that there may be any time series structure in our data set, we can check for independent residuals using the residuals versus the order of data collection plot. If, on the other hand, that is not a consideration, to check to see, if the residuals are independent, we don't really have another diagnostic approach, diagnostic graph that we can use. Instead, we want to go back to first principles and think about how the data are sampled. We've talked numerous times in this course about what independence of observations means and what do we need in terms of the sampling of the data to obtain independent observations. So let's quickly take a look to see if this order of data collection plot looks wonky in any way. For that, we simply plot our residuals, and we don't even have to specify anything for our x-axis, because R will basically plot them in the order that they appear in our data set. And the order of data collection plot where we have the residuals on the y-axis, and the order of data collection on the x-axis, does not show any patterns. If there was some non-independent structure we would see these residuals increasing or decreasing but we don't see any such pattern, so it appears that any sort of time series structure is not a consideration for this dataset.