To evaluate the overall fit of the predicted values of the response variable

to the observed values, and look for outliers.

We can examine a plot of the standardized residuals for each of the observations.

The standardized residuals are simply the residual values

transformed to have a mean of 0 and a standard deviation of 1.

This transformation is called normalizing or standardizing the values so

that they fit a standard, normal distribution.

In a standard, normal distribution,

68% of observations are expected to fall within one standard deviation of the mean.

That is between negative one and

one standard deviation, and 95% of the observations

are expected to fall within two standard deviations of the mean.

The plot of the standardized residuals for

each of the observations, is not one of the plots that is printed out by proc GLM.

So we need to run some additional code.

We use the gplot procedure to do this.

First, we can ask for the labels for the standardized residuals,

to which we gave the name stdres in the procedure above,

and for the observations which in the is set are the countries.

Then, we use the plot command to plot the standardized residuals by country.

The slash v ref equals zero throws a horizontal line at the mean which for

the standardized residuals, is zero.

If we take a look at this plot,

we see that most of the residuals fall within one standard deviation of the mean.

So basically, they're either between -1 or 1.

Few countries have residuals that are more than two standard deviations above or

below the mean of zero.

For the standard normal distribution, we would expect 95% of the values of

the residuals to fall between two standard deviations of the mean.

There are no observations that are three or more standard deviations from the mean.

So we do not appear to have any extreme outliers.

In terms of evaluating the overall fit of the model,

there's some rules of thumb that you can use to determine how well your model

fits the data based on the distribution of the residuals.

If more than one percent of our observations has standardized

residuals for the absolute value greater than 2.5,

or more than 5% have an absolute value greater than or equal to two.

Then there's evidence that the level of error within our model is unacceptable.

That is, the model's a fairly poor fit to the observed data.

None of the residuals from our model exceeded an absolute value of 2.5,

but 5.4% were greater than or equal to an absolute value of 2.0.

This suggests that the fit of this model is relatively poor.

The biggest contributor to poor model fit

is leaving out important explanatory variables.

In order to improve the fit of this model,

we should include more explanatory variables to better

explain the variability in our female employment rate response variable.

Going back to the output from the regression analysis,

the plot equals all command in the procedure,

also prints out residual plots for the individual explanatory variables.

Here's the residual for each observation at different values of Internet use rate.

There's clearly a funnel shaped pattern to the residuals

where we can see that the absolute values of the residuals

are significantly larger at lower values of Internet use rate, but

get smaller, that is closer to zero, as Internet use rate increases.

But then the residuals start to get larger at higher levels.

This is consistent with the other regression diagnostic plots

that indicate that this model does not predict female employment rate as well for

countries that have either have high or low levels of Internet use rate,

but is particularly worse predicting female employment rates for

countries with low Internet usage rate.

Similar to our urban rate variable, there also appears to be sort of a curvilinear

pattern to these observations where the residuals get larger again for

countries for which Internet use rate exceeds about 80 per 100 residents.

This suggests that the association between Internet use rate and

female employment rate may also be curved or linear.

So maybe we also want to add a second order polynomial or quadratic term for

Internet use rate to the model as well.

Because we have multiple explanatory variables, we might want to take a look at

the contribution of each individual explanatory variable to the model fit,

controlling for the other explanatory variables.

One type of plot that does this is the partial regression residual plot.

The GLM procedure does not print partial regression plots so

we need to run some additional code.

There's another SaaS procedure called reg which does provide partial

regression plots.

The reg procedure specifically estimates linear regression models.

The regression models we've tested so

far could also have been tested using the reg procedure.

But we prefer to teach the GLM procedure because it is much more flexible for

specifying your models, and accommodates many different kinds of linear models.

However, because the reg procedure was designed to run linear

regression models specifically, it can provide some additional diagnostic

plots that are not available with the GLM procedure.

So to make a long story short, we can use the reg procedure to test the same

regression model, and it will provide the same results, along with the options for

a few different plots.

Here's the SaaS code.

First, because the reg procedure does not allow you to specify multiplicative

variables in the model, we have to square our urban rate variable ahead of time.

So because it's an additional data management step,

I will create a new temporary data set called partial for

my previously managed temporary data set which I had named new.

Then I create a new variable called urbanrate2 that

is equal to the centered urban rate variable times itself

that is squared that I will use in my PROC reg regression model.

I use the PROC reg procedure and

I add plots=partial to request a partial regression plot.

Then I specify my regression model after the model command.

Note that I am using my new urban rate two variable to fit the quadratic curve.

I add a slash and then partial to ask SaaS to also

estimate the partial regression coefficient followed by a semicolon so

that we can plot the partial residuals and then run to run the code.

The output is the same as we saw in [INAUDIBLE] regression.

If we go to the partial regression residual plot for

the Internet use explanatory variable.

We see that it's a scatter plot.

This scatter plot shows the effect of adding Internet use

as an additional explanatory variable to a model that includes

only the two urban rate explanatory variables.

The residuals from a model predicting the female employment rate response

from the other explanatory variables excluding Internet use rate

are plotted on the vertical axis.

And the residuals from a model predicting Internet use rate from

all the other explanatory variables are plotted on the horizontal axis.

What this means is that the partial regression plot shows a relationship

between the response variable and the specific explanatory variable

after controlling for the other explanatory variables.

We can examine the plot to see if the Internet use rate residuals

show a linear or nonlinear pattern.

If the Internet use variable shows a linear relationship

to the dependent variable after adjusting for the variables already in the model,

it meets the linearity assumption in the multiple regression.

If there's an obvious non-linear pattern, this would be additional support for

adding a polynomial term for Internet use rate to the model.

When, we take a look at the plot for

Internet use rate here we see that in contrast to the plot we just looked at

of the residuals at different values of Internet use rate without adjusting for

the urban rate variables which is shown previously the partial residual regression

plot for Internet use does not clearly indicate a non linear association.

Rather the residuals are spread out in a random pattern around the partial

regression line.

In addition many of the residuals are pretty far from this line

indicating a great deal of female employment rate prediction error.

This suggests that although Internet use ratios are statistically significant

association with female employment rate.

This association is pretty weak after controlling for urbanization rate.

We can also look at the partial regression residual pods for

each of the other explanatory variables as well.

Finally, we can examine a leverage plot to identify observations that have unusually

large influence on the estimation of the predicted value of the response variable.

Female employment rate or their outliers or both.

The leverage of an observation can be thought of in terms of how much

the predict scores for other observations would differ if the observation in

question were not included in the analysis.

The leverage always takes on values between zero and one.

A point with zero leverage has no effect on the regression model.

Outliers are observations with residuals greater than two or

less than negative two.

If we go back to the GLM results.

We can find the leverage plot in the output.

SaaS kindly shows outliers as red symbols,

observations with high leverage values as green symbols, and

observations that are both outliers and high leverage as brown symbols.

One of the first things we see in the leverage plot is that we have a few

outliers.

That is, countries that have a residual that is greater than 2 or

less than negative 2.

We've already identified these outliers in some of the other plots we've looked at.

But this plot also tells us that these outliers have small,

that is close to zero, leverage values.

Meaning that although they are outlying observations, they do not appear to have

a strong influence on the estimation of the regression parameters.

On the other hand we see that there are few cases with higher than average

leverage, but one in particular is more obvious in terms of having an influence on

the estimation of the predicted value of female employment rate.

This observation has high leverage but is not an outlier.

We don't have any observations that are both high leverage and outliers.