If you drew a straight linear regression through these points,

most points would be really far away from the line.

Meaning that there is a lot of prediction error.

The best fitting line is not straight,

rather it is one that curves to capture the non-linear nature of the association.

Here's an example, of a less extreme curvilinear association

between urban rate and female employment rate with a linear regression line.

Returning to the SAAS code for the gap minder data said,

the code to produce this scatterplot is here.

We used the SG plot procedure to create a scatterplot for our x variable,

urbanrate and our y variable, femaleemployment rate.

After the slash, we add some options for

the regression line using the line ATTRS option.

Which stands for line attributes.

Specifically, we set the line attributes, L-I-N-E-A-T-T-R-S equal to,

and then in parentheses, color=blue and thickness equals=2.

These options ask for a blue regression line

that is a little bit thicker than the default of thickness=1.

Outside the parentheses, we add the option clm, which asks SAAS to print the 95%

confidence interval for the regression line followed by a semicolon.

Then we label our axis and type run to run the code.

You can see that it looks like family employment rate decreases as urbanization

rate increases.

But that around urban rates of 80 or

higher, it looks like female employment rate appears to increase.

So, it looks like kind of a U shaped association.

Just like with the anxiety and performance association, a straight linear regression

line isn't doing a good job of picking up on a curvilinear part of the association.

We can actually fit a line that curves to better this association

by adding a polynomial term.

For example, we could add a second-order polynomial, or

quadratic term, to draw a line of best fit that captures the curvature we are seeing.

To do this, we use the same sgplot procedure

code that was used to draw the scatter plot with a linear regression line.

The difference is that we’re asking for two lines to be plotted,

a straight linear regression line and a curved quadratic regression line.

In the second line of code, we ask for a linear regression line by

adding degree=1 to the options, following a slash.

Degree=1 asks for a first order polynomial, or a linear line.

In the third line of code, we asked for quadratic regression line

by adding degree=2 to the options following the slash.

Degree=2 asks for a second order polynomial or quadratic line.

Now, my scatter plot shows the original linear regression line in blue.

And the quadratic regression line in green.

You can also see that I added some line attributes to make the color

of the regression line green, and to increase its thickness to two.

Now, my scatter plot shows the original linear regression line in blue, and

the quadratic regression line in green.

Notice how the quadratic line does a better job of capturing the association at

lower and higher urbanization rates.

The points at these levels are close to the quadratic or

second order polynomial curve.

Meaning that the expected or

predicted values are closer to the actual observed values.

So, based on just looking at the two curves, it looks like the green quadratic

curve fits the data better than the blue linear regression line.

But, we can be even more sure of this conclusion if we test to see

whether adding a second order polynomial term

to our regression model gives us a significantly better fitting model.

I do this by simply adding another variable that is the squared value of my

explanatory x variable x squared to my regression model.

First, let's test a regression model for just a linear association between

urbanization rate and female employment rate using the GLM procedure.

Note that we have centered our urban rate quantitative explanatory variable,

urban_c.

Centering is especially important when you're testing a polynomial

regression model,

because it makes it considerably easier to interpret the regression coefficients.

If we look at the results, we can see from the significant P value and

negative parameter estimate that female employment rate is negatively associated

with urbanization rate.

So, the linear association, the blue line in the scatterplot,

is statistically significant.

But, the R-square is 0.09, indicating that the linear association of urban

rate is capturing only about 9% of the variability in female employment rate.

But what happens if we allow that straight line to curve by adding a second order

polynomial to that regression equation?

The SAAS code to do this is here.

As you can see, it's the same code as for the linear regression model.

With the exception that we've added another explanatory variable,

which is urbanrate_c*urbanrate_c.

This gives us the square of the urbanrate variable,

which is a second order polynomial, or quadratic term.