0:10

Hello. This lesson is going to introduce ordinary linear regression.

Ordinary linear regression computes

a best fit linear model between two dimensions under certain assumptions.

This linear model can be used with

these caveats that I've mentioned to perform predictive analytics,

as well as to visually understand relationships

between different dimensions of a data set.

OLS (ordinary linear regression) is

a very simple idea and it can be applied in a lot of different situations,

but sometimes it's misapplied and we have to be careful about that.

Future modules in this course will introduce

ideas that might be more powerful and be able to provide a better model.

This lesson includes two things.

First is using a visual website that explores

ordinary linear regression and second looks

at our notebook introduction to ordinary linear regression.

So what is this visual website?

The idea here is that you can play with the points and see what happens.

So for instance, here's some points and here's a best fit line and you might say,

"What happens if this point was down here?"

And you could see as you drag it, the line changes.

Not only that, but the fit parameters over here change as well.

So we can move the points up such that they are all very

nicely co-aligned and you can see that the fit parameters get pretty tight,

and our parameters that control that are going to get tight as well.

So we can also make changes here,

we can of course play with our website and see how this affects things.

This is actually demonstrating how this idea works and basically what we do is we create

little squares that represent the deviation from both delta Y and delta X from the line,

and we want to minimize these squares.

That's what we talk about when we say regression,

we're regressing to points to this line and it's an ordinary least squares.

And that's what I said earlier, OLS,

the squares are sort of what we're trying to minimize.

You can do this by playing with both of these.

Right? You can change the intercept and you see some of

the squares get really small and you see the squares listed over here,

and we can also change the slope of our line.

By doing these together,

we may get a really good fit or we may get a really bad fit.

So hopefully that shows you what's going on with simple linear regression.

Now, our notebook is going to talk about this in a little bit more detail.

We're going to analyze a data set.

First, we're going to use one data set that's included in Seaborn called Anscombe.

And so first we just grab it.

There's four different subsets in this particular dataset.

Here's the first part of one of them and you can just see there's X and Y.

So we're going to make a correlation measurement,

we're going to see what is our correlation,

and it's not too bad.

Point 8 1 6 Pearsonr.

We can then say, well, what does the data look like?

Here's our data and that looks like there's some sort of nice relationship present,

so we can actually fit a linear model using Seaborn.

We use regression plot and just say

fit regression true and it displays our line and you go.

That does look pretty good.

Now what about the residuals about this line.

You can plot these with the Seaborn's residplot.

And there you see that they're not too bad.

They're scattered around the line pretty nicely,

and we can put it all together and plot the best fit line with

our Pearson correlation coefficient by using the joint plot.

That's great. The problem is Anscombe includes

four different data sets that all have the exact same regression coefficient,

the same variance, all of the same parameters,

but when you look at them visually,

you can clearly see differences.

Here's the original one we looked at.

But if we simply looked at the numerical or analytic results,

these four are indistinguishable.

It's not until we actually view them visually that we see,

here is a line with a clear outlier,

here's a bunch of points that are all set at

the same thing with one point way off by itself,

and these points clearly don't have a linear relationship.

That's why I'm showing this to demonstrate that you must visualize your data,

view your data, to ensure that a linear model is even appropriate.

If we simply computed a linear regression,

we'd get the correlation coefficient that we got out and though,

t hey life is good, we've got a pretty good correlation, we can go forward.

But when you visualize it,

you can see these are not good fits to the data set.

So what else can we do?

We can actually perform the regression and calculate the parameters.

Here we're fitting a line and getting a Pearson correlation coefficient by using numpy.

We can also do the same thing by using Seaborn,

here's our Seaborn plot and then we can fit a line to that.

It's exact same things.

Here notice what we're trying to do,

here are different points.

And these epsilons are what we're trying to minimize in our equation.

If I scroll back up, you'll see these here.

This is what we're trying to do: we have a series of

variables x and we're trying to fit a linear model to them,

where we model X by a beta or slope parameter at

an intercept and have these Epsilon terms

which account for the difference between the model and the data point.

We want to model, minimize those by performing our regression.

So you can see, that's what we want to do,

the best fit line will minimize the sum of those differences.

We can look at this visually,

as we saw in the previous linear website.

The idea is that we want to minimize

the square differences and there's different ways

we could do it which this notebook walks you through.

One other thing I wanted to talk about here is that we can use

linear regressions in a categorical variable to see differences.

We've seen this before with the tips data set.

We noticed that the lunch has a stronger slope but there's

a lot more scatter at the dinner time for the total bill and the tip.

We can also look at these and get the correlation coefficient;

you see that just as we just saw visually,

there's a stronger correlation in the lunch and a much weaker one in the dinner.

There's also ability to look at nonlinear data sets: here we're generating

a compound interest data set and then we're going to visually explore that,

so we can perform regression on both of those.

And that when you do this with linear regression,

you see, that slope doesn't look too bad, that fit doesn't look too bad,

but if we change the number of years to even larger,

you'll note that this data is just going to continue to spiral

upwards as the exponential compounding continues.

So instead what we can do is change our variable to the logarithm of that.

And then you notice that we get a perfect straight line fit.

So this is again demonstrates it's important to look at your data,

to understand your data,

and that linear regression doesn't have to have the variables being linear,

it's the parameters that must be linear.

So we can take the logarithm of our data and compute the regression

on that and get a really good fit when it is distributed in a logarithm manner.

So I hope this has given you a bit of an introduction to ordinary linear regression.

It's a powerful technique that's often

useful to get a feel for what's happening in your data.

You need to be careful to ensure that your data are

reflective of the properties that are required for ordinary linear regression.

As we go forward in this course sequence,

we'll see other techniques that are more powerful and we'll see

ways to minimize the other issues that we might be concerned about,

such as bad bias and variants that are fitting models or overfitting.

If you have any questions, let us know in the course forums, and good luck.