1:02

So what you should be able to do after this lesson

is understand the idea of the loss or cost function, and

how you use it to figure out the optimal model parameters.

You also should be able to compute a regression on multiple variables.

You should be able to explain the different statistical measures

that quantify the quality of a regression, and

we've seen some of these in previous modules.

You also should be able to utilize categorical variables, you should know

what they are and how to use them in building a machine learning model.

And you should be able to use both scikit_learn and

statsmodel libraries to perform linear regression.

Now, the only thing to do in this particular lesson is to actually go

through the Introduction to Linear Regression notebook.

So let me change over to that.

2:02

This notebook will start by introducing the formalism of linear regression.

Then we'll talk about the loss or cost function and

how we need to minimize that in order to determine the optimal model parameters.

And then we'll talk about some of the steps with linear regression by

using the scikit_learn library.

We'll talk about categorical variables and how to use those with linear regression.

And finally, we'll talk about the statsmodel library.

We don't use the statsmodel library that often, but when we do,

it provides some powerful capabilities that I want to highlight.

So, first, we of course have our standard input.

And then we are going to load some data and look at it.

In this case, we are going to use the tips data, that is related.

3:06

So this is just representative of the data we're going to be analyzing.

First, we're going to perform some basic features on this,

this is actually extracting the data from our data frame as x and y.

The x will be the value that is our independent variable and y

will be our dependent variables, so we're going to be able to predict y given an x.

And to do this, we simply repeat what we did with our ordinary linear regression.

The simplest way to do that was to use the scipy's linregress function.

And this returns the slope and intercept as well as an r value, the p value, and

the slope standard error of these, the most important were the slope intercept

and r value, which is a measure of the quality of the fit.

That's the Pearson coalition coefficient.

And so we display these.

So this is just giving you a feel for what we should be expecting the relationship to

be for this one feature, total_bill, to predict the tip feature.

So, what are we doing?

In general, we want to create a function or model that is going to take multiple

input features, multiply them by some coefficients with a possible

interceptor along with some errors that are differences between the model and

the data for each data point.

That's our function and we're going to want to predict these values for,

in this case, y, the dependent variable.

4:41

To figure out those parameters, we have to define a cost function and

then we want to minimize that.

So what are the cost functions?

Usually what we do is we take the difference the actual or

the label information we have and subtract off the model value.

And do some sort of normalization on that.

Typically, what we'll use is the l2-norm,

which is shown here, y sub i- f(x sub i) squared.

The reason we square it is that way you're always summing up positive terms.

If you didn't square it, there might be a few that are negative and

some that are positive, they might cancel out and

you get zero even though your model was in a very good fit.

So by squaring it, we're going to actually try to

minimize the differences between our model and the data.

5:29

So that's what we do, first, we're going to load one dataset,

rather than the tips, we're going to start with a different one.

In this case, it's going to to be one of Amscomes datasets, we do our l2-norm.

We make the plot.

And you can see, here it is.

Some the data points lie right on the line.

Others are pretty big differences, right?

This E10, this epsilon 10 is pretty big, and epsilon 3 is pretty big.

But this was the minimum function that we got out of this.

We can look at cost functions, actually, define what they are.

So here we're going to say, what is our cost function?

It's going to be the logarithm of the sum between our actual value and

our model value.

In this case, we don't have an intercept term.

It's just the slope times the independent variable.

And then we apply that to our data and we make a plot.

So here you can see here is the coefficient, the single coefficient.

Here's the cost, we typically do this and

log because the cost function can vary dramatically, and by taking the log rhythm

of it you actually get to focus on the important aspects, right?

We don't want to be focusing out here in the wings where there's lots of

data points.

We really want to see where's that minimum.

As you can see, there it is, right?

The beta is very small, very close to 1.

In general, though, our cost function's going to be multi-dimensional,

not uni-dimensional, so we were going to want to look at something more complex.

Here we show the same idea, but now we've added an intercept term.

So we have our independent variable, times our slope, minus our intercept, and

we have our model there, and we subtract the model value from the actual value.

And then we can plot this, and in this case it's going to be a 2D plot.

So you can see it here.

Here's our minimum.

Again, beta is very close to 0, alpha is pretty close to 0 too, not too large.

7:13

So that's a demonstration of what this cost function is.

And this is very important because this whole idea of finding that minimum is

the key aspect in all of machine learning.

You want to find the best model,

to do that you have to minimize the cost function or the lost function.

And typically, we'll do this with gradient descent, but

there's a lot of other algorithms that come along to try to improve upon this.

And we will see more with gradient descent in future notebooks.

Now, let's get into doing linear regression with scikit-learn.

Again, as I showed you in the introduction of machine learning lesson,

using scikit-learn to perform machine learning is very simple.

We import our estimator, in this case, it's the linear regression estimator.

We create it and we may have to pass in some hyper parameters when we do it.

In this case, we are showing none but

we can specify some like the fit_intercept or normalize.

And then we fit the model and then we can predict the model.

So let's go back to the tips dataset, we're going to use that.

So first, we're going to split our data into training and testing.

We then apply this scikit-learn library and then this case we are going to make

a nice little plot first and then show the linear regression.

So here is our model, tip = 0.14 times total_bill.

And our model score is 39.9, that's not very good, right?

Ideally, you would have 100% accuracy.

But this is our first time, so we are just going to look at this.

We've color coded training and

testing data little bit differently and the model look somewhat reasonably.

It performs worst out here as the total_bill gets bigger.

There's some errors out here as well, but not too bad.

We can also look at the model residuals,

this is the differences between the model and the predicted value.

And so you can see that here and now I see why this value was so high,

that 39.9% wasn't 100% and that's because there is some serious scatter about this.

There's nothing that's too bad,

we don't see any indication of some systematic trend, so, for instance,

it's not like all the values out here at total_bill are well above the zero line.

That would indicate that our model might be under fitting the data.

That's something we always want to worry about.

So that's a reasonable first guess, and we can try other things as well, so

we can say this time let's fit an intercept.

So now you notice our model changes slightly and

our accuracy goes up, that's good.

And you can see the rest.

So this notebook walks you through these different steps.

The next thing is multivariate, can we use more than one independent feature?

And so here we are taking total_bill and size and extracting them out.

And the model just walks through making these, again, showing the same things.

The next step, though, is categorical features.

9:54

The previous independent features we looked at, total_bill and size,

they were numerical.

They had values and they could take any value.

On the other hand, some features are restricted.

They can be non numerical, like a gender, male or female, or smoker, yes, no.

They can also be restricted to nominal features

where there's no real relationship between different categories.

This good idea is the gender category because it's either male or female,

and there's not like you can do a function to go between the two.

Another type categorical feature is an ordinal feature,

where there is a relationship between them.

So, for instance, if you're saying what are the results of the race?

Somebody got first, somebody got second, somebody got third, there actually is

an intrinsic relationships between those, that first comes before the other two.

And the way we treat these can vary, and so you need to keep that in mind.

So the next part of this notebook talks about categorical features and

how to turn them into something that can actually be applied to machine learning.

Remember, machines understand numbers not text, so

if we have a categorical feature that in this case is four colors, red, blue,

yellow and green, how do we turn that into something that a machine can understand?

And one way to do that is to encode it as a number.

So that 0 is blue ,1 is green, 2 is red and 3 is yellow.

But the problem with that is,

that implies us a relationship that blue comes before red.

And that may not be true.

So another way we could do this is something called OneHotEncoding,

where we actually make multiple features.

So rather than saying, which color is it?

We say, was it red or not?

And was it blue or not?

And that's the idea.

So we expand the data to make a column,

in this case, a feature that says, first column, is it blue?

If it's a 1, that is blue, and in that case the other three are all 0.

On the other hand, the second feature is green, was it green?

If it is, then that feature is 1 and the others are 0, etc.

This is an important concept, and

it may seem strange because we've expanded the feature space.

But it's important because it makes it easier for

machine learning model to learn this,

especially since there's no relationship between green, blue, yellow and red.

It's not like one of these colors necessarily comes before the others.

12:12

So that's an important point.

We could apply that and use data to get a linear regression

with categorically features, and so that's what we are doing here.

We are creating a OneHotEncoder for the day that we are using.

And we are going to apply that to our linear regression and make predictions.

And you can see that now our model is using all of these features.

And the accuracy actually went down, and sometimes you're going to see that.

And you have to figure out, was it a good idea to use that feature or not?

And I'm going to want you to walk through lots of these different

things and see them.

The last thing you're going to look at is the statsmodel,

statsmodel has two things that are really important that I want to demonstrate.

The first is the way you write a formula, a relationship.

This is something that actually was virtually developed for

the R programming language and Python brought it over into the statsmodel

interface where we say the formula is, tip is dependent on total_bill.

13:27

We could also do the same thing where we say -1,

this removes the intercept term, and we compute the intercept.

We'll be actually using this specific example with scikit-learn

in future notebooks as well.

So it's an interesting way of writing it, makes it a little easier, perhaps,

to understand what's going on.

So I wanted to be sure to demonstrate that.

We show here the relationship between these.

Here's the regression comparison for both the no intercept, and the intercept model.

And lastly, the second thing I wanted to demonstrate with statsmodel was this,

it provides a very sick concise way of providing the summary results of the fit.

And this includes the Pearson correlation coefficient, a bunch of other statistics

as well, we'll talk about many of these in future lessons.

14:13

As well as things like the AIC and

BIC which are model terms telling you how well the fit performs.

So with that, I'm going to go ahead and end this particular lesson.

I hope you're excited to actually dive in and

start performing real machine learning, in this case, linear regression.

You'll be able to do linear regression with

normal numerical data as well as categorical data.

You'll also be able to perform multivariate linear regression and

to quantify the results with some particular metrics.

If you have any questions, let us know.

And good luck.

[MUSIC]