0:15
Hello and welcome to lesson three.
This lesson will introduce the concept of boosting which is another form of
ensemble learning that you're using to combine weak learning predictions,
but this time, we'll use weights.
The idea is that we use these weights in an iterative process,
and we modify the weights to emphasize predictions that were inaccurate.
The idea is that as we continue to refine the predictions,
these weights should help emphasize
training data that are really important for making the accurate predictions.
And thus, ideally, eventually,
you become more and more accurate over time.
So, by the end of this lesson,
you should be able to explain what boosting is and why you can
use it to get better results and also how is it different from bagging.
They're both ensemble learning techniques.
I want you to be able to explain them,
each independently and how they differ.
And then, finally, you should be able to apply boosting algorithms,
such as the gradient boosted decision tree,
by using the scikit learn library.
The reading for this particular lesson is the boosting notebook.
We, of course, once again start with the decision tree because
that's what these ensemble learning algorithms are generally based off of.
However, you can use different ones as I showed in the boosting notebook.
You can use the boosting classifier to do that.
What we're going to look at here is the idea of the boosting algorithm.
We're going to focus on the gradient tree
boosting for classification and then for regression.
Some things that we're going to look at that are a little bit different,
we're going to see,
of course, the feature importance.
We're also going to introduce the concept of stage predictions.
That's where we're going to see,
at each stage, what's the prediction, how is it changing?
Then, for regression, we're going to look at the Auto-MPG Data again,
and we're going to look at some concept of partial dependence.
And the last thing, we're going to introduce another algorithm called AdaBoost.
We start with our introduction.
We then jump straight into what is
the cost and loss functions for our gradient boosting, how is it working?
Either for AdaBoost or for the gradient boosting algorithm.
We're going to see some of the hyperparameters that are important and
then we're going to jump into using gradient boosted classifier.
The start, of course, will be with the Iris dataset.
We'll see how this data looks,
how we do when we predict on it.
Right off the bat, our accuracy is pretty good.
And again, this was only with 50 percent of the total data for training.
So, we're getting pretty good accuracy even just using half of the data for training.
When we look at our metrics,
we see they're pretty good,
a little bit of misclassification.
Easy thing to do there would be to increase the training data.
We'd remove that, but it's a better idea to try
changing the algorithms or the hyperparameters of the algorithms.
We can, of course, do feature importance.
Here the differences with the boosting tree,
the performance isn't measured the same way.
Obviously, if you look at these,
these adds up to over 200 percent.
The idea is what are the most important features here,
not in terms of a percentage as we had before,
but in terms of their relative importances.
So here we see that the pedal length is very important,
then the sepal width.
We knew that before. Those were always the features that were thought to be important.
But for gradient boosted decision trees,
the sepal width now is more important than the sepal length,
which is different than before.
So we see some variation in the importances.
And again, these are all relative to each other,
not in an absolute sense.
We can look at the decisions surface.
Again, it's based on decision trees.
So it should look pretty similar.
This is very simple.
Remember, we're continually re-improving
the performance of our algorithm that's the idea of a gradient boosted decision tree.
So, there shouldn't be a lot of cuts with the base hyperparameters.
We can, of course, change those.
We can change the number of estimators and increase
them and see how that changes their performance.
So we would just have the base number of estimators,
which should be 10, 10 very weak learners.
As we increase to 50, we, of course,
get a more complex decision surfaces as 100,
hasn't changed much because apparently,
there's little information to be gained by making more cuts.
We could, of course, increase that number of estimators up to
1,000 if you want it and see what happens.
And that's one of the things I ask you to do.
Try playing around with some of
these hyperparameters and see what changes in the decision surface.
We then go into the classification on a larger dataset.
In this case, we're going to apply it to the Adult Data.
We're going to use 50 percent of our data for training,
50 percent for testing,
and we're going to see how we're doing.
And as you see with this quote "Real world dataset" unquote,
it's doing quite well straight out of the box.
So, you might be able to do really well if you start tuning these hyperparameters.
What I really want you to get out of this is that this is
another important powerful algorithm that is
using the predictions of many weak learners to make that more accurate prediction.
Another thing that we want to talk about is the idea of stage prediction.
And the idea with gradient boosted decision trees,
we can see what the individual estimators predictions are.
And so, we can see how is the gradient boosted
tree changing on its way to the final estimator.
So, for this, we can plot half of the total estimators and see what happens,
and we can use different metrics for this.
We can use the default accuracy score.
We could use F1 score.
And so, that's what this particular cell does.
As we increase the number of estimators,
so we start with zero and then we go one,
two, three, as we add more and more estimators,
how does our accuracy change?
And you can see this for both the test data and the training data.
And you see it very quickly it jumps up and then sort of levels off.
Now, this is useful because it tells you that you know what, once you get to about 40,
45 estimators, you're not changing the performance much,
particularly on the test data.
And so, your going out to 1,000 is unlikely to do much benefit to your performance,
but it would take a lot longer to compute.
That's the importance of stage prediction.
Next, we can look at gradient tree boosting for regression,
and we'll apply it to the Auto-MPG Data.
We'll see what our results are.
And as you might expect, they're pretty good.
In this case, we are doing a single feature prediction,
and our mean absolute error is 2.25,
which is one of the better ones we've gotten.
We can also though look at something called partial dependence.
And the idea here is,
with a gradient tree boosting,
you can see what are the relationships between features and their importance.
Here we're not making a prediction of their relative importance.
Instead, we're saying, how do their dependence of
prediction depend on the relationship between two features?
So we're going to actually call this as a partial dependence method that we can use.
This will compute the dependence given a model between two different features.
And so, here we can see,
we're predicting the displacement and what's
the partial dependence of the displacement on the regression accuracy.
And you could see that when the displacement is small,
there's a high dependence, a high positive.
And as it gets smaller and smaller, it's a lower.
What this is telling us is that small displacement engines,
which is what this feature is measuring,
have better fuel performance,
and big engines have smaller.
And that should make some sense.
If you have a car with a really big engine,
it's going to suck a lot of gasoline.
It's going to have a lower performance.
It might go faster. It might accelerate faster,
but that's what you're going to have.
On the other hand, we can look at weight.
Same thing will hold with the car's weight.
A lighter car uses less fuel.
A heavier car will use more fuel.
So you're seeing here,
in this partial dependence plot,
the relationship between an individual feature
and the prediction of that particular feature that we're going after.
In this case, fuel performance.
Another one is acceleration.
Interestingly enough here,
only the very low accelerations actually have a positive value.
Otherwise, it's pretty much right around zero.
Maybe at the high end, it's a little low.
This just telling you if you accelerate really quickly,
you're using a little bit more fuel,
but that interestingly enough is not that predictive of a relationship.
But we could also make a multi-dimensional partial dependence plot,
and that's what we have here.
We're looking at weight versus displacement.
And as expected, we would have
small cars with small engines have a better fuel performance, and you can see this.
These are six and four.
The actual values here in our contour.
And as we get out to heavier cars that have bigger engines,
the fuel performance is lower and lower,
and these are negative 3.8 and negative 5.9.
So, partial dependence is useful to show that relationship.
Now, just like with the bagging algorithm,
we're going to introduce a another type of algorithm here,
AdaBoost algorithm, which is for adaptive boosting.
And very similar to the gradient tree booster is we have the ability to classification.
We can apply it straight out the box.
It does pretty well for the Adult Data.
We can also apply it to the Auto-MPG Data for regression.
Again, straight out of the box does pretty well,
but we can tune these values to see if we can do even better.
So with that, I've introduced the concept of boosting.
I've introduced the gradient tree boosting algorithm and the AdaBoost algorithm.
We've talked about partial dependence, and stage predict.
These are new concepts. A lot of stuff come at you.
Please go through this carefully.
And if you have any questions,
let us know. And of course, good luck.