Now, let's talk about how we can reduce variance of tree-based estimators,

while retaining most of their attractive features.

And this brings us to the topic of Ensemble Learning Methods.

First, let's talk a bit about the general idea of all ensemble methods.

They are based on the idea of relying on what is

called "The Wisdom of the Crowd" in the modern space.

Let's assume that we have a number of different classifiers,

each trained on the same dataset.

Once we train them,

we can combine them and impose some rule on how

we compute the final class label for data point in such scheme.

For example, let's assume that we trained Linear Regression, Neural Network,

SVM Classifier, CART tree,

and possibly some other classifiers using the same data.

In the context of ensemble learning,

we call all such classifiers weak learners.

Let's note that the more different algorithms used as weak learners are,

the more independent they will be.

Now, when presented with the new instance, X,

each weak learner will provide its own predictive label for this instance.

In this examples, example O predictors bought the SVM,

reduce their prediction of class label one,

while SVM predicts last two for the same data point.

The final predicted class of such ensemble of classifiers can

be decided based on the majority vote between weak classifiers.

In this example, the majority vote will produce the final predictive class label of one.

It turns out that this procedure works better than just

picking the best classifier among all available weak learners.

This might look a bit puzzling at the first sight,

but in fact there is nothing puzzling here.

This is just the law of large numbers of work.

And to see this,

let's consider probably a simplest possible ensemble method.

Let's assume that we have a biased coin with

the probability to come up heads of 51 percent.

Let's assume that we toss such coin 1000 times,

so we set N equal to 1000.

So, if we take a classifier that always predicts the next toss to come up heads,

its success rate will only be 51 percent.

That is, such classifier as only be better than a purely random guess.

However, if we look at all 1000 trials and take the majority vote,

then the probability to see most of the heads will be about 73 percent,

as you can calculate using the cumulative binomial distribution as shown here.

So, as you can see in the simple example,

a majority vote can be used to make it much better,

stronger learner out of a bunch of weak learners.

The trick is simply the law of large numbers.

The key point is that all this weak learners should be independent of one another.

Now, taken such different estimators as logistic regression,

neural networks, and SVM can only produce a small number of weak learners.

It can also be time-consuming to implement vastly different estimators.

A more popular approach to ensemble learning is

to use weak learners obtained with the same algorithm,

to decorrelate them, and that is to make them as independent as possible.

They can be applied to random subsets of the original dataset.

And after that, a majority vote is applied to the output of these weak learners.

This is called bagging, which is a short word for a bootstrap aggregation.

This method was developed by Leo Breiman in 1996,

the same Leo Breiman a bit later to bagging to his another model, Random Forest.

Random forest is one of the best performing algorithms

seen all of machine learning and also one of the most popular ones.

Random forest is widely used in many industrial appplications.

For example, it's used in Microsoft Kinect sensor to perform body pose recognition.

Random forests works very similar to bagging but this time,

a weak learner decorrelated using

both randomly sampling features and random subsets of a train dataset,

while using bagging for predictions.

This produces a final estimator that has

both good ensemble feat and low variance and has good utilization ability.

A random forest algorithm has a number of hyper-parameters that can

be tuned to improve performance using the validation set or using cross-validation.

One of these hyper-parameters is the number of weak learners.

Also, as random forests can bias individual trees,

it can carry its all hyper-parameters of a single tree.

Now, while random forest provides for

many practical problems one of the best classifiers available,

it can also be used as a nice tool for exploration of importance of different features.

Importance of a feature is calculated in

random forest as a by-product of the main algorithm.

It's given by the average depths at which

the given feature appears across all individual trees in the forest.

Simultaneously, a standard deviation of this quantity is calculated as well.

The graph on this slide shows you importance

of different features in our bank failure problem.

The list of features is shown on

the right along with their importance and standard deviations.

The most important feature turns out to be the log of total assets.

The next most important feature is the so-called assessment base,

equal to average consolidated assets minus tangible equity.

The fewer than fourth places are taken by the ratios of

non-performing loans to total loans and the ratio of net income to total assets.

So, as you can see,

there are only four or five really important predictors

in this problem according to random forest.

You can compare this with results obtained with logistic regression,

which you did in your previous course.

With logistic regression, importance of

different features is suggested by their P-values.

It turns out that these different assessments of importance of

different features are quite consistent for our banking data example.

One word on the pros of random forest is that game is usual.

There are flipside of the coins.

Random forest are often gives very good feats to data

because it's so flexible and uses many basic piece.

But by the same token,

it was a simple credibility.

There is no longer a single flowchart like tree that could be visualized and interpreted.

And on this note,

we conclude our review of random forest,

and moving on with our final topic for this week,

Boosting Methods, but not before this video will end up with a minute of reflection.