Hi there. Let's get started.

This lesson is about the random forest algorithm.

Random forest is a machine learning algorithm

which produces and ensemble of decision trees.

And this ensemble as a whole is more powerful than a single decision tree.

First of all, let us understand what the bootstrap is.

Bootstrapping is an algorithm which produces replicas of

a data set by doing random sampling with replacement.

This idea is essential for the random forest algorithm.

Consider a data set, Z,

for example, with five objects.

How does bootstrapping works?

At the first step,

you pick at random an object,

for example, the object number three.

Then you repeat it and pick the object number one.

Then the object number five. And so on.

Then possibly you can pick again the object number five,

because at each iteration you pick an object at random,

and there is no correlation with the previous step.

And finally, in our two example,

you pick the object number two.

After bootstrappingexam you have a new data set.

The size of this data set is the number of elements in the original data set.

But its content, as you can see, is slightly different.

Some objects may be missing and other objects

may be present several times, more than once.

Okay. The next method is bagging.

It was the second idea essential for understanding of the random forest algorithm.

Bagging is a shorthand for bootstrap aggregation

and it is a general method for averaging predictions of other algorithms,

not decision trees, but any other algorithm in general.

Bagging works because it reduces the variance of the prediction.

Let me explain it to you.

Here is a formal definition of the bagging algorithm.

You have a training set Z which consists of pairs XI, YI.

B is a number of iterations,

and you have a machine learning method M,

which can be of any nature.

For example, decision trees,

linear regression, neural networks,

support vector machine, anything.

What do you do? You repeat capital B iterations.

At each iteration, you draw a bootstrap sample,

G star B of size N,

from the training data.

Before we have already discussed

the algorithm that allows to produce the bootstrapped replicas.

Then you apply the machine learning method to this data set, G star B.

And finally, you obtain the model.

You will denote this model by f(x) star B.

This algorithm returns an ensemble of B models.

This is how training with bagging works.

The next question is how to do predictions with this ensemble.

In case of regression,

the prediction of the whole aggregate model is

just an average of all the single models F star B.

And in case of classification,

it is a majority vote of all their predictions.

What is the majority vote?

Consider an ensemble with five models,

and if three models vote for the class number one,

and two other models vote for the class number zero,

then the majority vote will be the class number one.

Surprisingly, but the bagging algorithm really works.

In most of situations with any machine learning method in the core,

the quality of such aggregated predictions will be better than of any single prediction.

Why does bagging works?

This phenomenon is based on

a very general principle which is called the bias variance trade off.

This topic is out of scope of our short course,

but I will explain it to you informally.

You can consider the training data set to be random by itself. Why is it so?

What is the training data set?

In the real situation,

the training data set may be a user behavior in Internet,

for example, web browsing,

using search engine, doing clicks on advertisement, and so on.

Other examples of training data sets are physical measurements.

For example, temperature, locations.

date, time, and so on.

And all these measurements are essentially stochastic.

If you can repeat the same experiment in the same conditions,

the measurements actually will be different because of the noise in measurements,

and since user behavior is essentially stochastic and not exactly predictable.

Now, you understand that the training data set itself is random.

In ideal situation, you should collect as much data as

possible to reduce this noise component of the data set.

But in fact, you cannot run the same experiment in

the same situation as it is impossible to invent a time machine,

but you can do a bootstrapping.

So bootstrapping is a method for

generating very similar replicas of the original data set.

What happens here if you train

a machine learning model on the bootstrapped replicas and average them?

Then the aggregated machine learning model will be more stable.

This happens because of averaging.

After averaging, the noisy parts of machine learning model will vanish out,

whereas stable and reliable parts will remain.

The quality of the average model will be better than any single model.

Okay. The summary of this lesson,

you have learned what the bootstrap algorithm and bagging algorithm are.

Now, you can improve the quality of almost any machine learning method by doing bagging.

However, it might be very time consuming for large data sets.