At a glance, Monte Carlo Tree Search is nothing but a family of

decision-time planning algorithms which may

be viewed as distant relatives of heuristic search.

Note that Monte Carlo tree search is

a planning algorithm and it requires a fair amount of computation time.

In particular, algorithm of Monte Carlo tree search family heavily

relies on Monte Carlo rollout estimates of action or action-value function.

But usually, these algorithms do not throw such estimates away

after an action is made but preserve some of the estimates from state to state.

This slightly reduces the computation time needed for the algorithm to work well.

The main difference compared to

the previously discussed naive version of planning with Monte Carlo

estimates is presence of two policies: the Tree policy and the Rollout policy.

The latter is already familiar to us and in practice,

it should be as good and as fast as possible.

The former however, is yet unfamiliar to us.

The tree policy is a policy which actually determines

the direction of the search while respecting uncertainty of the estimates.

Let us now discover the main structure of

the Monte Carlo tree search algorithms and then discuss the details of the UCT algorithm,

the most popular representative of the Monte Carlo tree search family.

Monte Carlo tree search,

like the naive algorithm we have discussed previously,

uses Monte Carlo rollout to estimate the action values.

However, unlike the previously discussed alternatives,

Monte Carlo Tree search algorithms additionally build a tree.

This tree consists of the initial segments,

states and actions of trajectories that have obtained higher return previously.

In some sense, these beginnings of trajectories are the most interesting to

us because they are most likely to lead us to the best possible trajectory.

We will discuss how to build such a tree in a while.

For now, let's discuss the main scheme of Monte Carlo tree search algorithm.

Please look at the diagram on the slide.

On this diagram, empty circles represent states,

and the filled circles correspond to actions.

Typical Monte Carlo tree search algorithm can be

divided into four main phases: selection,

expansion, simulation and backup.

To produce meaningful decisions,

these phases should be repeated as many times as it is possible,

and when the time runs out,

algorithm is stopped and the best action for current state,

that is the root of the tree, is selected.

All of these phases describe how a single rollout is performed.

At first, during the selection phase,

the rollout starts from the root of the tree,

that is current status,

and sends down the tree selecting actions according to the tree policy.

Once such rollout enters the leaf of the tree,

the expansion phase is launched.

During the expansion phase,

a new node or nodes which are directly adjacent

to the current leaf node are added to the tree.

For example, a new node maybe the new state

an agent find itself in after making an action in the leaf state.

More typically however, the new state is

added to the tree with all actions available in that state.

Then when the simulation exits the tree,

the rollout policy takes control.

All the subsequent actions till the very end of the episode are made

according to rollout policy which may be completely random, for example.

When the episode is over,

the return on the rollout become available.

And this return should be propagated back to each of

the state action pairs that were visited by the current rollout.

This big propagation is performed by simply storing in each such state action node,

the cumulative discounted return from that node till the end of the episode.

Well, the last two phases should be already familiar to you,

but several things are still unclear.

First and most importantly,

what does the tree policy look like?

Second, what are the strategies of choosing the action once the time is up?

Well, because a Monte Carlo tree search is not

a single algorithm but a family of algorithms,

there are plenty of different choices of the tree policy.

However, we are going to cover only one choice,

mostly because it's effectiveness and popularity.

The Upper Confidence Bounds for Trees, abbreviated as UCT.

What should the tree policy do is to balance between exploitation, that is,

concentration on the directions of the search that seems most profitable,

and on the other hand,

it should also explore mostly because of the uncertainty

in current estimates and because of the large search space.

Indeed, there are multiple sources of uncertainty in

our estimates such as stochasticity in the environment,

finite sample estimate of the return,

and random action choice of the rollout policy.

The effective balance between exploration and exploitation,

there are a lot of approaches but the most simple one is to treat

each action selection as an independent multi-armed bandit problem.

This problem could be solved with many techniques.

For example, with upper confidence bound algorithm known as UCB.

The application of the UCB algorithm as

a tree policy is in fact what is called UCT algorithm.

So, the tree policy in the UCT algorithm is defined as

argmax over all actions of the expression with two agents.

First agent is an approximate action-value function which is defined

as an average Monte Carlo return gained by the simulation after making

action a in state s. This first term promotes

exploitation because it favors

actions which were previously shown to lead to large action value.

The second agent is the one that encourage exploration.

Let's see why the N of s in the numerator is the total number of simulations that

previously have visited the state s. The N of s and a in the denominator,

is a total number of simulations that have made action a in

state s. When the action a is made in the state s,

then the denominator increases,

and it increases faster than the numerator because of the logarithm.

That effectively decreases the contribution

of the whole exploration term for this action.

On the other hand, incrementing N of s and a after making action a in

state s effectively increases the exploration values of all other actions in that state.

That is so because N of s increases every time an agent finds itself in

state s. But N of s and a increases only for the action that was committed.

This exploration exploitation balance has not only good theoretical properties,

but it is also very simple to implement and has proven to be very effective in practice.

Please note that the constant C in front of a second term,

can be used to increase or decrease

the overall amount of exploration made by the tree policy.

Also, note that exploration term ensure that each action is chosen at least once.

That is so because either N of s and a is zero,

the second term will be infinite.

The last but certainly not the least topic we need to discuss about

Monte Carlo tree search is how to actually select

an action when planning is finished or interrupted.

Remember, an agent stands in the root of the tree build by

the Monte Carlo tree search and need to choose an action from that state.

In fact, there are many possible strategies to select

such action while making use of the planning result.

The most straightforward one is to select

the action which has the biggest estimate of action-value function.

Despite simple and usually effective,

this may not be the best strategy.

One case when it fails is the case of

very rare but also very large returns, outlier returns.

If such returns are possible in your environment,

you may benefit more from the robust strategy of selecting the most visited action,

that is, the one which has the greatest N of s and a.

You may also want to continue planning until

the first two strategies will select the same action.

This approach is called Max-robust strategy

and was shown to be particularly effective for the game of go.

Another strategy is called the Secure strategy.

It is about choosing the action that maximizes the lower confidence bound.

More specifically, that is,

maximizes the same expression as the tree policy but

changing the plus sign to the minus in front of the second agent.

That maybe the way to go if you are solving

a real life task with some expensive robot as an agent.

In that case, you may want to sacrifice some amount of reward favoring safe trajectories,

that is, the ones that will definitely not match an expensive robot.

Well, now let me conclude with what are the properties of Monte Carlo tree search.

Monte Carlo tree search is context independent,

that is, it does not require any hand designed heuristics.

It performs an asymmetric search which means that it allows us

to focus computations on the most promising directions of the search.

Monte Carlo tree search is also what called anytime algorithm.

That mean that you can stop the algorithm at

any single moment and it will output the best decision it has found so far.

Another benefit of Monte Carlo tree search is that it

saves a lot of computation time because it preserves

all the estimates of the sub tree in which

root an agent found itself in after committing an action.

It is also incredibly simple and easy to implement.

However, its performance depends on the quality of the rollout policy.

And also Monte Carlo tree search may be very computationally intensive.

Now is the last. Monte Carlo tree search is

a great algorithm which is very important to know about.