The Brilliant Idea Behind Monte Carlo Tree Search

Yamac Eren Ay

Follow

9 min readSep 6, 2023

--

Before you read this article: Have you read my previous article on search algorithms?

Search Algorithms In a Nutshell

What could be better than funny analogies to help us understand a concept?

yamaceay.medium.com

In the world of complex searching algorithms, we often find ourselves navigating between two distinct ends of the spectrum: simplicity and complexity.

Comparison of Breadth-first vs. Depth-first (https://www.freelancinggig.com/blog/2019/02/06/what-is-the-difference-between-bfs-and-dfs-algorithms/)

Most notably, (extremely) simple algorithms can be characterized as below:

Breadth-first: exploring each level of nodes one by one
Depth-first: exploiting a solution as long as possible

The problem is that both simple approaches operate in a rigid manner, without having any adjustable parameters.

Simulated Annealing as a great example for a complex search algorithm (https://ieatbugsforbreakfast.wordpress.com/2011/10/14/simulated-annealing-a-brief-introduction/)

Now, consider the converse — a sophisticated algorithm that employs a finely tuned blend of intelligent heuristics and hyperparameters. Below is the list of required arguments for Simulated Annealing:

“In order to apply the simulated annealing method to a specific problem, one must specify the following parameters: the state space, the energy (goal) function E(), the candidate generator procedure neighbour(), the acceptance probability function P(), and the annealing schedule temperature() AND initial temperature init_temp“

The unique strength of this approach lies in its adaptability. By fine-tuning the individual arguments assigned to each heuristic, such complex algorithms can be tailored to address specific scenarios effectively.

However, this adaptability comes at a cost — a lack of generalization. The algorithm’s fine-tuning may perform exceptionally well in one context but struggle in another.

Questioning the explainability of complex architectures, visualized (https://pixabay.com/photos/man-thinking-doubt-question-mark-5723449/)

This trade-off between adaptability and generalization presents its own set of challenges and potential problems. Explainability, for example, may be an issue: can you explain “why the energy function is chosen logarithmic” or “how the temperature relates to the energy”?

Ockham’s Razor, illustrated (https://theethicalskeptic.com/2013/06/30/the-real-ockhams-razor/)

The Fundamental Challenge Behind All Complex Search Algorithms

Ockham’s Razor, a fundamental principle in philosophy, reminds us to favor simplicity when complexity doesn’t provide significant benefits.

Whether it’s temperature, energy levels, or learning rates, at its core, nearly every challenge in the world of complex search problems should boil down to a fundamental trade-off: exploitation vs. exploration.

These main objectives can be then simply expressed as:

max R: Exploitation involves diving deeper into the most promising solutions, thus maximizing the expected reward R.
min H: Exploration involves giving each choice its fair share of consideration, treating every potential solution equally, and, in the process, minimizing entropy H and ignorance.

Exploitation-Exploration Trade-off (https://yjjo.tistory.com/38)

If we distill the problem to finding the right equilibrium between exploration and exploitation, we find out that we only need to tune a single hyperparameter λ, referred as the exploration weight:

max (R) + λ * min (H) = max (R) - λ * max (H) = max (R - λ * H)

So, this single objective leads to the Upper Confidence Bound (UCB), also a heuristic for always selecting the most promising solutions which haven’t been explored much.

More on UCB in the context of multi-armed bandits, I recommend you to read this article:

The Upper Confidence Bound (UCB) Bandit Algorithm

Multi-Armed Bandits: Part 4

towardsdatascience.com

Most suitably, this strategy is employed by a powerful algorithm called Monte Carlo Tree Search (MCTS), which has revolutionized the world of AI, particularly in the world of game playing and decision-making.

Originally designed for solving complex game problems such as chess or Go, MCTS found applications in various domains, from robotics to finance. In this article, we will take a detailled look at what MCTS is, how it works, and its real-life use cases.

Universe analogy for showing uncertainty and vastness, which MCTS tries to solve (https://niagaraanglican.news/the-vast-expanse-of-interstellar-space-a-hidden-universe/)

The Advantages of MCTS

Before we dive into the inner workings of MCTS, it’s essential to understand why it was developed. MCTS was created to address three significant challenges:

Complexity of Choices: In games like chess and Go, there are a vast number of possible moves and game states. This complexity makes it difficult for traditional search algorithms to make good decisions.
Dealing with Uncertainty: Sometimes, we need to make decisions without having all the information we’d like. For instance, self-driving cars must make quick choices on the road without knowing exactly what other drivers will do. MCTS helps in such situations.
Adaptability: Many advanced algorithms work well in specific situations but struggle in others. For example, a robot movement planning algorithm might not be suitable for financial risk prediction. MCTS was designed to be adaptable and work effectively in various domains. By simply adjusting hyperparameters and fine-tuning the balance between exploration and exploitation, this search algorithm can be seamlessly adapted to diverse domains and problem spaces.

Visualization of MCTS (https://www.researchgate.net/figure/Monte-Carlo-tree-search-MCTS-for-a-three-atom-assignment-problem-Atoms-are-to-be_fig2_322513476)

How MCTS Works (In 4 Steps)

In general, this algorithm consists of a single loop, which can be meaningfully splitted into 4 parts as listed below:

Selection: MCTS navigates the existing tree of game states to find the most promising node to explore. This is most typically done using a policy that balances exploration and exploitation. The widely-used Upper Confidence Bounds for Trees (UCT) formula helps in this process, ensuring that both unexplored nodes and promising candidates are considered.
Expansion: Once a promising node is selected, the algorithm expands the tree by adding child nodes that represent possible actions. This is a critical step as it determines the range of actions the algorithm will consider in the future. The expansion phase continues until a leaf node is reached.
Simulation: MCTS plays out the game from the current node to the end using random actions or a rollout policy. The outcome of these simulations provides an estimate of the node’s value. Repeating this process multiple times helps in building a more accurate estimate of the node’s worth.
Backpropagation: The statistics of nodes in the tree are updated based on the results of the simulations. This phase ensures that the information about the value of different game states is propagated up the tree, allowing the algorithm to make more informed decisions in the future.

The iteration continues until a termination condition is satisfied, for example if the number of total steps is exceeded or the current solution is good enough.

The MCTS+UCT implementation

Now, let’s dive into implementing a Monte Carlo Tree Search (MCTS) scenario, where we want to train a Reinforcement Learning model using Q-functions (click the link below to learn more):

The Complete Reinforcement Learning Dictionary

The Reinforcement Learning Terminology, A to Z

towardsdatascience.com

Specifically, the goal is to learn the action(s) which maximizes the expected reward in a tree with nodes, where each node is associated with an immediate reward.

How environment and agents interact (https://www.researchgate.net/figure/Reinforcement-Learning-Agent-and-Environment_fig2_323867253)

At the beginning with no prior information, the algorithm needs to collect some environment data to be able to find the most profitable action. Luckily, the essential data can be leveraged by defining an environment which lets all actors:

perform an action,
transition to a new state,
receive new observations if the current state is private, and
obtain rewards based on the action taken,

and repeat the whole cycle again.

To make this work across other use cases, we’ll create a bandit interface for simulating dynamic environments, requiring the following essential methods:

def reset(self) -> list[Y]: Initialize the bandit environment to its initial state distribution and return the initial observations if any.
def actions(self) -> list[A]: Return a list of possible actions that can be performed at the current state of the bandit.
def step(self, a: A) -> tuple[float, list[Y], bool]: Perform an action in the bandit environment and return the reward, new observations, and a boolean flag indicating whether the current state is terminal.
def observe(self, s: S) -> list[Y]: Receive observations from the actual private state of the bandit to assist in making informed decisions.
def fin(self): bool: Return True if the entire simulation has ended; otherwise, return False.

When all of these methods (and optionally observe) are properly implemented, the bandit interface can be effectively used to simulate an environment.

This helps us experiment with various algorithms, including MCTS, to make smart decisions within a simulated environment.

By adhering to this interface, a flexible framework can be created for testing and developing decision-making algorithms in diverse scenarios, which allows optimizing the balance between exploration and exploitation while navigating complex problem spaces.

See the MCTS implementation below, which learns the Q-function by exploring the state space and maximizes the expected reward by picking the most promising action:

The Key Role That UCB Plays In MCTS

In this context, the selection policy relies on the Upper Confidence Bound (UCB) heuristic, which adopts an optimistic approach by prioritizing the exploration of subtrees that show potential for high performance but have been insufficiently explored. It considers namely two key factors:

Exploitation Score: Subtrees with higher expected rewards are favored over those with lower expected rewards. This encourages the exploitation of known promising paths within the search tree.
Exploration Score: Subtrees that remain unexplored are given priority over those that have already been extensively explored. This ensures that less-charted territory is adequately explored.

UCT Formula, used within MCTS selection phase (https://www.geeksforgeeks.org/upper-confidence-bound-algorithm-in-reinforcement-learning/)

When both factors present conflicting signals, the UCB heuristic steps in to resolve the trade-off. It assigns a score to each option, striking a balance between its past performance and the amount of exploration it has received.

Over time, as you gather more and more data and refine your estimates, then the exploration term is more and more evenly distributed, also there is nothing left much to explore (equally).

Then, clearly, options that have been explored extensively and have underperformed are gradually ignored in favor of options with more reward potential.

This capability enables MCTS with UCB selection policy to rapidly converge towards the best-performing options, accomplishing this with fewer iterations than other competitors. Makes sense, right?

Further Challenges

While MCTS has shown remarkable success, it is not without its challenges.

High complexity can be a bottleneck in real-time use cases. Researchers are continuously working on optimizing and parallelizing MCTS algorithms to make them more efficient.
For instance, when dealing with a large number of actions in a single episode, the simulation process can become excessively time-consuming, extending the overall time spent before reaching the backpropagation.

To cope with the problem #2, a horizon depth strategy can be employed. Within the horizon of any depth (let’s say k), the rollouts are performed the same as before. Otherwise, the evaluation value is estimated using a simple ML algorithm, e.g. a Polynomial Regression model or a Support Vector Classifier.

V as the value function achieved by a deep neural network (https://jonathan-hui.medium.com/monte-carlo-tree-search-mcts-in-alphago-zero-8a403588276a)

However, it’s important to strike the right balance with this horizon depth. If set too high, it doesn’t yield much performance boost, and if set too low, it fails to benefit from genuine rollouts and may produce evaluation scores that appear reasonable at first glance but are ultimately inaccurate.

Properly configuring the number of levels could significantly enhance the performance of MCTS while ensuring consistent and reliable outcomes.

Multi-armed bandit problem (https://huggingface.co/learn/deep-rl-course/unit1/exp-exp-tradeoff)

Final Words

Regardless of you’re a game enthusiast or a researcher in AI, understanding MCTS is essential for securing your knowledge in this study field.

If you ask my opinion, it is the most basic implementation of machine attention: Picking whichever option gives the most promising vibes while reminding oneself to try other options.

It is in fact a very simple algorithm, which relies on a simple philosophy: Simple models with low accuracy are not the dangerous ones, but the complex models with low explainability.

If you have serious doubts about a very sophisticated model being potentially flawed, most probably it is, and there should be an easier solution for that.

Why bother to skim all maps on earth, if all you want to do is to find the next lake for swimming?

Overengineering visualized (https://envylabs.com/insights/the-perils-of-over-engineering-software/)

The repository can be found here, and I recommend you to run the experiment many times and check the results in the logs to compare by different weights.

If you have read until the end, feel free to share your feedback with me, and stay tuned for my next article!