Monte Carlo Tree Search with PUCT

MCTS vs Minimax

Unlike minimax, which exhaustively searches to a fixed depth with a uniform branching strategy, MCTS builds a search tree asymmetrically — spending more time on promising moves and less on obviously bad ones. There's no fixed search depth; instead, the algorithm runs a fixed number of simulations, each traversing from root to a leaf, evaluating, and backing up.

After $N$ simulations, the action policy at the root is proportional to visit counts: $\pi(a) \propto N(a)$ . More visits to a child node means more confidence that it's a good move.

Four Phases

Each MCTS simulation has four phases:

Select: Starting from the root, traverse down the tree by choosing the child with the highest UCB score at each node. This balances exploitation (high value) with exploration (low visit count, high prior).
Evaluate: Run neural network inference on the leaf state to get policy logits and a value estimate. The logits are converted to a legal-only prior distribution with softmax (and root Dirichlet noise during self-play).
Expand: After inference, expand the leaf by creating child nodes for all legal moves. Child priors come from the evaluated policy distribution.
Backup: Propagate the value estimate from the leaf back to the root. At each level, the sign flips — the opponent's gain is our loss:

Source: alphazero/gomoku/pvmcts/treenode.py:217-235

      # alphazero/gomoku/pvmcts/treenode.py:217-235
def backup(self, value_for_this_node: float) -> None:
    """
    Propagate a leaf value to the root (node owner perspective).

    Parameters
    ----------
    value_for_this_node : float
        Value from this node's current player perspective.
    """
    node = self
    current_perspective_value = value_for_this_node

    while node is not None:
        node.visit_count += 1
        node.value_sum += current_perspective_value

        # Parent is opponent; flip perspective on the way up.
        current_perspective_value = -current_perspective_value
        node = node.parent

python

In this codebase, Evaluate and Expand are implemented together in _evaluate_and_expand(...): inference runs first, then expansion uses the returned policy priors.

MCTS — Four-Phase Simulation Loop

Step 0 / 0

① Select ② Evaluate ③ Expand ④ Backup

Open in a new tab

PUCT Selection Formula

The selection step uses the PUCT (Predictor + UCB applied to Trees) formula:

UCB(s,a) = Q(s,a) + c_{\text{puct}} \cdot P(s,a) \cdot \frac{\sqrt{N(s)}}{1 + N(s,a)}

where:

$Q(s,a)$ = mean value of the child (average backed-up value)
$P(s,a)$ = prior probability from the neural network's policy head
$N(s)$ = visit count of the parent
$N(s,a)$ = visit count of the child
$c_{\text{puct}}$ = exploration constant (configurable; 2.0 is a common/default value in this project)

Why not plain UCT? Classical UCT uses an exploration bonus based only on visit counts (typically $\sqrt{\log N(s) / (1 + N(s,a))}$ ), so all moves start nearly equal before enough rollouts accumulate. On a 19x19 board (361 actions), that is too sample-inefficient. PUCT injects the network prior $P(s,a)$ into the exploration term, so search spends simulations on moves the policy already considers plausible while still exploring under-visited branches.

This formula balances exploitation (high $Q$ — moves that have led to good outcomes) with exploration (high prior $P$ , low visits $N(s,a)$ — moves the network thinks are promising but haven't been explored much yet). The neural network's policy prior guides the search toward promising moves before they've been visited enough to have reliable $Q$ values — this is what makes MCTS + neural network so much more powerful than MCTS with random rollouts.

Source: alphazero/gomoku/pvmcts/treenode.py:102-128

      # alphazero/gomoku/pvmcts/treenode.py:102-128
def get_ucb(self, child: "TreeNode", c_puct: float) -> float:
    """
    Compute UCB score (parent perspective).

    U = c_puct * prior * sqrt(N_parent) / (1 + N_child)

    Parameters
    ----------
    child : TreeNode
        Child node to evaluate.
    c_puct : float
        Exploration constant.

    Returns
    -------
    float
        UCB score for the child from the parent's perspective.
    """
    q_value = -child.q_value

    parent_n = self.visit_count + self.pending_visits
    child_n = child.visit_count + child.pending_visits

    parent_sqrt = math.sqrt(max(1, parent_n))

    u_value = c_puct * child.prior * (parent_sqrt / (1 + child_n))
    return q_value + u_value

python

Note the sign flip: q_value = -child.q_value because the child stores value from the opponent's perspective. A high value for the child is bad for the parent.

Dirichlet Noise

At the root node only, Dirichlet noise is added to the priors at the start of each search, after the root node is expanded:

P'(s,a) = (1 - \epsilon) \cdot P(s,a) + \epsilon \cdot \eta_a, \quad \eta \sim \text{Dir}(\alpha)

This ensures exploration during self-play training: even if the network strongly favors one move, the noise gives other moves a chance to be explored. Without noise, the search would converge to the network's current beliefs and never discover better moves.

In this codebase, Dirichlet is applied only at the root during self-play (add_noise=True), after legal-move masking + softmax, and only over legal actions (then renormalized). See alphazero/gomoku/pvmcts/search/engine.py (_masked_policy_from_logits, _safe_dirichlet_noise, _ensure_root_noise).

Parameters: $\epsilon = 0.25$ follows the AlphaZero paper. $\alpha = 0.15$ was chosen for this 19×19 board — higher than the $10 / \text{avg\_legal\_moves} \approx 0.028$ heuristic to maintain stronger exploration pressure during self-play. In evaluation and production, Dirichlet noise is disabled (dirichlet_epsilon = 0.0). Training-phase tuning details (including an experimental $\alpha = 0.03$ phase) are in Training History.

Temperature-Based Action Selection

After MCTS completes, the root visit counts form a policy distribution. Temperature is applied to this visit-based policy (not to network logits). How that distribution is used depends on the training phase:

\pi(a) \propto N(a)^{1/\tau}

During the first 20 moves (configurable exploration_turns), temperature $\tau = 1.0$ — actions are sampled proportionally to visit counts, encouraging diverse opening play for the training data. After the exploration phase, $\tau \to 0$ (effectively argmax — pick the most-visited move), which produces the strongest deterministic play.

Implementation detail: temperature is applied in _apply_temperature (alphazero/gomoku/alphazero/agent.py) with stable guards (near-zero → argmax one-hot, otherwise exponentiate by $1/\tau$ and renormalize).

During evaluation and production serving, this project uses deterministic settings (temperature=0, no Dirichlet noise) for strongest play.

Practical split of responsibilities:

Dirichlet noise perturbs root priors before/during search (exploration at search time).
Temperature reshapes the final root visit distribution after search (exploration at action-selection time).

Prev Next