The Policy-Value Network

Dual-Head Architecture

A single convolutional neural network processes the board state and produces two outputs simultaneously:

Policy head: 361 unnormalized logits (one score per action)
Value head: a scalar in $[-1, 1]$ predicting game outcome from the current player's perspective ( $+1$ win, $0$ draw, $-1$ loss)

This replaces both components of classical minimax: the evaluation function (value head) and move ordering (policy head). The combined loss function trains both heads jointly:

L = (z - v)^2 - \pi^T \log p + c\|\theta\|^2

where $z$ is the game outcome from the current player's perspective ( $+1$ win, $0$ draw, $-1$ loss), $v$ is the predicted value, $\pi$ is the MCTS policy (normalized visit counts), $p$ is the network's policy distribution, and $c\|\theta\|^2$ is L2 regularization. The first term pushes the value head to predict game outcomes; the second term pushes the policy head to match the MCTS-refined move distribution.

In implementation, the policy head outputs logits. During search, illegal moves are masked and softmax is applied to convert logits into priors. During training, cross-entropy is computed with log_softmax(out_policy). The L2 regularization term is applied through optimizer weight_decay in the training setup, rather than being added as a separate explicit term inside the trainer step.

ResNet Architecture

The network follows a shared-backbone, dual-head design:

Input and stem: The 13x19x19 state tensor is lifted into a 128-channel feature space by a 3x3 convolution, followed by batch normalization and ReLU.
Backbone: 12 residual blocks repeatedly refine shared features while preserving information flow through skip connections (x + F(x)).
Policy head (move selection): A lightweight 1x1 convolution and linear projection convert shared features into logits over all 361 actions.
Value head (position evaluation): A 1x1 convolution compresses shared features, then a two-layer MLP (361 -> 256 -> 1 for a 19x19 board) outputs a scalar through tanh, producing a value in [-1, 1].

In this implementation, policy_channels = 2 and value_channels = 1.

Source: alphazero/gomoku/model/policy_value_net.py:105-148

      # alphazero/gomoku/model/policy_value_net.py:105-148
self.start_block = nn.Sequential(
    nn.Conv2d(num_planes, num_hidden, kernel_size=3, padding=1, bias=False),
    nn.BatchNorm2d(num_hidden),
    nn.ReLU(),
)
self.res_blocks = nn.ModuleList([ResBlock(num_hidden) for _ in range(num_resblocks)])

self.policy_head = nn.Sequential(
    nn.Conv2d(num_hidden, policy_channels, kernel_size=1, bias=False),
    nn.BatchNorm2d(policy_channels),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(policy_channels * conv_out, game.action_size),
)
self.value_head = nn.Sequential(
    nn.Conv2d(num_hidden, value_channels, kernel_size=1, bias=False),
    nn.BatchNorm2d(value_channels),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(value_channels * conv_out, VALUE_HEAD_FC_UNITS),
    nn.ReLU(),
    nn.Linear(VALUE_HEAD_FC_UNITS, 1),
    nn.Tanh(),
)

def forward(self, x):
    x = self.start_block(x)
    for res_block in self.res_blocks:
        x = res_block(x)
    policy_logits = self.policy_head(x)
    value = self.value_head(x)
    return policy_logits, value

python

The original AlphaZero paper used 256 channels and 20 residual blocks with TPU-scale compute. 128 channels and 12 blocks is the balance point for the available GPU budget while still providing sufficient model capacity for 19x19 Gomoku.

- PolicyValueNet architecture diagram -

State Encoding

With the default history_length=5, the network sees a 13-channel tensor (8 + history_length), each channel being a 19x19 binary or scalar plane:

Channels	Description
0	Current player's stones (binary)
1	Opponent's stones (binary)
2	Empty positions (binary)
3	Last move indicator (one-hot)
4	Current player's capture score (ratio, clipped 0–1)
5	Opponent's capture score (ratio, clipped 0–1)
6	Color indicator (1.0 for Black, -1.0 for White)
7	Forbidden move mask (double-three positions, binary)
8–12	Move history (last 5 moves as one-hot planes)

Source: alphazero/gomoku/core/gomoku.py:398-463

      # alphazero/gomoku/core/gomoku.py:398-463 (get_encoded_state, simplified — intermediate variables inlined for readability)
features = np.zeros((batch_size, 8 + self.history_length, h, w), dtype=np.float32)

# Base occupancy planes
features[:, 0] = boards == next_players[:, None, None]
features[:, 1] = boards == (3 - next_players)[:, None, None]
features[:, 2] = boards == EMPTY_SPACE

# Last move one-hot plane
if valid_idx.size:
    features[valid_idx, 3, ys, xs] = 1.0

# Capture progress planes (normalized 0..1)
if self.enable_capture and self.capture_goal > 0:
    features[:, 4] = my_ratio[:, None, None]
    features[:, 5] = opp_ratio[:, None, None]

# Color plane (+1 black, -1 white)
features[:, 6] = color_vals[:, None, None]

# Forbidden points for current player (double-three)
if self.enable_doublethree:
    for b_idx in range(batch_size):
        empties = np.flatnonzero(boards[b_idx] == EMPTY_SPACE)
        xs, ys = bulk_index_to_xy(empties, self.col_count)
        for x, y in zip(xs, ys, strict=False):
            if detect_doublethree(
                boards[b_idx], int(x), int(y), int(next_players[b_idx]), self.row_count
            ):
                features[b_idx, 7, int(y), int(x)] = 1.0

# Move history planes (one-hot per past move)
hist_buffer = np.full((batch_size, self.history_length), -1, dtype=np.int32)
for idx, st in enumerate(states):
    if st.history:
        capped = st.history[: self.history_length]
        hist_buffer[idx, : len(capped)] = capped
start_ch = 8
for k in range(self.history_length):
    moves = hist_buffer[:, k]
    valid = moves >= 0
    if not np.any(valid):
        continue
    xs = moves[valid] % self.col_count
    ys = moves[valid] // self.col_count
    features[valid, start_ch + k, ys, xs] = 1.0

python

This encoding gives the network everything it needs: the current position, game-specific context (captures, forbidden moves), temporal context (move history), and player identity. The capture score channels are particularly important for this Gomoku variant — they tell the network how close each player is to the 5-pair capture win condition.

The get_encoded_state method supports batch encoding and has an optional native C++ encoding path for faster inference in production. D4 symmetry augmentation is applied in the training dataset pipeline (ReplayDataset), not inside get_encoded_state.

Prev Next