C++ Extensions: pybind11 for Performance

Why C++

Two performance bottlenecks emerged in the pure Python implementation:

  1. Double-three detection requires deep recursive rule checking, called for every legal move computation during MCTS expansion. On a 19x19 board with 200+ legal moves per position, Python function call overhead becomes significant.
  2. MCTS tree operations — selection, expansion, and backup — are called hundreds to thousands of times per move. Each operation involves pointer traversal, arithmetic, and dictionary lookups. Python loop overhead across thousands of iterations adds up.

The Python implementation remains the default and serves as the reference. C++ is optional: use_native: true in the config enables the native path. Both produce identical results — the C++ modules are drop-in replacements, not alternative implementations.

Two Modules

Two pybind11 modules provide the C++ acceleration:

renju_cpp: Low-level rules module. Contains CForbiddenPointFinder implementing double-three detection — sourced from renju.se, adapted from the same implementation used in the minimax engine. Direction-based open-three counting on a padded board, handling the recursive edge cases that bitmask pattern matching got wrong.

gomoku_cpp: High-level game and search module containing two main classes:

  • GomokuCore mirrors the Python game engine — state management, move application, capture detection, win checking, and state encoding. Operates on flat arrays instead of numpy, avoiding Python-C++ boundary overhead for state manipulation.
  • MctsEngine is a native MCTS engine. Selection, expansion, and backup happen entirely in C++, with a Python callback for neural network inference. The callback interface supports three modes: synchronous single inference, batched synchronous inference, and asynchronous inference with Ray:

Source: alphazero/gomoku/pvmcts/search/sequential/cpp_strategy.py:56-71

      # alphazero/gomoku/pvmcts/search/sequential/cpp_strategy.py:56-71
def _run_async_mcts(self, engine, root, root_native_state,
                    sims, batch_size, noise_pending):
    inflight_refs = {}
    next_handle = 0

    def async_dispatcher(py_batch):
        nonlocal next_handle
        tensor = torch.from_numpy(py_batch).to(device=engine._inference_device)
        ref = engine.inference.infer_async(tensor)
        h = next_handle
        next_handle += 1
        inflight_refs[h] = ref
        return h  # C++ gets an opaque handle

    def async_checker(handles, timeout_s):
        target_refs = [inflight_refs[h] for h in handles]
        ready_refs, _ = ray.wait(target_refs, num_returns=len(target_refs),
                                 timeout=timeout_s)
        # ... match refs back to handles, return results

    
python

The async mode is a notable design choice: C++ drives the MCTS loop (selection, virtual loss, expansion), but delegates inference to Python callbacks that dispatch to Ray GPU actors. C++ gets back opaque integer handles and later polls for results — the same pipelined architecture as the pure-Python Ray engine, but with the tree operations in C++.

Build System

The C++ extensions use scikit-build-core + CMake for building. pybind11 3.0.1 provides the Python binding layer. The standard is C++14. The build produces two importable .so modules that are installed as part of the Python package:

      # Rebuild extensions
cd alphazero && pip install -e . -v --force-reinstall --no-cache-dir --no-deps
# Verify
python -c "from gomoku.cpp_ext import renju_cpp, gomoku_cpp; print('ok')"

    
bash

Performance Impact

In the local benchmark harness, native MCTS shows about 13x search speedup by removing Python interpreter overhead from the hot loop (node traversal, UCB computation, tree mutation).

Benchmark method (source): alphazero/tests/native/test_native_play_test.py:52-103. It measures wall-clock search time for SequentialEngine at num_searches=200 on the same prepared position, comparing Python (use_native=False) vs C++ native (use_native=True) with the same deterministic inference stub.

Measured result (5 runs on 2026-02-20; AMD Ryzen 7 8845HS, Linux 6.17, Python 3.13.10): speedup ranged from 11.93x to 13.98x (average 13.04x), with Python search around 0.222s and native C++ search around 0.017s for 200 simulations.

In production (V4 live deployment on GCP c2d-standard-4, 4 vCPU / 16 GB), 200 MCTS simulations with native C++ typically produced around 2.5s response latency per move (workload-dependent). See deployment machine specs in About Project / Deployment. The server reconstructs native C++ state from the frontend WebSocket payload per request — no persistent native state is maintained between requests. This stateless design simplifies deployment and avoids state synchronization issues.