C++ Extensions: pybind11 for Performance
Why C++
Two performance bottlenecks emerged in the pure Python implementation:
- Double-three detection requires deep recursive rule checking, called for every legal move computation during MCTS expansion. On a 19x19 board with 200+ legal moves per position, Python function call overhead becomes significant.
- MCTS tree operations — selection, expansion, and backup — are called hundreds to thousands of times per move. Each operation involves pointer traversal, arithmetic, and dictionary lookups. Python loop overhead across thousands of iterations adds up.
The Python implementation remains the default and serves as the reference. C++ is optional: use_native: true in the config enables the native path. Both produce identical results — the C++ modules are drop-in replacements, not alternative implementations.
Two Modules
Two pybind11 modules provide the C++ acceleration:
renju_cpp: Low-level rules module. Contains CForbiddenPointFinder implementing double-three detection — sourced from renju.se, adapted from the same implementation used in the minimax engine. Direction-based open-three counting on a padded board, handling the recursive edge cases that bitmask pattern matching got wrong.
gomoku_cpp: High-level game and search module containing two main classes:
GomokuCoremirrors the Python game engine — state management, move application, capture detection, win checking, and state encoding. Operates on flat arrays instead of numpy, avoiding Python-C++ boundary overhead for state manipulation.MctsEngineis a native MCTS engine. Selection, expansion, and backup happen entirely in C++, with a Python callback for neural network inference. The callback interface supports three modes: synchronous single inference, batched synchronous inference, and asynchronous inference with Ray:
Source: alphazero/gomoku/pvmcts/search/sequential/cpp_strategy.py:56-71
# alphazero/gomoku/pvmcts/search/sequential/cpp_strategy.py:56-71
def _run_async_mcts(self, engine, root, root_native_state,
sims, batch_size, noise_pending):
inflight_refs = {}
next_handle = 0
def async_dispatcher(py_batch):
nonlocal next_handle
tensor = torch.from_numpy(py_batch).to(device=engine._inference_device)
ref = engine.inference.infer_async(tensor)
h = next_handle
next_handle += 1
inflight_refs[h] = ref
return h # C++ gets an opaque handle
def async_checker(handles, timeout_s):
target_refs = [inflight_refs[h] for h in handles]
ready_refs, _ = ray.wait(target_refs, num_returns=len(target_refs),
timeout=timeout_s)
# ... match refs back to handles, return results
pythonThe async mode is a notable design choice: C++ drives the MCTS loop (selection, virtual loss, expansion), but delegates inference to Python callbacks that dispatch to Ray GPU actors. C++ gets back opaque integer handles and later polls for results — the same pipelined architecture as the pure-Python Ray engine, but with the tree operations in C++.
Build System
The C++ extensions use scikit-build-core + CMake for building. pybind11 3.0.1 provides the Python binding layer. The standard is C++14. The build produces two importable .so modules that are installed as part of the Python package:
# Rebuild extensions
cd alphazero && pip install -e . -v --force-reinstall --no-cache-dir --no-deps
# Verify
python -c "from gomoku.cpp_ext import renju_cpp, gomoku_cpp; print('ok')"
bashPerformance Impact
In the local benchmark harness, native MCTS shows about 13x search speedup by removing Python interpreter overhead from the hot loop (node traversal, UCB computation, tree mutation).
Benchmark method (source): alphazero/tests/native/test_native_play_test.py:52-103. It measures wall-clock search time for SequentialEngine at num_searches=200 on the same prepared position, comparing Python (use_native=False) vs C++ native (use_native=True) with the same deterministic inference stub.
Measured result (5 runs on 2026-02-20; AMD Ryzen 7 8845HS, Linux 6.17, Python 3.13.10): speedup ranged from 11.93x to 13.98x (average 13.04x), with Python search around 0.222s and native C++ search around 0.017s for 200 simulations.
In production (V4 live deployment on GCP c2d-standard-4, 4 vCPU / 16 GB), 200 MCTS simulations with native C++ typically produced around 2.5s response latency per move (workload-dependent). See deployment machine specs in About Project / Deployment. The server reconstructs native C++ state from the frontend WebSocket payload per request — no persistent native state is maintained between requests. This stateless design simplifies deployment and avoids state synchronization issues.