Training History, Deployment, and Lessons Learned

Training Runs: V1 through V4

The current champion model is the product of four training runs on GCP, not one. Each run exposed a different failure mode that shaped the next attempt's hyperparameters. The configs were updated mid-run via the scheduled parameter system — sometimes daily — based on evaluation results from the arena.

Training was run on the Ray cluster profile defined in alphazero/infra/cluster/cluster_elo1800.yaml: one g2-standard-16 head node (CPU:16, GPU:1) plus ten c4-standard-8 CPU workers (CPU:8 each), for a total of 96 vCPUs and 1 GPU.

Authoritative iteration ledger:

RunCompleted iterationsNotes
V128Baseline run
V22Abandoned (overfitting)
V313Abandoned (underfitting)
V4222Production run
Total (V1-V4)265All completed training iterations

V1: Baseline (28 iterations, 6 promotions)

V1 established the baseline: 10 residual blocks, 100 MCTS simulations, LR=0.002, and 2 training epochs. The model learned — first promotion at iteration 6 with 82.5% win rate, and iteration 12 won 40-0 against the champion. But the evaluation used elo_k_factor: 32.0, which inflated the internal Elo to 3996 after just 28 iterations. This made progress tracking meaningless.

V1 also revealed instability: iteration 12 swept 40-0, then iteration 14 lost 3-37. The model was oscillating rather than converging. The search budget (100 simulations) was too shallow for 19×19 — the policy network was getting noisy training signal from games where MCTS couldn't search deep enough to validate its own moves.

V2: Overfitting (2 iterations, 0 promotions — abandoned)

V2 tried to fix convergence by increasing training epochs from 2 to 10, reasoning that more gradient steps per iteration would extract more from each batch of self-play data. It also dropped the Elo K-factor to 2.0 to prevent inflation.

The model never promoted. After 2 iterations of 0 promotions, the run was killed. The problem was overfitting: 10 epochs on a small replay buffer memorized the training positions instead of learning generalizable patterns. The challenger would play well on positions it had seen, but fail against the champion on novel positions in the arena. V2 lasted less than a day.

V3: Underfitting (13 iterations, 0 promotions — abandoned)

V3 overcorrected: LR was dropped to 0.0002 (10× lower than V1/V2) with 5 epochs. The result was the opposite failure mode — the model couldn't learn at all. Every evaluation from iteration 1 through 13 was 0-40. The challenger lost every single game against the random-initialized champion.

At iteration 8, a mid-run config change (elo1800-v3-day1-5.yaml) boosted LR to 0.001, but by iteration 13 the model still hadn't produced a single promotion. V3 was abandoned.

V4: The Production Run (222 iterations, 14 config revisions)

V4 combined the lessons: keep LR=0.002 (V1's known-good value), keep 2 epochs (AlphaZero standard, avoid overfitting), keep K-factor=2.0 (V2's fix), and increase model capacity from 10 to 12 residual blocks. The Dirichlet alpha was also recalibrated from 0.3 to 0.15 — closer to the theoretical optimum of 10/avg_legal_moves10 / \text{avg\_legal\_moves} for a 19×19 board.

V4 started with a 4-iteration validation run (day0.5) to confirm the config worked before committing GCP budget. First promotion at iteration 3. From there, the run was managed across 8 training days with 14 config revisions:

Iteration indexing note: this document uses completed, 1-based iteration IDs. The final config (elo1800-v4-day9.yaml) set num_iterations: 252 as a target cap, but training was stopped at iteration 222 as further iterations showed diminishing returns.

Days 1–3 (iter 1–100): Ramp-up. MCTS searches scaled from 200 → 400 → 600 as the model improved — shallow search early (fast data collection) ramping to deeper search (higher quality). Learning rate held at 0.002 then dropped to 0.001. Temperature moved from 1.0 to 0.75, shifting from exploration toward exploitation. PER was enabled during the early ramp-up (day1~day3) and then turned off from day3.5 onward, returning to uniform replay sampling for a simpler and more stable training loop.

Day 4 (iter 101–110): Deep search. MCTS pushed to 1600 simulations. Self-play games dropped from 3000 to 1500 per iteration — fewer but higher-quality games. Opponent diversity was removed (prev_bot_ratio: 0.0) — the model was strong enough to learn purely from self-play. Win rates during this phase hit 100% on consecutive evaluations (iter 102, 104, 106, 110).

Day 5 (iter 111–137): Opening pursuit. MCTS reached 2400 simulations — the deepest search of the entire training run. Dirichlet noise increased from 0.25 to 0.35, exploration turns from 20 to 25. The goal was center opening play, which the model hadn't discovered. The replay buffer was shrunk from 500K to 400K to flush older data faster. This produced strong tactical play but the model continued opening on star points rather than center.

Day 6 (iter 138–150): Rule bug and recovery. Training had stopped improving, so I revisited the codebase end-to-end and found a double-three rule asymmetry: the C++ extension was enforcing the forbidden move rule for Black but not White. This meant all training data had been generated under incorrect rules. The fix is in the current codebase: legal-move generation now applies the double-three check for the current player (state.next_player) in the native path (alphazero/cpp/src/GomokuCore.cpp:257-271), and regression tests include White-turn forbidden-move coverage (alphazero/tests/core/gomoku_doublethree_color_test.py). After patching, the entire replay buffer was purged and training resumed from the current checkpoint with fresh self-play data under corrected rules.

The recovery config (day6-recover) raised temperature back to 1.10, increased Dirichlet noise to 0.42, raised LR to 0.001 (faster policy adaptation to the new rule), and expanded the buffer to 450K. Opponent diversity was reintroduced (random_bot: 0.08, prev_bot: 0.25) to prevent the model from overfitting to its own post-fix play patterns.

Days 6.5–7 (iter 151–208): Spiky Dirichlet experiment. To break star-point opening fixation, dirichlet_alpha was dropped from 0.15 to 0.03. With 361 legal moves, Dir(0.03)\text{Dir}(0.03) concentrates noise on 1–3 random actions instead of spreading it uniformly — when the spike lands on a center cell, it temporarily overrides the learned prior and forces MCTS to explore that move. Temperature was raised to 1.30, MCTS searches reduced to 800 (amplifying the noise's effect on PUCT), and 6 random opening turns were added. The buffer was shrunk to 250K for rapid flush of star-point data.

Day 8 (iter 209–222): Taper. Three-phase convergence back to normal parameters: searches 1000 → 1400 → 1600, temperature 1.20 → 1.00 → 0.85, Dirichlet epsilon 0.45 → 0.38 → 0.32. The aggressive exploration period was over; the goal was to stabilize whatever opening diversity had been gained without losing tactical strength.

V4 Evaluation and Replay Decisions

  • SPRT was implemented but deliberately disabled in V4. The run used fixed-count arena evaluations (40 games each time) to keep model comparisons quantitatively consistent across all config revisions. In evaluation_log.jsonl, recorded head-to-head game counts are uniformly 40, which made trend analysis and promotion auditing straightforward.
  • PER was used as an experiment, not a permanent default. It was active in the early phase, then disabled after day3.5 when the project prioritized stability and comparability over replay-sampling complexity. The run log reflects this phase split (priority_replay_active=true in early evaluations, then false afterward).
  • Details on PER activation behavior (including the start_iteration caveat) are documented in Training Pipeline.

What the Runs Taught

The progression from V1 to V4 is a study in hyperparameter sensitivity for RL training. The margin between "learning" and "not learning at all" was narrow: V1's LR=0.002 with 2 epochs worked; V2's same LR with 10 epochs caused overfitting; V3's LR=0.0002 with 5 epochs caused underfitting. Neither V2 nor V3 produced a single promotion — not one model in 15 combined iterations was stronger than a random-initialized network.

The day 6 rule bug is worth highlighting: the model trained for 135 iterations under subtly wrong rules, and the resulting play was still visually plausible. Without the replay data analysis that measured the contamination rate (1.8% of samples affected overall, but up to 10% in later iterations), the bug might never have been caught. This reinforced the unit-testing lesson from the refactoring — gameplay testing is not sufficient.

Deployment Architecture

The production system runs on GCP VMs using Container-Optimized OS (COS). Docker images are built and pushed to GCP Artifact Registry, and startup scripts on each VM pull the latest image and run the container.

Two backend VMs serve the game: one for the minimax engine, one for AlphaZero. A Cloudflare Worker handles path-based routing — /minimax/* routes to the minimax VM, /alphazero/* routes to the AlphaZero VM. This keeps the frontend unaware of backend topology; it just connects to the domain and the Worker handles the rest.

The deployment uses infrastructure-as-scripts rather than Terraform:

  • 01_setup.sh — one-time GCP project setup (Artifact Registry, IAM, networking, storage, quota checks)
  • 02_deploy.sh — build and push Docker images, update VM metadata, trigger startup scripts, verify containers via SSH
  • 03_deploy_cloudflare.sh — configure Cloudflare DNS A records and deploy the Worker

Production Serving Configuration

The AlphaZero production Docker image (Dockerfile.prod) uses Python 3.13-slim, builds C++ extensions at image build time, and includes torch-cpu (no CUDA overhead).

Source: alphazero/configs/deploy.yaml

      # alphazero/configs/deploy.yaml
model:
  num_hidden: 128
  num_resblocks: 12

mcts:
  C: 2.0
  num_searches: 200
  use_native: true    # Native C++ MCTS
  exploration_turns: 0
  dirichlet_epsilon: 0.0
  resign_threshold: 0.95
  resign_enabled: true

    
yaml

Key production decisions:

  • Native C++ MCTS enabled — local benchmark measured ~13x search-loop speedup at 200 simulations (Python SequentialEngine vs native path; see C++ Extensions).
  • 200 MCTS simulations per move — strong CPU play with typical latency around ~2.5s per move on the V4 c2d-standard-4 deployment VM
  • Zero exploration, zero noise — deterministic play at full strength
  • Configurable inference backend via ALPHAZERO_INFER_BACKEND — default is local PyTorch inference (local); onnx and onnx-int8 switch to ONNX Runtime (with int8 dynamic quantization for onnx-int8). ONNX artifacts are cached in ONNX_CACHE_DIR (default /tmp/onnx_cache).
  • Cgroup-aware thread management — the server auto-detects container CPU limits via cgroup v1/v2 and configures PyTorch threads accordingly:

Backend selection source: alphazero/server/engine.py:40-59

Cgroup detection source: alphazero/server/engine.py:230-255

      # alphazero/server/engine.py:230-255
@classmethod
def _detect_cpu_quota_limit(cls) -> int | None:
    # cgroup v2
    cpu_max = Path("/sys/fs/cgroup/cpu.max").read_text().strip()
    parts = cpu_max.split()
    if len(parts) >= 2 and parts[0] != "max":
        quota_us, period_us = int(parts[0]), int(parts[1])
        return max(1, int(math.ceil(quota_us / period_us)))
    # cgroup v1 fallback ...

    
python

Each WebSocket request includes the full board state. The server reconstructs native C++ state from the payload, runs MCTS, and returns the result — fully stateless between requests.

Known Limitations

The current model does not consistently achieve center opening play. In the AlphaZero paper, this strategic knowledge emerged from significantly more training iterations with higher compute budgets. The current limitation is a training budget constraint, not an architectural one — the system is capable of continued training.

The model does demonstrate strong mid-game and defensive play: blocking opponent threats, creating capture opportunities, and executing aggressive 3-3 patterns that force defensive responses. These behaviors emerged naturally from self-play without any domain-specific encoding.

Potential technical improvements include deeper ONNX benchmarking/tuning and model distillation to a smaller network for lower latency.

Lessons Learned

Unit testing catches what gameplay testing can't. The original implementation had data integrity issues — outcome values with wrong signs, PER priorities that were never updated — that were invisible during casual gameplay. The model would train, games would play, everything looked normal. But the training signal was corrupted. The complete refactoring that produced the current codebase started with building a proper test suite. Once the tests existed, bugs that had been silently degrading training for weeks were caught immediately. This was the single most impactful change.

Fix the foundation before optimizing. There are infinite optimization paths in an AlphaZero system: faster MCTS, better network architecture, curriculum learning, opening books. The temptation is to chase performance. But the biggest gains came from getting the fundamentals right — correct data pipeline, proper PER weight updates, reliable evaluation gating. A faster system that trains on corrupted data is worse than a slower system that trains correctly.

The gap between paper and implementation is real. Reading the AlphaZero paper gives you the algorithm. It doesn't give you the engineering: how to handle distributed state synchronization, how to debug a model that plays but doesn't improve, how to manage GPU costs on a budget, how to detect when a rule change mid-training has contaminated your replay buffer. This gap is where the real learning happened.

More compute unlocks more capability, but plateaus require patience. The model's progression was non-linear: long plateaus of seemingly no improvement, then sudden jumps in play quality (blocking patterns, aggressive tactics). Understanding that this is expected behavior of RL training — and having the discipline to keep running — is as important as the technical implementation.