AlphaZero Training and Config Guide

Training (Entry Point + Local Commands)

Training entry point:

python -m gomoku.scripts.train

CLI arguments:

--config (required): YAML config path
--mode (required): sequential | vectorize | mp | ray
--device (optional): auto | cpu (default: auto)

Requires Python 3.13+.

      cd alphazero

# If needed
# python -m venv .venv
# source .venv/bin/activate

# CPU dependencies
pip install -e ".[torch-cpu]" --extra-index-url https://download.pytorch.org/whl/cpu

# Sequential
python -m gomoku.scripts.train --config configs/config_alphazero_test.yaml --mode sequential --device cpu

# Vectorized (single process, multiple game slots)
python -m gomoku.scripts.train --config configs/config_alphazero_vectorize_test.yaml --mode vectorize --device cpu

# Multiprocessing
python -m gomoku.scripts.train --config configs/config_alphazero_mp_test.yaml --mode mp --device cpu

# Ray
pip install -e ".[ray,torch-cpu]" --extra-index-url https://download.pytorch.org/whl/cpu
python -m gomoku.scripts.train --config configs/5x5_local_test.yaml --mode ray --device cpu

bash

Output Paths and Resume Behavior

      {paths.run_prefix}/
  {paths.run_id}/
    ckpt/
      iteration_0000.pt
      iteration_0000.pt.optim
      iteration_0001.pt
      iteration_0001.pt.optim
      ...
    replay/
      shard-iter0001-....parquet
      shard-iter0002-....parquet
      ...
    eval_logs/
      ...
    manifest.json

text

First run: create manifest.json and save the initial champion checkpoint
Resume: continue from manifest.json; restore optimizer state from *.optim when available

Configuration Template

This template shows the most common fields. Real configs in alphazero/configs/ may include additional options such as priority_replay, opponent_rates, elo_k_factor, and resign controls (resign_threshold, resign_enabled, min_moves_before_resign).

Section quick guide:

board: Board size and game-rule toggles used by self-play and evaluation.
model: Neural network width/depth settings that control model capacity.
training: Iteration counts, optimization schedule, replay sampling, and data-loader behavior.
mcts: Search behavior during play; C is the UCB exploration weight, num_searches is simulations per move, exploration_turns keeps early-game exploration higher, dirichlet_epsilon/dirichlet_alpha control root-noise mix/shape, and batch_infer_size controls inference batch size.
evaluation: Promotion gate and periodic benchmark settings for challenger vs champion.
parallel: Worker/process counts for vectorize, mp, and local ray modes.
paths: Output location and run identifier (run_prefix, run_id, local vs GCS).
io: Replay shard sizing and local replay cache behavior.
runtime: Optional Ray actor CPU/GPU allocation and async self-play/inference topology.

      # For schedulable config fields, you can use either fixed numeric values
# or scheduled params in the form: { until: ..., value: ... }.
# Example: learning_rate: 0.001  OR  learning_rate: [{ until: 20, value: 0.002 }, { until: 60, value: 0.001 }]
# Scheduled values must be a list covering all iterations up to num_iterations.

board:
  num_lines: 19
  enable_doublethree: true
  enable_capture: true
  capture_goal: 5
  gomoku_goal: 5
  history_length: 5 # 5 by default

model:
  num_hidden: 128
  num_resblocks: 12
  # num_planes / policy_channels / value_channels are fixed/derived in code

training:
  num_iterations: 60 # total iterations
  # note: for scheduled fields, the final `until` must match `num_iterations`
  num_selfplay_iterations:
    - { until: 20, value: 1200 }
    - { until: 40, value: 1800 }
    - { until: 60, value: 2400 }

  num_epochs: 2
  batch_size: 512

  learning_rate:
    - { until: 20, value: 0.0020 }
    - { until: 40, value: 0.0010 }
    - { until: 60, value: 0.0005 }

  weight_decay: 0.0001
  temperature:
    - { until: 20, value: 1.0 }
    - { until: 40, value: 0.7 }
    - { until: 60, value: 0.4 }

  replay_buffer_size: 500000
  min_samples_to_train: 10000
  random_play_ratio:
    - { until: 20, value: 0.03 }
    - { until: 40, value: 0.02 }
    - { until: 60, value: 0.01 }
  dataloader_num_workers: 4
  dataloader_prefetch_factor: 2
  enable_tf32: true
  use_channels_last: true

mcts:
  C: 2.0
  num_searches:
    - { until: 20, value: 400 }
    - { until: 40, value: 800 }
    - { until: 60, value: 1200 }
  exploration_turns: 20
  dirichlet_epsilon:
    - { until: 20, value: 0.25 }
    - { until: 40, value: 0.15 }
    - { until: 60, value: 0.05 }
  dirichlet_alpha: 0.3
  batch_infer_size: 32
  max_batch_wait_ms: 5
  min_batch_size: 1
  use_native: true

evaluation:
  num_eval_games: 40
  eval_every_iters: 2
  promotion_win_rate:
    - { until: 30, value: 0.55 }
    - { until: 60, value: 0.58 }
  num_baseline_games: 0
  blunder_threshold: 0.5
  initial_blunder_rate: 0.0
  initial_baseline_win_rate: 0.0
  blunder_increase_limit: 1.0
  baseline_wr_min: 0.0
  random_play_ratio: 0.0
  eval_num_searches:
    - { until: 30, value: 600 }
    - { until: 60, value: 900 }
  baseline_num_searches: 0
  use_sprt: false
  fast_eval:
    enabled: false
    num_games: 0
    num_searches: 0
    promote_threshold: 0.0
    reject_threshold: 0.0

parallel:
  num_parallel_games: 8 # only for 'vectorize' mode
  mp_num_workers: 4 # only for 'mp' mode
  ray_local_num_workers: 8 # only for 'ray' mode - local worker count

paths:
  use_gcs: false # use local filesystem
  run_prefix: runs # default when use_gcs=false
  run_id: exp_20260217 # run identifier
  # use_gcs=true example:
  # use_gcs: true
  # run_prefix: gomoku-prod-bucket # put your bucket name here (no gs:// prefix, no slash)
  # run_id: exp_20260217

io:
  initial_replay_shards: null
  initial_replay_iters: null
  max_samples_per_shard: 5000
  local_replay_cache: /tmp/gmk_replay_cache

runtime: null # null is valid for local/simple runs
# ray mode example (CPU-only local):
# runtime:
#   selfplay:
#     actor_num_cpus: 1.0
#     games_per_actor: 16
#     inflight_per_actor: 16
#   inference:
#     actor_num_gpus: 0.0
#     num_actors: 1
#     actor_num_cpus: 1.0
#     use_local_inference: true
#   evaluation:
#     num_workers: 2
#     actor_num_cpus: 1.0

yaml

Ray/GCP Cluster Files

alphazero/infra/cluster/cluster_elo1800.yaml
- Ray cluster template used for GCP deployment.
alphazero/infra/cluster/restart_cluster.sh
- Cluster restart/redeploy script.
- Renders the cluster config and runs ray up.
- All required GCP_* variables must be set in repo-root .env (the script uses :?required checks and exits early if any are missing).
- For the full list, set all GCP_* entries in .env.example (including GCP_PROJECT, GCP_REGION, GCP_ZONE, GCP_REPO, GCP_CLUSTER_NAME, GCP_SSH_USER, GCP_CONTAINER_NAME, GCP_GPU_TAG, GCP_CPU_TAG, GCP_SA_NAME, GCP_HEAD_RESERVATION, GCP_SSH_PRIVATE_KEY, GCP_USER_EMAIL, GCP_BUCKET_NAME).
- You can override with exported env vars before running the script.

      # from repo root
# optional overrides (examples)
# export GCP_PROJECT=my-gcp-project
# export GCP_REGION=us-central1
# export DO_BUILD=true
# export DO_RESTART=true

bash alphazero/infra/cluster/restart_cluster.sh

bash

Prev Next