Model Evaluation: Arena and SPRT

The Promotion Gate

After training produces a new model (the "challenger"), it must prove it's stronger than the current best (the "champion") in head-to-head matches before replacing it. This gate exists to prevent a failure mode: if a weaker challenger replaced the champion, all subsequent self-play games would use that weaker model — generating lower-quality training data. This compounds: worse data produces a worse next model, which produces even worse data. The arena evaluation prevents this regression loop.

Arena Evaluation

The arena runs 40 head-to-head games between challenger and champion, split evenly: 20 games with the challenger playing Black (Player 1), 20 with the challenger playing White (Player 2). This alternation ensures the evaluation isn't biased by first-move advantage.

Arena evaluation settings are configurable per run (evaluation.* in YAML). In V4 (day7/day8), challenger-vs-champion arena games used 1000 MCTS simulations per move with 6 stochastic opening turns (eval_opening_turns = 6, eval_temperature = 1.0, eval_dirichlet_epsilon = 0.2), then continued with deterministic move selection. The 600-search setting applied to baseline matches (baseline_num_searches = 600), not the main challenger-vs-champion arena.

The promotion threshold is a 55% win rate. The evaluation also tracks guardrails beyond raw win rate:

  • Baseline win rate: if configured, the challenger must also beat a baseline model (e.g., an early checkpoint) above a minimum threshold — preventing catastrophic forgetting
  • Blunder rate: tracks how often the challenger makes moves that drastically drop its own position evaluation — a proxy for tactical instability

All three conditions must pass for promotion:

Source: alphazero/gomoku/alphazero/eval/arena.py:363-367

      # alphazero/gomoku/alphazero/eval/arena.py:363-367
summary["promote"] = bool(
    summary["pass_promotion"]      # win_rate >= 55%
    and summary["pass_baseline"]   # baseline_wr >= minimum
    and summary["pass_blunder"]    # blunder_rate <= limit
)

    
python

SPRT: Early Stopping

SPRT (Sequential Probability Ratio Test) was implemented to terminate arena matches early once statistical significance is reached — no need to play all 40 games if the result is already clear.

The log-likelihood ratio after nn games with win rate ww:

LLR=n[wlnp1p0+(1w)ln1p11p0]LLR = n \cdot \left[ w \cdot \ln\frac{p_1}{p_0} + (1-w) \cdot \ln\frac{1-p_1}{1-p_0} \right]

where p0p_0 is the null hypothesis win rate (challenger is not better) and p1p_1 is the alternative hypothesis (challenger is better). In implementation, nn and ww are computed in two modes: ignore draws (n = wins + losses, \, w = wins / n) or include draws (n = wins + losses + draws, \, w = (wins + 0.5 \cdot draws) / n). The 0.5 weight treats a draw as half a win, following the standard Elo convention.

Source: alphazero/gomoku/alphazero/eval/sprt.py:7-36

      # alphazero/gomoku/alphazero/eval/sprt.py:7-36
def check_sprt(cfg, wins, losses, draws):
    ignore_draws = cfg.ignore_draws

    if ignore_draws:
        n = wins + losses
        wr = wins / n if n > 0 else 0.0
    else:
        n = wins + losses + draws
        wr = (wins + 0.5 * draws) / n if n > 0 else 0.0

    if n <= 0:
        return "continue"

    llr = n * (
        wr * math.log(cfg.p1 / cfg.p0 + 1e-12)
        + (1 - wr) * math.log((1 - cfg.p1) / (1 - cfg.p0) + 1e-12)
    )

    lower = math.log(cfg.beta / (1 - cfg.alpha))
    upper = math.log((1 - cfg.beta) / cfg.alpha)

    if llr > upper:
        return "accept_h1"
    if llr < lower:
        return "accept_h0"
    return "continue"

    
python

If LLRLLR exceeds the upper boundary: accept the challenger (promote). If it drops below the lower boundary: reject (keep champion). Otherwise: keep playing.

Note: SPRT was implemented and validated, but intentionally not used in the training that produced the current champion model. Fixed 40-game evaluation gave cleaner cross-iteration comparisons in evaluation logs, and in this cycle self-play consumed more wall-clock time than evaluation, so consistency mattered more than early-stop speedups.

Elo Tracking

Cumulative Elo is tracked across training iterations as an internal progress metric:

ΔElo=400log10(w1w)\Delta\text{Elo} = 400 \cdot \log_{10}\left(\frac{w}{1-w}\right)

where ww is the challenger's win rate against the champion. Each successful promotion adds a positive delta.

Important caveat: this Elo is self-referential — the new champion inherits the old champion's score, and the number can only increase as long as promotions succeed. It is not calibrated against any external benchmark and should not be interpreted as an absolute skill rating. For qualitative play behavior, see Training History, Deployment, and Lessons Learned.