Skip to content
Portada » News » SPRT test 200825 wordfish dev vs Base

SPRT test 200825 wordfish dev vs Base

SPRT Tests for Chess Engines

Summary

Finished game 5000 (Wordfish base vs Wordfish 1.0 dev 260825): 1/2-1/2 {Draw by adjudication: SyzygyTB}
Score of Wordfish 1.0 dev 260825 vs Wordfish base: 1571 - 1448 - 1981  [0.512] 5000
...      Wordfish 1.0 dev 260825 playing White: 756 - 718 - 1026  [0.508] 2500
...      Wordfish 1.0 dev 260825 playing Black: 815 - 730 - 955  [0.517] 2500
...      White vs Black: 1486 - 1533 - 1981  [0.495] 5000
Elo difference: 8.5 +/- 7.5, LOS: 98.7 %, DrawRatio: 39.6 %
SPRT: llr 1.25 (42.5%), lbound -2.94, ubound 2.94

Bottom line

  • Estimated Elo gain: +8.5 Elo for Wordfish 1.0 dev 260825 vs Wordfish base, with reported uncertainty ±7.5 Elo.
  • Likelihood the dev is stronger (LOS): 98.7%.
  • SPRT status: Inconclusive so far — the LLR = 1.25 has not crossed the decision bounds ±2.94, so a textbook SPRT would keep running. chessprogramming.org+1

How this squares with the numbers you gave

  • Match total: 5000 games1571–1448–1981 (W–L–D).
    The total score is 0.512, which corresponds to about +8.34 Elo via the logistic transform Elo≈400⋅log10(p/(1−p))Elo ≈ 400·log10(p/(1−p))Elo≈400⋅log10(p/(1−p)); that aligns with the reported +8.5 Elo.
  • By colour:
    • White: 0.508 over 2500 games.
    • Black: 0.517 over 2500 games.
      This mild asymmetry suggests the gain is consistent across colours (if anything, slightly better as Black), but the effect size is still small.
  • Draw ratio: 39.6%, which is typical for balanced engine self-play and helps explain why many games are needed to pin down a small Elo signal.

Interpreting LOS, LLR and the SPRT line

  • LOS 98.7% means: given the observed data and the model, there’s a very high probability the dev build is stronger than the base (i.e., the Elo difference is > 0). LOS is a widely used “probability of superiority” metric in engine testing. official-stockfish.github.iorustic-chess.org
  • SPRT (Sequential Probability Ratio Test): a sequential test that updates a Log-Likelihood Ratio (LLR) after each batch of games and stops only when LLR hits a bound (accept or reject). Your LLR = 1.25 is inside the bounds [−2.94,+2.94][-2.94, +2.94][−2.94,+2.94], so the test has not reached a decision. In other words: strong evidence of a small positive effect, but not yet enough to declare the patch “passed” under those specific SPRT settings. chessprogramming.org+1
  • The bounds you quoted (±2.94) are consistent with common settings (e.g., α=β=0.05 with some Elo0/Elo1 pair); the exact mapping from α, β, and (elo0, elo1) to LLR bounds is described in Stockfish/Fishtest documentation and the chess-programming literature. official-stockfish.github.iochessprogramming.org

Practical conclusion

  • Treat this as a likely but modest gain: ≈ +8–9 Elo.
  • If your pass criterion is strict SPRT (hit the upper bound), keep the test running. If your workflow allows promoting patches with high LOS and a positive estimate even before the bound is crossed, this result already supports the dev version.

BayesElo-style posterior interval from your raw W/L/D:

  • MLE Elo estimate:+8.55 Elo
  • Posterior mean Elo (Jeffreys prior):+8.55 Elo
  • 95% credible interval: [–1.1, +18.2] Elo

Leave a Reply

Your email address will not be published. Required fields are marked *

Share via