Summary
Finished game 5000 (Wordfish base vs Wordfish 1.0 dev 260825): 1/2-1/2 {Draw by adjudication: SyzygyTB}
Score of Wordfish 1.0 dev 260825 vs Wordfish base: 1571 - 1448 - 1981 [0.512] 5000
... Wordfish 1.0 dev 260825 playing White: 756 - 718 - 1026 [0.508] 2500
... Wordfish 1.0 dev 260825 playing Black: 815 - 730 - 955 [0.517] 2500
... White vs Black: 1486 - 1533 - 1981 [0.495] 5000
Elo difference: 8.5 +/- 7.5, LOS: 98.7 %, DrawRatio: 39.6 %
SPRT: llr 1.25 (42.5%), lbound -2.94, ubound 2.94
Bottom line
- Estimated Elo gain: +8.5 Elo for Wordfish 1.0 dev 260825 vs Wordfish base, with reported uncertainty ±7.5 Elo.
- Likelihood the dev is stronger (LOS): 98.7%.
- SPRT status: Inconclusive so far — the LLR = 1.25 has not crossed the decision bounds ±2.94, so a textbook SPRT would keep running. chessprogramming.org+1
How this squares with the numbers you gave
- Match total: 5000 games → 1571–1448–1981 (W–L–D).
The total score is 0.512, which corresponds to about +8.34 Elo via the logistic transform Elo≈400⋅log10(p/(1−p))Elo ≈ 400·log10(p/(1−p))Elo≈400⋅log10(p/(1−p)); that aligns with the reported +8.5 Elo. - By colour:
- White: 0.508 over 2500 games.
- Black: 0.517 over 2500 games.
This mild asymmetry suggests the gain is consistent across colours (if anything, slightly better as Black), but the effect size is still small.
- Draw ratio: 39.6%, which is typical for balanced engine self-play and helps explain why many games are needed to pin down a small Elo signal.
Interpreting LOS, LLR and the SPRT line
- LOS 98.7% means: given the observed data and the model, there’s a very high probability the dev build is stronger than the base (i.e., the Elo difference is > 0). LOS is a widely used “probability of superiority” metric in engine testing. official-stockfish.github.iorustic-chess.org
- SPRT (Sequential Probability Ratio Test): a sequential test that updates a Log-Likelihood Ratio (LLR) after each batch of games and stops only when LLR hits a bound (accept or reject). Your LLR = 1.25 is inside the bounds [−2.94,+2.94][-2.94, +2.94][−2.94,+2.94], so the test has not reached a decision. In other words: strong evidence of a small positive effect, but not yet enough to declare the patch “passed” under those specific SPRT settings. chessprogramming.org+1
- The bounds you quoted (±2.94) are consistent with common settings (e.g., α=β=0.05 with some Elo0/Elo1 pair); the exact mapping from α, β, and (elo0, elo1) to LLR bounds is described in Stockfish/Fishtest documentation and the chess-programming literature. official-stockfish.github.iochessprogramming.org
Practical conclusion
- Treat this as a likely but modest gain: ≈ +8–9 Elo.
- If your pass criterion is strict SPRT (hit the upper bound), keep the test running. If your workflow allows promoting patches with high LOS and a positive estimate even before the bound is crossed, this result already supports the dev version.
BayesElo-style posterior interval from your raw W/L/D:
- MLE Elo estimate: ≈ +8.55 Elo
- Posterior mean Elo (Jeffreys prior): ≈ +8.55 Elo
- 95% credible interval: [–1.1, +18.2] Elo