Summary
Results of wordfish vs wordfish1 (10+0.1, 4t, 32MB, UHO_2024_8mvs_+085_+094.pgn):
Elo: -2.39 +/- 5.64, nElo: -4.78 +/- 11.29
LOS: 20.32 %, DrawRatio: 53.41 %, PairsRatio: 0.93
Games: 3636, Wins: 815, Losses: 840, Draws: 1981, Points: 1805.5 (49.66 %)
Ptnml(0-2): [8, 430, 971, 397, 12], WL/DD Ratio: 0.68
LLR: -2.95 (-100.1%) (-2.94, 2.94) [0.00, 10.00]
--------------------------------------------------
SPRT ([0.00, 10.00]) completed - H0 was accepted
Player: wordfish1
Timeouts: 1
Crashed: 0
Bottom line
- Estimated Elo difference (dev − old): −2.39 Elo with an error bar of ±5.64.
- Likelihood of Superiority (LOS): 20.32% (i.e., only a 1-in-5 chance the dev build is actually stronger).
- SPRT result: H0 accepted for bounds [0, +10] — in plain terms: no evidence of a +10 Elo gain; the data favour “no improvement”.
- Conclusion: At this time control and setup, Wordfish 1.0 dev is essentially indistinguishable from — and probably slightly weaker than — the old version. If there is any real difference, it’s small (on the order of a couple of Elo) and not statistically significant given your sample.
How the numbers support that
- Your fixed-length summary:
- Games: 3,636 (W/L/D = 815/840/1981; score 49.66%)
- Elo: −2.39 ± 5.64 → the interval straddles 0, so no significant gain.
- nElo: −4.78 ± 11.29 (same story, just the normalised scale).
- Draw ratio: 53.41% — perfectly normal at 10+0.1 and paired openings.
- LLR: −2.95 with bounds (−2.94, +2.94) for SPRT [0, +10] → the LLR crossed the lower boundary, so the test accepts H0 (“no ≥+10 Elo improvement”).
- Report line: “SPRT ([0.00, 10.00]) completed – H0 was accepted” confirms the decision.
Interpreting SPRT & the LLR (why this is a “no-gain” result)
- SPRT sequentially accumulates evidence for H0 (no or negligible gain) vs H1 (a target gain, here +10 Elo). When the log-likelihood ratio (LLR) hits the lower bound, we accept H0; hit the upper bound, accept H1. Your test hit the lower bound. chessprogramming.orgtests.stockfishchess.org
- “Normalised Elo” and the pentanomial model reduce variance and speed up decisions by using paired openings; the Rustic notes and Fishtest stats pages document these conventions and what the bounds mean. rustic-chess.org+1tests.stockfishchess.org
Practical takeaways
- Treat this as no measurable improvement and likely a tiny regression (~2–5 Elo) at 10+0.1, 4 threads, 32 MB, UHO_2024_8mvs_+085_+094.
- If you want to detect smaller effects (e.g., ±2 Elo), either:
- Tighten the SPRT to [0, +2] (same alpha/beta), or
- Run a longer fixed-length test (substantially more games), and
- Prefer 1 thread, larger book pairs, and pentanomial pairing to reduce variance (common guidance in Fishtest/OpenBench). tests.stockfishchess.org
Final verdict
Elo gain of Wordfish 1.0 dev vs the old version: approximately −2.4 Elo (not significant). The SPRT explicitly rejects a +10 Elo gain and supports no improvement under your test conditions.