Summary
Finished game 2000 (wordfish1 vs wordfish): 1/2-1/2 {Draw by 3-fold repetition}
--------------------------------------------------
Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn):
Elo: 4.00 +/- 5.68, nElo: 10.71 +/- 15.23
LOS: 91.59 %, DrawRatio: 74.80 %, PairsRatio: 1.21
Games: 2000, Wins: 182, Losses: 159, Draws: 1659, Points: 1011.5 (50.58 %)
Ptnml(0-2): [5, 109, 748, 134, 4], WL/DD Ratio: 0.06
LLR: 0.39 (13.3%) (-2.94, 2.94) [0.00, 2.50]
What the result says (direction matters)
our summary line is:
Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn)
Elo: +4.00 ± 5.68, nElo: +10.71 ± 15.23, LOS: 91.59%
Games: 2000 (182-159-1659), Score: 50.58%
Ptnml(0-2): [5, 109, 748, 134, 4]; LLR: 0.39 (13.3%) with bounds ±2.94 and targets [0.0, 2.5]
- As printed, the first engine (“wordfish”) is ahead of the second (“wordfish1”) by ~+4 Elo.
- If “wordfish1” is your Wordfish 1.0 dev and “wordfish” is the old/base version, then dev − base ≈ −4.0 Elo (i.e., dev is ~4 Elo weaker on this run).
- The error bar is larger than the estimate (±5.68), so the interval includes zero. In plain terms: no statistically significant difference—the data are compatible with anything from a small loss to a small gain.
LOS and LLR, decoded
- LOS 91.6% ≈ probability the first engine (wordfish) is superior under the logistic model. That again points against the dev if dev = wordfish1.
- LLR = 0.39 with boundaries −2.94 / +2.94 (the usual α = β = 5% SPRT bounds) means the sequential test is far from a decision; progress ~13% towards either boundary. In other words, the SPRT is inconclusive at this sample size. chessprogramming.org
Pentanomial line sanity-check
Your Ptnml(0-2) = [5, 109, 748, 134, 4] sums to 1000 paired openings, which matches 2000 games. The bins are (for the first-named engine): 2-0, 1.5-0.5, 1-1, 0.5-1.5, 0-2—exactly the standard pentanomial pairing model. The vast draw bulk (748 pairs) explains the slow information gain. Chess Stack Exchangeofficial-stockfish.github.io
Bottom line (dev vs old)
- Point estimate: −4 Elo for Wordfish 1.0 dev relative to the old version (assuming dev = wordfish1).
- Uncertainty (as reported): about ±5.7 Elo, so the plausible range is roughly [−9.7, +1.7] Elo for dev−old.
- Decision: No reliable Elo gain detected yet; if anything, the small tilt favours the old engine, but it’s not significant.
(If instead your naming is reversed—i.e., “wordfish” is the dev and “wordfish1” the old—then flip the sign: dev ≈ +4 Elo vs old, still not significant.)
Why this conclusion is sound
- Elo from score: 50.58% → ~+4 Elo (via 400log10p1−p400\log_{10}\frac{p}{1-p}400log101−pp), matching the tool’s figure.
- SPRT mechanics: tests compare H0 vs H1 using the log-likelihood ratio until crossing ± bounds; your LLR 0.39 is far from ±2.94, so neither “no gain” nor “≥2.5 Elo gain” is established. chessprogramming.org
- Pentanomial statistics & “nElo”: fast tests for engines use the pentanomial model and often report “normalised Elo” (nElo) to reduce dependency on draw rate; your nElo +10.7 ± 15.2 tells the same story: big uncertainty around a tiny edge. official-stockfish.github.io
Practical tips to reach a decision sooner
- Use a more biased opening set (e.g., UHO_Lichess or Pohl) to cut draw rate, which speeds up SPRT convergence. chessprogramming.org
- Tighten bounds for gains you actually care about (e.g., [0, 2.0] or [0.5, 2.5]); wide bounds can terminate faster when the signal is clear, but they’re inappropriate if you want to detect small improvements reliably. chessprogramming.org
- Consider more games or shorter time controls if throughput is the bottleneck (mind the risk of evaluation noise).
One-line answer (what you asked for)
Wordfish 1.0 dev vs old: ≈ −4 Elo (or +4 Elo if your labels are swapped), not statistically significant; no proven Elo gain based on this 2000-game run.
References (for definitions/interpretation): Rustic’s SPRT overview and examples; Stockfish’s Fishtest maths (pentanomial, nElo, LLR/Bounds); CPW on SPRT.