Summary
Results of wordfish vs wordfish dev (10+0.1, 1t, 32MB, 8moves_v3.pgn):
Elo: 0.69 +/- 8.67, nElo: 1.73 +/- 21.53
LOS: 56.24 %, DrawRatio: 73.60 %, PairsRatio: 1.10
Games: 1000, Wins: 88, Losses: 86, Draws: 826, Points: 501.0 (50.10 %)
Ptnml(0-2): [7, 56, 368, 66, 3], WL/DD Ratio: 0.05
LLR: 0.01 (0.4%) (-2.94, 2.94) [0.00, 2.00]
Bottom line: ~+0.7 Elo in favour of wordfish 1.0 dev — statistically indistinguishable from 0 under these conditions. Your own match output already says it: Elo: 0.69 ± 8.67 (nElo: 1.73 ± 21.53), LOS 56.24%, LLR 0.01 with SPRT bounds (−2.94, +2.94). That combination means the test is inconclusive and you should treat the two engines as equal strength. official-stockfish.github.iochessprogramming.orgtalkchess.com
Why that’s the correct interpretation
- Point estimate: 50.10% score over 1000 games ⇒ ~+0.69 Elo (the standard logistic mapping). Your tool reports the same.
- Uncertainty: ±8.67 Elo means the 95% interval spans roughly [−8.0, +9.4] Elo; since zero lies inside, there’s no significant gain. LOS 56% is only a hair above a coin-flip. chessprogramming.org
- SPRT status: LLR = 0.01 is essentially at the origin and well inside the decision bounds ±2.94 (typical α=β=5%). Because the LLR didn’t hit either boundary, SPRT hasn’t proven an improvement (H1) nor a regress (H0). In other words, keep testing if you want a formal pass/fail. chessprogramming.orgtalkchess.com
- About nElo: Your nElo 1.73 ± 21.53 matches the story: normalised Elo is a Fishtest convention that adjusts for draw rate/book so test durations are comparable. Here it’s clearly not significant either. official-stockfish.github.io
- Pentanomial / draw context: DrawRatio 73.6% with Ptnml(0–2) = [7, 56, 368, 66, 3] is typical of closely matched engines; the heavy mass in the middle bins reduces information per game, widening your Elo error bars. chessprogramming.org
Practical take-away
Under 10+0.1, 1 thread, 32 MB, 8-move book, your data do not justify claiming an Elo gain for wordfish 1.0 dev over the prior version. If you want a decisive SPRT result, increase sample size and/or use Fishtest-style nElo bounds (e.g., H0=0, H1≈+2 nElo with α=β=0.05) so the LLR can actually reach ±2.94. Higher variety books and more games help when draw rates are ~70%+. official-stockfish.github.iorustic-chess.org
Verdict: Treat wordfish 1.0 dev = wordfish (old) within error; no proven Elo gain from this run.