Summary
Finished game 10000 (Wordfish base vs Wordfish 1.0 dev 260825): 1-0 {White mates} Score of Wordfish 1.0 dev 260825 vs Wordfish base: 3290 - 3144 - 3566 [0.507] 10000 ... Wordfish 1.0 dev 260825 playing White: 2489 - 753 - 1758 [0.674] 5000 ... Wordfish 1.0 dev 260825 playing Black: 801 - 2391 - 1808 [0.341] 5000 ... White vs Black: 4880 - 1554 - 3566 [0.666] 10000 Elo difference: 5.1 +/- 5.5, LOS: 96.6 %, DrawRatio: 35.7 % SPRT: llr 1.23 (41.8%), lbound -2.94, ubound 2.94
Bottom line
- Point estimate: +5.1 Elo for Wordfish 1.0 dev vs Wordfish base at your TC/openings.
- Uncertainty: ±5.5 Elo (≈95% CI) → interval ≈ [–0.4, +10.6] Elo. Since 0 is inside the interval, the gain is not yet statistically confirmed.
- LOS (P(dev > base)): 96.6% — encouraging, but not an SPRT “pass” by Stockfish criteria.
- SPRT status: LLR = +1.23 with bounds [–2.94, +2.94] → the test is inconclusive and should continue until one bound is hit. Chess Programmingtalkchess.com
How that lines up with your raw scores
You’ve got:3290–3144–3566
over 10,000
games → score 0.5073. Converting that score to Elo via the standard logistic model gives ~+5.07 Elo, which matches your tool’s +5.1. (Same method used by engine projects when quoting Elo from a match score.) Chess Programming
About the SPRT fields you posted
- LLR +1.23 (41.8%): positive LLR favours H1 (the “improvement” hypothesis), but you’re well short of +2.94, the typical accept bound; –2.94 is the reject bound. The run should keep going until one bound is crossed. Chess Programmingtalkchess.com
- Stockfish-style setups often use normalised bounds like ±2.94 (roughly α≈β≈5%). Rustic’s notes and Stockfish docs explain this convention and why tests stop early only at these thresholds. Rustic Chess+1Stockfish
One red flag to check (colour bias)
Your colour split is extreme:
- Dev as White: 0.674 over 5000 games
- Dev as Black: 0.341 over 5000 games
- Overall White: 0.666 over 10k
That size of white advantage is unusual for properly paired openings (each position played both colours). If your run wasn’t paired (or the book is heavily skewed), the Elo estimate can be biased. Make sure cutechess used reversed pairs (or a balanced book), otherwise rerun with pairing enabled. (Engine projects stress pairing for exactly this reason.) Rustic Chess
Verdict
- Treat the current result as “small likely gain, not yet confirmed”.
- Keep the test running (same bounds) until LLR ≥ +2.94 (accept) or ≤ –2.94 (reject). If you prefer faster decisions for tiny changes, narrow the target (e.g., H1 = +1 Elo) but keep proper pairing.
What I’ve delivered now
A ready-to-run Python module in the workspace:/mnt/data/sprt_recompute_toolkit.py
It parses PGN or CSV, identifies the dev engine by name, pairs games consecutively (the standard assumption), builds the pentanomial counts 0.0, 0.5,1.0,.5,2.0
0.0, 0.5, 1.0, 1.5, 2.00.0,0.5,1.0,1.5,2.0 from the dev’s perspective, estimates the draw rate, and computes the LLR for a pentanomial SPRT between your chosen hypotheses H0 and H1 (defaults: 0 vs +5 Elo).
It also prints a colour-reversal sanity check (how often the second game of each pair flips colours).
How to run it on your data
Upload your PGN or CSV to /mnt/data/
(left sidebar → Files).
Then call:from sprt_recompute_toolkit import run_analysis # Adjust engine names/H0/H1 if needed run_analysis("/mnt/data/your_file.pgn", "Wordfish 1.0 dev", "Wordfish base", 0.0, 5.0) # H0=0 Elo, H1=+5 Elo
If your log is CSV, the file should have at least headers White, Black, Result
.
If your dev/base names differ slightly (e.g., “Wordfish 1.0 dev 260825”), pass the exact strings so the script knows which side is the dev in each game.
What you’ll get back
Pentanomial counts: [n_0.0, n_0.5, n_1.0, n_1.5, n_2.0]
Per-game draw rate
Point-estimate Elo from mean score (logistic)
LLR for your chosen SPRT H
0
,
H
1
H0, H1H0,H1
% of pairs that are colour-reversed (to catch pairing/book bias)