Summary
Results of Revolution vs Revolution1 (10+0.1, 1t, 32MB, UHO_2024_8mvs_+085_+094.pgn):
Elo: 30.64 +/- 7.21, nElo: 56.04 +/- 13.12
LOS: 100.00 %, DrawRatio: 47.88 %, PairsRatio: 1.90
Games: 2694, Wins: 783, Losses: 546, Draws: 1365, Points: 1465.5 (54.40 %)
Ptnml(0-2): [14, 228, 645, 427, 33], WL/DD Ratio: 0.82
LLR: 2.95 (100.2%) (-2.94, 2.94) [0.00, 2.50]
--------------------------------------------------
SPRT ([0.00, 2.50]) completed - H1 was accepted
Finished match
What your numbers say (the short answer)
- Revolution 1.0 (dev) is ~+30.6 Elo stronger than the older Wordfish at 10+0.1, 1 thread, 32 MB, UHO_2024_8mvs_+085_+094.pgn.
- Point estimate: +30.64 Elo with a 1-σ error of ±7.21 → 95% CI ≈ +16.5 to +44.8 Elo.
- LOS = 100% (essentially certain superiority).
- SPRT:
LLR = 2.95
with bounds(-2.94, 2.94)
and H1 accepted for the test window [0.00, 2.50] → the test crossed the upper boundary cleanly.
So, yes—there is a real, material gain. At this time control and setup, expect roughly 54.4% score for the dev vs the old version (you observed 54.40%), which is exactly consistent with ~+30.6 Elo.
How those fields map to strength
- Elo / score: The match scored 54.40% (1465.5/2694), which converts to Elo via the logistic model Δ=400log10(E1−E)\Delta = 400 \log_{10}\left(\frac{E}{1-E}\right). Plugging E = 0.544 gives Δ ≈ +30.65 Elo, matching your tool.
- nElo (56.04 ± 13.12): “Normalised Elo” rescales by the observed per-game variance (largely driven by draw rate). It’s useful within a given testing framework (e.g., deciding SPRT thresholds), but not portable as a public rating. Use the logistic Elo (+30.6) when you want to quote a gain. (Stockfish/Fishtest docs explain normalised Elo and why it differs from logistic Elo depending on draw rate.) (tests.stockfishchess.org)
- SPRT & LLR: Your test used an SPRT window [0, 2.5] (in nElo units). The LLR = 2.95 exceeded the upper boundary (≈ 2.94) at α = β = 0.05, so H1 accepted: the new engine is better than the null by at least the target. That’s exactly how SPRT is intended to work in chess engine testing. (rustic-chess.org, chessprogramming.org)
- LOS (Likelihood of Superiority): 100% means the probability the dev is stronger than the baseline is (within rounding) 1.0 under the fitted model. (chessprogramming.org)
- Ptnml / pentanomial: Your
Ptnml(0-2): [14, 228, 645, 427, 33]
are the five paired-game outcome bins used by the pentanomial model (0–2, ½–1½, 1–1, 1½–½, 2–0), which properly accounts for opening pairing effects and correlations. It provides better variance estimates than a simple trinomial model. (chessprogramming.org)
Context and caveats
- The ~+30–45 Elo interval is time-control and setup specific (10+0.1, 1t, 32 MB, this UHO suite). Elo gains typically shrink at longer TCs or with different books/hardware; external validity is always contextual. (This is a standard caution in engine testing practice.) (chessprogramming.org)
- Your draw ratio ~47.9% is moderate; if you move to suites that raise draws, the same playing-strength gap can translate into larger nElo but similar logistic Elo, because nElo depends on per-game variance. Fishtest’s “normalised” view illustrates this effect. (tests.stockfishchess.org)
- SPRT mechanics: Accepting H1 here means the improvement is beyond the specified window with ~95% operating characteristics (α, β). It’s a decision test; the magnitude is better read from the Elo estimate and its CI. (rustic-chess.org, chessprogramming.org)
Bottom line
At 10+0.1 (1t, 32 MB, UHO_2024_8mvs_+085_+094):
Revolution 1.0 (dev) ≈ +30.6 Elo vs the old Wordfish, with a 95% CI ≈ +16.5 to +44.8 Elo, and SPRT conclusively accepts H1. That’s a solid, practically meaningful gain at this TC.
References for methodology: Rustic’s SPRT guide/results and Fishtest stats pages (SPRT boundaries, nElo/LLR, pentanomial model and LOS). (rustic-chess.org, tests.stockfishchess.org, chessprogramming.org)