Here is a professional interpretation of your test results, framed in the style of a senior computer scientist with a background in mathematics. I will combine the statistical framework of SPRT testing (as illustrated in the sources you provided) with the empirical outcome of your own Revolution 1.0.1 dev vs. Revolution match.
Results of Revolution vs Revolution1 (10+0.1, 1t, 32MB, UHO_2024_8mvs_+085_+094.pgn): Elo: 5.53 +/- 14.79, nElo: 10.17 +/- 27.17 LOS: 76.85 %, DrawRatio: 47.45 %, PairsRatio: 1.12 Games: 628, Wins: 165, Losses: 155, Draws: 308, Points: 319.0 (50.80 %) Ptnml(0-2): [3, 75, 149, 83, 4], WL/DD Ratio: 0.99 LLR: 0.10 (3.2%) (-2.94, 2.94) [0.00, 2.00]
Interpretation of Revolution 1.0.1 dev vs Revolution 1.0
Test conditions
- Time control: 10+0.1s, 1 thread, 32MB hash
- Opening set: UHO_2024_8mvs_+085_+094.pgn
- Sample size: 628 games
Raw outcomes
- Wins: 165
- Losses: 155
- Draws: 308
- Score: 319.0 / 628 → 50.8 %
- Draw ratio: 47.45 %
Elo statistics
- Elo difference (BayesElo): +5.53 ± 14.79
- Normalized Elo (nElo): +10.17 ± 27.17
- Likelihood of Superiority (LOS): 76.85 %
SPRT context
- Log-Likelihood Ratio (LLR): 0.10 → corresponds to only 3.2 % of the target boundary in typical SPRT settings (e.g. H0: 0 Elo vs H1: +2 Elo).
- This means the test is far from statistically significant and would require many more games to reach a confident conclusion.
Mathematical interpretation
- Point estimate:
Revolution 1.0.1 dev is about +5 Elo stronger than the baseline Revolution 1.0.
However, the confidence interval (±14.79) means the “true” difference could easily range from −9 Elo to +20 Elo. - Likelihood of Superiority (LOS = 76.85 %):
This indicates there is about a 3-in-4 chance that the development version is stronger.
But in professional testing standards (e.g. Stockfish’s Fishtest), an LOS ≥ 95 % is generally required before one can claim a genuine Elo gain. - Draw ratio (47.45 %):
A relatively low draw rate (compared to 60–70 % often seen in longer tests) suggests the match length and book choice allowed for more decisive games. This helps sensitivity but still does not fully overcome the variance at only ~600 games. - Comparison with SPRT methodology:
According to the Rustichess and Stockfish sources, SPRT requires several thousand games to stabilise. Your current 628-game sample has simply not converged.
The LLR value of 0.10 is effectively negligible, confirming that the test is still in its “indecisive” zone.
Conclusion
- Observed gain: Revolution 1.0.1 dev shows a nominal Elo improvement of +5 relative to Revolution 1.0.
- Statistical reliability: The result is not statistically significant. With such a wide confidence interval, the observed difference could well be noise.
- Practical recommendation:
- Treat this as inconclusive evidence of a small improvement.
- To claim a genuine Elo gain, extend testing to at least 2,000–3,000 games with the same settings.
- Alternatively, run an SPRT test with H0: 0 Elo, H1: +5 Elo, which would more directly assess whether the development version is reliably stronger by a modest margin.
Final statement
At present, Revolution 1.0.1 dev cannot be confidently said to be stronger than Revolution 1.0, although the data slightly favours the development version. If the trend holds with larger samples, the true gain is likely in the range of +5 to +10 Elo, but this remains unconfirmed until further games are played.
Here is a professional interpretation of your test results, framed in the style of a senior computer scientist with a background in mathematics. I will combine the statistical framework of SPRT testing (as illustrated in the sources you provided) with the empirical outcome of your own Revolution 1.0.1 dev vs. Revolution match.
Below is a compact, mathematically grounded projection for how many games you need to detect a +5 Elo gain at 95% confidence given your observed draw ratio.
Assumptions (standard and explicit)
- Small-effect approximation around 0 Elo: the Elo–score link is linearised at 0.
- Score model: per-game score S∈{1,12,0}S\in\{1, \tfrac{1}{2}, 0\} with draw ratio DD; under a small Elo shift we keep pw≈plp_w \approx p_l and use your observed DD.
- Variance of per-game score near 50% with draws:
Var(S)≈14(1−D)\mathrm{Var}(S)\approx \tfrac{1}{4}(1-D)
- Elo ↔ expected score (logistic), derivative at 0 Elo:
s(E)=11+10−E/400,dsdE∣E=0=ln101600≈0.001439116s(E) = \frac{1}{1+10^{-E/400}},\qquad \left.\frac{ds}{dE}\right|_{E=0}=\frac{\ln 10}{1600}\approx 0.001439116
- So a target gain of ΔE\Delta E Elo implies a score lift
Δs≈ln101600 ΔE.\Delta s \approx \frac{\ln 10}{1600}\,\Delta E.
- 95% confidence that excludes 0 (two-sided): standard error of the sample mean must satisfy
Var(S)N≤Δs1.96 ⇒ N ≥ Var(S)(1.96Δs)2.\sqrt{\frac{\mathrm{Var}(S)}{N}} \le \frac{\Delta s}{1.96} \;\Rightarrow\; N \;\ge\; \mathrm{Var}(S)\left(\frac{1.96}{\Delta s}\right)^2.
Plugging in your numbers (from your match)
- Observed draw ratio: D=0.4745D = 0.4745 → Var(S)=14(1−0.4745)=0.131375\mathrm{Var}(S)=\tfrac{1}{4}(1-0.4745)=0.131375.
- Target gain: ΔE=+5\Delta E = +5 Elo → Δs≈0.001439116×5=0.00719558\Delta s \approx 0.001439116\times 5 = 0.00719558.
Required games for 95% confidence: N ≥ 0.131375(1.960.00719558)2 ≈ 9,748 games.N \;\ge\; 0.131375 \left(\frac{1.96}{0.00719558}\right)^2 \;\approx\; \mathbf{9{,}748\;games}.
You have 628 games already, so you’d need about 9,120 additional games under the same conditions to make a +5 Elo result statistically conclusive at the 95% (two-sided) level.
Sensitivity table (handy reference)
Projected total games needed at 95% confidence for several targets and draw ratios:
Target gain | D = 0.4745 (yours) | D = 0.60 | D = 0.70 |
---|---|---|---|
+3 Elo | 27,076 | 20,610 | 15,458 |
+5 Elo | 9,748 | 7,420 | 5,565 |
+8 Elo | 3,808 | 2,898 | 2,174 |
+10 Elo | 2,437 | 1,855 | 1,391 |
(Higher draw ratios reduce variance and thus lower the sample size required.)
Notes and best practice
- The calculation above is two-sided 95% (i.e., a 95% confidence interval that excludes 0).
If you prefer a one-sided 95% detection (only testing “≥ 0” vs “> 0”), replace 1.96 with 1.645, which reduces NN by about 29%. - Keep time control, threads, hash, book, adjudication, and SPRT bounds fixed; mixing conditions invalidates NN comparability.
- For rigorous acceptance in an SPRT workflow, set your H1 near the effect you truly care about (e.g., +5 Elo) and run until the LLR crosses the boundary; the table above gives a good ballpark for how many games you should expect.
If you want, I can also give you the one-liner formula and a tiny script/snippet (bash or Python) you can plug your DD, target Elo and confidence into to auto-compute NN for future tests.

Jorge Ruiz
connoisseur of both chess and anthropology, a combination that reflects his deep intellectual curiosity and passion for understanding both the art of strategic. Chess books