SPRT TEST of Revolution 1.0.1 dev vs Revolution 1.0

Here is a professional interpretation of your test results, framed in the style of a senior computer scientist with a background in mathematics. I will combine the statistical framework of SPRT testing (as illustrated in the sources you provided) with the empirical outcome of your own Revolution 1.0.1 dev vs. Revolution match.

Results of Revolution vs Revolution1 (10+0.1, 1t, 32MB, UHO_2024_8mvs_+085_+094.pgn): Elo: 5.53 +/- 14.79, nElo: 10.17 +/- 27.17 LOS: 76.85 %, DrawRatio: 47.45 %, PairsRatio: 1.12 Games: 628, Wins: 165, Losses: 155, Draws: 308, Points: 319.0 (50.80 %) Ptnml(0-2): [3, 75, 149, 83, 4], WL/DD Ratio: 0.99 LLR: 0.10 (3.2%) (-2.94, 2.94) [0.00, 2.00]

Results of Revolution vs Revolution1 (10+0.1, 1t, 32MB, UHO_2024_8mvs_+085_+094.pgn): Elo: 5.53 +/- 14.79, nElo: 10.17 +/- 27.17 LOS: 76.85 %, DrawRatio: 47.45 %, PairsRatio: 1.12 Games: 628, Wins: 165, Losses: 155, Draws: 308, Points: 319.0 (50.80 %) Ptnml(0-2): [3, 75, 149, 83, 4], WL/DD Ratio: 0.99 LLR: 0.10 (3.2%) (-2.94, 2.94) [0.00, 2.00]

Table of Contents

Interpretation of Revolution 1.0.1 dev vs Revolution 1.0

Test conditions

Time control: 10+0.1s, 1 thread, 32MB hash
Opening set: UHO_2024_8mvs_+085_+094.pgn
Sample size: 628 games

Raw outcomes

Wins: 165
Losses: 155
Draws: 308
Score: 319.0 / 628 → 50.8 %
Draw ratio: 47.45 %

Elo statistics

Elo difference (BayesElo): +5.53 ± 14.79
Normalized Elo (nElo): +10.17 ± 27.17
Likelihood of Superiority (LOS): 76.85 %

SPRT context

Log-Likelihood Ratio (LLR): 0.10 → corresponds to only 3.2 % of the target boundary in typical SPRT settings (e.g. H0: 0 Elo vs H1: +2 Elo).
This means the test is far from statistically significant and would require many more games to reach a confident conclusion.

Mathematical interpretation

Point estimate:
Revolution 1.0.1 dev is about +5 Elo stronger than the baseline Revolution 1.0.
However, the confidence interval (±14.79) means the “true” difference could easily range from −9 Elo to +20 Elo.
Likelihood of Superiority (LOS = 76.85 %):
This indicates there is about a 3-in-4 chance that the development version is stronger.
But in professional testing standards (e.g. Stockfish’s Fishtest), an LOS ≥ 95 % is generally required before one can claim a genuine Elo gain.
Draw ratio (47.45 %):
A relatively low draw rate (compared to 60–70 % often seen in longer tests) suggests the match length and book choice allowed for more decisive games. This helps sensitivity but still does not fully overcome the variance at only ~600 games.
Comparison with SPRT methodology:
According to the Rustichess and Stockfish sources, SPRT requires several thousand games to stabilise. Your current 628-game sample has simply not converged.
The LLR value of 0.10 is effectively negligible, confirming that the test is still in its “indecisive” zone.

Conclusion

Observed gain: Revolution 1.0.1 dev shows a nominal Elo improvement of +5 relative to Revolution 1.0.
Statistical reliability: The result is not statistically significant. With such a wide confidence interval, the observed difference could well be noise.
Practical recommendation:
- Treat this as inconclusive evidence of a small improvement.
- To claim a genuine Elo gain, extend testing to at least 2,000–3,000 games with the same settings.
- Alternatively, run an SPRT test with H0: 0 Elo, H1: +5 Elo, which would more directly assess whether the development version is reliably stronger by a modest margin.

Final statement

At present, Revolution 1.0.1 dev cannot be confidently said to be stronger than Revolution 1.0, although the data slightly favours the development version. If the trend holds with larger samples, the true gain is likely in the range of +5 to +10 Elo, but this remains unconfirmed until further games are played.

Below is a compact, mathematically grounded projection for how many games you need to detect a +5 Elo gain at 95% confidence given your observed draw ratio.

Assumptions (standard and explicit)

Small-effect approximation around 0 Elo: the Elo–score link is linearised at 0.
Score model: per-game score S∈{1,12,0}S\in\{1, \tfrac{1}{2}, 0\} with draw ratio DD; under a small Elo shift we keep pw≈plp_w \approx p_l and use your observed DD.
Variance of per-game score near 50% with draws:

Var(S)≈14(1−D)\mathrm{Var}(S)\approx \tfrac{1}{4}(1-D)

Elo ↔ expected score (logistic), derivative at 0 Elo:

s(E)=11+10−E/400,dsdE∣E=0=ln⁡101600≈0.001439116s(E) = \frac{1}{1+10^{-E/400}},\qquad \left.\frac{ds}{dE}\right|_{E=0}=\frac{\ln 10}{1600}\approx 0.001439116

So a target gain of ΔE\Delta E Elo implies a score lift

Δs≈ln⁡101600 ΔE.\Delta s \approx \frac{\ln 10}{1600}\,\Delta E.

95% confidence that excludes 0 (two-sided): standard error of the sample mean must satisfy

Var(S)N≤Δs1.96 ⇒ N ≥ Var(S)(1.96Δs)2.\sqrt{\frac{\mathrm{Var}(S)}{N}} \le \frac{\Delta s}{1.96} \;\Rightarrow\; N \;\ge\; \mathrm{Var}(S)\left(\frac{1.96}{\Delta s}\right)^2.

Plugging in your numbers (from your match)

Observed draw ratio: D=0.4745D = 0.4745 → Var(S)=14(1−0.4745)=0.131375\mathrm{Var}(S)=\tfrac{1}{4}(1-0.4745)=0.131375.
Target gain: ΔE=+5\Delta E = +5 Elo → Δs≈0.001439116×5=0.00719558\Delta s \approx 0.001439116\times 5 = 0.00719558.

Required games for 95% confidence: N ≥ 0.131375(1.960.00719558)2 ≈ 9,748 games.N \;\ge\; 0.131375 \left(\frac{1.96}{0.00719558}\right)^2 \;\approx\; \mathbf{9{,}748\;games}.

You have 628 games already, so you’d need about 9,120 additional games under the same conditions to make a +5 Elo result statistically conclusive at the 95% (two-sided) level.

Sensitivity table (handy reference)

Projected total games needed at 95% confidence for several targets and draw ratios:

Target gain	D = 0.4745 (yours)	D = 0.60	D = 0.70
+3 Elo	27,076	20,610	15,458
+5 Elo	9,748	7,420	5,565
+8 Elo	3,808	2,898	2,174
+10 Elo	2,437	1,855	1,391

(Higher draw ratios reduce variance and thus lower the sample size required.)

Notes and best practice

The calculation above is two-sided 95% (i.e., a 95% confidence interval that excludes 0).
If you prefer a one-sided 95% detection (only testing “≥ 0” vs “> 0”), replace 1.96 with 1.645, which reduces NN by about 29%.
Keep time control, threads, hash, book, adjudication, and SPRT bounds fixed; mixing conditions invalidates NN comparability.
For rigorous acceptance in an SPRT workflow, set your H1 near the effect you truly care about (e.g., +5 Elo) and run until the LLR crosses the boundary; the table above gives a good ballpark for how many games you should expect.

If you want, I can also give you the one-liner formula and a tiny script/snippet (bash or Python) you can plug your DD, target Elo and confidence into to auto-compute NN for future tests.

Download engine

Jorge Ruiz

connoisseur of both chess and anthropology, a combination that reflects his deep intellectual curiosity and passion for understanding both the art of strategic. Chess books

SÍGUEME