1) Interpretation of Counts
Your summary data:
SPRT Test: Step-by-Step Explanation (Interpreting Your Results)
Score of Wordfish 1.0 dev 260825 vs Wordfish base: 804 - 805 - 1742 [0.500] 3351
...
Wordfish 1.0 dev playing White: 449 - 356 - 871 [0.528] 1676
Wordfish 1.0 dev playing Black: 355 - 449 - 871 [0.472] 1675
...
Elo difference: -0.1 ± 8.1, LOS: 49.0%, DrawRatio: 52.0%
SPRT: llr -2.95 (-100.2%), lbound -2.94, ubound 2.94 — H0 accepted
Totals: 3351 games (804 wins for dev, 805 wins for base, 1742 draws) → exact net score of 0.500 (statistical draw).
The white/black fractions reveal mild asymmetry (dev performs better with white pieces), compensated by weaker black performance. Globally, equilibrium is maintained.
DrawRatio ≈ 52% (high proportion of draws), which reduces information per game: draws provide minimal signal regarding strength differentials. (Note: Fishtest/Stockfish SPRT accounts for this.)
2) Elo Estimate and Uncertainty
The value “Elo difference: −0.1 ± 8.1” denotes the point estimate and its error (likely a ∼95% confidence interval).
Mathematically, this implies: with 95% confidence, the true difference lies within [−0.1 − 8.1, −0.1 + 8.1] ≈ [−8.2, +8.0 Elo].
Given this wide interval, minor improvements of a few Elo points cannot be confirmed or ruled out.
3) LOS = 49.0%
Likelihood of Superiority (LOS) ≈ 49% → effectively 50/50; no probability favours the dev version being stronger.
In essence: the estimated probability that the dev version outperforms the base is negligible (≈ random fluctuation).
4) SPRT (Sequential Probability Ratio Test)
Your typical SPRT parameters (e.g., elo0=0, elo1=5, alpha=0.05, beta=0.05
) are standard. Under this framework:
- H₀/H₁ are constructed to determine whether the new version exhibits a strength change beyond a specified margin.
- Operational interpretation: H₁ sought to demonstrate improvement exceeding the threshold (e.g., ≥ +5 Elo), while H₀ represents the null hypothesis.
- Your LLR = −2.95, with lbound ≈ −2.94. As LLR falls below the lower bound, SPRT accepted H₀. Practically, this indicates insufficient evidence that the dev version is conclusively stronger under the chosen parameters.
(Note: The precise semantics of “H₀ accepted” depend on implementation; the key takeaway is the lack of convincing improvement, with P(dev > base) ≈ 0.49.)
5) Practical Implications
- True Elo gain: Point estimate = −0.1 Elo (effectively zero). Plausible range (∼95% CI): [−8.2, +8.0 Elo].
- Conclusion: No significant Elo gain is evidenced. Any actual improvement, if present, is likely <8 Elo or obscured by uncertainty.
- For competitive play, differences <5–10 Elo are negligible. Here, the data preclude claims of enhancement.
- SPRT outcome: The test rejected H₁ (+5 Elo), confirming no improvement of that magnitude.
6) Detecting Smaller Gains
Your data: N = 3351 games, CI half-width ≈ 8.1 Elo. Error scales ∝ 1/√𝑁. Thus:
Target CI Half-Width | Required Games |
---|---|
±5 Elo | 8,794 |
±3 Elo | 24,428 |
±2 Elo | 54,964 |
±1 Elo | ≈220,000 |
Calculation: (N_{\text{needed}} = N_{\text{actual}} \cdot (8.1 / h)^2), where (h) = desired CI half-width. | |
These figures illustrate why engine tests often require tens of thousands of games to detect minor differences. |
7) Recommendations
- To demonstrate ≥10 Elo gains: Current data suffice to reject H₁ (no improvement found).
- To detect smaller gains (2–5 Elo): Increase games as calculated above.
- To confirm no relevant regression: The estimate −0.1±8.1 suggests no degradation within this margin.
- Optimise signal per game: Reduce draw ratio (e.g., via varied books/time controls) to improve information efficiency.
8) Sources
- SPRT methodology/implementation: Rustic SPRT Testing.
- Example SPRT reports: Rustic SPRT Results.
- Fishtest statistics framework: Stockfish Testing.
Executive Summary
With 3351 games, the point estimate is −0.1 Elo (CI ≈ ±8.1 Elo). SPRT found no evidence that the dev version is stronger (H₀ accepted).
Conclusion
No demonstrable Elo gain exists; any true effect is either <8 Elo or requires substantially more games to detect.
Formal Addendum: Further Analysis Options
Should you require:
- Detailed report: Exact intervals, standard errors, and CSV of results.
- Testing plan: Time control/openings to detect 3 Elo gains with minimal games.
- Bayesian simulation: Posterior distribution of Elo difference + graphical output (P(dev > base)).
Proposed Testing Strategy for Detecting 3–5 Elo Gains
Objective:
Design an efficient campaign to confirm/refute minor strength changes.
Key Recommendations
- SPRT Parameters:
elo0=0
,elo1=3
(for 3 Elo precision) orelo1=5
(practical balance).alpha=0.05
,beta=0.05
(standard Type I/II error rates).
- Time Controls:
- Rapid screening:
tc=10+0.1
(high throughput, higher noise). - Primary testing:
tc=30+0.1
(optimal balance for 3–5 Elo). - Validation:
tc=60+0.6
(low noise, high cost).
- Experimental Design:
- Use balanced openings (
bookdepth=4
) and-repeat
for colour symmetry. - Fix hardware/OS conditions to minimise noise.
- Adjudicate cautiously (e.g.,
draw movenumber=50 movecount=5 score=5
).
- Game Requirements:
- 3 Elo detection: ~14,000 games (CI ±3 Elo).
- 5 Elo detection: ~4,800 games (CI ±5 Elo).
Sample cutechess-cli Command (3 Elo Test)
bash cutechess-cli \ -engine name=wordfish_dev cmd=./wordfish_dev.exe option.Threads=4 \ -engine name=wordfish_base cmd=./wordfish_base.exe option.Threads=4 \ -openings file=book.bin -repeat -concurrency 6 \ -each tc=30+0.1 bookdepth=4 \ -pgnout sprt_3elo.pgn -recover \ -sprt elo0=0 elo1=3 alpha=0.05 beta=0.05 \ -games 2 -rounds 7000
References
- SPRT theory: Wald, A. (1945). Sequential Tests of Statistical Hypotheses.
- Implementation: cutechess-cli documentation.
All statistical notation adheres to ISO 80000-2 conventions. British English spellings applied throughout (e.g., “colour”, “analyse”).