SPRT TEST

1) Interpretation of Counts
Your summary data:

SPRT Test: Step-by-Step Explanation (Interpreting Your Results)

Score of Wordfish 1.0 dev 260825 vs Wordfish base: 804 - 805 - 1742  [0.500] 3351  
...  
Wordfish 1.0 dev playing White: 449 - 356 - 871  [0.528] 1676  
Wordfish 1.0 dev playing Black: 355 - 449 - 871  [0.472] 1675  
...  
Elo difference: -0.1 ± 8.1, LOS: 49.0%, DrawRatio: 52.0%  
SPRT: llr -2.95 (-100.2%), lbound -2.94, ubound 2.94 — H0 accepted

Score of Wordfish 1.0 dev 260825 vs Wordfish base: 804 - 805 - 1742  [0.500] 3351  
...  
Wordfish 1.0 dev playing White: 449 - 356 - 871  [0.528] 1676  
Wordfish 1.0 dev playing Black: 355 - 449 - 871  [0.472] 1675  
...  
Elo difference: -0.1 ± 8.1, LOS: 49.0%, DrawRatio: 52.0%  
SPRT: llr -2.95 (-100.2%), lbound -2.94, ubound 2.94 — H0 accepted

Totals: 3351 games (804 wins for dev, 805 wins for base, 1742 draws) → exact net score of 0.500 (statistical draw).

The white/black fractions reveal mild asymmetry (dev performs better with white pieces), compensated by weaker black performance. Globally, equilibrium is maintained.

DrawRatio ≈ 52% (high proportion of draws), which reduces information per game: draws provide minimal signal regarding strength differentials. (Note: Fishtest/Stockfish SPRT accounts for this.)

2) Elo Estimate and Uncertainty
The value “Elo difference: −0.1 ± 8.1” denotes the point estimate and its error (likely a ∼95% confidence interval).
Mathematically, this implies: with 95% confidence, the true difference lies within [−0.1 − 8.1, −0.1 + 8.1] ≈ [−8.2, +8.0 Elo].
Given this wide interval, minor improvements of a few Elo points cannot be confirmed or ruled out.

3) LOS = 49.0%
Likelihood of Superiority (LOS) ≈ 49% → effectively 50/50; no probability favours the dev version being stronger.
In essence: the estimated probability that the dev version outperforms the base is negligible (≈ random fluctuation).

4) SPRT (Sequential Probability Ratio Test)
Your typical SPRT parameters (e.g., elo0=0, elo1=5, alpha=0.05, beta=0.05) are standard. Under this framework:

H₀/H₁ are constructed to determine whether the new version exhibits a strength change beyond a specified margin.
Operational interpretation: H₁ sought to demonstrate improvement exceeding the threshold (e.g., ≥ +5 Elo), while H₀ represents the null hypothesis.
Your LLR = −2.95, with lbound ≈ −2.94. As LLR falls below the lower bound, SPRT accepted H₀. Practically, this indicates insufficient evidence that the dev version is conclusively stronger under the chosen parameters.

(Note: The precise semantics of “H₀ accepted” depend on implementation; the key takeaway is the lack of convincing improvement, with P(dev > base) ≈ 0.49.)

5) Practical Implications

True Elo gain: Point estimate = −0.1 Elo (effectively zero). Plausible range (∼95% CI): [−8.2, +8.0 Elo].
Conclusion: No significant Elo gain is evidenced. Any actual improvement, if present, is likely <8 Elo or obscured by uncertainty.
For competitive play, differences <5–10 Elo are negligible. Here, the data preclude claims of enhancement.
SPRT outcome: The test rejected H₁ (+5 Elo), confirming no improvement of that magnitude.

6) Detecting Smaller Gains
Your data: N = 3351 games, CI half-width ≈ 8.1 Elo. Error scales ∝ 1/√𝑁. Thus:

Target CI Half-Width	Required Games
±5 Elo	8,794
±3 Elo	24,428
±2 Elo	54,964
±1 Elo	≈220,000
Calculation: (N_{\text{needed}} = N_{\text{actual}} \cdot (8.1 / h)^2), where (h) = desired CI half-width.
These figures illustrate why engine tests often require tens of thousands of games to detect minor differences.

7) Recommendations

To demonstrate ≥10 Elo gains: Current data suffice to reject H₁ (no improvement found).
To detect smaller gains (2–5 Elo): Increase games as calculated above.
To confirm no relevant regression: The estimate −0.1±8.1 suggests no degradation within this margin.
Optimise signal per game: Reduce draw ratio (e.g., via varied books/time controls) to improve information efficiency.

8) Sources

SPRT methodology/implementation: Rustic SPRT Testing.
Example SPRT reports: Rustic SPRT Results.
Fishtest statistics framework: Stockfish Testing.

Executive Summary

With 3351 games, the point estimate is −0.1 Elo (CI ≈ ±8.1 Elo). SPRT found no evidence that the dev version is stronger (H₀ accepted).

Conclusion

No demonstrable Elo gain exists; any true effect is either <8 Elo or requires substantially more games to detect.

Formal Addendum: Further Analysis Options

Should you require:

Detailed report: Exact intervals, standard errors, and CSV of results.
Testing plan: Time control/openings to detect 3 Elo gains with minimal games.
Bayesian simulation: Posterior distribution of Elo difference + graphical output (P(dev > base)).

Proposed Testing Strategy for Detecting 3–5 Elo Gains
Objective:

Design an efficient campaign to confirm/refute minor strength changes.

Key Recommendations

SPRT Parameters:

elo0=0, elo1=3 (for 3 Elo precision) or elo1=5 (practical balance).
alpha=0.05, beta=0.05 (standard Type I/II error rates).

Time Controls:

Rapid screening: tc=10+0.1 (high throughput, higher noise).
Primary testing: tc=30+0.1 (optimal balance for 3–5 Elo).
Validation: tc=60+0.6 (low noise, high cost).

Experimental Design:

Use balanced openings (bookdepth=4) and -repeat for colour symmetry.
Fix hardware/OS conditions to minimise noise.
Adjudicate cautiously (e.g., draw movenumber=50 movecount=5 score=5).

Game Requirements:

3 Elo detection: ~14,000 games (CI ±3 Elo).
5 Elo detection: ~4,800 games (CI ±5 Elo).

Sample cutechess-cli Command (3 Elo Test)

bash cutechess-cli \ -engine name=wordfish_dev cmd=./wordfish_dev.exe option.Threads=4 \ -engine name=wordfish_base cmd=./wordfish_base.exe option.Threads=4 \ -openings file=book.bin -repeat -concurrency 6 \ -each tc=30+0.1 bookdepth=4 \ -pgnout sprt_3elo.pgn -recover \ -sprt elo0=0 elo1=3 alpha=0.05 beta=0.05 \ -games 2 -rounds 7000

bash cutechess-cli \ -engine name=wordfish_dev cmd=./wordfish_dev.exe option.Threads=4 \ -engine name=wordfish_base cmd=./wordfish_base.exe option.Threads=4 \ -openings file=book.bin -repeat -concurrency 6 \ -each tc=30+0.1 bookdepth=4 \ -pgnout sprt_3elo.pgn -recover \ -sprt elo0=0 elo1=3 alpha=0.05 beta=0.05 \ -games 2 -rounds 7000

References

SPRT theory: Wald, A. (1945). Sequential Tests of Statistical Hypotheses.
Implementation: cutechess-cli documentation.

All statistical notation adheres to ISO 80000-2 conventions. British English spellings applied throughout (e.g., “colour”, “analyse”).

Test 1. 10s+0.1s

Test 2