Skip to content
Home » News » SPRT TEST dev Wordfish

SPRT TEST dev Wordfish

SPRT TEST dev Wordfish

SPRT TEST dev Wordfish

Score of Wordfish 1.0 dev 260825 vs Wordfish base: 4003 - 4003 - 5994 [0.500] 14000 ... Wordfish 1.0 dev 260825 playing White: 2249 - 1754 - 2997 [0.535] 7000 ... Wordfish 1.0 dev 260825 playing Black: 1754 - 2249 - 2997 [0.465] 7000 ... White vs Black: 4498 - 3508 - 5994 [0.535] 14000 Elo difference: 0.0 +/- 4.3, LOS: 50.0 %, DrawRatio: 42.8 % SPRT: llr -0.913 (-31.0%), lbound -2.94, ubound 2.94

Direct summary: There is no measurable Elo gain.
Your match shows Elo diff = +0.0 ± 4.3 with LOS = 50% and LLR = −0.913, all well within the [−2.94, +2.94] bounds of the SPRT. This means that, with the current data, Wordfish 1.0 dev and Wordfish base play at the same strength within roughly ±4 Elo; the test is inconclusive (neither an accepted improvement nor an accepted regression).

How to read your numbers

  • Score 4003–4003–5994 (0.500, 14,000 games): Equal wins and losses → estimator at 0 Elo.
  • ±4.3 Elo: Narrow confidence interval; still includes both small losses and small gains (≈ −4.3 to +4.3).
  • LOS 50%: Probability that dev is stronger than base is coin-flip; no signal.
  • SPRT LLR −0.913 with bounds ±2.94: The statistic is between the thresholds, so the SPRT does not “decide” yet. With typical parameters (α=β=0.05; elo0=0; elo1≈2), this means “keep playing” until a boundary is crossed or the maximum game limit is reached.

Colour notes (for context)

  • White 0.535 vs Black 0.465: Reflects the inherent advantage of moving first; symmetric and does not affect the overall conclusion.
  • DrawRatio 42.8%: Normal for fast/medium TCs; influences variance and thus how many games are needed for the SPRT to settle.

What I would do to close the case

  • Continue the SPRT until the LLR crosses +2.94 (accept improvement) or −2.94 (accept no-improvement), or raise the game cap if still undecided.
  • If you only want to know “is there ≥X Elo gain?”, set elo1 = X (e.g., 2 Elo) with α=β=0.05; the test will stop once there is sufficient evidence.

Final verdict: With 14,000 games, Wordfish 1.0 dev = Wordfish base (0 ± 4.3 Elo). There is no statistical signal of improvement or regression; the SPRT indicates continue testing or accept “no measurable change.

Download games

Test 3

@echo off

REM set openings to use
set opening=8moves_v3
REM set the window title
title Test wordfish base vs wordfish 1.0 260825 

REM cutechess-cli setup
@cutechess-cli.exe ^
  -event "Test wordfish base vs wordfish 1.0 260825" -site "HP Proliant DL360P Gen8 Server" ^
  -engine conf="Wordfish 1.0 dev 260825" ^
  -engine conf="Wordfish base" ^
  -tournament round-robin ^
  -each tc=30+0.1 timemargin=10000 option.Hash=32 option.Threads=1 -tb C:\Syzygy -tbpieces 5 ^
  -openings file=..\Openings\PGN\%opening%.pgn format=pgn order=random ^
  -sprt elo0=0 elo1=2 alpha=0.05 beta=0.05 ^
  -concurrency 4 -rounds 10000 -games 2 -maxmoves 200 ^
  -repeat 2 -recover ^
  -pgnout ..\Games\testbase.pgn fi ^
  -ratinginterval 10
pause
Score of Wordfish 1.0 dev 260825 vs Wordfish base: 5023 - 5023 - 9954  [0.500] 20000
...      Wordfish 1.0 dev 260825 playing White: 2692 - 2331 - 4977  [0.518] 10000
...      Wordfish 1.0 dev 260825 playing Black: 2331 - 2692 - 4977  [0.482] 10000
...      White vs Black: 5384 - 4662 - 9954  [0.518] 20000
Elo difference: 0.0 +/- 3.4, LOS: 50.0 %, DrawRatio: 49.8 %
SPRT: llr -0.66 (-22.4%), lbound -2.94, ubound 2.94
  • Headline outcome: No measurable Elo gain.
    Elo difference = +0.0 ± 3.4 (one-sigma). That interval straddles zero, so the observed effect is statistically indistinguishable from nil. A 95% CI would be roughly ±6.7 Elo, still crossing zero.
  • Likelihood of Superiority (LOS): 50.0% — exactly coin-flip territory; neither engine is credibly stronger given these data.
  • SPRT status: LLR = −0.66 with bounds [−2.94, +2.94] → the sequential test is inconclusive; it has not crossed either boundary (accept-null or accept-alt). In SPRT/GSPRT terms, you should continue the test if your stopping rule requires boundary crossing. For background on how SPRT/GSPRT and LLR bounds work in engine testing, see Rustic’s SPRT notes and the Stockfish Fishtest maths docs. rustic-chess.org+1official-stockfish.github.io
  • Symmetry checks:
    White: 0.518 over 10,000 games; Black: 0.482 over 10,000 games → exactly complementary, as expected in a balanced self-play protocol. Draw ratio ≈ 49.8%, also typical at your TC; modern test frameworks often use the pentanomial model to handle paired openings and reduce variance. official-stockfish.github.io

Bottom line

Wordfish 1.0 dev vs Wordfish base: ~+0 Elo (no detectable gain). With ±3.4 Elo uncertainty (1σ), any true improvement is likely within only a few Elo either side of zero. To prove a small positive effect (say +5 Elo) you’d need substantially more games or tighter variance (paired openings, pentanomial analysis, consistent hardware), as standardised in Stockfish’s Fishtest methodology.

Download games

Test 4

Score of Wordfish 1.0 dev 260825 vs Wordfish base: 431 - 476 - 513  [0.484] 1420
...      Wordfish 1.0 dev 260825 playing White: 223 - 224 - 263  [0.499] 710
...      Wordfish 1.0 dev 260825 playing Black: 208 - 252 - 250  [0.469] 710
...      White vs Black: 475 - 432 - 513  [0.515] 1420
Elo difference: -11.0 +/- 14.4, LOS: 6.8 %, DrawRatio: 36.1 %
SPRT: llr -2.95 (-100.2%), lbound -2.94, ubound 2.94 - H0 was accepted

Short answer (direct conclusion)

Based on the match you pasted, Wordfish 1.0 dev is not stronger than Wordfish base. The point estimate shows the dev is weaker by 11 Elo (−11.0), but the result is statistically inconclusive because the 95% confidence interval crosses zero (−25.4 to +3.4 Elo). The likelihood that the dev is actually stronger is very small: LOS ≈ 6.8%. Stockfish Testing FrameworkRustic Chess


What the numbers you gave mean (step-by-step)

  1. Raw score
    431 - 476 - 513 [0.484] 1420 means: Wordfish 1.0 dev scored 431 wins, 476 losses, 513 draws against the base, for 1 420 games total. The bracketed value 0.484 is the point score (points / games) = (431 + 0.5×513) / 1420 ≈ 0.484. Rustic Chess
  2. Reported Elo difference
    You supplied Elo difference: -11.0 +/- 14.4. That is the point estimate −11.0 Elo with a 95% confidence interval half-width of 14.4 Elo. Converting that to a 95% CI gives: CI95%=−11.0±14.4=[−25.4,  +3.4] Elo.\text{CI}_{95\%} = -11.0 \pm 14.4 = [-25.4,\; +3.4]\ \text{Elo}.CI95%​=−11.0±14.4=[−25.4,+3.4] Elo. Because the CI includes 0, the match does not demonstrate a statistically significant improvement. Stockfish Testing Framework
  3. LOS (Likelihood of Superiority)
    LOS: 6.8 % indicates the estimated probability that the dev is stronger than the base is about 6.8% — i.e. very unlikely. That is consistent with a negative point estimate and the CI concentrated on the negative side. Stockfish Testing Framework
  4. SPRT outcome and interpretation
    You report SPRT: llr -2.95 (-100.2%), lbound -2.94, ubound 2.94 - H0 was accepted. In the typical SPRT setup used by engine testers, the test has two hypotheses (see Rustic / cutechess documentation): H0 (null) and H1 (alternative). The negative log-likelihood ratio crossing the lower bound indicates the test stopped in favour of the hypothesis associated with the lower bound — in this run the test concluded that the dev version is not convincingly stronger than the base (the precise labelling of H0/H1 depends on parameter ordering used for the run). In plain terms: the SPRT stopped early and accepted the hypothesis consistent with the dev not being an improvement. (If you want the exact semantic mapping of H0/H1 for that particular run, the test parameters printed with the run show how elo0/elo1 were set; that mapping is important when you interpret “H0 accepted”.) Rustic ChessStockfish Testing Framework
  5. Draw ratio
    DrawRatio: 36.1% (or the draw stats reported elsewhere) just describes the proportion of final game results that were draws — that affects variance per game and therefore how many games are required to resolve small Elo differences. Higher draw rates inflate the number of games required to detect small changes. Stockfish Testing Framework

Numeric summary (concise)

Conclusion: there is no evidence of an Elo gain for Wordfish 1.0 dev; if anything the point estimate suggests a small loss (≈11 Elo), but the interval is wide enough that a small improvement (up to ≈+3.4 Elo) cannot be completely ruled out.


Practical implications and recommended next steps

  1. Do not conclude the dev is stronger. The most honest interpretation is: inconclusive; point estimate slightly worse. Acting as if the dev is stronger would be unjustified.
  2. If you want to resolve ±5 Elo reliably you need far more games.
    From the reported CI half-width (14.4 Elo at 95%), we can estimate the standard error: SE≈14.41.96≈7.35 Elo.SE \approx \frac{14.4}{1.96} \approx 7.35\ \text{Elo}.SE≈1.9614.4​≈7.35 Elo. To reach a 95% half-width of ±2.5 Elo (so a 95% CI small enough to detect ~5 Elo differences), SE must be ≈2.55 Elo. Because variance scales roughly inversely with the number of games, required sample multiplier ≈ (7.35 / 2.55)² ≈ 8.3. With 1 420 games now, you would therefore need roughly 1 420 × 8.3 ≈ 11 800 games in total to shrink the uncertainty to that level. That is a practical estimate — the true number depends on draw rate and the exact rating model used. Stockfish Testing Framework
  3. Alternative: change testing parameters to increase sensitivity
    • Use a smaller elo1 (narrower alternative) in SPRT to detect smaller deltas, but be aware that this increases required games. Rustic Chess
    • Reduce noise by using stronger time controls, more concurrency per worker, or fixed opening sets; but these choices change the interpretation vs real-play conditions. Rustic ChessStockfish Testing Framework
  4. If you care about practical strength, rerun a longer SPRT with the same parameters (alpha=0.05 / beta=0.05, elo0/elo1 matching your target) or switch to an ordinary long-round robin (fixed number of games) with enough games to bring SE down

Jorge Ruiz

Jorge Ruiz

connoisseur of both chess and anthropology, a combination that reflects his deep intellectual curiosity and passion for understanding both the art of strategic

Leave a Reply

Your email address will not be published. Required fields are marked *