SPRT test 190825

Table of Contents

Summary (raw lines you gave)

Score of Wordfish 1.0 dev 260825 vs Wordfish base: 3225 - 3068 - 3707  [0.508] 10000
...      Wordfish 1.0 dev 260825 playing White: 1830 - 1328 - 1842  [0.550] 5000
...      Wordfish 1.0 dev 260825 playing Black: 1395 - 1740 - 1865  [0.466] 5000
...      White vs Black: 3570 - 2723 - 3707  [0.542] 10000
Elo difference: 5.5 +/- 5.4, LOS: 97.6 %, DrawRatio: 37.1 %
SPRT: llr 1.94 (66.1%), lbound -2.94, ubound 2.94

Score of Wordfish 1.0 dev 260825 vs Wordfish base: 3225 - 3068 - 3707  [0.508] 10000
...      Wordfish 1.0 dev 260825 playing White: 1830 - 1328 - 1842  [0.550] 5000
...      Wordfish 1.0 dev 260825 playing Black: 1395 - 1740 - 1865  [0.466] 5000
...      White vs Black: 3570 - 2723 - 3707  [0.542] 10000
Elo difference: 5.5 +/- 5.4, LOS: 97.6 %, DrawRatio: 37.1 %
SPRT: llr 1.94 (66.1%), lbound -2.94, ubound 2.94

Below I explain exactly what each line means, how the SPRT decision rule is applied, how to (approximately) compute the pentanomial vector from your aggregated numbers, and how to judge whether the reported Elo gain is reliable.

All core statistical concepts used by Fishtest (pentanomial model, SPRT / GSPRT, the meaning of LLR and bounds) are documented in the Fishtest maths pages and the ChessProgramming reference — I’ll cite those where needed. (Stockfish, Chessprogramming)

How to read the page

3225 - 3068 - 3707 are the aggregate wins – draws – losses for Wordfish 1.0 dev across 10 000 games (they sum to 10 000).
The [...] number shown ([0.508]) on Stockfish pages is the expected score (the logistic probability corresponding to the estimated Elo difference), not the raw empirical fraction.
Elo difference: 5.5 +/- 5.4 is the point estimate and its (approx.) uncertainty. If the confidence interval includes 0, the change is not statistically established. (Stockfish)
LOS = likelihood of superiority (probability that the tested engine is stronger than the baseline). It is related to, but not identical with, the SPRT decision — treat it as an additional summary measure. (Chessprogramming)
DrawRatio is draws/games (here 37.1%).
SPRT: llr 1.94 (66.1%), lbound -2.94, ubound 2.94 shows:
- LLR = log-likelihood ratio accumulated by the SPRT/GSPRT for H0 vs H1 (see below),
- lbound and ubound are the negative and positive decision thresholds computed from your elo0/elo1, alpha, beta settings (the command you used had -sprt elo0=0 elo1=5 alpha=0.05 beta=0.05, which gives bounds ≈ ±2.944).
- Decision rule: if LLR >= ubound accept H1 (a gain at least the H1 threshold); if LLR <= lbound accept H0 (no gain); otherwise the test is inconclusive / ongoing. (GitHub, Chessprogramming)

Immediate verdict for your numbers: LLR = 1.94 lies between -2.94 and +2.94, therefore the SPRT has not accepted H1 or H0 — the test is inconclusive. The reported Elo estimate 5.5 ± 5.4 also has a confidence interval that crosses zero, so you cannot claim a statistically significant Elo gain from this run alone. (Stockfish)

What the SPRT lines mean (short technical note)

Hypotheses (from your -sprt):
- H0: true Elo difference = elo0 (here 0)
- H1: true Elo difference = elo1 (here +5)
Alpha / Beta control type-I and type-II error (both 0.05 in your run). From these the SPRT sets decision bounds a and b (≈ -2.944 and +2.944 here). The test accumulates the log-likelihood ratio (LLR). When LLR crosses a bound the test stops and accepts the corresponding hypothesis. The formulas and approximations used by Fishtest are explained in the Fishtest maths notes and the SPRT references. (Stockfish, Chessprogramming)
LLR = 1.94: that is the current log-likelihood ratio in favour of H1. Because 1.94 < +2.94, H1 is not yet accepted; because 1.94 > -2.94, H0 is not accepted either. The match remains inconclusive under the configured thresholds.

Why the page also prints “Elo difference: 5.5 +/- 5.4” though the SPRT is inconclusive

Stockfish/Fishtest compute a point estimate of Elo from the observed data (using the GSPRT approximations / normal-drift approximations described in the Fishtest maths doc) and also produce LLR and SPRT status. An Elo estimate with a large uncertainty (±5.4 here) that includes 0 means the effect is not statistically confirmed even if the point estimate is positive. The SPRT conclusion is the authoritative termination decision — you need LLR ≥ ubound to claim H1. (Stockfish)

Pentanomial: what it is and why it matters

Fishtest uses a pentanomial model for paired-games testing. When matches are played with paired openings (each opening twice with colours swapped), the pair outcomes are modelled with 5 categories (pair score for the tested engine): 0, 0.5, 1.0, 1.5, 2.0 (points per opening pair). This captures correlation and opening bias better than a simple win/draw/loss trinomial and gives more realistic error estimates and SPRT behaviour. (Stockfish, Chessprogramming)
Important: the exact pentanomial counts come from pairing the two games of each opening. If you want the true pentanomial vector you must read the results per opening pair (i.e. which opening produced W/D/L on each side). Aggregate W/D/L by colour alone is not sufficient to reconstruct the exact pentanomial counts without the per-pair data. Fishtest’s own “Pentanomial” numbers are computed from the paired results uploaded by workers. (Stockfish)

Practical: compute an (approximate) pentanomial from your aggregated White/Black totals

You gave per-colour totals (5000 White games, 5000 Black games). If you assume independence between results in the two games of each opening (this is an approximation — it ignores pairing correlations or opening bias), you can estimate the expected pentanomial frequencies by multiplying the marginal probabilities. The maths (straightforward) is:

Let (for the tested engine)

pW(w)p_{W}^{(w)}, pD(w)p_{D}^{(w)}, pL(w)p_{L}^{(w)} = probabilities when playing White (from White totals),
pW(b)p_{W}^{(b)}, pD(b)p_{D}^{(b)}, pL(b)p_{L}^{(b)} = probabilities when playing Black (from Black totals),
number of pairs = NpairsN_{pairs} (here 5 000 pairs because you had 10 000 games with 2 games per opening).

Then the expected pentanomial frequencies are:

p2 (2.0 pts)=pW(w)⋅pW(b)⋅Npairs\text{p2 (2.0 pts)} = p_W^{(w)} \cdot p_W^{(b)} \cdot N_{pairs}
p1.5 (1.5 pts)=(pW(w)⋅pD(b)+pD(w)⋅pW(b))⋅Npairs\text{p1.5 (1.5 pts)} = \big( p_W^{(w)} \cdot p_D^{(b)} + p_D^{(w)} \cdot p_W^{(b)} \big) \cdot N_{pairs}
p1 (1.0 pts)=(pW(w)⋅pL(b)+pL(w)⋅pW(b)+pD(w)⋅pD(b))⋅Npairs\text{p1 (1.0 pts)} = \big( p_W^{(w)} \cdot p_L^{(b)} + p_L^{(w)} \cdot p_W^{(b)} + p_D^{(w)} \cdot p_D^{(b)} \big) \cdot N_{pairs}
p0.5 (0.5 pts)=(pD(w)⋅pL(b)+pL(w)⋅pD(b))⋅Npairs\text{p0.5 (0.5 pts)} = \big( p_D^{(w)} \cdot p_L^{(b)} + p_L^{(w)} \cdot p_D^{(b)} \big) \cdot N_{pairs}
p0 (0.0 pts)=pL(w)⋅pL(b)⋅Npairs\text{p0 (0.0 pts)} = p_L^{(w)} \cdot p_L^{(b)} \cdot N_{pairs}

(This product approximation is exactly what you get if the two games of a pair are independent draws from the marginal distributions.) The pentanomial model used by Fishtest is the formal model for the actual paired data; the product-approximation is only an estimator when you don’t have per-pair data. (Chessprogramming)

Applying that to your numbers (step-by-step)

From your White totals (5000 games):

pW(w)=1830/5000=0.3660p_W^{(w)} = 1830/5000 = 0.3660
pD(w)=1328/5000=0.2656p_D^{(w)} = 1328/5000 = 0.2656
pL(w)=1842/5000=0.3684p_L^{(w)} = 1842/5000 = 0.3684

From your Black totals (5000 games):

pW(b)=1395/5000=0.2790p_W^{(b)} = 1395/5000 = 0.2790
pD(b)=1740/5000=0.3480p_D^{(b)} = 1740/5000 = 0.3480
pL(b)=1865/5000=0.3730p_L^{(b)} = 1865/5000 = 0.3730

Number of pairs Npairs=5000N_{pairs} = 5000.

Compute expected pentanomial (multiply and scale by 5000):

p₂ (score 2.0) ≈ 0.3660 × 0.2790 × 5000 ≈ 511
p₁.₅ (1.5) ≈ (0.3660×0.3480 + 0.2656×0.2790) × 5000 ≈ 1007
p₁ (1.0) ≈ (0.3660×0.3730 + 0.3684×0.2790 + 0.2656×0.3480) × 5000 ≈ 1659
p₀.₅ (0.5) ≈ (0.2656×0.3730 + 0.3684×0.3480) × 5000 ≈ 1136
p₀ (0.0) ≈ 0.3684 × 0.3730 × 5000 ≈ 687

(Rounded to nearest whole number; they sum to 5 000 pairs.) These are the expected pentanomial counts under the independence approximation. If you feed the real per-pair counts into Fishtest, it may produce slightly (or noticeably) different pentanomial numbers because of pair correlation or opening bias. (Chessprogramming)

Worked numbers (compact)

 pentanomial (N_pairs=5000):
 [p0, p0.5, p1, p1.5, p2]  ≈  [687, 1136, 1659, 1007, 511]

 pentanomial (N_pairs=5000):
 [p0, p0.5, p1, p1.5, p2]  ≈  [687, 1136, 1659, 1007, 511]

How to truly obtain the exact pentanomial used by Fishtest

Download the per-pair game results (Fishtest stores results per opening pair; workers upload the two games for each opening). The site / your PGN will show the opening pair associations.
For each opening, compute the pair score for Wordfish 1.0 dev (0, 0.5, 1, 1.5, 2) and tally the counts. Those five counts are the true pentanomial vector; they are what Fishtest uses in its pentanomial statistics. Fishtest’s web UI shows the pentanomial column for exactly this reason. The Fishtest docs and the pentanomial code repositories explain the details. (Stockfish, GitHub)

If you want, I can provide a small Python script you can run locally that reads a PGN or a pair-indexed CSV and outputs the pentanomial counts (I can include an example snippet in this article).

Final interpretation & recommended steps

Current conclusion: The SPRT is inconclusive. LLR = 1.94 sits between the bounds -2.94 and +2.94. The point Elo estimate 5.5 ± 5.4 has an interval that crosses zero: that means the evidence is not strong enough to claim a real gain. (This is the correct conservative conclusion.) (Stockfish)
Why you might still see LOS = 97.6% while SPRT is inconclusive: LOS and LLR are related but different statistics — LOS is a likelihood-of-superiority summary, while the SPRT LLR plus the chosen H0/H1/α/β decision framework determines termination. Mixed signals can occur when the point estimate is positive but uncertainty is large and the sequential LLR hasn’t yet crossed the decision threshold. (Chessprogramming)
If you want a decisive answer: continue the test (more pairs) or run a second stage with tighter constraints / more games. Alternatively, use the true pentanomial counts (pairwise results) to re-compute the SPRT/GSPRT values (that is precisely what Fishtest does internally). Using more pairs reduces the ±CI and moves the LLR as more evidence accumulates. (Stockfish)
To reproduce locally: your cutechess-cli command already uses -sprt elo0=0 elo1=5 alpha=0.05 beta=0.05 which is correct for a standard two-threshold SPRT. If you want the test to be more decisive you can increase -rounds (number of opening lines) or reduce alpha/beta – but note changing α/β changes the SPRT bounds and the expected runtime. See the cutechess docs for -sprt details. (GitHub)

Pentanomial

 pentanomial (N_pairs=5000):
 [p0, p0.5, p1, p1.5, p2]  ≈  [687, 1136, 1659, 1007, 511]

 pentanomial (N_pairs=5000):
 [p0, p0.5, p1, p1.5, p2]  ≈  [687, 1136, 1659, 1007, 511]

Ordo

SPRT TEST WORDFISH 1.0 DEV VS WORDFISH BASE

Name	ELO	POINTS	PLAYED	(%)
Wordfish 1.0 dev 260825	0.0	5078.5	10000	51
Wordfish base	-5.5	4921.5	10000	49

SUITE LICHESSBOOK.EPD

References (useful links)

Fishtest — Statistical Methods and Algorithms (pentanomial & GSPRT explanations). (Stockfish)
ChessProgramming — Match statistics and SPRT discussion (derivations, formulas, LLR snippet). (Chessprogramming)
cutechess-cli documentation — -sprt usage. (GitHub)
Pentanomial implementation / simulators (useful if you want to reproduce the Fishtest maths locally). (GitHub)

Download games