Summary (raw lines you gave)
Score of Wordfish 1.0 dev 260825 vs Wordfish base: 3225 - 3068 - 3707 [0.508] 10000
... Wordfish 1.0 dev 260825 playing White: 1830 - 1328 - 1842 [0.550] 5000
... Wordfish 1.0 dev 260825 playing Black: 1395 - 1740 - 1865 [0.466] 5000
... White vs Black: 3570 - 2723 - 3707 [0.542] 10000
Elo difference: 5.5 +/- 5.4, LOS: 97.6 %, DrawRatio: 37.1 %
SPRT: llr 1.94 (66.1%), lbound -2.94, ubound 2.94
Below I explain exactly what each line means, how the SPRT decision rule is applied, how to (approximately) compute the pentanomial vector from your aggregated numbers, and how to judge whether the reported Elo gain is reliable.
All core statistical concepts used by Fishtest (pentanomial model, SPRT / GSPRT, the meaning of LLR and bounds) are documented in the Fishtest maths pages and the ChessProgramming reference — I’ll cite those where needed. (Stockfish, Chessprogramming)
How to read the page
3225 - 3068 - 3707
are the aggregate wins – draws – losses for Wordfish 1.0 dev across 10 000 games (they sum to 10 000).- The
[...]
number shown ([0.508]
) on Stockfish pages is the expected score (the logistic probability corresponding to the estimated Elo difference), not the raw empirical fraction. Elo difference: 5.5 +/- 5.4
is the point estimate and its (approx.) uncertainty. If the confidence interval includes 0, the change is not statistically established. (Stockfish)LOS
= likelihood of superiority (probability that the tested engine is stronger than the baseline). It is related to, but not identical with, the SPRT decision — treat it as an additional summary measure. (Chessprogramming)DrawRatio
is draws/games (here 37.1%).SPRT: llr 1.94 (66.1%), lbound -2.94, ubound 2.94
shows:LLR
= log-likelihood ratio accumulated by the SPRT/GSPRT for H0 vs H1 (see below),lbound
andubound
are the negative and positive decision thresholds computed from yourelo0/elo1
,alpha
,beta
settings (the command you used had-sprt elo0=0 elo1=5 alpha=0.05 beta=0.05
, which gives bounds ≈ ±2.944).- Decision rule: if
LLR >= ubound
accept H1 (a gain at least the H1 threshold); ifLLR <= lbound
accept H0 (no gain); otherwise the test is inconclusive / ongoing. (GitHub, Chessprogramming)
Immediate verdict for your numbers: LLR = 1.94
lies between -2.94
and +2.94
, therefore the SPRT has not accepted H1 or H0 — the test is inconclusive. The reported Elo estimate 5.5 ± 5.4
also has a confidence interval that crosses zero, so you cannot claim a statistically significant Elo gain from this run alone. (Stockfish)
What the SPRT lines mean (short technical note)
- Hypotheses (from your
-sprt
):- H0: true Elo difference =
elo0
(here 0) - H1: true Elo difference =
elo1
(here +5)
- H0: true Elo difference =
- Alpha / Beta control type-I and type-II error (both 0.05 in your run). From these the SPRT sets decision bounds
a
andb
(≈-2.944
and+2.944
here). The test accumulates the log-likelihood ratio (LLR). When LLR crosses a bound the test stops and accepts the corresponding hypothesis. The formulas and approximations used by Fishtest are explained in the Fishtest maths notes and the SPRT references. (Stockfish, Chessprogramming) - LLR = 1.94: that is the current log-likelihood ratio in favour of H1. Because
1.94 < +2.94
, H1 is not yet accepted; because1.94 > -2.94
, H0 is not accepted either. The match remains inconclusive under the configured thresholds.
Why the page also prints “Elo difference: 5.5 +/- 5.4” though the SPRT is inconclusive
Stockfish/Fishtest compute a point estimate of Elo from the observed data (using the GSPRT approximations / normal-drift approximations described in the Fishtest maths doc) and also produce LLR and SPRT status. An Elo estimate with a large uncertainty (±5.4 here) that includes 0 means the effect is not statistically confirmed even if the point estimate is positive. The SPRT conclusion is the authoritative termination decision — you need LLR ≥ ubound to claim H1. (Stockfish)
Pentanomial: what it is and why it matters
- Fishtest uses a pentanomial model for paired-games testing. When matches are played with paired openings (each opening twice with colours swapped), the pair outcomes are modelled with 5 categories (pair score for the tested engine): 0, 0.5, 1.0, 1.5, 2.0 (points per opening pair). This captures correlation and opening bias better than a simple win/draw/loss trinomial and gives more realistic error estimates and SPRT behaviour. (Stockfish, Chessprogramming)
- Important: the exact pentanomial counts come from pairing the two games of each opening. If you want the true pentanomial vector you must read the results per opening pair (i.e. which opening produced W/D/L on each side). Aggregate W/D/L by colour alone is not sufficient to reconstruct the exact pentanomial counts without the per-pair data. Fishtest’s own “Pentanomial” numbers are computed from the paired results uploaded by workers. (Stockfish)
Practical: compute an (approximate) pentanomial from your aggregated White/Black totals
You gave per-colour totals (5000 White games, 5000 Black games). If you assume independence between results in the two games of each opening (this is an approximation — it ignores pairing correlations or opening bias), you can estimate the expected pentanomial frequencies by multiplying the marginal probabilities. The maths (straightforward) is:
Let (for the tested engine)
- pW(w)p_{W}^{(w)}, pD(w)p_{D}^{(w)}, pL(w)p_{L}^{(w)} = probabilities when playing White (from White totals),
- pW(b)p_{W}^{(b)}, pD(b)p_{D}^{(b)}, pL(b)p_{L}^{(b)} = probabilities when playing Black (from Black totals),
- number of pairs = NpairsN_{pairs} (here 5 000 pairs because you had 10 000 games with 2 games per opening).
Then the expected pentanomial frequencies are:
- p2 (2.0 pts)=pW(w)⋅pW(b)⋅Npairs\text{p2 (2.0 pts)} = p_W^{(w)} \cdot p_W^{(b)} \cdot N_{pairs}
- p1.5 (1.5 pts)=(pW(w)⋅pD(b)+pD(w)⋅pW(b))⋅Npairs\text{p1.5 (1.5 pts)} = \big( p_W^{(w)} \cdot p_D^{(b)} + p_D^{(w)} \cdot p_W^{(b)} \big) \cdot N_{pairs}
- p1 (1.0 pts)=(pW(w)⋅pL(b)+pL(w)⋅pW(b)+pD(w)⋅pD(b))⋅Npairs\text{p1 (1.0 pts)} = \big( p_W^{(w)} \cdot p_L^{(b)} + p_L^{(w)} \cdot p_W^{(b)} + p_D^{(w)} \cdot p_D^{(b)} \big) \cdot N_{pairs}
- p0.5 (0.5 pts)=(pD(w)⋅pL(b)+pL(w)⋅pD(b))⋅Npairs\text{p0.5 (0.5 pts)} = \big( p_D^{(w)} \cdot p_L^{(b)} + p_L^{(w)} \cdot p_D^{(b)} \big) \cdot N_{pairs}
- p0 (0.0 pts)=pL(w)⋅pL(b)⋅Npairs\text{p0 (0.0 pts)} = p_L^{(w)} \cdot p_L^{(b)} \cdot N_{pairs}
(This product approximation is exactly what you get if the two games of a pair are independent draws from the marginal distributions.) The pentanomial model used by Fishtest is the formal model for the actual paired data; the product-approximation is only an estimator when you don’t have per-pair data. (Chessprogramming)
Applying that to your numbers (step-by-step)
From your White totals (5000 games):
- pW(w)=1830/5000=0.3660p_W^{(w)} = 1830/5000 = 0.3660
- pD(w)=1328/5000=0.2656p_D^{(w)} = 1328/5000 = 0.2656
- pL(w)=1842/5000=0.3684p_L^{(w)} = 1842/5000 = 0.3684
From your Black totals (5000 games):
- pW(b)=1395/5000=0.2790p_W^{(b)} = 1395/5000 = 0.2790
- pD(b)=1740/5000=0.3480p_D^{(b)} = 1740/5000 = 0.3480
- pL(b)=1865/5000=0.3730p_L^{(b)} = 1865/5000 = 0.3730
Number of pairs Npairs=5000N_{pairs} = 5000.
Compute expected pentanomial (multiply and scale by 5000):
- p₂ (score 2.0) ≈ 0.3660 × 0.2790 × 5000 ≈ 511
- p₁.₅ (1.5) ≈ (0.3660×0.3480 + 0.2656×0.2790) × 5000 ≈ 1007
- p₁ (1.0) ≈ (0.3660×0.3730 + 0.3684×0.2790 + 0.2656×0.3480) × 5000 ≈ 1659
- p₀.₅ (0.5) ≈ (0.2656×0.3730 + 0.3684×0.3480) × 5000 ≈ 1136
- p₀ (0.0) ≈ 0.3684 × 0.3730 × 5000 ≈ 687
(Rounded to nearest whole number; they sum to 5 000 pairs.) These are the expected pentanomial counts under the independence approximation. If you feed the real per-pair counts into Fishtest, it may produce slightly (or noticeably) different pentanomial numbers because of pair correlation or opening bias. (Chessprogramming)
Worked numbers (compact)
pentanomial (N_pairs=5000):
[p0, p0.5, p1, p1.5, p2] ≈ [687, 1136, 1659, 1007, 511]
How to truly obtain the exact pentanomial used by Fishtest
- Download the per-pair game results (Fishtest stores results per opening pair; workers upload the two games for each opening). The site / your PGN will show the opening pair associations.
- For each opening, compute the pair score for Wordfish 1.0 dev (0, 0.5, 1, 1.5, 2) and tally the counts. Those five counts are the true pentanomial vector; they are what Fishtest uses in its pentanomial statistics. Fishtest’s web UI shows the pentanomial column for exactly this reason. The Fishtest docs and the pentanomial code repositories explain the details. (Stockfish, GitHub)
If you want, I can provide a small Python script you can run locally that reads a PGN or a pair-indexed CSV and outputs the pentanomial counts (I can include an example snippet in this article).
Final interpretation & recommended steps
- Current conclusion: The SPRT is inconclusive.
LLR = 1.94
sits between the bounds-2.94
and+2.94
. The point Elo estimate5.5 ± 5.4
has an interval that crosses zero: that means the evidence is not strong enough to claim a real gain. (This is the correct conservative conclusion.) (Stockfish) - Why you might still see
LOS = 97.6%
while SPRT is inconclusive: LOS and LLR are related but different statistics — LOS is a likelihood-of-superiority summary, while the SPRT LLR plus the chosen H0/H1/α/β decision framework determines termination. Mixed signals can occur when the point estimate is positive but uncertainty is large and the sequential LLR hasn’t yet crossed the decision threshold. (Chessprogramming) - If you want a decisive answer: continue the test (more pairs) or run a second stage with tighter constraints / more games. Alternatively, use the true pentanomial counts (pairwise results) to re-compute the SPRT/GSPRT values (that is precisely what Fishtest does internally). Using more pairs reduces the ±CI and moves the LLR as more evidence accumulates. (Stockfish)
- To reproduce locally: your
cutechess-cli
command already uses-sprt elo0=0 elo1=5 alpha=0.05 beta=0.05
which is correct for a standard two-threshold SPRT. If you want the test to be more decisive you can increase-rounds
(number of opening lines) or reducealpha
/beta
– but note changing α/β changes the SPRT bounds and the expected runtime. See the cutechess docs for-sprt
details. (GitHub)
Pentanomial
pentanomial (N_pairs=5000):
[p0, p0.5, p1, p1.5, p2] ≈ [687, 1136, 1659, 1007, 511]
Ordo
SPRT TEST WORDFISH 1.0 DEV VS WORDFISH BASE
Name | ELO | POINTS | PLAYED | (%) |
---|---|---|---|---|
Wordfish 1.0 dev 260825 | 0.0 | 5078.5 | 10000 | 51 |
Wordfish base | -5.5 | 4921.5 | 10000 | 49 |
References (useful links)
- Fishtest — Statistical Methods and Algorithms (pentanomial & GSPRT explanations). (Stockfish)
- ChessProgramming — Match statistics and SPRT discussion (derivations, formulas, LLR snippet). (Chessprogramming)
- cutechess-cli documentation —
-sprt
usage. (GitHub) - Pentanomial implementation / simulators (useful if you want to reproduce the Fishtest maths locally). (GitHub)