2. fastchess test sprt wordfish base vs Wordfish dev

Summary

Finished game 2000 (wordfish1 vs wordfish): 1/2-1/2 {Draw by 3-fold repetition}
--------------------------------------------------
Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn):
Elo: 4.00 +/- 5.68, nElo: 10.71 +/- 15.23
LOS: 91.59 %, DrawRatio: 74.80 %, PairsRatio: 1.21
Games: 2000, Wins: 182, Losses: 159, Draws: 1659, Points: 1011.5 (50.58 %)
Ptnml(0-2): [5, 109, 748, 134, 4], WL/DD Ratio: 0.06
LLR: 0.39 (13.3%) (-2.94, 2.94) [0.00, 2.50]

Finished game 2000 (wordfish1 vs wordfish): 1/2-1/2 {Draw by 3-fold repetition}
--------------------------------------------------
Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn):
Elo: 4.00 +/- 5.68, nElo: 10.71 +/- 15.23
LOS: 91.59 %, DrawRatio: 74.80 %, PairsRatio: 1.21
Games: 2000, Wins: 182, Losses: 159, Draws: 1659, Points: 1011.5 (50.58 %)
Ptnml(0-2): [5, 109, 748, 134, 4], WL/DD Ratio: 0.06
LLR: 0.39 (13.3%) (-2.94, 2.94) [0.00, 2.50]

What the result says (direction matters)

our summary line is:

Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn)
Elo: +4.00 ± 5.68, nElo: +10.71 ± 15.23, LOS: 91.59%
Games: 2000 (182-159-1659), Score: 50.58%
Ptnml(0-2): [5, 109, 748, 134, 4]; LLR: 0.39 (13.3%) with bounds ±2.94 and targets [0.0, 2.5]

As printed, the first engine (“wordfish”) is ahead of the second (“wordfish1”) by ~+4 Elo.
If “wordfish1” is your Wordfish 1.0 dev and “wordfish” is the old/base version, then dev − base ≈ −4.0 Elo (i.e., dev is ~4 Elo weaker on this run).
The error bar is larger than the estimate (±5.68), so the interval includes zero. In plain terms: no statistically significant difference—the data are compatible with anything from a small loss to a small gain.

LOS and LLR, decoded

LOS 91.6% ≈ probability the first engine (wordfish) is superior under the logistic model. That again points against the dev if dev = wordfish1.
LLR = 0.39 with boundaries −2.94 / +2.94 (the usual α = β = 5% SPRT bounds) means the sequential test is far from a decision; progress ~13% towards either boundary. In other words, the SPRT is inconclusive at this sample size. chessprogramming.org

Pentanomial line sanity-check

Your Ptnml(0-2) = [5, 109, 748, 134, 4] sums to 1000 paired openings, which matches 2000 games. The bins are (for the first-named engine): 2-0, 1.5-0.5, 1-1, 0.5-1.5, 0-2—exactly the standard pentanomial pairing model. The vast draw bulk (748 pairs) explains the slow information gain. Chess Stack Exchange official-stockfish.github.io

Bottom line (dev vs old)

Point estimate: −4 Elo for Wordfish 1.0 dev relative to the old version (assuming dev = wordfish1).
Uncertainty (as reported): about ±5.7 Elo, so the plausible range is roughly [−9.7, +1.7] Elo for dev−old.
Decision: No reliable Elo gain detected yet; if anything, the small tilt favours the old engine, but it’s not significant.

(If instead your naming is reversed—i.e., “wordfish” is the dev and “wordfish1” the old—then flip the sign: dev ≈ +4 Elo vs old, still not significant.)

Why this conclusion is sound

Elo from score: 50.58% → ~+4 Elo (via 400log⁡10p1−p400\log_{10}\frac{p}{1-p}400log101−pp), matching the tool’s figure.
SPRT mechanics: tests compare H0 vs H1 using the log-likelihood ratio until crossing ± bounds; your LLR 0.39 is far from ±2.94, so neither “no gain” nor “≥2.5 Elo gain” is established. chessprogramming.org
Pentanomial statistics & “nElo”: fast tests for engines use the pentanomial model and often report “normalised Elo” (nElo) to reduce dependency on draw rate; your nElo +10.7 ± 15.2 tells the same story: big uncertainty around a tiny edge. official-stockfish.github.io

Practical tips to reach a decision sooner

Use a more biased opening set (e.g., UHO_Lichess or Pohl) to cut draw rate, which speeds up SPRT convergence. chessprogramming.org
Tighten bounds for gains you actually care about (e.g., [0, 2.0] or [0.5, 2.5]); wide bounds can terminate faster when the signal is clear, but they’re inappropriate if you want to detect small improvements reliably. chessprogramming.org
Consider more games or shorter time controls if throughput is the bottleneck (mind the risk of evaluation noise).

One-line answer (what you asked for)

Wordfish 1.0 dev vs old: ≈ −4 Elo (or +4 Elo if your labels are swapped), not statistically significant; no proven Elo gain based on this 2000-game run.

References (for definitions/interpretation): Rustic’s SPRT overview and examples; Stockfish’s Fishtest maths (pentanomial, nElo, LLR/Bounds); CPW on SPRT.

Download games