Skip to content
Portada » 2. fastchess test sprt wordfish base vs Wordfish dev

2. fastchess test sprt wordfish base vs Wordfish dev

Summary

Finished game 2000 (wordfish1 vs wordfish): 1/2-1/2 {Draw by 3-fold repetition}
--------------------------------------------------
Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn):
Elo: 4.00 +/- 5.68, nElo: 10.71 +/- 15.23
LOS: 91.59 %, DrawRatio: 74.80 %, PairsRatio: 1.21
Games: 2000, Wins: 182, Losses: 159, Draws: 1659, Points: 1011.5 (50.58 %)
Ptnml(0-2): [5, 109, 748, 134, 4], WL/DD Ratio: 0.06
LLR: 0.39 (13.3%) (-2.94, 2.94) [0.00, 2.50]

What the result says (direction matters)

our summary line is:

Results of wordfish vs wordfish1 (10+0.1, 1t, 32MB, 8moves_v3.pgn)
Elo: +4.00 ± 5.68, nElo: +10.71 ± 15.23, LOS: 91.59%
Games: 2000 (182-159-1659), Score: 50.58%
Ptnml(0-2): [5, 109, 748, 134, 4]; LLR: 0.39 (13.3%) with bounds ±2.94 and targets [0.0, 2.5]

  • As printed, the first engine (“wordfish”) is ahead of the second (“wordfish1”) by ~+4 Elo.
  • If “wordfish1” is your Wordfish 1.0 dev and “wordfish” is the old/base version, then dev − base ≈ −4.0 Elo (i.e., dev is ~4 Elo weaker on this run).
  • The error bar is larger than the estimate (±5.68), so the interval includes zero. In plain terms: no statistically significant difference—the data are compatible with anything from a small loss to a small gain.

LOS and LLR, decoded

  • LOS 91.6% ≈ probability the first engine (wordfish) is superior under the logistic model. That again points against the dev if dev = wordfish1.
  • LLR = 0.39 with boundaries −2.94 / +2.94 (the usual α = β = 5% SPRT bounds) means the sequential test is far from a decision; progress ~13% towards either boundary. In other words, the SPRT is inconclusive at this sample size. chessprogramming.org

Pentanomial line sanity-check

Your Ptnml(0-2) = [5, 109, 748, 134, 4] sums to 1000 paired openings, which matches 2000 games. The bins are (for the first-named engine): 2-0, 1.5-0.5, 1-1, 0.5-1.5, 0-2—exactly the standard pentanomial pairing model. The vast draw bulk (748 pairs) explains the slow information gain. Chess Stack Exchangeofficial-stockfish.github.io

Bottom line (dev vs old)

  • Point estimate: −4 Elo for Wordfish 1.0 dev relative to the old version (assuming dev = wordfish1).
  • Uncertainty (as reported): about ±5.7 Elo, so the plausible range is roughly [−9.7, +1.7] Elo for dev−old.
  • Decision: No reliable Elo gain detected yet; if anything, the small tilt favours the old engine, but it’s not significant.

(If instead your naming is reversed—i.e., “wordfish” is the dev and “wordfish1” the old—then flip the sign: dev ≈ +4 Elo vs old, still not significant.)

Why this conclusion is sound

  • Elo from score: 50.58% → ~+4 Elo (via 400log⁡10p1−p400\log_{10}\frac{p}{1-p}400log10​1−pp​), matching the tool’s figure.
  • SPRT mechanics: tests compare H0 vs H1 using the log-likelihood ratio until crossing ± bounds; your LLR 0.39 is far from ±2.94, so neither “no gain” nor “≥2.5 Elo gain” is established. chessprogramming.org
  • Pentanomial statistics & “nElo”: fast tests for engines use the pentanomial model and often report “normalised Elo” (nElo) to reduce dependency on draw rate; your nElo +10.7 ± 15.2 tells the same story: big uncertainty around a tiny edge. official-stockfish.github.io

Practical tips to reach a decision sooner

  • Use a more biased opening set (e.g., UHO_Lichess or Pohl) to cut draw rate, which speeds up SPRT convergence. chessprogramming.org
  • Tighten bounds for gains you actually care about (e.g., [0, 2.0] or [0.5, 2.5]); wide bounds can terminate faster when the signal is clear, but they’re inappropriate if you want to detect small improvements reliably. chessprogramming.org
  • Consider more games or shorter time controls if throughput is the bottleneck (mind the risk of evaluation noise).

One-line answer (what you asked for)

Wordfish 1.0 dev vs old: ≈ −4 Elo (or +4 Elo if your labels are swapped), not statistically significant; no proven Elo gain based on this 2000-game run.

References (for definitions/interpretation): Rustic’s SPRT overview and examples; Stockfish’s Fishtest maths (pentanomial, nElo, LLR/Bounds); CPW on SPRT.

Download games