SPRT TEST REVOLUTION
revolution_090925_v2.20_avx vs Revolution
SPRT completed — H₀ accepted (no improvement).
Time control: 60+0.1; Openings: UHO_Lichess_4852_v1.epd
Elo ± err-32.01 ± 15.19
nElo ± err-60.71 ± 28.62
LOS0.00%
Games566
Score123-175 (268 draws)
Draw ratio46.64%
Pairs ratio0.51
LLR (A,B)-2.99 (A=-2.94, B=2.94)
Timeouts / Crashes1 / 0
Notes: Wald bounds reflect α and β shown by the tester; LLR point is the final value at n = games.
revolution_090925_v2.20_avx vs Revolution
SPRT completed — H₀ accepted (no improvement).
Time control: 10+0.1; Openings: UHO_Lichess_4852_v1.epd
Elo ± err-59.50 ± 20.56
nElo ± err-108.79 ± 36.82
LOS0.00%
Games342
Score56-114 (172 draws)
Draw ratio40.35%
Pairs ratio0.32
LLR (A,B)-2.98 (A=-2.94, B=2.94)
Timeouts / Crashes1 / 0
Notes: Wald bounds reflect α and β shown by the tester; LLR point is the final value at n = games.
Here’s a clear read of both SPRT runs and what they mean for the dev engine.
Executive summary
- 60+0.1 (UHO_Lichess_4852_v1.epd)
LLR = −2.99 < A = −2.94 ⇒ SPRT completed: H₀ accepted.
Estimated strength: −32.0 ± 15.2 Elo over 566 games; LOS ≈ 0%.
→ The dev build is statistically weaker than the baseline at long TC. - 10+0.1 (UHO_Lichess_4852_v1.epd)
LLR = −2.98 < A = −2.94 ⇒ SPRT completed: H₀ accepted.
Estimated strength: −59.5 ± 20.6 Elo over 342 games; LOS ≈ 0%.
→ The dev build is even weaker at fast TC.
(H₀ = “no improvement or worse than baseline”. Crossing the lower Wald bound A means we accept H₀ and stop.)
What the statistics are saying
- Both tests independently conclude a regression. In each case, the Wald LLR dropped below the lower threshold (A ≈ −2.94 for α=β=0.05), which is the formal stopping rule for “no improvement.”
- Effect size:
- Long TC shows a sizable but moderate loss (~−32 Elo).
- Fast TC shows a large loss (~−60 Elo).
This pattern (worse at faster time controls) often points to time-management sensitivity, evaluation instability at shallow search, or a branching factor increase that hurts when nodes are scarce.
- Uncertainty: The “±” figures are the test’s reported error bars. Even at one sigma, both intervals stay mostly below 0 Elo, consistent with LOS ~ 0% (chance the dev is actually stronger is essentially nil).
- Draw rates: As expected, the shorter TC has a lower draw rate, which tends to amplify measured Elo gaps and speeds up LLR movement. That aligns with the larger magnitude regression at 10+0.1.
- Quality checks: The logs indicate no crash bias and only minimal timeouts (both runs noted 1 timeout, 0 crashes), and you already enforced NOEXP (experience disabled and cleaned), so the comparison is fair and reproducible.
Practical interpretation
- The dev build regresses across both time controls, with stronger negative impact at fast TC. If your release criteria require non-negative Elo at α=β=0.05, this fails at both TCs and should be treated as a blocker for promotion.
- If you care more about longer controls (e.g., engine vs. engine tournaments at 60+0.1), the −32 Elo is still well beyond noise and unacceptable for a merge.
Likely causes to investigate
- Time-management / overhead: At 10+0.1, even small mistakes on Move Overhead, nodestime, or latency can cost points. Consider testing Move Overhead a bit higher to rule out flag-losses disguised as strength loss (you currently use the engine’s default).
- Evaluation or pruning tuned for deeper search: Changes that help only at depth N+2 can hurt at shallow depths; that often shows up as worse fast-TC Elo.
- Search instability / node inflation: If the dev change increases branching or harms pruning order, fast time controls suffer first.
- Book interaction: You used UHO_Lichess_4852_v1.epd; if a regression is opening-sensitive, it will look worse with more decisive suites (lower draw rate). A quick cross-check with the older UHO PGN set can help isolate this.
Recommended next steps
- Stop the current dev line (both TCs failed the SPRT at α=β=0.05).
- Triage with faster gates: run a quick screen at α=β=0.10, H₁=+10 Elo to iterate on fixes rapidly; promote to α=β=0.05 only once the sign is positive.
- Targeted A/Bs:
- Re-test at 15+0.1 and 30+0.1 to map the TC sensitivity curve.
- Try a slightly larger Move Overhead on 10+0.1 to rule out time losses.
- If the patch touches eval or pruning, run component ablations (toggle the new term/pruning and re-SPRT).
- Sanity duplicates: Run a duplicate of the 10+0.1 test with a different randomized opening seed to confirm the magnitude (should still cross A quickly if real).
- Keep NOEXP policy on: You’ve already neutralized experience files; retain that in all confirmatory tests.

Jorge Ruiz
Filólogo y amante de la antropología social africana