STC LTC SPRT Test revolution

Table of Contents

SPRT TEST REVOLUTION

revolution_090925_v2.20_avx vs Revolution

SPRT completed — H₀ accepted (no improvement).

Time control: 60+0.1; Openings: UHO_Lichess_4852_v1.epd

Elo ± err-32.01 ± 15.19

nElo ± err-60.71 ± 28.62

LOS0.00%

Games566

Score123-175 (268 draws)

Draw ratio46.64%

Pairs ratio0.51

LLR (A,B)-2.99 (A=-2.94, B=2.94)

Timeouts / Crashes1 / 0

Notes: Wald bounds reflect α and β shown by the tester; LLR point is the final value at n = games.

revolution_090925_v2.20_avx vs Revolution

SPRT completed — H₀ accepted (no improvement).

Time control: 10+0.1; Openings: UHO_Lichess_4852_v1.epd

Elo ± err-59.50 ± 20.56

nElo ± err-108.79 ± 36.82

LOS0.00%

Games342

Score56-114 (172 draws)

Draw ratio40.35%

Pairs ratio0.32

LLR (A,B)-2.98 (A=-2.94, B=2.94)

Timeouts / Crashes1 / 0

Notes: Wald bounds reflect α and β shown by the tester; LLR point is the final value at n = games.

Here’s a clear read of both SPRT runs and what they mean for the dev engine.

Executive summary

60+0.1 (UHO_Lichess_4852_v1.epd)
LLR = −2.99 < A = −2.94 ⇒ SPRT completed: H₀ accepted.
Estimated strength: −32.0 ± 15.2 Elo over 566 games; LOS ≈ 0%.
→ The dev build is statistically weaker than the baseline at long TC.
10+0.1 (UHO_Lichess_4852_v1.epd)
LLR = −2.98 < A = −2.94 ⇒ SPRT completed: H₀ accepted.
Estimated strength: −59.5 ± 20.6 Elo over 342 games; LOS ≈ 0%.
→ The dev build is even weaker at fast TC.

(H₀ = “no improvement or worse than baseline”. Crossing the lower Wald bound A means we accept H₀ and stop.)

What the statistics are saying

Both tests independently conclude a regression. In each case, the Wald LLR dropped below the lower threshold (A ≈ −2.94 for α=β=0.05), which is the formal stopping rule for “no improvement.”
Effect size:
- Long TC shows a sizable but moderate loss (~−32 Elo).
- Fast TC shows a large loss (~−60 Elo).
  This pattern (worse at faster time controls) often points to time-management sensitivity, evaluation instability at shallow search, or a branching factor increase that hurts when nodes are scarce.
Uncertainty: The “±” figures are the test’s reported error bars. Even at one sigma, both intervals stay mostly below 0 Elo, consistent with LOS ~ 0% (chance the dev is actually stronger is essentially nil).
Draw rates: As expected, the shorter TC has a lower draw rate, which tends to amplify measured Elo gaps and speeds up LLR movement. That aligns with the larger magnitude regression at 10+0.1.
Quality checks: The logs indicate no crash bias and only minimal timeouts (both runs noted 1 timeout, 0 crashes), and you already enforced NOEXP (experience disabled and cleaned), so the comparison is fair and reproducible.

Practical interpretation

The dev build regresses across both time controls, with stronger negative impact at fast TC. If your release criteria require non-negative Elo at α=β=0.05, this fails at both TCs and should be treated as a blocker for promotion.
If you care more about longer controls (e.g., engine vs. engine tournaments at 60+0.1), the −32 Elo is still well beyond noise and unacceptable for a merge.

Likely causes to investigate

Time-management / overhead: At 10+0.1, even small mistakes on Move Overhead, nodestime, or latency can cost points. Consider testing Move Overhead a bit higher to rule out flag-losses disguised as strength loss (you currently use the engine’s default).
Evaluation or pruning tuned for deeper search: Changes that help only at depth N+2 can hurt at shallow depths; that often shows up as worse fast-TC Elo.
Search instability / node inflation: If the dev change increases branching or harms pruning order, fast time controls suffer first.
Book interaction: You used UHO_Lichess_4852_v1.epd; if a regression is opening-sensitive, it will look worse with more decisive suites (lower draw rate). A quick cross-check with the older UHO PGN set can help isolate this.

Recommended next steps

Stop the current dev line (both TCs failed the SPRT at α=β=0.05).
Triage with faster gates: run a quick screen at α=β=0.10, H₁=+10 Elo to iterate on fixes rapidly; promote to α=β=0.05 only once the sign is positive.
Targeted A/Bs:
- Re-test at 15+0.1 and 30+0.1 to map the TC sensitivity curve.
- Try a slightly larger Move Overhead on 10+0.1 to rule out time losses.
- If the patch touches eval or pruning, run component ablations (toggle the new term/pruning and re-SPRT).
Sanity duplicates: Run a duplicate of the 10+0.1 test with a different randomized opening seed to confirm the magnitude (should still cross A quickly if real).
Keep NOEXP policy on: You’ve already neutralized experience files; retain that in all confirmatory tests.

Jorge Ruiz

Filólogo y amante de la antropología social africana

SÍGUEME