Skip to content
Home » News » STC LTC SPRT Test revolution

STC LTC SPRT Test revolution

SPRT TEST REVOLUTION

revolution_090925_v2.20_avx vs Revolution
SPRT completed — H₀ accepted (no improvement).
Time control: 60+0.1; Openings: UHO_Lichess_4852_v1.epd
Elo ± err-32.01 ± 15.19
nElo ± err-60.71 ± 28.62
LOS0.00%
Games566
Score123-175 (268 draws)
Draw ratio46.64%
Pairs ratio0.51
LLR (A,B)-2.99 (A=-2.94, B=2.94)
Timeouts / Crashes1 / 0
LLR vs bounds Elo ± error
Notes: Wald bounds reflect α and β shown by the tester; LLR point is the final value at n = games.
revolution_090925_v2.20_avx vs Revolution
SPRT completed — H₀ accepted (no improvement).
Time control: 10+0.1; Openings: UHO_Lichess_4852_v1.epd
Elo ± err-59.50 ± 20.56
nElo ± err-108.79 ± 36.82
LOS0.00%
Games342
Score56-114 (172 draws)
Draw ratio40.35%
Pairs ratio0.32
LLR (A,B)-2.98 (A=-2.94, B=2.94)
Timeouts / Crashes1 / 0
LLR vs bounds Elo ± error
Notes: Wald bounds reflect α and β shown by the tester; LLR point is the final value at n = games.

Here’s a clear read of both SPRT runs and what they mean for the dev engine.

Executive summary

  • 60+0.1 (UHO_Lichess_4852_v1.epd)
    LLR = −2.99 < A = −2.94 ⇒ SPRT completed: H₀ accepted.
    Estimated strength: −32.0 ± 15.2 Elo over 566 games; LOS ≈ 0%.
    → The dev build is statistically weaker than the baseline at long TC.
  • 10+0.1 (UHO_Lichess_4852_v1.epd)
    LLR = −2.98 < A = −2.94 ⇒ SPRT completed: H₀ accepted.
    Estimated strength: −59.5 ± 20.6 Elo over 342 games; LOS ≈ 0%.
    → The dev build is even weaker at fast TC.

(H₀ = “no improvement or worse than baseline”. Crossing the lower Wald bound A means we accept H₀ and stop.)


What the statistics are saying

  • Both tests independently conclude a regression. In each case, the Wald LLR dropped below the lower threshold (A ≈ −2.94 for α=β=0.05), which is the formal stopping rule for “no improvement.”
  • Effect size:
    • Long TC shows a sizable but moderate loss (~−32 Elo).
    • Fast TC shows a large loss (~−60 Elo).
      This pattern (worse at faster time controls) often points to time-management sensitivity, evaluation instability at shallow search, or a branching factor increase that hurts when nodes are scarce.
  • Uncertainty: The “±” figures are the test’s reported error bars. Even at one sigma, both intervals stay mostly below 0 Elo, consistent with LOS ~ 0% (chance the dev is actually stronger is essentially nil).
  • Draw rates: As expected, the shorter TC has a lower draw rate, which tends to amplify measured Elo gaps and speeds up LLR movement. That aligns with the larger magnitude regression at 10+0.1.
  • Quality checks: The logs indicate no crash bias and only minimal timeouts (both runs noted 1 timeout, 0 crashes), and you already enforced NOEXP (experience disabled and cleaned), so the comparison is fair and reproducible.

Practical interpretation

  • The dev build regresses across both time controls, with stronger negative impact at fast TC. If your release criteria require non-negative Elo at α=β=0.05, this fails at both TCs and should be treated as a blocker for promotion.
  • If you care more about longer controls (e.g., engine vs. engine tournaments at 60+0.1), the −32 Elo is still well beyond noise and unacceptable for a merge.

Likely causes to investigate

  1. Time-management / overhead: At 10+0.1, even small mistakes on Move Overhead, nodestime, or latency can cost points. Consider testing Move Overhead a bit higher to rule out flag-losses disguised as strength loss (you currently use the engine’s default).
  2. Evaluation or pruning tuned for deeper search: Changes that help only at depth N+2 can hurt at shallow depths; that often shows up as worse fast-TC Elo.
  3. Search instability / node inflation: If the dev change increases branching or harms pruning order, fast time controls suffer first.
  4. Book interaction: You used UHO_Lichess_4852_v1.epd; if a regression is opening-sensitive, it will look worse with more decisive suites (lower draw rate). A quick cross-check with the older UHO PGN set can help isolate this.

Recommended next steps

  • Stop the current dev line (both TCs failed the SPRT at α=β=0.05).
  • Triage with faster gates: run a quick screen at α=β=0.10, H₁=+10 Elo to iterate on fixes rapidly; promote to α=β=0.05 only once the sign is positive.
  • Targeted A/Bs:
    • Re-test at 15+0.1 and 30+0.1 to map the TC sensitivity curve.
    • Try a slightly larger Move Overhead on 10+0.1 to rule out time losses.
    • If the patch touches eval or pruning, run component ablations (toggle the new term/pruning and re-SPRT).
  • Sanity duplicates: Run a duplicate of the 10+0.1 test with a different randomized opening seed to confirm the magnitude (should still cross A quickly if real).
  • Keep NOEXP policy on: You’ve already neutralized experience files; retain that in all confirmatory tests.

Jorge Ruiz

Jorge Ruiz

Filólogo y amante de la antropología social africana

Leave a Reply

Your email address will not be published. Required fields are marked *