SPRT TEST Revolution dev versus Baseline

Table of Contents

Snapshot of the current match

Score (10+0.1, 1T, 32MB, UHO_2024_8mvs_+085_+094): 18–15–27 (60 games), 52.5% for DEV.
Elo: +17.4 ± 49.5 (not statistically significant).
LOS: 75.6% (suggestive, but not conclusive).
Draw rate: 46.7% (healthier, closer to parity conditions than earlier).
LLR: 0.04 — effectively “no verdict” yet; far from any SPRT boundary.

How this compares to earlier runs (regression from the beginning)

Early tests: DEV was badly negative vs the base (often −90 to −150 Elo) and showed a catastrophic collapse as Black (e.g., near-zero win rate with Black), heavily distorted by White-biased books and some option mismatches.
Mid-stage fixes: After aligning time management (defaults, no MinThink/SlowMover hacks), enforcing color-pairing per line, and cleaning UCI mismatches, results moved toward rough parity but still volatile; many runs were short and white-skewed.
Now: DEV is slightly ahead (+17 Elo), with wide error bars. The draw rate rose vs earlier (where it hovered ~30–40%), which usually indicates better comparability and fewer “free points” from adjudication/over-pruning. The Black collapse signal is no longer obvious in this small sample, but with only 60 games you can’t call it fixed.

What to take from this

Direction of travel: from clearly worse → about even / marginally better.
Confidence: still low due to the small sample (n=60) and ±50 Elo uncertainty.
SPRT status: LLR ~0 means keep playing; you’re nowhere near accept/reject thresholds.

Recommended next steps (quick, practical)

Grow the sample to at least 400–800 games at the same settings before judging (LLR will move; CI will shrink to ±20–25 Elo).
Color sanity: continue to track per-color scores; if possible, also run an original + mirrored suite pair and combine, to fully remove residual book bias.
Hold conditions fixed: 1 thread, 32 MB, same book, no extra time options, Ponder off, MultiPV=1.
Watch these indicators:
- Draw rate (should stabilize),
- White vs Black split (if Black dips again, we revisit SEE/TT/NMP gates),
- Time losses and illegal option warnings in logs.

Bottom line: today’s result is encouraging—a marked improvement over the early negative regressions—but not yet decisive. Keep the current setup and extend the run so the statistics can speak clearly.

SPRT TEST Revolution dev versus Baseline

Snapshot of the current match

How this compares to earlier runs (regression from the beginning)

What to take from this

Recommended next steps (quick, practical)

Leave a Reply Cancel reply