Snapshot of the current match
- Score (10+0.1, 1T, 32MB, UHO_2024_8mvs_+085_+094): 18–15–27 (60 games), 52.5% for DEV.
- Elo: +17.4 ± 49.5 (not statistically significant).
- LOS: 75.6% (suggestive, but not conclusive).
- Draw rate: 46.7% (healthier, closer to parity conditions than earlier).
- LLR: 0.04 — effectively “no verdict” yet; far from any SPRT boundary.
How this compares to earlier runs (regression from the beginning)
- Early tests: DEV was badly negative vs the base (often −90 to −150 Elo) and showed a catastrophic collapse as Black (e.g., near-zero win rate with Black), heavily distorted by White-biased books and some option mismatches.
- Mid-stage fixes: After aligning time management (defaults, no MinThink/SlowMover hacks), enforcing color-pairing per line, and cleaning UCI mismatches, results moved toward rough parity but still volatile; many runs were short and white-skewed.
- Now: DEV is slightly ahead (+17 Elo), with wide error bars. The draw rate rose vs earlier (where it hovered ~30–40%), which usually indicates better comparability and fewer “free points” from adjudication/over-pruning. The Black collapse signal is no longer obvious in this small sample, but with only 60 games you can’t call it fixed.
What to take from this
- Direction of travel: from clearly worse → about even / marginally better.
- Confidence: still low due to the small sample (n=60) and ±50 Elo uncertainty.
- SPRT status: LLR ~0 means keep playing; you’re nowhere near accept/reject thresholds.
Recommended next steps (quick, practical)
- Grow the sample to at least 400–800 games at the same settings before judging (LLR will move; CI will shrink to ±20–25 Elo).
- Color sanity: continue to track per-color scores; if possible, also run an original + mirrored suite pair and combine, to fully remove residual book bias.
- Hold conditions fixed: 1 thread, 32 MB, same book, no extra time options, Ponder off, MultiPV=1.
- Watch these indicators:
- Draw rate (should stabilize),
- White vs Black split (if Black dips again, we revisit SEE/TT/NMP gates),
- Time losses and illegal option warnings in logs.
Bottom line: today’s result is encouraging—a marked improvement over the early negative regressions—but not yet decisive. Keep the current setup and extend the run so the statistics can speak clearly.
