Overall score: Wordfish 1.0 dev 260825 vs Wordfish base: 3649 - 3582 - 2769 [0.503] 10000 games
Wordfish 1.0 dev playing White: 2569 - 1033 - 1398 [0.654] 5000 games
Wordfish 1.0 dev playing Black: 1080 - 2549 - 1371 [0.353] 5000 games
White vs Black: 5118 - 2113 - 2769 [0.650] 10000 games
Elo difference: 2.3 +/- 5.8, LOS: 78.5 %, DrawRatio: 27.7 %
SPRT: llr 0.309 (10.5%), lbound -2.94, ubound 2.94
Step 1: Understanding the numbers
- Overall score (3649-3582-2769) corresponds to wins-draws-losses for Wordfish 1.0 dev vs Wordfish base across 10,000 games.
- Wins: 3649
- Draws: 3582
- Losses: 2769
- Score fraction: 0.503 → slightly above 50%, meaning the dev version performs slightly better.
- Colour breakdown:
- White: 2569 wins, 1033 draws, 1398 losses → score fraction 0.654 (dominating when playing White).
- Black: 1080 wins, 2549 draws, 1371 losses → score fraction 0.353 (weaker when playing Black).
- White vs Black combined: 5118-2113-2769 → score fraction 0.650 over all games separating colour.
- SPRT (Sequential Probability Ratio Test) data:
llr = 0.309
→ log-likelihood ratio is small, well within bounds.lbound = -2.94
,ubound = 2.94
→ test is inconclusive at strict confidence level, but shows a slight advantage.
- Elo difference:
- Mean:
+2.3
Elo - Uncertainty:
±5.8
- Level of statistical significance:
LOS 78.5%
- Draw ratio: 27.7%
- Mean:
Step 2: Interpretation
- Overall strength:
- Wordfish 1.0 dev is slightly stronger than the base version.
- +2.3 Elo is very modest; given the standard deviation of 5.8 Elo, the confidence interval includes zero (i.e., the dev version might not be stronger in a strict statistical sense).
- Colour asymmetry:
- Very strong performance with White (score 0.654), significantly outperforming the base.
- Substantially weaker with Black (score 0.353).
- This suggests that the dev version may have optimisations or heuristics that favour playing first, but these are overcompensated when playing second, leading to losses with Black.
- Draw ratio & SPRT:
- Draw ratio 27.7% is relatively low, meaning many decisive games → higher variance per game.
- SPRT LLR 0.309 < upper bound → the sequential test has not crossed the threshold, so the result is suggestive but not statistically conclusive.
Step 3: Elo Gain Estimation
- The provided Elo difference: +2.3 ± 5.8
- This means:
- Expected gain for Wordfish 1.0 dev vs base: 2.3 Elo
- Error margin (1 standard deviation): 5.8 Elo
- Statistical confidence (LOS 78.5%): moderate; not definitive, but indicates slight improvement.
Interpretation in practical terms:
- The dev version is marginally stronger, but this improvement is very small and might not be reliably detectable in short tournament runs.
- White advantage is clear; Black disadvantage cancels some of the gains.
Step 4: Summary Table
Metric | Value | Interpretation |
---|---|---|
Overall score fraction | 0.503 | Slight advantage for dev |
Elo difference | +2.3 ± 5.8 | Marginal, within statistical noise |
LOS (Level of Significance) | 78.5% | Moderate confidence |
Draw ratio | 27.7% | Low draws, many decisive games |
White score fraction | 0.654 | Strong White performance |
Black score fraction | 0.353 | Weak Black performance |
SPRT llr | 0.309 | Well within bounds, test inconclusive |
✅ Conclusion
- Wordfish 1.0 dev shows a slight Elo improvement of +2.3 over Wordfish base.
- The improvement is not statistically robust due to ±5.8 Elo uncertainty.
- Most of the gain comes from playing White, while Black games are weaker than White, resulting in an overall modest net gain.
- SPRT confirms the result is suggestive but not decisive; further testing would be required to confirm a true performance increase.
In short, Wordfish 1.0 dev is marginally stronger, with a small estimated gain of roughly +2 Elo, primarily when playing first. The statistical evidence indicates improvement but is not fully conclusive.
Download games