Guide SPRT Tests for Chess Engines with cutechess-cli

Mastering Statistical Validation with cutechess-cli

Table of Contents

1. Introduction: Why SPRT?

In chess engine development, validating strength improvements is critical. The Sequential Probability Ratio Test (SPRT) offers a statistically rigorous way to terminate tests early when results are conclusive, saving CPU time. Developed by Abraham Wald in WWII, SPRT is now standard in projects like Stockfish (see Stockfish Testing Dashboard).

Key Advantages:

Efficiency: Stops tests when “enough evidence” is gathered (e.g., 5x faster than fixed-game tests).
Accuracy: Controls Type I/II errors (false positives/negatives).
Adaptability: Works for small Elo changes (e.g., ±5 Elo).

2. Decoding Your SPRT Output

Let’s dissect your provided data:

Elo difference: 0.0 +/- 29.1,  
LOS: 50.0 %,  
DrawRatio: 32.4 %  
SPRT: llr -0.228 (-7.7%), lbound -2.94, ubound 2.94

Elo difference: 0.0 +/- 29.1,  
LOS: 50.0 %,  
DrawRatio: 32.4 %  
SPRT: llr -0.228 (-7.7%), lbound -2.94, ubound 2.94

Elo difference: 0.0 ± 29.1
The engines are statistically equal. The true Elo difference is between -29.1 and +29.1 (95% confidence).
LOS (Likelihood of Superiority): 50.0%
Indicates neither engine is favoured. LOS > 95% typically signifies significance.
DrawRatio: 32.4%
Reflects game dynamics (e.g., higher in endgame-rich engines). Affects SPRT sensitivity.
SPRT Parameters:
llr (Log-Likelihood Ratio): -0.228
Measures evidence toward H0 (null hypothesis: “no Elo change”). Negative values favour H0.
Bounds: lbound = -2.94, ubound = 2.94
Set for α=0.05 (5% false positive), β=0.05 (5% false negative).
Progress: -7.7%
The test is 7.7% toward accepting H0. If llr hits -2.94, H0 is accepted; if +2.94, H1 (e.g., “Elo +10”) is accepted.

Interpretation:
The test leans toward concluding no Elo change, but more games are needed for certainty.

3. SPRT Setup in cutechess-cli: Step-by-Step Tutorial

Prerequisites

cutechess-cli: Download here.
Engines: Old vs. new versions (e.g., my_engine_v1 vs my_engine_v2).
Opening Book: e.g., noob_3moves.epd (balanced openings).

Command Template

“`bash
cutechess-cli \
-engine name=Old cmd=./my_engine_v1 \
-engine name=New cmd=./my_engine_v2 \
-each proto=uci tc=60+0.6 \
-games 2 -rounds 500 \
-sprt elo0=0 elo1=5 \
-draw movenumber=40 movecount=8 score=20 \
-openings file=noob_3moves.epd order=random \
-concurrency 8 \
-pgnout results.pgn

###### **Parameter Breakdown**  
| **Flag**          | **Value**       | **Purpose**                                                                 |  
|--------------------|-----------------|-----------------------------------------------------------------------------|  
| `tc`               | `60+0.6`        | Base 60s + 0.6s increment per move.                                         |  
| `sprt elo0 elo1`   | `elo0=0 elo1=5` | Tests H0: "ΔElo = 0" vs. H1: "ΔElo = 5". Choose `elo1` as the target gain. |  
| `draw`             | `movenumber=40...` | Adjusts draw detection (critical for high-draw engines).                 |  
| `concurrency`      | `8`             | Parallel threads (match CPU cores).                                         |  
| `pgnout`           | `results.pgn`   | Saves games for analysis.                                                   |  

###### **Choosing `elo0` and `elo1`**  
- **Sensitivity Trade-off:**  
  - `elo1=5`: Detects tiny changes (slower test).  
  - `elo1=15`: Faster but misses small improvements.  
  **Recommendation:** Start with `elo1=10` for most updates.  

###### **Time Control Tips**  
- **LTC (Long Time Control):** `tc=180+1.8` for reliable results.  
- **STC (Short Time Control):** `tc=10+0.1` for rapid iteration.  

---

#### **4. Interpreting Real-Time Output**  
During execution, cutechess-cli logs:

###### **Parameter Breakdown**  
| **Flag**          | **Value**       | **Purpose**                                                                 |  
|--------------------|-----------------|-----------------------------------------------------------------------------|  
| `tc`               | `60+0.6`        | Base 60s + 0.6s increment per move.                                         |  
| `sprt elo0 elo1`   | `elo0=0 elo1=5` | Tests H0: "ΔElo = 0" vs. H1: "ΔElo = 5". Choose `elo1` as the target gain. |  
| `draw`             | `movenumber=40...` | Adjusts draw detection (critical for high-draw engines).                 |  
| `concurrency`      | `8`             | Parallel threads (match CPU cores).                                         |  
| `pgnout`           | `results.pgn`   | Saves games for analysis.                                                   |  

###### **Choosing `elo0` and `elo1`**  
- **Sensitivity Trade-off:**  
  - `elo1=5`: Detects tiny changes (slower test).  
  - `elo1=15`: Faster but misses small improvements.  
  **Recommendation:** Start with `elo1=10` for most updates.  

###### **Time Control Tips**  
- **LTC (Long Time Control):** `tc=180+1.8` for reliable results.  
- **STC (Short Time Control):** `tc=10+0.1` for rapid iteration.  

---

#### **4. Interpreting Real-Time Output**  
During execution, cutechess-cli logs:

Finished game 42 (New vs Old): 1/2-1/2 {Draw by adjudication}
Score of New vs Old: 10 – 8 – 24 [0.571]
SPRT: llr 1.23 (33.1%), lbound -2.94, ubound 2.94

- **Score:** 10 wins, 8 losses, 24 draws → 44/84 points (52.4%).  
- **llr 1.23 (33.1%)**: Evidence is 33.1% toward accepting H1 ("Elo +5").  

**Decision Tree:**  
- If `llr ≥ ubound (2.94)`: Accept H1 (new engine stronger).  
- If `llr ≤ lbound (-2.94)`: Accept H0 (no change).  
- Else: Continue testing.  

---

#### **5. Best Practices & Pitfalls**  

##### **Optimal Settings**  
- **Draw Ratio Adjustment:**  
  If your engine has a 40% draw rate, add `draw=40 score=20` to avoid premature adjudication.  
- **Bounds Rigor:**  
  Use `alpha=0.01` and `beta=0.05` for critical updates:

- **Score:** 10 wins, 8 losses, 24 draws → 44/84 points (52.4%).  
- **llr 1.23 (33.1%)**: Evidence is 33.1% toward accepting H1 ("Elo +5").  

**Decision Tree:**  
- If `llr ≥ ubound (2.94)`: Accept H1 (new engine stronger).  
- If `llr ≤ lbound (-2.94)`: Accept H0 (no change).  
- Else: Continue testing.  

---

#### **5. Best Practices & Pitfalls**  

##### **Optimal Settings**  
- **Draw Ratio Adjustment:**  
  If your engine has a 40% draw rate, add `draw=40 score=20` to avoid premature adjudication.  
- **Bounds Rigor:**  
  Use `alpha=0.01` and `beta=0.05` for critical updates:

-sprt elo0=0 elo1=5 alpha=0.01 beta=0.05

##### **Common Mistakes**  
1. **Too Aggressive `elo1`:**  
   Avoid `elo1=20` – may accept regressions.  
2. **Insufficient Games:**  
   SPRT needs 1k–10k games for ±5 Elo precision.  
3. **Biased Openings:**  
   Use balanced books (e.g., [Noomen 2023](https://www.chessprogramming.org/Openings_Book)).  

##### **Stockfish SPRT Example**  
Stockfish tests often use:

##### **Common Mistakes**  
1. **Too Aggressive `elo1`:**  
   Avoid `elo1=20` – may accept regressions.  
2. **Insufficient Games:**  
   SPRT needs 1k–10k games for ±5 Elo precision.  
3. **Biased Openings:**  
   Use balanced books (e.g., [Noomen 2023](https://www.chessprogramming.org/Openings_Book)).  

##### **Stockfish SPRT Example**  
Stockfish tests often use:

-sprt elo0=0 elo1=4 alpha=0.05 beta=0.05

This detects improvements as small as **4 Elo** with 95% confidence.  

---

#### **6. Advanced: The Math Behind SPRT**  
SPRT calculates the **log-likelihood ratio (LLR)** after each game:

This detects improvements as small as **4 Elo** with 95% confidence.  

---

#### **6. Advanced: The Math Behind SPRT**  
SPRT calculates the **log-likelihood ratio (LLR)** after each game:

LLR = log( [P(Results | H1)] / [P(Results | H0)] )

Where:  
- `H0`: Null hypothesis (e.g., ΔElo = 0).  
- `H1`: Alternative hypothesis (e.g., ΔElo = 5).  

The bounds `[lbound, ubound]` are derived from:

Where:  
- `H0`: Null hypothesis (e.g., ΔElo = 0).  
- `H1`: Alternative hypothesis (e.g., ΔElo = 5).  

The bounds `[lbound, ubound]` are derived from:

lbound = ln(β / (1-α))
ubound = ln((1-β) / α)
`` With α=β=0.05:lbound = ln(0.05/0.95) ≈ -2.94,ubound = ln(0.95/0.05) ≈ 2.94`.

7. Tools & Resources

SPRT Calculator: Rustic Chess SPRT Simulator
Theory Deep Dive: CPW SPRT Guide

Conclusion

SPRT transforms engine testing from a “fixed-game gamble” to a dynamic, evidence-driven process. By integrating this guide, you’ve mastered:
✅ Configuring SPRT in cutechess-cli.
✅ Interpreting Elo/LOS/llr metrics.
✅ Avoiding statistical pitfalls.

Final Tip: Start with STC (tc=10+0.1) for quick feedback, then validate with LTC before deploying changes.

“In chess, as in science, not all that glitters is Elo.”
― Adapted from Peter Leko.

Feedback? Share your SPRT results with me for personalised analysis! 🚀

Jorge Ruiz

connoisseur of both chess and anthropology, a combination that reflects his deep intellectual curiosity and passion for understanding both the art of strategic

SÍGUEME