PlentyChess SSSE3
SSSE3-Optimised Builds of PlentyChess: Design and Performance of the Codex-Assisted Integration
PlentyChess is a state-of-the-art UCI chess engine whose evaluation is based on an efficiently updated neural network (NNUE). Its network is trained on billions of self-generated positions and the engine currently ranks among the very top engines on multiple public rating lists. (GitHub)
The work carried out with Codex focuses on exploiting SSSE3-capable x86-64 processors by producing a dedicated SSSE3 binary and ensuring that PlentyChess’ NNUE kernels fully benefit from this instruction set. The result is a configuration that is significantly faster than the generic build on CPUs that support SSSE3 but lack newer extensions such as AVX2—typical of many Ivy Bridge-class Xeon systems (e.g. Xeon E5-2697 v2, which supports SSSE3 and AVX, but not AVX2). (valid.x86.fr)
1. Background: Instruction Sets and PlentyChess’ SIMD Layer
Supplemental Streaming SIMD Extensions 3 (SSSE3) is the fourth generation of Intel’s SSE technology. It adds, among other operations, new multiply-and-add instructions that are particularly well suited for dot products on packed 8-bit and 16-bit data. (Wikipedia) NNUE evaluation, which relies heavily on dot products over quantised weights and activations, can therefore benefit strongly from SSSE3.
PlentyChess already organises its evaluation around a portable SIMD abstraction layer defined in src/simd.h. This header maps high-level vector types (e.g. VecI8, VecI16, VecF) to concrete hardware types (__m128i, __m256i, __m512i, or NEON types), and implements vector primitives such as saturated adds, min/max, and fused multiply–add. For x86-64, the header provides separate paths for AVX-512, AVX2 and a more general SSE/SSSE3 configuration. (GitHub)
Codex’ task was essentially to bridge this SIMD infrastructure with a robust build pipeline that can (a) generate an SSSE3-targeted binary on demand, and (b) guarantee that the NNUE code takes the SSSE3 fast path on such builds.
2. Build-System Integration of the SSSE3 Target
2.1. Architecture detection and arch=ssse3
The central entry point is PlentyChess’ Makefile. It uses a parameter arch to steer compilation. When arch is not explicitly set, the Makefile auto-detects CPU capabilities by querying the compiler with -march=native and parsing the predefined macros, on both Unix and Windows (the latter via detect_flags.bat). (GitHub)
The detection logic computes booleans like HAS_SSSE3, HAS_FMA, HAS_AVX2, HAS_AVX512, and then selects the “best” architecture:
avx512vnniavx512avx2fmassse3generic
Codex’ work ensures that the ssse3 branch is fully wired:
- When
arch = ssse3, the Makefile now setsCXXFLAGS := ... -DARCH_X86 -mssse3andCFLAGS := ... -mssse3. (GitHub) - This guarantees that the compiler defines
__SSSE3__and that all x86 SIMD code is compiled with SSSE3 enabled, while still keeping the generic SSE2 baseline for portability.
From a user’s perspective, this allows two complementary modes:
- Automatic selection
make -j # auto-detects CPU; picks ssse3 if available - Explicit SSSE3 build, for example when cross-compiling or when you want to ensure a conservative target for a cluster of similar machines:
make arch=ssse3 EXE=PlentyChess-ssse3 -j
On Windows/MinGW, the same logic is available and integrated with static linking and (where available) the LLD linker, ensuring reproducible SSSE3 binaries for UCI testing. (GitHub)
2.2. Compatibility with PGO and LTO
The Makefile also contains a profile-guided optimisation (PGO) pipeline: for Clang it uses -fprofile-instr-generate and -fprofile-instr-use, and for GCC it falls back to -fprofile-generate / -fprofile-use. In both cases, PGO is layered on top of the chosen arch, so an SSSE3 build can be profile-trained and recompiled for the exact instruction set. (GitHub)
Additionally, link-time optimisation (LTO) and, where possible, the LLD linker are enabled for non-macOS platforms. This combination—SSSE3 targeting + PGO + LTO—forms the basis for the speed-up that the Codex-assisted pipeline aims to exploit.
3. SSSE3-Accelerated NNUE Evaluation
The NNUE inference code in src/nnue.cpp is organised into several phases: accumulation of feature activations, pairwise interaction computation, sparse propagation through the first hidden layer, and then fully connected layers up to the final scalar evaluation. (GitHub) Codex’ changes mainly leverage the SIMD infrastructure to accelerate the two most expensive stages for an SSSE3 build.
3.1. Fast non-zero detection and sparse L1 propagation
The first optimisation concerns the detection of non-zero activations in the large L1 layer. PlentyChess maintains a precomputed lookup table nnzLookup[256][8] that encodes the indices of non-zero bytes for each possible 8-bit mask. (GitHub) In an SSSE3/AVX2/AVX-512 build, the code uses vector comparisons and this lookup to compute the indices of active neurons in blocks, rather than scanning the whole layer.
In the SSSE3 configuration (ARCH_X86 plus __SSSE3__), the engine uses 128-bit XMM registers and instructions like _mm_cmpgt_epi32 and _mm_movemask_ps (through the vecNNZ abstraction) to build a bitmask of non-zero elements, then expands that mask into indices with the lookup table. This turns what would otherwise be an O(N) loop into a set of vectorised operations that handle 16 bytes at a time. (GitHub)
3.2. SSSE3 dot products using maddubs and madd
The second—and more critical—optimisation is the dot-product computation between 8-bit activations and 8-bit weights. In the scalar fallback, this is implemented as nested loops that multiply bytes and accumulate into 32-bit integers. In the SSSE3 path, Codex leverages the dpbusdEpi32 / dpbusdEpi32x2 abstractions in simd.h. On x86 with SSSE3, these are implemented via _mm_maddubs_epi16 followed by _mm_madd_epi16 and additions. (GitHub)
In effect, every call to dpbusdEpi32 computes a packed dot product of 16 byte-pairs and accumulates into four 32-bit lanes in two instructions, dramatically increasing throughput. The NNUE code calls these intrinsics in a tight loop over the active L1 indices:
- For most of the layer, it processes two feature indices at a time with
dpbusdEpi32x2, doubling instruction-level parallelism. - For a possible trailing index, it falls back to
dpbusdEpi32.
This pattern is encoded in the SSSE3 branch of NNUE::evaluate and is only compiled when __SSSE3__ (or stronger SIMD) is defined, which is precisely what the SSSE3 build guarantees. (GitHub)
3.3. Vectorised floating-point layers
Subsequent layers operate on 32-bit floating-point values. For SSSE3-class CPUs, PlentyChess still uses the SSE scalar path in simd.h (128-bit vectors of four floats) and the vectorised implementation of activation and reduction:
- Conversions from 32-bit integers to floats via
_mm_cvtepi32_ps. - Clamping and squaring of activations using
_mm_min_ps,_mm_max_psand, when available,_mm_fmadd_psvia FMA. (GitHub)
Although these are not unique to SSSE3, the combined effect of vectorised integer and floating-point code leads to a much higher instruction throughput than a purely scalar implementation.
4. Practical Build Usage
In practical terms, Codex’ integration means that on an SSSE3-capable but AVX2-less system—such as a dual-socket Xeon E5-2697 v2 server—you can now build a fully optimised PlentyChess binary with commands of the following form:
# Development build, auto-detected arch (will choose ssse3 here)
make -j
# Explicit SSSE3 release with PGO (two-stage build):
make profile-build arch=ssse3 EXE=PlentyChess-ssse3 -j
On Windows/MinGW, the same targets are available with the appropriate CXX setting (e.g. CXX=clang++), and the build system will still auto-detect __SSSE3 via detect_flags.bat. (GitHub)
5. Performance and Expected Gains
The PlentyChess project itself reports that its optimised binaries (with instruction-set specific builds plus PGO/LTO) can be several percent faster than earlier generic releases—on the order of ~5 % on Linux and up to ~10 % on Windows for some versions. (GitHub) The Codex-assisted SSSE3 configuration is in line with this philosophy:
- Compared with a purely “generic” SSE2 build, an SSSE3 build activates the sparse NNZ logic and the
maddubs-based dot-product kernels, which significantly reduce the number of scalar integer operations. - Because these optimisations live behind PlentyChess’ SIMD abstraction and are combined with PGO/LTO, the SSSE3 binary remains stable and portable across all SSSE3-capable CPUs without sacrificing maintainability.
While exact speed-ups depend on compiler, OS, and hardware, the qualitative effect is consistent: more nodes per second and better utilisation of the NNUE evaluation pipeline on mid-generation x86-64 processors.
6. Conclusion
In summary, Codex’ work on PlentyChess has:
- Exposed a robust SSSE3 build target through the
Makefile, with automatic CPU feature detection and an explicitarch=ssse3override. - Ensured that SSSE3 builds trigger the vectorised NNUE code paths, in particular the fast non-zero detection and dot-product kernels based on
maddubs/madd. - Integrated these changes with PGO and LTO, so that the SSSE3 binary benefits from both microarchitectural tuning and profile-guided optimisation.
For clusters built around SSSE3-capable, AVX2-less hardware, this configuration offers a technically clean and practically effective way to run PlentyChess at a significantly higher speed while retaining full compatibility with its upstream codebase and release workflow.
