Skip to content

Conversation

@KierenP
Copy link
Contributor

@KierenP KierenP commented Aug 16, 2025

We can use the _mm512_mask_compressstoreu_epi16 instruction which does a compress + store in one instruction rather than two. It's part of the same AVX512_VBMI2 instruction set.

Baseline:

$ ./stockfish_master.exe speedtest
Stockfish dev-20250816-169737a9 by the Stockfish developers (see AUTHORS file)
info string Using 32 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20250816-169737a9
Compiled by                : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture   : x86-64-avx512icl
Compilation settings       : 64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages                : no
User invocation            : speedtest
Filled invocation          : speedtest 32 4096 150
Available processors       : 0-31
Thread count               : 32
Thread binding             : none
TT size [MiB]              : 4096
Hash max, avg [per mille]  :
    single search          : 48, 23
    single game            : 647, 456
Total nodes searched       : 5322766982
Total search time [s]      : 153.515
Nodes/second               : 34672618

With my change:

$ ./stockfish.exe speedtest
Stockfish dev-20250816-169737a9 by the Stockfish developers (see AUTHORS file)
info string Using 32 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20250816-169737a9
Compiled by                : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture   : x86-64-avx512icl
Compilation settings       : 64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages                : no
User invocation            : speedtest
Filled invocation          : speedtest 32 4096 150
Available processors       : 0-31
Thread count               : 32
Thread binding             : none
TT size [MiB]              : 4096
Hash max, avg [per mille]  :
    single search          : 47, 22
    single game            : 655, 448
Total nodes searched       : 5403131641
Total search time [s]      : 153.519
Nodes/second               : 35195198

Speedup: 1.51%

Bench: 2996176
@mstembera
Copy link
Contributor

See #6153 (comment)

@mstembera
Copy link
Contributor

We should probably put a comment here just like

// Avoid _mm512_mask_compressstoreu_epi16() as it's 256 uOps on Zen4

@KierenP
Copy link
Contributor Author

KierenP commented Aug 17, 2025

Ah thanks, I didn't realise. That being said, it is a speedup on Zen5 so in the future maybe we can have a target taking advantage of this.

@Disservin
Copy link
Member

how much of a speedup is this for zen5? if it is only minor id simply avoid this and not introduce a new target, nor start parsing cpuid..

@KierenP
Copy link
Contributor Author

KierenP commented Aug 28, 2025

It's 1.5% (as seen in the above speedtest comparison). That being said, the different would be bigger if I also applied _mm512_mask_compressstoreu_epi16 in the NNUE inference.

This PR can probably be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants