Skip to content

Conversation

@anematode
Copy link
Contributor

@anematode anematode commented Nov 3, 2025

As usual... non-functional speedup.

Failed yellow on Fishtest:

LLR: -2.95 (-2.94,2.94) <0.00,2.00>
Total: 219616 W: 57165 L: 57105 D: 105346
Ptnml(0-2): 732, 24277, 59720, 24357, 722 

vondele and I chatted on the Discord and agreed that this is probably still a good change even if it doesn't gain on fishtest, given that it gains on at least some systems, and it would be insane for it to regress anywhere. That being said, it might be a good idea to get more benchmarks to understand the impact of this change....

Background/explanation

fetch_add and friends must produce a locking instruction because they guarantee consistency in the presence of multiple writers. Because only one thread ever writes to nodes and tbHits, we may use relaxed load and store to emit regular loads and stores (on both x86-64 and arm64), while avoiding formal 👻 UB 👻 and miscounting nodes.

Credit must go to Arseniy Surkov of Reckless (@codedeliveryservice) for uncovering this performance gotcha.

(I'm honestly shocked that this patch works—I guess uncontended atomics are much more expensive than I thought!)

Other data

Benchmark results from vondele's many-core machine:

==== master ====
1 Nodes/second : 294269736
2 Nodes/second : 295217629
Average (over 2):  294743682
==== goose ====
1 Nodes/second : 299003603
2 Nodes/second : 298111320
Average (over 2):  298557461

indicating it scales fine. Interestingly I get only a minuscule speedup on my Zen 5 machine (+0.1%, P>0.99). I can't imagine a case where it'd ever regress though, and fishtest doesn't lie.

@anematode anematode changed the title Use non-atomic operations for nodes count and tbHits Use non-locking operations for nodes count and tbHits Nov 3, 2025
@ddobbelaere
Copy link
Contributor

Why does only one thread update those counters? What's the point for using atomics anyway then, to just read them in another thread? Sorry for the potentially stupid question.

@vondele
Copy link
Member

vondele commented Nov 3, 2025

because two different threads read them... they have to be treated as atomics.

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Nov 3, 2025

I see, they are local counters of the worker thread itself (and not global ones) which are accumulated elsewhere. Makes perfect sense, sorry for the noise.

rlefko added a commit to rlefko/Stockfish that referenced this pull request Nov 6, 2025
rlefko added a commit to rlefko/Stockfish that referenced this pull request Nov 6, 2025
@vondele
Copy link
Member

vondele commented Nov 13, 2025

will need rebase

@vondele
Copy link
Member

vondele commented Nov 14, 2025

will need rebase

@anematode FYI

@vondele vondele added the to be merged Will be merged shortly label Nov 17, 2025
@vondele vondele closed this in b083049 Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

simplification to be merged Will be merged shortly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants