-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[libc++] Improve performance of std::atomic_flag on Windows #163524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libc++] Improve performance of std::atomic_flag on Windows #163524
Conversation
On Windows 8 and above, the WaitOnAddress, WakeByAddressSingle and WakeByAddressAll functions allow efficient implementation of the C++20 wait and notify features of std::atomic_flag. These Windows functions have never been made use of in libc++, leading to very poor performance of these features on Windows platforms, as they are implemented using a spin loop with backoff, rather than using any OS thread signalling whatsoever. This change implements the use of these OS functions where available, falling back to the original implementation on Windows versions prior to 8. Fixes llvm#127221
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
DavidSpickett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a libcxx expert but happened to see this on Discord.
Please include links in the PR description to the Windows documentation for these API calls. My memory is that Microsoft are pretty good at listing when and where they're available.
|
I allowed the CI to run as well. There is a Windows build there and that's probably Windows 10, >=8 for sure. |
|
I observe that just substituting
Though taking advantage of varying sizes seems complicated, so maybe not worth doing. |
What is the overhead of dispatch? Perhaps it should align with std::chrono, only doing so when _WIN32_WINNT < _WIN32_WINNT_WIN8? |
|
@llvm/pr-subscribers-libcxx Author: Roger Sanders (RogerSanders) ChangesOn Windows 8 and above, the WaitOnAddress, WakeByAddressSingle and WakeByAddressAll functions allow efficient implementation of the C++20 wait and notify features of std::atomic_flag. These Windows functions have never been made use of in libc++, leading to very poor performance of these features on Windows platforms, as they are implemented using a spin loop with backoff, rather than using any OS thread signalling whatsoever. This change implements the use of these OS functions where available, falling back to the original implementation on Windows versions prior to 8. Relevant API docs from Microsoft: Fixes #127221 Full diff: https://github.com/llvm/llvm-project/pull/163524.diff 1 Files Affected:
diff --git a/libcxx/src/atomic.cpp b/libcxx/src/atomic.cpp
index b214ba1fd11c0..4001e48f1ec25 100644
--- a/libcxx/src/atomic.cpp
+++ b/libcxx/src/atomic.cpp
@@ -41,6 +41,10 @@
// OpenBSD has no indirect syscalls
# define _LIBCPP_FUTEX(...) futex(__VA_ARGS__)
+#elif defined(_WIN32)
+
+# include <windows.h>
+
#else // <- Add other operating systems here
// Baseline needs no new headers
@@ -101,6 +105,40 @@ static void __libcpp_platform_wake_by_address(__cxx_atomic_contention_t const vo
_umtx_op(const_cast<__cxx_atomic_contention_t*>(__ptr), UMTX_OP_WAKE, __notify_one ? 1 : INT_MAX, nullptr, nullptr);
}
+#elif defined(_WIN32)
+
+static void
+__libcpp_platform_wait_on_address(__cxx_atomic_contention_t const volatile* __ptr, __cxx_contention_t __val) {
+ // WaitOnAddress was added in Windows 8 (build 9200)
+ static auto __pWaitOnAddress = reinterpret_cast<BOOL(WINAPI*)(volatile void*, PVOID, SIZE_T, DWORD)>(
+ GetProcAddress(GetModuleHandleW(L"api-ms-win-core-synch-l1-2-0.dll"), "WaitOnAddress"));
+ if (__pWaitOnAddress != nullptr) {
+ __pWaitOnAddress(const_cast<__cxx_atomic_contention_t*>(__ptr), &__val, sizeof(__val), INFINITE);
+ } else {
+ __libcpp_thread_poll_with_backoff(
+ [=]() -> bool { return !__cxx_nonatomic_compare_equal(__cxx_atomic_load(__ptr, memory_order_relaxed), __val); },
+ __libcpp_timed_backoff_policy());
+ }
+}
+
+static void __libcpp_platform_wake_by_address(__cxx_atomic_contention_t const volatile* __ptr, bool __notify_one) {
+ if (__notify_one) {
+ // WakeByAddressSingle was added in Windows 8 (build 9200)
+ static auto __pWakeByAddressSingle = reinterpret_cast<void(WINAPI*)(PVOID)>(
+ GetProcAddress(GetModuleHandleW(L"api-ms-win-core-synch-l1-2-0.dll"), "WakeByAddressSingle"));
+ if (__pWakeByAddressSingle != nullptr) {
+ __pWakeByAddressSingle(const_cast<__cxx_atomic_contention_t*>(__ptr));
+ }
+ } else {
+ // WakeByAddressAll was added in Windows 8 (build 9200)
+ static auto __pWakeByAddressAll = reinterpret_cast<void(WINAPI*)(PVOID)>(
+ GetProcAddress(GetModuleHandleW(L"api-ms-win-core-synch-l1-2-0.dll"), "WakeByAddressAll"));
+ if (__pWakeByAddressAll != nullptr) {
+ __pWakeByAddressAll(const_cast<__cxx_atomic_contention_t*>(__ptr));
+ }
+ }
+}
+
#else // <- Add other operating systems here
// Baseline is just a timed backoff
|
The overhead is effectively zero - GetProcAddress is used internally to resolve statically linked exported functions at runtime by LoadLibrary, and just like that, we're doing it once per module and caching the result in a static function pointer, all we did is shift the timing, so overhead isn't a concern. If we were to try and do the same thing as std::chrono, the real problem is the linking - there's a separate Synchronization.lib to pull in for these exports, we don't get them in kernel32.lib. If libc++ was being linked statically, that'd push the dependency down to the linking module I believe, unless there's a way around that being used in libc++ currently that I'm not aware of. Happy to do things differently if there's an established way to do this already, but keeping it all in one place and using GetProcAddress feels simplest to me right now. |
Sorry, missed your comment earlier. That's a good point, although we'd still need a loop above WaitOnAddress checking the value, since as per C++20 the wait functions on atomics must not unblock spuriously, while WaitOnAddress can, so it's not quite a direct implementation, but still pretty close. I'd personally put this in the realm of a possible future enhancement under a different job though. 99% of the value is obtained by this simpler change, although it is a pity std::atomic_flag in particular still uses a surrogate address rather than using the actual address directly. Addressing this would involve an ABI change though, whereas this change is fully ABI compatible with the existing implementation. |
huixie90
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd still need a loop above WaitOnAddress checking the value, since as per C++20 the wait functions on atomics must not unblock spuriously, while WaitOnAddress can
we already do that. Please check the header atomic_sync.h for implementation of ‘atomic/hatch/barrier’s wait function. Basically , we check the predicate (not binary equal) in a while loop. If wait time < 4us , we spin, otherwise, we call platform dependent code (eg futex wait on Linux ). So it already handles spurious wake up in the platform code.
WaitOnAddress can wait on 1, 2, 4, or 8 bytes in both x86-64 and 32-bit x86, so can implement most atomic::wait operations directly.
I would suggest to get familiar how Libc++ dispatch the wait based on the type. see contention_t.h. Unfortunately the current trunk behaviour is not optimal. The way it works is that, each platform defines a contention_t, (eg int32_t on Linux) if the atomic value type happens to be the same, it goes to the happy path and call platform wait directly. Otherwise, if the type does not match exactly, eg uint32_t, it will go through a global contention table, and using the proxy atomic to call the platform code. Which is ineffient. Unfortunately allowing dispatch based on the size is an ABI break (in the sense that it changed the side effect and the post condition of a function). In short, I would highly recommend to implement efficiently from the very beginning for windows.
I have a refactoring of the whole thing in progress
#161086
I would suggest wait until we land that first, so you can get better wait strategy from the very beginning, without needing to fighting with the ABI macros
I'm aware of the spin loop, I was using that point to illustrate why WaitOnAddress couldn't implement the functionality "directly". I did look into the contention_t type and saw the ABI limitations here, and prepared this patch deliberately and specifically to solve the main issue without causing an ABI break, believing (perhaps incorrectly) that it would be simpler and easier to get such a fix into the actual distribution in a timely manner. I ran into this break in a real project, and I'd like to see a pathway to get the fix deployed quickly if possible, even into a minor update for an existing major version if possible, so that this project (and others in similar situations) could benefit from the issue here being addressed in the near term. When I was making this change, I could see the potential for a more major ABI-breaking change here, which exposes the ability to wait on memory addresses of different sizes directly. This sounds like what you've done, and it would certainly be a better end result overall. Even without that though, a more targeted fix would extract 99% of the value. The current implementation is very poor on Windows, and I was seeing a 3x real world performance degradation with a simple case of synchronizing two threads under the current implementation - crippling for my purposes. This change brings that back into comparable timing with the MSVC STL implementation. Your changes will deal with some more edge cases better, and shave off a few extra cycles, but most of the value can be obtained with a simpler change up front. I suppose what I'm saying is, I see the value in the change you're advancing, but I disagree, I think maybe the order should be reversed here? What about putting my simple change through first, which is targeted, ABI compatible, and already passing all tests and ready to go (perhaps with one or two minor commenting/naming changes as raised by other reviewers), merge it into maintenance branches for current releases if agreed, and your change comes in and builds on top of it for the next major release, adding support for waiting on addresses of 1, 2, or 4 bytes directly, without a surrogate address being required? |
|
Since this is a new performance improvement and not a regression or fix in any way, I don't think this will get back-ported to previous releases. Every change has the potential for breaking things, and the aim of the point releases is to remove as many bugs as possible. New features should only be added in major releases unless there is a really compelling reason otherwise. |
If that's the policy so be it. I'd argue the performance here might be bad enough to consider it a defect. I was seeing my program run 3x slower vs MSVC STL, and I only had two threads attempting to synchronize using a pair of std::atomic_flag members. This fix brings it back to rough parity - a 3x speedup. That was my motivation to submit this. I can live with it being a major update though. I've already had to mitigate it on the application side. |
philnik777
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the changes, but let's wait for others to comment.
mstorsjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me, looks reasonable.
huixie90
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change is not an ABI break and we can make incremental improvement. I am fine with the change.
| // WaitOnAddress was added in Windows 8 (build 9200) | ||
| static auto wait_on_address = reinterpret_cast<BOOL(WINAPI*)(volatile void*, PVOID, SIZE_T, DWORD)>( | ||
| GetProcAddress(GetModuleHandleW(L"api-ms-win-core-synch-l1-2-0.dll"), "WaitOnAddress")); | ||
| if (wait_on_address != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way (e.g. a macro) to check if the function is available at compile time? we don't do this sort of check at runtime on other systems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know which OS the resulting binary will be run on, so no, no such thing, except by dropping support for windows 7 completely.
Which may be a reasonable move - ms-stl already did - but that would require a project-wide RFC, not just a mundane-looking PR to an obscure module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libc++ seems to have dropped Windows 7 support (for modern toolchains where _WIN32_WINNT > _WIN32_WINNT_WIN8 ), but avoiding the dependency on synchronization.lib does sound very meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libc++ seems to have dropped Windows 7 support
That's not true. Libc++ does support Windows 7 - but when building, _WIN32_WINNT indicates the lowest version you require at runtime.
So if your toolchain defaults to a higher version, you need to manually define _WIN32_WINNT to the lowest version you want to support. llvm-mingw builds with _WIN32_WINNT set to Windows 7.
So for things like this, like the example in chrono.cpp, we can choose to either load symbols dynamically, if targeting an older version, or link directly, if we're targeting a new enough version. But in this case, the PR author pointed out that it would require linking in another import library if we'd link directly, which both is extra work on the CMake level, and also translates to extra work for users who statically link in libc++. So as the overhead is near-zero with only loading it the first time it is called, the PR author thought it was simplest to always go with the dynamic loading approach, which sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not true. Libc++ does support Windows 7 - but when building,
_WIN32_WINNTindicates the lowest version you require at runtime.
Yes, that's exactly what I meant. Considering that both MinGW and Windows SDK default to _WIN32_WINNT_WIN10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if they currently aren't using the latest LLVM, I don't think that logic should be extended so that they can't update to the latest one, for things they are working on. Whether the next release is ready for shipping right now or not seems beside the point - as long as that branch is worked on, and is built for a Win7 baseline, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, considering that version 4.0 will not be released anytime soon, absolutely NO SUPPORT for it from the VideoLAN Team before its release, and that VideoLAN also provides GCC builds, I suspect that there are likely few or no end users who will be affected by LLVM's decision to drop Windows 7 support. In this case, LLVM's downstream users should perhaps reconsider their platform support policies rather than passing the burden onto LLVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we figure out how to get synchronization.a into libc++.a? If no, we need GetProcAddress even on win11, and there is no additional support burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can rely on llvm-lib, it's very simple, but if we have to be compatible with ar, it becomes very ugly.
if(LIBCXX_ENABLE_STATIC AND MINGW)
find_library(libsynchronization synchronization REQUIRED)
add_custom_command(
TARGET cxx_static
POST_BUILD
COMMAND llvm-lib /out:$<TARGET_FILE:cxx_static> $<TARGET_FILE:cxx_static> ${libsynchronization}
)
endif()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLC 4.0 will target Windows 7 as minimal, yes.
VLC 4.0 will be LLVM-focused now, this being the default on macOS/iOS/visionOS, Android, and all Windows versions. Only some Linux versions will stay on GCC, because it is the default on some distributions.
ldionne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes since I have a question.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
| // WaitOnAddress was added in Windows 8 (build 9200) | ||
| static auto wait_on_address = reinterpret_cast<BOOL(WINAPI*)(volatile void*, PVOID, SIZE_T, DWORD)>( | ||
| GetProcAddress(GetModuleHandleW(L"api-ms-win-core-synch-l1-2-0.dll"), "WaitOnAddress")); | ||
| if (wait_on_address != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this seems reasonably straightforward again.
Regarding "supports and not supports at the same time" - this is pretty much the same as how it works on macOS; if you do a default build of things, it sets the minimum deployment target to the latest/current version, but if you want to you can build with -mmacosx-version-min=<old> to target older things. If targeting an older version, potentially unavailable symbols are linked to weakly and one can make conditionals to check whether they exist before calling them.
And regarding bumping the minimum supported version on Windows; we have VLC as one fairly major user, who still are targeting Windows 7 in their latest version.
libcxx/src/atomic.cpp
Outdated
| // Attempt to locate the function in the API and return the result to the caller. Note that the NULL return from this | ||
| // method is documented as being interchangeable with nullptr. | ||
| // https://devblogs.microsoft.com/oldnewthing/20180307-00/?p=98175 | ||
| return GetProcAddress(module_handle.get(), function_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a unique_ptr anymore, so this .get() doesn't belong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I accidentally did a build test of that change on a clean repo, not my modified one. I've pushed a fix.
ldionne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @mstorsjo Can we update the documentation to mention the versions of Windows that we support?
Sure, I can try to do that at some point. |
|
@RogerSanders Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
On Windows 8 and above, the WaitOnAddress, WakeByAddressSingle and WakeByAddressAll functions allow efficient implementation of the C++20 wait and notify features of std::atomic_flag. These Windows functions have never been made use of in libc++, leading to very poor performance of these features on Windows platforms, as they are implemented using a spin loop with backoff, rather than using any OS thread signalling whatsoever. This change implements the use of these OS functions where available, falling back to the original implementation on Windows versions prior to 8.
Relevant API docs from Microsoft:
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddresssingle
https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-wakebyaddressall
Fixes #127221