-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #418. #419
base: master
Are you sure you want to change the base?
Fix issue #418. #419
Conversation
To avoid entering Parking too frequently in case of cache contention, adding sleep 1ms, 4 times before parking and after old 'spin()'. Signed-off-by: Ping Zhao <[email protected]>
The benchmark numbers here are actually very misleading. Essentially, this benchmark measures how fast a thread can repeatedly acquire and release a lock. The best performance is obtained when only 1 thread is accessing the lock memory, so there are no cache conflicts, no need to fall back to the slow path, etc. This is the number you get when running the benchmark with only 1 thread. When running the benchmark with multiple threads, you get the best performance in the same way: by ensuring a single thread can repeatedly lock/unlock without interference from other threads. The easiest way to achieve this is to put all other threads to sleep. The longer the sleep, the better the benchmark numbers. The problem is that this only makes the benchmark look good on paper, and actually hurts real applications. Making a thread sleep for 1ms (~3000000 CPU cycles) if it misses a lock can have very noticeable impacts on application performance, especially for timing-sensitive code such as rendering or audio. Instead, it is better to park the thread so that it can be woken up by the OS as soon as the lock is free. |
The Mutex is designed for multi-threaded data access, ideally program without the need for data sharing. However, in practice, some critical structures require sharing among multiple threads. Based on our experience, this lock benchmark can effectively represent scenarios of lock contention, as illustrated by the given case reflecting Mutex behavior in high contention situations. While reducing the number of threads will have better results, the issue persists, resulting in lower performance compared to std::Mutex. I have chosen this case to better highlight the problem. The performance drop in parking-lot::Mutex, compared to std::Mutex, occurs in the slow path. Before entering the parking state, there should be a smooth transition, and using sleep appears to be a suitable way to transition between spin_wait, yield, and parking. The logic remains consistent with before, meaning that the slow path still attempts spin_wait and yield first. If these also fail and before entering the parking state, it will try sleeping to make the transition smoother. This is because parking takes the OS significantly more time to wake than a simple sleep. Based on that it won’t hurt the timing-sensitive program but help them smoothly transition in the case of high lock contention. |
This change also fixe issue #338 Before:
After
|
We have a video/audio streaming application build on tokio(which enable parking_lot by default), when parking_lot is enabled, when we use wrk to bench http output of the streaming application, the application is bottlenecked in cpu, no matter after disable parking_lot, we can reach the number we anticipated. the benchmark for parking_lot is very pool in our server(AMD EPYC 7502, Rocky linux 9.3, Kernel 5.14.0-362.8.1.el9_3.x86_64)
|
@dista Can you open a separate issue for this? Also, would you be able to test some modifications of parking_lot on your application to see how it affects CPU usage? |
Fixing issue #418,
To avoid entering Parking too frequently in case of cache contention, adding sleep 1ms, 4 times before parking and after old 'spin()'.
The lock-bench result shows:
vs. Before:
also fixed issue #338
Before:
After