ReentrantLock: wakeup a single task on unlock and add a short spin #56814

andrebsguedes · 2024-12-12T16:45:38Z

I propose a change in the implementation of the ReentrantLock to improve its overall throughput for short critical sections and fix the quadratic wake-up behavior where each unlock schedules all waiting tasks on the lock's wait queue.

This implementation follows the same principles of the Mutex in the parking_lot Rust crate which is based on the Webkit WTF::ParkingLot class. Only the basic working principle is implemented here, further improvements such as eventual fairness will be proposed separately.

The gist of the change is that we add one extra state to the lock, essentially going from:

0x0 => The lock is not locked
0x1 => The lock is locked by exactly one task. No other task is waiting for it.
0x2 => The lock is locked and some other task tried to lock but failed (conflict)

To:

# PARKED_BIT | LOCKED_BIT | Description
#     0      |     0      | The lock is not locked, nor is anyone waiting for it.
# -----------+------------+------------------------------------------------------------------
#     0      |     1      | The lock is locked by exactly one task. No other task is
#            |            | waiting for it.
# -----------+------------+------------------------------------------------------------------
#     1      |     0      | The lock is not locked. One or more tasks are parked.
# -----------+------------+------------------------------------------------------------------
#     1      |     1      | The lock is locked by exactly one task. One or more tasks are
#            |            | parked waiting for the lock to become available.
#            |            | In this state, PARKED_BIT is only ever cleared when the cond_wait lock
#            |            | is held (i.e. on unlock). This ensures that
#            |            | we never end up in a situation where there are parked tasks but
#            |            | PARKED_BIT is not set (which would result in those tasks
#            |            | potentially never getting woken up).

In the current implementation we must schedule all tasks to cause a conflict (state 0x2) because on unlock we only notify any task if the lock is in the conflict state. This behavior means that with high contention and a short critical section the tasks will be effectively spinning in the scheduler queue.

With the extra state the proposed implementation has enough information to know if there are other tasks to be notified or not, which means we can always notify one task at a time while preserving the optimized path of not notifying if there are no tasks waiting. To improve throughput for short critical sections we also introduce a bounded amount of spinning before attempting to park.

Results

Not spinning on the scheduler queue greatly reduces the CPU utilization of the following example:

function example()
    lock = ReentrantLock()
    @sync begin
        for i in 1:10000
            Threads.@spawn begin
                @lock lock begin
                    sleep(0.001)
                end
            end
        end
    end
end


@time example()

Current:

28.890623 seconds (101.65 k allocations: 7.646 MiB, 0.25% compilation time)

Proposed:

22.806669 seconds (101.65 k allocations: 7.814 MiB, 0.35% compilation time)

In a micro-benchmark where 8 threads contend for a single lock with a very short critical section we see a ~2x improvement.

Current:

8-element Vector{Int64}:
 6258688
 5373952
 6651904
 6389760
 6586368
 3899392
 5177344
 5505024
Total iterations: 45842432

Proposed:

8-element Vector{Int64}:
 12320768
 12976128
 10354688
 12845056
  7503872
 13598720
 13860864
 11993088
Total iterations: 95453184

~~In the uncontended scenario the extra bookkeeping causes a 10% throughput reduction:~~
EDIT: I reverted _trylock to the simple case to recover the uncontended throughput and now both implementations are on the same ballpark (without hurting the above numbers).

In the uncontended scenario:

Current:

Total iterations: 236748800

Proposed:

Total iterations: 237699072

Closes #56182

oscardssmith · 2024-12-12T18:40:17Z

base/lock.jl

+# Instead, the backoff is simply capped at a maximum value. This can be
+# used to improve throughput in `compare_exchange` loops that have high
+# contention.
+@inline function spin(iteration::Int)


is this better than just calling yield?

In this case we do not want to leave the core but instead just busy wait in-core for a small number of iterations before we attempt the compare_exchange again, this way if the critical section of the lock is small enough we have a chance to acquire the lock without paying for a OS thread context switch (or a Julia scheduler task switch if you mean Base.yield).
This is the same strategy employed by Rust here.

I meant Base.yield. I think we're in a different situation than Rust since we have M:N threading and a user mode scheduler which Rust doesn't.

Yes, I am aware of the differences but the one that matters most in this case is that Julia has a cooperative scheduler. This means we have a tradeoff between micro-contention throughput (where we want to stay in-core) and being nice to the other tasks (by calling Base.yield).

So I wrote a benchmark that measures the total amount of forward progress that can be made both by the tasks participating in locking as well as other unrelated tasks to see where we reach a good balance in this tradeoff. It turns out yielding to the Julia scheduler up to a limited amount seems to work great and does not suffer from the pathological case of always spinning in the scheduler (the first example in the PR description).

I will update the code and post benchmark results soon-ish (my daughter arrived yesterday!).

Thanks, @gbaraldi for working on benchmarks also, maybe I can contribute this new benchmark to your LockBench.jl

@andrebsguedes It would be lovely to add more benchmarks to it. Having a suite of benchmarks that stress locks in different ways would be great.

adienes · 2024-12-12T19:01:38Z

(this PR is fantastically written! professional, comprehensive, and easy to follow 👏👏)

vchuravy

Thanks! Very well written PR.

base/lock.jl

gbaraldi · 2024-12-12T20:48:55Z

Another point of comparison could be https://github.com/kprotty/usync which is just a "normal" lock

gbaraldi · 2024-12-16T17:42:27Z

I wrote https://github.com/gbaraldi/LockBench.jl and there it seems to be a good upgrade

gbaraldi · 2024-12-16T17:44:24Z

Master

PR

base/lock.jl

andrebsguedes · 2024-12-24T13:36:03Z

I updated the proposal to reflect the changes from tuning for fairness to non-locking tasks also.

The following benchmark introduces competing tasks that perform a unit of work and call Base.yield in a loop to understand how unrelated tasks are affected by locking tasks.

Current implementation (thundering heard, 8 threads, 8 tasks locking, 8 task yielding, short critical section)

Locking:
  488.60 kHz
  488.60 kHz
  482.49 kHz
  486.56 kHz
  482.49 kHz
  486.56 kHz
  484.52 kHz
  488.60 kHz
Yielding:
  348.12 kHz
  348.12 kHz
  348.12 kHz
  348.12 kHz
  348.12 kHz
  348.12 kHz
  348.12 kHz
  348.12 kHz
Total useful work: 10625024

Initial proposal (limited CPU spin, 8 threads, 8 tasks locking, 8 tasks yielding, short critical section)

Locking:
  916.67 kHz
  937.04 kHz
  910.56 kHz
  916.67 kHz
  918.71 kHz
  918.71 kHz
  926.86 kHz
  928.90 kHz
Yielding:
  38.70 kHz
  38.70 kHz
  38.70 kHz
  40.74 kHz
  38.70 kHz
  40.74 kHz
  40.74 kHz
  38.70 kHz
Total useful work: 15144960

New proposal (limited scheduler spin, 8 threads, 8 tasks locking, 8 tasks yielding, short critical section)

Locking:
  737.73 kHz
  749.96 kHz
  731.62 kHz
  739.77 kHz
  729.58 kHz
  735.69 kHz
  733.66 kHz
  735.69 kHz
Yielding:
  401.47 kHz
  403.51 kHz
  399.43 kHz
  403.51 kHz
  401.47 kHz
  403.51 kHz
  401.47 kHz
  401.47 kHz
Total useful work: 15077376

Some observations:

The current implementation wastes more cycles in the scheduler as it is the one with the least amount of useful work.
The initial proposal has the best micro-contention throughput but to achieve this it favors the locking tasks too much.
The new proposal strikes the best balance between the locking and the yielding tasks while still being an improvement in micro-contention throughput and in the amount of useful work performed.
The new proposal still performs as well as the previous one in the simple sleep example which means the limited spin in the scheduler is benign.

kpamnany

This looks good to me. The new benchmark is a good idea. @andrebsguedes: ifi you don't have enough time to add it to the lock benchmark suite, please share it with @gbaraldi to do so if/when he gets time!

Nice work!

kpamnany · 2024-12-26T18:27:00Z

@andrebsguedes: CI failures seem related: first, second.

kpamnany · 2024-12-28T17:34:29Z

CI shows only upload failures.

@vtjnash: this is good to go IMO. Can you take a look?

ancapdev · 2024-12-30T09:55:47Z

We run Julia on very high core count machines and I'm pleased to see progress on scalability issues like this. Would the same kind of pattern be applicable to Channel condition notifications? Channels currently wake all waiters on put, which can lead to massive scheduler churn (at the OS level) to wake threads that have no work available.

@sync

…uliaLang#56814) I propose a change in the implementation of the `ReentrantLock` to improve its overall throughput for short critical sections and fix the quadratic wake-up behavior where each unlock schedules **all** waiting tasks on the lock's wait queue. This implementation follows the same principles of the `Mutex` in the [parking_lot](https://github.com/Amanieu/parking_lot/tree/master) Rust crate which is based on the Webkit [WTF::ParkingLot](https://webkit.org/blog/6161/locking-in-webkit/) class. Only the basic working principle is implemented here, further improvements such as eventual fairness will be proposed separately. The gist of the change is that we add one extra state to the lock, essentially going from: ``` 0x0 => The lock is not locked 0x1 => The lock is locked by exactly one task. No other task is waiting for it. 0x2 => The lock is locked and some other task tried to lock but failed (conflict) ``` To: ``` ``` In the current implementation we must schedule all tasks to cause a conflict (state 0x2) because on unlock we only notify any task if the lock is in the conflict state. This behavior means that with high contention and a short critical section the tasks will be effectively spinning in the scheduler queue. With the extra state the proposed implementation has enough information to know if there are other tasks to be notified or not, which means we can always notify one task at a time while preserving the optimized path of not notifying if there are no tasks waiting. To improve throughput for short critical sections we also introduce a bounded amount of spinning before attempting to park. Not spinning on the scheduler queue greatly reduces the CPU utilization of the following example: ```julia function example() lock = ReentrantLock() @sync begin for i in 1:10000 Threads.@Spawn begin @lock lock begin sleep(0.001) end end end end end @time example() ``` Current: ``` 28.890623 seconds (101.65 k allocations: 7.646 MiB, 0.25% compilation time) ``` ![image](https://github.com/user-attachments/assets/dbd6ce57-c760-4f5a-b68a-27df6a97a46e) Proposed: ``` 22.806669 seconds (101.65 k allocations: 7.814 MiB, 0.35% compilation time) ``` ![image](https://github.com/user-attachments/assets/b0254180-658d-4493-86d3-dea4c500b5ac) In a micro-benchmark where 8 threads contend for a single lock with a very short critical section we see a ~2x improvement. Current: ``` 8-element Vector{Int64}: 6258688 5373952 6651904 6389760 6586368 3899392 5177344 5505024 Total iterations: 45842432 ``` Proposed: ``` 8-element Vector{Int64}: 12320768 12976128 10354688 12845056 7503872 13598720 13860864 11993088 Total iterations: 95453184 ``` ~~In the uncontended scenario the extra bookkeeping causes a 10% throughput reduction:~~ EDIT: I reverted _trylock to the simple case to recover the uncontended throughput and now both implementations are on the same ballpark (without hurting the above numbers). In the uncontended scenario: Current: ``` Total iterations: 236748800 ``` Proposed: ``` Total iterations: 237699072 ``` Closes JuliaLang#56182

kpamnany · 2024-12-30T16:37:31Z

~~We will take a look at those as well.~~ From a quick look, this fix should also address Channel puts?

@sync

…56814) I propose a change in the implementation of the `ReentrantLock` to improve its overall throughput for short critical sections and fix the quadratic wake-up behavior where each unlock schedules **all** waiting tasks on the lock's wait queue. This implementation follows the same principles of the `Mutex` in the [parking_lot](https://github.com/Amanieu/parking_lot/tree/master) Rust crate which is based on the Webkit [WTF::ParkingLot](https://webkit.org/blog/6161/locking-in-webkit/) class. Only the basic working principle is implemented here, further improvements such as eventual fairness will be proposed separately. The gist of the change is that we add one extra state to the lock, essentially going from: ``` 0x0 => The lock is not locked 0x1 => The lock is locked by exactly one task. No other task is waiting for it. 0x2 => The lock is locked and some other task tried to lock but failed (conflict) ``` To: ``` # PARKED_BIT | LOCKED_BIT | Description # 0 | 0 | The lock is not locked, nor is anyone waiting for it. # -----------+------------+------------------------------------------------------------------ # 0 | 1 | The lock is locked by exactly one task. No other task is # | | waiting for it. # -----------+------------+------------------------------------------------------------------ # 1 | 0 | The lock is not locked. One or more tasks are parked. # -----------+------------+------------------------------------------------------------------ # 1 | 1 | The lock is locked by exactly one task. One or more tasks are # | | parked waiting for the lock to become available. # | | In this state, PARKED_BIT is only ever cleared when the cond_wait lock # | | is held (i.e. on unlock). This ensures that # | | we never end up in a situation where there are parked tasks but # | | PARKED_BIT is not set (which would result in those tasks # | | potentially never getting woken up). ``` In the current implementation we must schedule all tasks to cause a conflict (state 0x2) because on unlock we only notify any task if the lock is in the conflict state. This behavior means that with high contention and a short critical section the tasks will be effectively spinning in the scheduler queue. With the extra state the proposed implementation has enough information to know if there are other tasks to be notified or not, which means we can always notify one task at a time while preserving the optimized path of not notifying if there are no tasks waiting. To improve throughput for short critical sections we also introduce a bounded amount of spinning before attempting to park. ### Results Not spinning on the scheduler queue greatly reduces the CPU utilization of the following example: ```julia function example() lock = ReentrantLock() @sync begin for i in 1:10000 Threads.@Spawn begin @lock lock begin sleep(0.001) end end end end end @time example() ``` Current: ``` 28.890623 seconds (101.65 k allocations: 7.646 MiB, 0.25% compilation time) ``` ![image](https://github.com/user-attachments/assets/dbd6ce57-c760-4f5a-b68a-27df6a97a46e) Proposed: ``` 22.806669 seconds (101.65 k allocations: 7.814 MiB, 0.35% compilation time) ``` ![image](https://github.com/user-attachments/assets/b0254180-658d-4493-86d3-dea4c500b5ac) In a micro-benchmark where 8 threads contend for a single lock with a very short critical section we see a ~2x improvement. Current: ``` 8-element Vector{Int64}: 6258688 5373952 6651904 6389760 6586368 3899392 5177344 5505024 Total iterations: 45842432 ``` Proposed: ``` 8-element Vector{Int64}: 12320768 12976128 10354688 12845056 7503872 13598720 13860864 11993088 Total iterations: 95453184 ``` ~~In the uncontended scenario the extra bookkeeping causes a 10% throughput reduction:~~ EDIT: I reverted _trylock to the simple case to recover the uncontended throughput and now both implementations are on the same ballpark (without hurting the above numbers). In the uncontended scenario: Current: ``` Total iterations: 236748800 ``` Proposed: ``` Total iterations: 237699072 ``` Closes #56182

ancapdev · 2025-01-02T17:20:42Z

~~We will take a look at those as well.~~ From a quick look, this fix should also address Channel puts?

I'm not following how, could you explain, please?

oscardssmith · 2025-01-02T17:38:39Z

Channel uses a ReentrantLock, fixing this fixes channel put.

ancapdev · 2025-01-02T21:01:57Z

Channel uses a ReentrantLock, fixing this fixes channel put.

For the lock, yes, but it notifies all waiters of the condition variable https://github.com/JuliaLang/julia/blob/v1.11.2/base/channels.jl#L386:

# notify all, since some of the waiters may be on a "fetch" call.
notify(c.cond_take, nothing, true, false)

oscardssmith · 2025-01-02T21:17:29Z

wait, what the hell is this doing? why would we notify waiters of the lock before we unlock? Isn't that just going to cause a bunch of contention for no reason?

ancapdev · 2025-01-03T08:20:09Z

wait, what the hell is this doing? why would we notify waiters of the lock before we unlock? Isn't that just going to cause a bunch of contention for no reason?

https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2

@sync

…uliaLang#56814) (#200) I propose a change in the implementation of the `ReentrantLock` to improve its overall throughput for short critical sections and fix the quadratic wake-up behavior where each unlock schedules **all** waiting tasks on the lock's wait queue. This implementation follows the same principles of the `Mutex` in the [parking_lot](https://github.com/Amanieu/parking_lot/tree/master) Rust crate which is based on the Webkit [WTF::ParkingLot](https://webkit.org/blog/6161/locking-in-webkit/) class. Only the basic working principle is implemented here, further improvements such as eventual fairness will be proposed separately. The gist of the change is that we add one extra state to the lock, essentially going from: ``` 0x0 => The lock is not locked 0x1 => The lock is locked by exactly one task. No other task is waiting for it. 0x2 => The lock is locked and some other task tried to lock but failed (conflict) ``` To: ``` ``` In the current implementation we must schedule all tasks to cause a conflict (state 0x2) because on unlock we only notify any task if the lock is in the conflict state. This behavior means that with high contention and a short critical section the tasks will be effectively spinning in the scheduler queue. With the extra state the proposed implementation has enough information to know if there are other tasks to be notified or not, which means we can always notify one task at a time while preserving the optimized path of not notifying if there are no tasks waiting. To improve throughput for short critical sections we also introduce a bounded amount of spinning before attempting to park. Not spinning on the scheduler queue greatly reduces the CPU utilization of the following example: ```julia function example() lock = ReentrantLock() @sync begin for i in 1:10000 Threads.@Spawn begin @lock lock begin sleep(0.001) end end end end end @time example() ``` Current: ``` 28.890623 seconds (101.65 k allocations: 7.646 MiB, 0.25% compilation time) ``` ![image](https://github.com/user-attachments/assets/dbd6ce57-c760-4f5a-b68a-27df6a97a46e) Proposed: ``` 22.806669 seconds (101.65 k allocations: 7.814 MiB, 0.35% compilation time) ``` ![image](https://github.com/user-attachments/assets/b0254180-658d-4493-86d3-dea4c500b5ac) In a micro-benchmark where 8 threads contend for a single lock with a very short critical section we see a ~2x improvement. Current: ``` 8-element Vector{Int64}: 6258688 5373952 6651904 6389760 6586368 3899392 5177344 5505024 Total iterations: 45842432 ``` Proposed: ``` 8-element Vector{Int64}: 12320768 12976128 10354688 12845056 7503872 13598720 13860864 11993088 Total iterations: 95453184 ``` ~~In the uncontended scenario the extra bookkeeping causes a 10% throughput reduction:~~ EDIT: I reverted _trylock to the simple case to recover the uncontended throughput and now both implementations are on the same ballpark (without hurting the above numbers). In the uncontended scenario: Current: ``` Total iterations: 236748800 ``` Proposed: ``` Total iterations: 237699072 ``` Closes JuliaLang#56182 Co-authored-by: André Guedes <[email protected]>

ReentrantLock: wakeup a single task on unlock and add a short spin

5fe5d5f

oscardssmith added performance Must go faster multithreading Base.Threads and related functionality labels Dec 12, 2024

kpamnany requested a review from vtjnash December 12, 2024 17:13

oscardssmith reviewed Dec 12, 2024

View reviewed changes

Revert _trylock to the simple case to recover uncontended throughput

4dc6406

vchuravy reviewed Dec 12, 2024

View reviewed changes

base/lock.jl Outdated Show resolved Hide resolved

base/lock.jl Outdated Show resolved Hide resolved

kpamnany reviewed Dec 17, 2024

View reviewed changes

base/lock.jl Outdated Show resolved Hide resolved

Make ReentrantLock fair to non-locking tasks

fa6832d

andrebsguedes requested review from gbaraldi, oscardssmith, kpamnany and vchuravy December 24, 2024 13:46

kpamnany approved these changes Dec 26, 2024

View reviewed changes

Minor test fix

5db392f

vchuravy approved these changes Dec 29, 2024

View reviewed changes

kpamnany merged commit 4b2f4d9 into JuliaLang:master Dec 30, 2024
6 of 7 checks passed

kpamnany mentioned this pull request Dec 30, 2024

ReentrantLock: wakeup a single task on unlock and add a short spin (#56814) RelationalAI/julia#200

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReentrantLock: wakeup a single task on unlock and add a short spin #56814

ReentrantLock: wakeup a single task on unlock and add a short spin #56814

andrebsguedes commented Dec 12, 2024 •

edited

Loading

oscardssmith Dec 12, 2024

andrebsguedes Dec 12, 2024

oscardssmith Dec 12, 2024

andrebsguedes Dec 17, 2024

gbaraldi Dec 17, 2024

adienes commented Dec 12, 2024

vchuravy left a comment

gbaraldi commented Dec 12, 2024

gbaraldi commented Dec 16, 2024

gbaraldi commented Dec 16, 2024

andrebsguedes commented Dec 24, 2024 •

edited

Loading

kpamnany left a comment

kpamnany commented Dec 26, 2024

kpamnany commented Dec 28, 2024

ancapdev commented Dec 30, 2024

kpamnany commented Dec 30, 2024 •

edited

Loading

ancapdev commented Jan 2, 2025

oscardssmith commented Jan 2, 2025 •

edited

Loading

ancapdev commented Jan 2, 2025

oscardssmith commented Jan 2, 2025

ancapdev commented Jan 3, 2025

ReentrantLock: wakeup a single task on unlock and add a short spin #56814

ReentrantLock: wakeup a single task on unlock and add a short spin #56814

Conversation

andrebsguedes commented Dec 12, 2024 • edited Loading

Results

oscardssmith Dec 12, 2024

Choose a reason for hiding this comment

andrebsguedes Dec 12, 2024

Choose a reason for hiding this comment

oscardssmith Dec 12, 2024

Choose a reason for hiding this comment

andrebsguedes Dec 17, 2024

Choose a reason for hiding this comment

gbaraldi Dec 17, 2024

Choose a reason for hiding this comment

adienes commented Dec 12, 2024

vchuravy left a comment

Choose a reason for hiding this comment

gbaraldi commented Dec 12, 2024

gbaraldi commented Dec 16, 2024

gbaraldi commented Dec 16, 2024

andrebsguedes commented Dec 24, 2024 • edited Loading

Current implementation (thundering heard, 8 threads, 8 tasks locking, 8 task yielding, short critical section)

Initial proposal (limited CPU spin, 8 threads, 8 tasks locking, 8 tasks yielding, short critical section)

New proposal (limited scheduler spin, 8 threads, 8 tasks locking, 8 tasks yielding, short critical section)

kpamnany left a comment

Choose a reason for hiding this comment

kpamnany commented Dec 26, 2024

kpamnany commented Dec 28, 2024

ancapdev commented Dec 30, 2024

kpamnany commented Dec 30, 2024 • edited Loading

ancapdev commented Jan 2, 2025

oscardssmith commented Jan 2, 2025 • edited Loading

ancapdev commented Jan 2, 2025

oscardssmith commented Jan 2, 2025

ancapdev commented Jan 3, 2025

andrebsguedes commented Dec 12, 2024 •

edited

Loading

andrebsguedes commented Dec 24, 2024 •

edited

Loading

kpamnany commented Dec 30, 2024 •

edited

Loading

oscardssmith commented Jan 2, 2025 •

edited

Loading