i#6938 sched migrate: Separate run queue per output #6985

derekbruening · 2024-09-13T03:09:16Z

Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions. Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100
cores we were seeing 41% of lock attempts collide with
the global queue:

    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911

With separate runqueues we see < 1 in 10,000 collide:

    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086

Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record.

Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues.

Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different resulting schedules.

Adds a new unit test targeting queue rebalancing.

Tested under ThreadSanitizer for race detection on a relatively large trace on 90 cores.

Issue: #6938

Refactors the scheduler's time-oriented options to become based on simulated microseconds rather than being unitless and having to all be separately scaled depending on the simulator's clock. Deprecates these scheduler_options_t fields, replacing them with new versions: + quantum_duration => quantum_duration_{instructions,us} + block_time_scale => block_time_multiplier + block_time_max => block_time_max_us Adds a new "time_units_per_us" which is the single place where a simulator sets the relationship between the value passed to "cur_time" in next_record() and simulated microseconds. The aforementioned fields are all compared to cur_time multiplied by time_units_per_us. This is a prelude to adding yet more time-based options for the forthcoming scheduler additions with migration thresholds and rebalance periods. Adds legacy support for binary compatibility. Recompiling will result in error messages prompting an update to the new fields. Adds a unit test of legacy support. Issue: #6938

…me-units

Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex. Makes the runqueue lock usage narrow, avoiding holding locks across the larger functions. Establishes a lock ordering convention: input > output > unsched. The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100 cores we were seeing 41% of lock attempts collide with the global queue: ``` [scheduler] Schedule lock acquired : 72674364 [scheduler] Schedule lock contended : 30144911 ``` With separate runqueues we see < 1 in 10,000 collide: ``` [scheduler] Stats for output #0 <...> [scheduler] Runqueue lock acquired : 34594996 [scheduler] Runqueue lock contended : 29 [scheduler] Stats for output #1 <...> [scheduler] Runqueue lock acquired : 51130763 [scheduler] Runqueue lock contended : 41 <...> [scheduler] Runqueue lock acquired : 46305755 [scheduler] Runqueue lock contended : 44 [scheduler] Unscheduled queue lock acquired : 27834 [scheduler] Unscheduled queue lock contended : 273 $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}' 11528 $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}' 6814820713 (gdb) p 11528/6814820713.*100 $1 = 0.00016916072315753086 ``` Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record. Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues. Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing. Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing. Adds statistics on the stealing and rebalancing instances. Updates all of the unit tests, many of which now have different resulting schedules. Adds a new unit test targeting queue rebalancing. Issue: #6938

…w/ core biases

…h local analyzer runs

brettcoon

I'm still working through the changes, but I'm not going to finish today so I'll share what I have so far.

clients/drcachesim/scheduler/scheduler.cpp

clients/drcachesim/scheduler/scheduler.h

clients/drcachesim/scheduler/scheduler.cpp

…t lock

derekbruening · 2024-09-14T17:03:57Z

Note that this depends on #6980 which maybe you missed

brettcoon

More comments and questions.

clients/drcachesim/scheduler/scheduler.cpp

…edundant asserts

…r-core-runqueue

…he new time_units_per_us (tested w/ that at 1 before)

brettcoon

Thanks for the clarifications.

clients/drcachesim/scheduler/scheduler.cpp

derekbruening · 2024-09-17T14:58:13Z

a64 failure is the before-a-bundle assert #3320

Fixes clang-tidy complaints from PR #6985 about a misspelling, mismatched parameter names, and missing includes, and removes a stale comment. Issue: #6938

PR #6985 removed the global sched_lock_ which used to synchronize access to other output's record_index field, leaving a data race. Since only the index and not the recorded contents need synchronization, we use an atomic rather than a mutex to coordinate. Tested by running an internal test under ThreadSanitizer where a race is reported without this fix and no race with the fix. Issue: #6938

PR #6985 removed the global sched_lock_ which used to synchronize access to other output's record_index field, leaving a data race (affecting only the rough timing across outputs where an output waits if gets ahead of other outputs' timestamps). Since only the index and not the recorded contents need synchronization, we use an atomic rather than a mutex to coordinate. Tested by running an internal test under ThreadSanitizer where a race is reported without this fix and no race with the fix. Issue: #6938

Removes a duplicate option -sched_time_per_us which was accidentally added by PR #6980 separately from the similar -sched_time_units_per_us from PR #6985. Issue: #6938

Fixes a deadlock on replay introduced by PR #6985. If a recorded schedule has back-to-back entries with the same input (with a gap where another thread runs that input in between of course) and it has to wait for that other thread for its 2nd stint with that input, it will deadlock as it holds the input's lock while it tries to get the lock in set_cur_input(). That 2nd lock acquisition is what was added by PR #6985: that's the source of the regression. Tested on a 12-thread threadsig trace which hangs without the fix and succeeds with the fix. It is too large to turn into a convenient regression test. Issue: #7044

derekbruening added 8 commits September 11, 2024 21:40

Fix Windows build warning

bdec908

Merge branch 'master' of github.com:DynamoRIO/dynamorio into i6938-ti…

b12e991

…me-units

Fix another Windows build warning

4b40d2b

Fix yet another Windows build warning

33b5bfd

Add missing includes

a462db1

Fix Windows build warnings; relax switches per core expected in test …

5d6762a

…w/ core biases

derekbruening requested a review from brettcoon September 13, 2024 03:34

derekbruening added 4 commits September 13, 2024 00:08

Fix another Windows build warning

0cb3cad

Fix scaling bugs for blocked time and quantum; shift defaults to matc…

7cf2e42

…h local analyzer runs

Fix two unit tests missing time_units_per_us

3b3089d

Remove diagnostics

1d36d81

brettcoon reviewed Sep 14, 2024

View reviewed changes

derekbruening added 2 commits September 14, 2024 12:07

Fix mis-attribution of stats on switchto to target's old output

6dce7ce

On _SCHEDULE, give up target input lock before acquiring target outpu…

7e9fe13

…t lock

brettcoon reviewed Sep 16, 2024

View reviewed changes

Base automatically changed from i6938-time-units to master September 17, 2024 00:15

derekbruening added 4 commits September 17, 2024 00:31

Give up direct switch input lock before acquiring its output lock

6f37460

Review requests: s/runqueue/ready_queue/; add many comments; remove r…

2ebda7c

…edundant asserts

Merge branch 'master' of github.com:DynamoRIO/dynamorio into i6938-pe…

6128656

…r-core-runqueue

Drop rebalance threshold from 1.5M to 150K as 1.5M is too high with t…

a475ef4

…he new time_units_per_us (tested w/ that at 1 before)

brettcoon approved these changes Sep 17, 2024

View reviewed changes

clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved

Review suggestion: optimize work-stealing loop

baf9cf8

derekbruening merged commit f1b2d54 into master Sep 17, 2024
16 of 17 checks passed

derekbruening deleted the i6938-per-core-runqueue branch September 17, 2024 15:17

derekbruening added a commit that referenced this pull request Sep 18, 2024

i#6938 sched migrate: Fix clang-tidy complaints

f534e4d

Fixes clang-tidy complaints from PR #6985 about a misspelling, mismatched parameter names, and missing includes, and removes a stale comment. Issue: #6938

derekbruening mentioned this pull request Sep 18, 2024

i#6938 sched migrate: Fix clang-tidy complaints #6987

Merged

derekbruening mentioned this pull request Sep 20, 2024

i#6938 sched migrate: Make replay record_index atomic #6995

Merged

derekbruening mentioned this pull request Sep 26, 2024

i#6938 sched migrate: Remove duplicate option #7007

Merged

derekbruening mentioned this pull request Oct 16, 2024

Livelock in replay-as-traced with per-core-runqueues #7044

Closed

derekbruening mentioned this pull request Oct 22, 2024

i#7044 hang: Fix deadlock regression on drmemtrace replay #7049

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i#6938 sched migrate: Separate run queue per output #6985

i#6938 sched migrate: Separate run queue per output #6985

derekbruening commented Sep 13, 2024 •

edited

Loading

brettcoon left a comment

derekbruening commented Sep 14, 2024

brettcoon left a comment

brettcoon left a comment

derekbruening commented Sep 17, 2024

i#6938 sched migrate: Separate run queue per output #6985

i#6938 sched migrate: Separate run queue per output #6985

Conversation

derekbruening commented Sep 13, 2024 • edited Loading

brettcoon left a comment

Choose a reason for hiding this comment

derekbruening commented Sep 14, 2024

brettcoon left a comment

Choose a reason for hiding this comment

brettcoon left a comment

Choose a reason for hiding this comment

derekbruening commented Sep 17, 2024

derekbruening commented Sep 13, 2024 •

edited

Loading