Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#6938 sched migrate: Separate run queue per output #6985

Merged
merged 19 commits into from
Sep 17, 2024

Conversation

derekbruening
Copy link
Contributor

@derekbruening derekbruening commented Sep 13, 2024

Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions. Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100
cores we were seeing 41% of lock attempts collide with
the global queue:

    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911

With separate runqueues we see < 1 in 10,000 collide:

    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086

Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record.

Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues.

Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different resulting schedules.

Adds a new unit test targeting queue rebalancing.

Tested under ThreadSanitizer for race detection on a relatively large trace on 90 cores.

Issue: #6938

Refactors the scheduler's time-oriented options to become based on
simulated microseconds rather than being unitless and having to all be
separately scaled depending on the simulator's clock.

Deprecates these scheduler_options_t fields, replacing them with new versions:
+ quantum_duration => quantum_duration_{instructions,us}
+ block_time_scale => block_time_multiplier
+ block_time_max => block_time_max_us

Adds a new "time_units_per_us" which is the single place where a
simulator sets the relationship between the value passed to "cur_time"
in next_record() and simulated microseconds.  The aforementioned
fields are all compared to cur_time multiplied by time_units_per_us.

This is a prelude to adding yet more time-based options for the
forthcoming scheduler additions with migration thresholds and
rebalance periods.

Adds legacy support for binary compatibility.  Recompiling will result
in error messages prompting an update to the new fields.

Adds a unit test of legacy support.

Issue: #6938
Removes the global runqueue and global sched_lock_, replacing with
per-output runqueues which each have a lock inside a new struct
input_queue_t which clearly delineates what the lock protects.  The
unscheduled queue remains global and has its own lock as another
input_queue_t.  The output fields .active and .cur_time are now
atomics, as they are accessed from other outputs yet are separate from
the queue and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions.  Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen
on fast analyzers (the original design targeted heavyweight
simulators).  On a large internal trace with hundreds of threads on
>100 cores we were seeing 41% of lock attempts collide with
the global queue:
```
    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911
```
With separate runqueues we see < 1 in 10,000 collide:
```
    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086
```

Before an output goes idle, it attempts to steal work from another
output's runqueue.  A new input option is added controlling the
migration threshold to avoid moving jobs too frequently.  The stealing
is done inside eof_or_idle() which now returns a new internal status
code STATUS_STOLE so the various callers can be sure to read the next
record.

Adds a periodic rebalancing with a period equal to another new input
option.  Adds flexible_queue_t::back() for rebalancing to not take from
the front of the queues.

Updates an output going inactive and promoting everything-unscheduled
to use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs
during stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different
resulting schedules.

Adds a new unit test targeting queue rebalancing.

Issue: #6938
Copy link
Contributor

@brettcoon brettcoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working through the changes, but I'm not going to finish today so I'll share what I have so far.

clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.h Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
@derekbruening
Copy link
Contributor Author

Note that this depends on #6980 which maybe you missed

Copy link
Contributor

@brettcoon brettcoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments and questions.

clients/drcachesim/scheduler/scheduler.cpp Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
Base automatically changed from i6938-time-units to master September 17, 2024 00:15
Copy link
Contributor

@brettcoon brettcoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications.

clients/drcachesim/scheduler/scheduler.cpp Outdated Show resolved Hide resolved
@derekbruening
Copy link
Contributor Author

a64 failure is the before-a-bundle assert #3320

@derekbruening derekbruening merged commit f1b2d54 into master Sep 17, 2024
16 of 17 checks passed
@derekbruening derekbruening deleted the i6938-per-core-runqueue branch September 17, 2024 15:17
derekbruening added a commit that referenced this pull request Sep 18, 2024
Fixes clang-tidy complaints from PR #6985 about a misspelling,
mismatched parameter names, and missing includes, and removes a stale
comment.

Issue: #6938
derekbruening added a commit that referenced this pull request Sep 18, 2024
Fixes clang-tidy complaints from PR #6985 about a misspelling,
mismatched parameter names, and missing includes, and removes a stale
comment.

Issue: #6938
derekbruening added a commit that referenced this pull request Sep 20, 2024
PR #6985 removed the global sched_lock_ which used to synchronize
access to other output's record_index field, leaving a data
race. Since only the index and not the recorded contents need
synchronization, we use an atomic rather than a mutex to coordinate.

Tested by running an internal test under ThreadSanitizer where a race
is reported without this fix and no race with the fix.

Issue: #6938
derekbruening added a commit that referenced this pull request Sep 21, 2024
PR #6985 removed the global sched_lock_ which used to synchronize
access to other output's record_index field, leaving a data
race (affecting only the rough timing across outputs where an output
waits if gets ahead of other outputs' timestamps).
Since only the index and not the recorded contents need
synchronization, we use an atomic rather than a mutex to coordinate.

Tested by running an internal test under ThreadSanitizer where a race
is reported without this fix and no race with the fix.

Issue: #6938
derekbruening added a commit that referenced this pull request Sep 26, 2024
Removes a duplicate option -sched_time_per_us which was accidentally
added by PR #6980 separately from the similar -sched_time_units_per_us
from PR #6985.

Issue: #6938
derekbruening added a commit that referenced this pull request Sep 27, 2024
Removes a duplicate option -sched_time_per_us which was accidentally
added by PR #6980 separately from the similar -sched_time_units_per_us
from PR #6985.

Issue: #6938
derekbruening added a commit that referenced this pull request Oct 22, 2024
Fixes a deadlock on replay introduced by PR #6985.  If a recorded
schedule has back-to-back entries with the same input (with a gap
where another thread runs that input in between of course) and it has
to wait for that other thread for its 2nd stint with that input, it
will deadlock as it holds the input's lock while it tries to get the
lock in set_cur_input().  That 2nd lock acquisition is what was added
by PR #6985: that's the source of the regression.

Tested on a 12-thread threadsig trace which hangs without the fix and
succeeds with the fix.  It is too large to turn into a convenient
regression test.

Issue: #7044
derekbruening added a commit that referenced this pull request Oct 22, 2024
Fixes a deadlock on replay introduced by PR #6985. If a recorded
schedule has back-to-back entries with the same input (with a gap where
another thread runs that input in between of course) and it has to wait
for that other thread for its 2nd stint with that input, it will
deadlock as it holds the input's lock while it tries to get the lock in
set_cur_input(). That 2nd lock acquisition is what was added by PR
#6985: that's the source of the regression.

Tested on a 12-thread threadsig trace which hangs without the fix and
succeeds with the fix. It is too large to turn into a convenient
regression test.

Issue: #7044
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants