Skip to content

Commit

Permalink
i#6938 sched migrate: Separate run queue per output (#6985)
Browse files Browse the repository at this point in the history
Removes the global runqueue and global sched_lock_, replacing with
per-output runqueues which each have a lock inside a new struct
input_queue_t which clearly delineates what the lock protects. The
unscheduled queue remains global and has its own lock as another
input_queue_t. The output fields .active and .cur_time are now atomics,
as they are accessed from other outputs yet are separate from the queue
and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions.  Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen on
fast analyzers (the original design targeted heavyweight simulators). On
a large internal trace with hundreds of threads on >100
cores we were seeing 41% of lock attempts collide with
the global queue:
```
    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911
```
With separate runqueues we see < 1 in 10,000 collide:
```
    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086
```

Before an output goes idle, it attempts to steal work from another
output's runqueue. A new input option is added controlling the migration
threshold to avoid moving jobs too frequently. The stealing is done
inside eof_or_idle() which now returns a new internal status code
STATUS_STOLE so the various callers can be sure to read the next record.

Adds a periodic rebalancing with a period equal to another new input
option. Adds flexible_queue_t::back() for rebalancing to not take from
the front of the queues.

Updates an output going inactive and promoting everything-unscheduled to
use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs during
stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different
resulting schedules.

Adds a new unit test targeting queue rebalancing.

Tested under ThreadSanitizer for race detection on a relatively large
trace on 90 cores.

Issue: #6938
  • Loading branch information
derekbruening authored Sep 17, 2024
1 parent 3cda8c8 commit f1b2d54
Show file tree
Hide file tree
Showing 8 changed files with 1,118 additions and 409 deletions.
3 changes: 3 additions & 0 deletions clients/drcachesim/analyzer_multi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -564,6 +564,9 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::init_dynamic_schedule()
sched_ops.blocking_switch_threshold = op_sched_blocking_switch_us.get_value();
sched_ops.block_time_multiplier = op_sched_block_scale.get_value();
sched_ops.block_time_max_us = op_sched_block_max_us.get_value();
sched_ops.migration_threshold_us = op_sched_migration_threshold_us.get_value();
sched_ops.rebalance_period_us = op_sched_rebalance_period_us.get_value();
sched_ops.time_units_per_us = op_sched_time_units_per_us.get_value();
sched_ops.randomize_next_input = op_sched_randomize.get_value();
sched_ops.honor_direct_switches = !op_sched_disable_direct_switches.get_value();
#ifdef HAS_ZIP
Expand Down
9 changes: 9 additions & 0 deletions clients/drcachesim/common/memtrace_stream.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,15 @@ class memtrace_stream_t {
* i.e., the number of input migrations to this core.
*/
SCHED_STAT_MIGRATIONS,
/**
* Counts the number of times this output's runqueue became empty and it took
* work from another output's runqueue.
*/
SCHED_STAT_RUNQUEUE_STEALS,
/**
* Counts the number of output runqueue rebalances triggered by this output.
*/
SCHED_STAT_RUNQUEUE_REBALANCES,
/** Count of statistic types. */
SCHED_STAT_TYPE_COUNT,
};
Expand Down
12 changes: 12 additions & 0 deletions clients/drcachesim/common/options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1002,6 +1002,18 @@ droption_t<double> op_sched_time_units_per_us(
"of the -sched_*_us values as it converts wall-clock time into the simulated "
"microseconds measured by those options.");

droption_t<uint64_t> op_sched_migration_threshold_us(
DROPTION_SCOPE_ALL, "sched_migration_threshold_us", 500,
"Time in simulated microseconds before an input can be migrated across cores",
"The minimum time in simulated microseconds that must have elapsed since an input "
"last ran on a core before it can be migrated to another core.");

droption_t<uint64_t> op_sched_rebalance_period_us(
DROPTION_SCOPE_ALL, "sched_rebalance_period_us", 1500000,
"Period in microseconds at which core run queues are load-balanced",
"The period in simulated microseconds at which per-core run queues are re-balanced "
"to redistribute load.");

// Schedule_stats options.
droption_t<uint64_t>
op_schedule_stats_print_every(DROPTION_SCOPE_ALL, "schedule_stats_print_every",
Expand Down
3 changes: 3 additions & 0 deletions clients/drcachesim/common/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,9 @@ extern dynamorio::droption::droption_t<std::string> op_cpu_schedule_file;
extern dynamorio::droption::droption_t<std::string> op_sched_switch_file;
extern dynamorio::droption::droption_t<bool> op_sched_randomize;
extern dynamorio::droption::droption_t<bool> op_sched_disable_direct_switches;
extern dynamorio::droption::droption_t<uint64_t> op_sched_migration_threshold_us;
extern dynamorio::droption::droption_t<uint64_t> op_sched_rebalance_period_us;
extern dynamorio::droption::droption_t<double> op_sched_time_units_per_us;
extern dynamorio::droption::droption_t<uint64_t> op_schedule_stats_print_every;
extern dynamorio::droption::droption_t<std::string> op_syscall_template_file;
extern dynamorio::droption::droption_t<uint64_t> op_filter_stop_timestamp;
Expand Down
9 changes: 9 additions & 0 deletions clients/drcachesim/scheduler/flexible_queue.h
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,15 @@ class flexible_queue_t {
return entries_[rand_gen_() % size()]; // Undefined if empty.
}

// Returns an entry from the back -- or at least not from the front; it's not
// guaranteed to be the lowest priority, just not the highest.
T
back()
{
assert(!empty());
return entries_.back();
}

bool
empty() const
{
Expand Down
881 changes: 651 additions & 230 deletions clients/drcachesim/scheduler/scheduler.cpp

Large diffs are not rendered by default.

273 changes: 180 additions & 93 deletions clients/drcachesim/scheduler/scheduler.h

Large diffs are not rendered by default.

337 changes: 251 additions & 86 deletions clients/drcachesim/tests/scheduler_unit_tests.cpp

Large diffs are not rendered by default.

0 comments on commit f1b2d54

Please sign in to comment.