-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i#6938 sched migrate: Separate run queue per output #6985
Conversation
Refactors the scheduler's time-oriented options to become based on simulated microseconds rather than being unitless and having to all be separately scaled depending on the simulator's clock. Deprecates these scheduler_options_t fields, replacing them with new versions: + quantum_duration => quantum_duration_{instructions,us} + block_time_scale => block_time_multiplier + block_time_max => block_time_max_us Adds a new "time_units_per_us" which is the single place where a simulator sets the relationship between the value passed to "cur_time" in next_record() and simulated microseconds. The aforementioned fields are all compared to cur_time multiplied by time_units_per_us. This is a prelude to adding yet more time-based options for the forthcoming scheduler additions with migration thresholds and rebalance periods. Adds legacy support for binary compatibility. Recompiling will result in error messages prompting an update to the new fields. Adds a unit test of legacy support. Issue: #6938
Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex. Makes the runqueue lock usage narrow, avoiding holding locks across the larger functions. Establishes a lock ordering convention: input > output > unsched. The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100 cores we were seeing 41% of lock attempts collide with the global queue: ``` [scheduler] Schedule lock acquired : 72674364 [scheduler] Schedule lock contended : 30144911 ``` With separate runqueues we see < 1 in 10,000 collide: ``` [scheduler] Stats for output #0 <...> [scheduler] Runqueue lock acquired : 34594996 [scheduler] Runqueue lock contended : 29 [scheduler] Stats for output #1 <...> [scheduler] Runqueue lock acquired : 51130763 [scheduler] Runqueue lock contended : 41 <...> [scheduler] Runqueue lock acquired : 46305755 [scheduler] Runqueue lock contended : 44 [scheduler] Unscheduled queue lock acquired : 27834 [scheduler] Unscheduled queue lock contended : 273 $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}' 11528 $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}' 6814820713 (gdb) p 11528/6814820713.*100 $1 = 0.00016916072315753086 ``` Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record. Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues. Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing. Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing. Adds statistics on the stealing and rebalancing instances. Updates all of the unit tests, many of which now have different resulting schedules. Adds a new unit test targeting queue rebalancing. Issue: #6938
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working through the changes, but I'm not going to finish today so I'll share what I have so far.
Note that this depends on #6980 which maybe you missed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments and questions.
…he new time_units_per_us (tested w/ that at 1 before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications.
a64 failure is the before-a-bundle assert #3320 |
PR #6985 removed the global sched_lock_ which used to synchronize access to other output's record_index field, leaving a data race. Since only the index and not the recorded contents need synchronization, we use an atomic rather than a mutex to coordinate. Tested by running an internal test under ThreadSanitizer where a race is reported without this fix and no race with the fix. Issue: #6938
PR #6985 removed the global sched_lock_ which used to synchronize access to other output's record_index field, leaving a data race (affecting only the rough timing across outputs where an output waits if gets ahead of other outputs' timestamps). Since only the index and not the recorded contents need synchronization, we use an atomic rather than a mutex to coordinate. Tested by running an internal test under ThreadSanitizer where a race is reported without this fix and no race with the fix. Issue: #6938
Fixes a deadlock on replay introduced by PR #6985. If a recorded schedule has back-to-back entries with the same input (with a gap where another thread runs that input in between of course) and it has to wait for that other thread for its 2nd stint with that input, it will deadlock as it holds the input's lock while it tries to get the lock in set_cur_input(). That 2nd lock acquisition is what was added by PR #6985: that's the source of the regression. Tested on a 12-thread threadsig trace which hangs without the fix and succeeds with the fix. It is too large to turn into a convenient regression test. Issue: #7044
Fixes a deadlock on replay introduced by PR #6985. If a recorded schedule has back-to-back entries with the same input (with a gap where another thread runs that input in between of course) and it has to wait for that other thread for its 2nd stint with that input, it will deadlock as it holds the input's lock while it tries to get the lock in set_cur_input(). That 2nd lock acquisition is what was added by PR #6985: that's the source of the regression. Tested on a 12-thread threadsig trace which hangs without the fix and succeeds with the fix. It is too large to turn into a convenient regression test. Issue: #7044
Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex.
Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions. Establishes a lock ordering convention: input >
output > unsched.
The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100
cores we were seeing 41% of lock attempts collide with
the global queue:
With separate runqueues we see < 1 in 10,000 collide:
Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record.
Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues.
Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing.
Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing.
Adds statistics on the stealing and rebalancing instances.
Updates all of the unit tests, many of which now have different resulting schedules.
Adds a new unit test targeting queue rebalancing.
Tested under ThreadSanitizer for race detection on a relatively large trace on 90 cores.
Issue: #6938