-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-4881 sched: CPU relaxing on idle #4332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
4cf26e9
to
e80b097
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
e80b097
to
56e7814
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
56e7814
to
80e48bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
80e48bb
to
3d89b19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
The patch hasn't been fully tested yet, but I think you may have an early review if you have time. Thanks. |
Scheduler will put CPU relaxing for a short interval when the xstream is idle. The default relax mode is that scheduler calls sleep("short_timeout") on idle, if the cart progress interface is able to relax CPU (depends on network provider), env var SCHED_RELAX_MODE=net could be set to make the scheduler calls crt_progress("short_timeout") to realx while waiting on request. Following changes are made to support this feature: - Introduced sched_cond_wait(), it's called by ULTs being blocked on a condition for a long time (like RDB daemons), so that scheduler will be able to know how many ULTs are blocked on a non-immediate event. - Introduced sched_create_thread/task() to bump 'busy timestamp' once a ULT is created for certain xstream, so that scheduler is able to know if the xstream is busy recently. - Intrdouced DSS_ULT_FL_PERIODIC flag, it'll be used by the thread creation caller who doesn't want 'busy timestamp' being bumped. It's mainly for EC aggregation periodical collective calls. - Necessary changes to DTX and EC aggregation ULTs to avoid keeping the xstream 'busy'. - Added few env variables: 1. SCHED_PRIO_DISABLED to disable scheduler priority queue. 2. SCHED_RELAX_DISABLED to disable CPU relaxing on idle. 3. SCHED_RELAX_INTVL to configure the short relaxing interval. (in milli-seconds, 50 msecs by default) 4. SCHED_STATS_INTVL to periodically print relaxing stats on console (in seconds). 5. SCHED_RELAX_MODE to change relax mode, "sleep" is the default mode, set to any other string for "block cart progress" mode. Signed-off-by: Niu Yawei <[email protected]>
3d89b19
to
fd670ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Merge remote-tracking branch 'github/master' into DAOS-4881-relax Conflicts: src/dtx/dtx_common.c Signed-off-by: Niu Yawei <[email protected]>
Shrink the default relax internval from 50ms to 1ms. Signed-off-by: Niu Yawei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
This had hit some assertion failures and deadlocked in CI overnight so I killed the job given the length of the queue. |
Not sure why the test is cancelled. Already finished tests have some failures, some are from DAOS-6544, others are test timeout, no obvious errors in the log, not sure if they are caused by DAOS-6544 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
@ashleypittman do you know what are the assertion failures? I didn't find them. |
Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4332/8/execution/node/389/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4332/8/execution/node/346/log |
Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4332/8/execution/node/306/log |
The DTX batched commit ULT will sleep during relax model. Under such case, after new DTX is added into the DTX CoS cache, the leader will check whehter the count of cached committable DTX exceeds the batched commit threshold, if yes, wakeup the batched commit ULT. Signed-off-by: Fan Yong <[email protected]> Co-authored-by: Niu Yawei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Merge remote-tracking branch 'github/master' into DAOS-4881-relax Conflicts: src/dtx/dtx_common.c Signed-off-by: Niu Yawei <[email protected]>
This has conflicts that need to be resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Reviewers, the patch passed CI, but got a conflict, I just fixed the conflict and did another push. Please review it again. Thanks. |
Merge remote-tracking branch 'github/master' into DAOS-4881-relax Conflicts: src/iosrv/srv.c Signed-off-by: Niu Yawei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also want to document the new env vars (what's the right place? doc/admin/env_variables.md?) later.
@daos-stack/daos-gatekeeper All tests passed, but three new NLT errors showed up after merging PR 4490. Could we land it and solve the NLT errors in follow-on patch? |
The NLT error is "dss_module_key_get() Assertion 'dtls != NULL' failed" and if landed as-is would cause all master builds and PRs to fail. |
Fix a merge error. Signed-off-by: Niu Yawei <[email protected]>
@ashleypittman oh, right, that's caused by a merge conflict. Thanks. sigh, I have to push a commit again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Niu, I have restarted your build with a higher priority, running build 19 now. Build 18 should be canceled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
I just had a second look, the other two are in master anyway as of https://build.hpdd.intel.com/job/daos-stack/job/daos/job/master/3744/ this PR must have been merged with master and built before the relevent master build completed to be selected as a reference. A simple re-trigger or rebuilding of just the stage should have been enough to pass. |
Scheduler will put CPU relaxing for a short interval when the
xstream is idle. The default relax mode is that scheduler calls
sleep("short_timeout") on idle, if the cart progress interface
is able to relax CPU (depends on network provider), env var
SCHED_RELAX_MODE=net could be set to make the scheduler calls
crt_progress("short_timeout") to realx while waiting on request.
Following changes are made to support this feature:
Introduced sched_cond_wait(), it's called by ULTs being blocked
on a condition for a long time (like RDB daemons), so that
scheduler will be able to know how many ULTs are blocked on a
non-immediate event.
Introduced sched_create_thread/task() to bump 'busy timestamp'
once a ULT is created for certain xstream, so that scheduler is
able to know if the xstream is busy recently.
Intrdouced DSS_ULT_FL_PERIODIC flag, it'll be used by the thread
creation caller who doesn't want 'busy timestamp' being bumped.
It's mainly for EC aggregation periodical collective calls.
Necessary changes to DTX and EC aggregation ULTs to avoid keeping
the xstream 'busy'.
Added few env variables:
(in milli-seconds, 1 msec by default)
console (in seconds).
mode, set to any other string for "block cart progress" mode.
Signed-off-by: Niu Yawei [email protected]