Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-4881 sched: CPU relaxing on idle #4332

Merged
merged 17 commits into from
Feb 4, 2021
Merged

DAOS-4881 sched: CPU relaxing on idle #4332

merged 17 commits into from
Feb 4, 2021

Conversation

NiuYawei
Copy link
Contributor

@NiuYawei NiuYawei commented Jan 19, 2021

Scheduler will put CPU relaxing for a short interval when the
xstream is idle. The default relax mode is that scheduler calls
sleep("short_timeout") on idle, if the cart progress interface
is able to relax CPU (depends on network provider), env var
SCHED_RELAX_MODE=net could be set to make the scheduler calls
crt_progress("short_timeout") to realx while waiting on request.

Following changes are made to support this feature:

  • Introduced sched_cond_wait(), it's called by ULTs being blocked
    on a condition for a long time (like RDB daemons), so that
    scheduler will be able to know how many ULTs are blocked on a
    non-immediate event.

  • Introduced sched_create_thread/task() to bump 'busy timestamp'
    once a ULT is created for certain xstream, so that scheduler is
    able to know if the xstream is busy recently.

  • Intrdouced DSS_ULT_FL_PERIODIC flag, it'll be used by the thread
    creation caller who doesn't want 'busy timestamp' being bumped.
    It's mainly for EC aggregation periodical collective calls.

  • Necessary changes to DTX and EC aggregation ULTs to avoid keeping
    the xstream 'busy'.

  • Added few env variables:

    1. SCHED_PRIO_DISABLED to disable scheduler priority queue.
    2. SCHED_RELAX_DISABLED to disable CPU relaxing on idle.
    3. SCHED_RELAX_INTVL to configure the short relaxing interval.
      (in milli-seconds, 1 msec by default)
    4. SCHED_STATS_INTVL to periodically print relaxing stats on
      console (in seconds).
    5. SCHED_RELAX_MODE to change relax mode, "sleep" is the default
      mode, set to any other string for "block cart progress" mode.

Signed-off-by: Niu Yawei [email protected]

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei
Copy link
Contributor Author

The patch hasn't been fully tested yet, but I think you may have an early review if you have time. Thanks.

Scheduler will put CPU relaxing for a short interval when the
xstream is idle. The default relax mode is that scheduler calls
sleep("short_timeout") on idle, if the cart progress interface
is able to relax CPU (depends on network provider), env var
SCHED_RELAX_MODE=net could be set to make the scheduler calls
crt_progress("short_timeout") to realx while waiting on request.

Following changes are made to support this feature:

- Introduced sched_cond_wait(), it's called by ULTs being blocked
  on a condition for a long time (like RDB daemons), so that
  scheduler will be able to know how many ULTs are blocked on a
  non-immediate event.

- Introduced sched_create_thread/task() to bump 'busy timestamp'
  once a ULT is created for certain xstream, so that scheduler is
  able to know if the xstream is busy recently.

- Intrdouced DSS_ULT_FL_PERIODIC flag, it'll be used by the thread
  creation caller who doesn't want 'busy timestamp' being bumped.
  It's mainly for EC aggregation periodical collective calls.

- Necessary changes to DTX and EC aggregation ULTs to avoid keeping
  the xstream 'busy'.

- Added few env variables:
  1. SCHED_PRIO_DISABLED to disable scheduler priority queue.
  2. SCHED_RELAX_DISABLED to disable CPU relaxing on idle.
  3. SCHED_RELAX_INTVL to configure the short relaxing interval.
     (in milli-seconds, 50 msecs by default)
  4. SCHED_STATS_INTVL to periodically print relaxing stats on
     console (in seconds).
  5. SCHED_RELAX_MODE to change relax mode, "sleep" is the default
     mode, set to any other string for "block cart progress" mode.

Signed-off-by: Niu Yawei <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Merge remote-tracking branch 'github/master' into DAOS-4881-relax

Conflicts:
	src/dtx/dtx_common.c

Signed-off-by: Niu Yawei <[email protected]>
Shrink the default relax internval from 50ms to 1ms.

Signed-off-by: Niu Yawei <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@ashleypittman
Copy link
Contributor

This had hit some assertion failures and deadlocked in CI overnight so I killed the job given the length of the queue.

@NiuYawei
Copy link
Contributor Author

Not sure why the test is cancelled. Already finished tests have some failures, some are from DAOS-6544, others are test timeout, no obvious errors in the log, not sure if they are caused by DAOS-6544 as well.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei
Copy link
Contributor Author

This had hit some assertion failures and deadlocked in CI overnight so I killed the job given the length of the queue.

@ashleypittman do you know what are the assertion failures? I didn't find them.

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4332/8/execution/node/389/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4332/8/execution/node/346/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4332/8/execution/node/306/log

The DTX batched commit ULT will sleep during relax model.
Under such case, after new DTX is added into the DTX CoS
cache, the leader will check whehter the count of cached
committable DTX exceeds the batched commit threshold, if
yes, wakeup the batched commit ULT.

Signed-off-by: Fan Yong <[email protected]>
Co-authored-by: Niu Yawei <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Merge remote-tracking branch 'github/master' into DAOS-4881-relax

Conflicts:
	src/dtx/dtx_common.c

Signed-off-by: Niu Yawei <[email protected]>
@ashleypittman
Copy link
Contributor

This has conflicts that need to be resolved.

@NiuYawei NiuYawei requested review from liw and wangdi1 February 4, 2021 00:45
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@NiuYawei
Copy link
Contributor Author

NiuYawei commented Feb 4, 2021

Reviewers, the patch passed CI, but got a conflict, I just fixed the conflict and did another push. Please review it again. Thanks.

Merge remote-tracking branch 'github/master' into DAOS-4881-relax

Conflicts:
	src/iosrv/srv.c

Signed-off-by: Niu Yawei <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

liw
liw previously approved these changes Feb 4, 2021
Copy link
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also want to document the new env vars (what's the right place? doc/admin/env_variables.md?) later.

wangdi1
wangdi1 previously approved these changes Feb 4, 2021
@NiuYawei NiuYawei requested a review from a team February 4, 2021 15:15
@NiuYawei
Copy link
Contributor Author

NiuYawei commented Feb 4, 2021

@daos-stack/daos-gatekeeper All tests passed, but three new NLT errors showed up after merging PR 4490. Could we land it and solve the NLT errors in follow-on patch?

@ashleypittman
Copy link
Contributor

The NLT error is "dss_module_key_get() Assertion 'dtls != NULL' failed" and if landed as-is would cause all master builds and PRs to fail.

Fix a merge error.

Signed-off-by: Niu Yawei <[email protected]>
@NiuYawei
Copy link
Contributor Author

NiuYawei commented Feb 4, 2021

@ashleypittman oh, right, that's caused by a merge conflict. Thanks. sigh, I have to push a commit again.

@NiuYawei NiuYawei dismissed stale reviews from wangdi1 and liw via 9eaf376 February 4, 2021 16:04
@NiuYawei NiuYawei requested review from liw and wangdi1 February 4, 2021 16:05
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@sylviachanoiyee
Copy link
Contributor

sylviachanoiyee commented Feb 4, 2021

Niu, I have restarted your build with a higher priority, running build 19 now. Build 18 should be canceled.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@ashleypittman
Copy link
Contributor

@daos-stack/daos-gatekeeper All tests passed, but three new NLT errors showed up after merging PR 4490. Could we land it and solve the NLT errors in follow-on patch?

I just had a second look, the other two are in master anyway as of https://build.hpdd.intel.com/job/daos-stack/job/daos/job/master/3744/ this PR must have been merged with master and built before the relevent master build completed to be selected as a reference. A simple re-trigger or rebuilding of just the stage should have been enough to pass.

@jolivier23 jolivier23 merged commit fbf5746 into master Feb 4, 2021
@jolivier23 jolivier23 deleted the DAOS-4881-relax branch February 4, 2021 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

9 participants