Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-10206 config: set up extra systemd options #8593

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

johannlombardi
Copy link
Contributor

Add several options to the systemd unit file of the daos_server to:

  • increase the scheduling priority over other Linux processes
  • bind the control plane to core #0 to avoid noises on the engine
  • reduce the OOM score to kill other processes before the daos_server
  • dump the content of pmem to core files

Signed-off-by: Johann Lombardi [email protected]

Add several options to the systemd unit file of the daos_server to:
- increase the scheduling priority over other Linux processes
- bind the control plane to core #0 to avoid noises on the engine
- reduce the OOM score to kill other processes before the daos_server
- dump the content of pmem to core files

Signed-off-by: Johann Lombardi <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@@ -13,12 +13,24 @@ RuntimeDirectoryMode=0755
ExecStart=/usr/bin/daos_server start
StandardOutput=journal
StandardError=journal
StartLimitBurst=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What situation do you want to address with this setting ?? Do I correctly understand that this means we will tolerate up to 5 service restarts within StartLimitIntervalSec/DefaultStartLimitIntervalSec=10s before to decide stopping restart attempts ?? Well, my experience with daos_server service startup failures is that there is something wrong that needs to be investigated, so I would better set StartLimitBurst=1 and raise a little bit StartLimitIntervalSec to something like 120s. But I am sure we will had to schedule a specific meeting about this topic where everybody can have a different idea !!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only moved this one around and did not change the value. Not sure who added it. @mjmac any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Well, git blame shows that @dpquigl added that line. It looks like StartLimitIntervalSec is set to 60s up in the unit block above this, so this is saying that daos_server can only attempt to restart 5 times within 60s, if I understand correctly.

I do tend to agree that there probably isn't much value in having the service attempt to restart multiple times, as errors which prevent successful startup (bad config, missing hardware, etc) aren't likely to resolve themselves.

@dpquigl: Was there a specific rationale behind choosing that value? Should we just revert to the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, everything seems fine at the exception of this parameter StartLimitIntervalSec . Is there any feedback on the motivation to choose this value ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was set the way it was because without it we were flooding the logs with error messages if DAOS wasn't configured correctly and systemd had it enabled.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

knard38
knard38 previously approved these changes May 5, 2022
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine for me.

bfaccini
bfaccini previously approved these changes May 9, 2022
@@ -4,21 +4,40 @@ StartLimitIntervalSec=60
Wants=network-online.target
After=network-online.target

StartLimitIntervalSec=60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks redundant with line #3 in fact

Signed-off-by: Johann Lombardi <[email protected]>
@johannlombardi johannlombardi dismissed stale reviews from bfaccini and knard38 via a36c66e June 29, 2022 09:12
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@johannlombardi johannlombardi requested a review from a team July 5, 2022 11:34
Copy link
Contributor

@jolivier23 jolivier23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider merging latest since it's quite old?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants