Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: Restart=always docker systemd service #3758

Merged
merged 2 commits into from
Aug 28, 2020

Conversation

jackfrancis
Copy link
Member

@jackfrancis jackfrancis commented Aug 27, 2020

Reason for Change:

This PR ensures that Restart=always is configured for the docker systemd service, and that the included monitor script for both docker and kubelet has a fail-safe "start the systemd service" if the systemd job spec doesn't do so itself.

Issue Fixed:

Requirements:

Notes:

DavidParks8
DavidParks8 previously approved these changes Aug 27, 2020
@jackfrancis
Copy link
Member Author

@xuto2 @yangl900 FYI, this pathology may exist for AKS node systemd definitions as well

@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

Merging #3758 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3758   +/-   ##
=======================================
  Coverage   73.19%   73.20%           
=======================================
  Files         148      148           
  Lines       25367    25372    +5     
=======================================
+ Hits        18568    18573    +5     
  Misses       5663     5663           
  Partials     1136     1136           
Impacted Files Coverage Δ
pkg/engine/templates_generated.go 53.42% <ø> (ø)
pkg/api/addons.go 97.80% <0.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b05d7a8...f7ce667. Read the comment docs.

devigned
devigned previously approved these changes Aug 27, 2020
Copy link
Member

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xuto2
Copy link
Contributor

xuto2 commented Aug 27, 2020

/lgtm.

Thanks @jackfrancis for sharing!

mboersma
mboersma previously approved these changes Aug 27, 2020
Copy link
Member

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis jackfrancis dismissed stale reviews from mboersma, devigned, and DavidParks8 via f7ce667 August 27, 2020 21:30
@acs-bot acs-bot removed the lgtm label Aug 27, 2020
@acs-bot
Copy link

acs-bot commented Aug 27, 2020

New changes are detected. LGTM label has been removed.

@acs-bot acs-bot added size/S and removed size/XS labels Aug 27, 2020
@acs-bot
Copy link

acs-bot commented Aug 27, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Member Author

After more investigation, it appears that Restart=always is already present for the docker systemd service:

$ grep Restart /lib/systemd/system/docker.service
RestartSec=2
Restart=always

(Though not via aks-engine-delivered config).

So, we should still not just rely upon that being there, and explicitly declare that config (like we're doing in this PR). However, the fact that we're still seeing terminally stopped docker systemd services in the wild suggests that we aren't as resilient as we want to in terms of guaranteeing the availability of the docker systemd service. These docs suggest that Restart=always in fact has limits:

https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

The key statement is:

As exceptions to the setting above, the service will not be restarted if the exit code or signal is specified in RestartPreventExitStatus= (see below) or the service is stopped with systemctl stop or an equivalent operation.

We aren't specifying any non-qualifying exit codes for docker:

$ sudo grep -R "RestartPreventExitStatus" /etc/systemd/
/etc/systemd/system/sshd.service:RestartPreventExitStatus=255
/etc/systemd/system/multi-user.target.wants/snapd.service:RestartPreventExitStatus=42
/etc/systemd/system/multi-user.target.wants/ssh.service:RestartPreventExitStatus=255

So there must be some systemctl stop-equivalent events that occasionally happen. In order to restart docker after that, I've added systemctl start logic to the script that the pre-existing docker-monitor service uses. I think with this combination, we'll be in better shape to deal with docker systemd service events.

@jackfrancis jackfrancis merged commit 5e22312 into Azure:master Aug 28, 2020
penggu pushed a commit to penggu/aks-engine that referenced this pull request Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants