-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd timers can result in changed services not being restarted reliably #49415
Comments
Did you have
Enabled? Perhaps leaving took to long and systemd bailed out? If you can tell me whether you had this option enabled or not I can try to reproduce the issue |
It should be
edit: Nope |
Related: #33172 You also seem to have opened that issue :). I think the |
@arianvp No, I have no I have:
Also, I think that if systemd tried to stop consul at all, even via |
The difference to #33172 is that there it at least says |
I just noticed
This is a string that |
Further research: What is certain, is that whatever caused the edit: Scrap this part below. I just realised that in the logs that failed to have
`SIGTERM` causes a `consul server` to exit with exit code `1`, which (I think?) could trigger the `Restart` , and thus a race condition occurs between `systemctl stop` and `Restart`, causing the stop job to be canceled.
Though I am a bit skeptical about this hypothesis, since you don't have `Stopping consul.service ...` in your logs.
Whilst non-servers gracefully exit on
I think the fix here is to make the the systemd unit send This should fix the race condition occuring. I haven't been able to trigger the race condition though, so I might be wrong. on my local machine |
I can actually reproduce this part. If I just type Afterwards However, |
|
From the doc regarding
I'm not sure but maybe setting |
Related: The NixOS consul systemd service sets I asked about that on 9ce0c1cb7113#r23138805 but never got an answer from @wkennington. |
I have filed one systemd issue that seems to be at least related: systemd/systemd#10626 I don't think the timer issue explains the full situation, because of the
from the issue description where consul didn't get restarted, only 1 had a systemd timer installed. |
I am getting more and more convinced that the In particular, I suspect that the old-version problem has more to do with I have done the following experiment, which supports this: I've started modifying
adding more and more After a deploy with
That is even though
I had
And in this case, no Trying this experiment more times, I also observe deploys where |
I wrote a script to ease debugging this automatically, but encountered another systemd issue that stopped my script from being effective, because the systemd timer randomly stopped firing at all: |
After adding the My script does nothing special, just adding more |
OK, so some more info: I have an rsync job with a systemd timer; the rsync job service Involved are: See https://gist.github.com/nh2/be476d7ce3466ea5b92dfbc39b235770 for their contents. It seems like at least for one of the "2 out of 7 machines" (the other one doens't run a timer, so there it's totally unclear to me so far), the @arianvp and me know from observing systemd (systemd/systemd#10626) that a timer with I suspect that what happens is this:
I haven't been able to reproduce the probelm if my timer uses Still racy, but an apparent workaround in my situation. It is still unclear whether this also explains the same issue on the other machine (likely not), and if #33172 has some similar situation. |
Yea, it really seems to be a race condition in there. If I set This doesn't seem like a great solution though, magic sleeps will start breaking when other participants in the race are slower some day. |
Here's my quick script btw: #!/usr/bin/env python3
import argparse
import sys
import os
import subprocess
import time
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('-d', '--deployment', type=str, help='deployment name')
parser.add_argument('--machine-name', type=str, help='nixops machine name to --include')
parser.add_argument('--ssh-host', type=str, help='SSH connection string to connect to for running ps')
args = parser.parse_args()
deployment = args.deployment
machine_name = args.machine_name
ssh_host = args.ssh_host
consul_nix = "../../nix-channel/nixpkgs/nixos/modules/services/networking/consul.nix"
line_no = 182 - 1
def update_consul_file():
with open(consul_nix) as f:
consul_nix_lines = f.read().split('\n')
line = consul_nix_lines[line_no]
consul_nix_lines[line_no] = line[:len(line)-1] + ' -ui"'
# print(consul_nix_lines[line_no])
with open(consul_nix, 'w') as f:
f.writelines('\n'.join(consul_nix_lines))
def deploy():
subprocess.check_call(['./ops', 'deploy', '-d', deployment, '--include', machine_name])
def count_ui():
ps_out = subprocess.check_output(['ssh', ssh_host, 'ps aux | grep "[c]onsul agent"'], universal_newlines=True)
count = len([word for word in ps_out.split(' ') if word == '-ui'])
return count
update_consul_file()
deploy()
initial_count = count = count_ui()
while True:
update_consul_file()
deploy()
new_count = count_ui()
print(new_count)
if new_count == count:
print("-ui count failed to change!")
print("initial_count:", initial_count)
print("new_count:", new_count)
break
count = new_count I use it like |
@nh2 so this was an issue with timers or a problem on consul module? |
This is a problem with timers giving a conflicting job queue on the systemd-side of things every once in a while. We opened an upstream issue for this. systemd/systemd#10626 I'm not sure if there's anything for us to fix here on the NixOS side of things. I'd be OK with closing. |
Given that NixOS forces Consul to be restarted (even if it is server), we can adapt consul module to prevent such. Set But yeah, this is very service related and looks quite like a hack. |
@danbst I don't quite understand. My original issue was that NixOS does NOT restart Consul when I expect it, making rolling upgrades unreliable. The systemd issue systemd/systemd#10626 where I commented by now indicates that even Facebook is hitting this problem in systemd btw. |
@nh2 you are right 🤦♂️. What would be best solution, what you think? |
@danbst I am not sure, but until anybody from systemd has commented there, I don't think it's clear that this is not a NixOS issue, and we should probably reopen. |
We ran into the same problem at Awake Security. I found the root cause here and I have a minimal reproduction. First off, here is the reproducing NixOS configuration (for EC2, which you can tweak to your desired physical configuration): let
# To reproduce the bug:
#
# * Deploy this configuration "as is"
# * Then increment "FOO" to "1"
# * Then redeploy the configuration
#
# If the reproduction succeeds the `restart-child` service should still be
# erroneously displaying `0` instead of `1`
FOO = "0";
in
{ imports = [ <nixpkgs/nixos/modules/virtualisation/amazon-image.nix> ];
ec2.hvm = true;
# This is the service that the `switch-to-configuration` script will fail to
# update correctly. This service prints the value of the `FOO` environment
# variable and will always lag behind by one deploy due to being prematurely
# restarted in the middle of a `switch-to-configuration` step.
systemd.services.restart-child = {
enable = true;
wantedBy = [ "restart-parent.service" ];
environment = { inherit FOO; };
script = ''
while true; do echo "$FOO"; sleep 1; done
'';
};
# This service is responsible for restarting `restart-child` every second,
# since the bug is triggered by prematurely restarting a service that is
# being upgraded by NixOS.
systemd.services.restart-parent = {
enable = true;
wantedBy = [ "multi-user.target" ];
script = "\n";
};
systemd.timers.restart-parent = {
wantedBy = [ "timers.target" ];
timerConfig = {
Type = "oneshot";
OnUnitInactiveSec = "1s";
AccuracySec = "1s";
};
};
# This service adds a delay to the `systemctl stop …` step of the
# `switch-to-configuration` script so that the `restart-parent` service has
# enough time to restart the `restart-child` service.
systemd.services.delay-stop = {
enable = true;
wantedBy = [ "multi-user.target" ];
environment = { inherit FOO; };
script = ''
while true; do sleep 1; done
'';
# The purpose of this `preStop` delay is to give the `restart-parent`
# service a chance to restart the `restart-child` service before
# the `switch-to-configuration` script can run `systemctl daemon-reload`
preStop = ''
sleep 10
'';
};
} Here's the explanation of what triggers the bug:
This leads me to conclude that this is a NixOS bug and not a |
@Gabriel439 Great investigation! Restarting makes sense to me. But would we not also have to do My understanding is that So it seems that the "real" stop can only be after |
@nh2: Yeah, I like the idea of just moving the |
Moving the stop to after the |
@arianvp: Why would moving |
After the The current order is:
If you now move the stop after the daemon-reload, the wrong If you move it after the daemon-reload, there is no difference between Personally I wouldn't mind deprecating (or better documenting) this functionality, as its semantics are a bit confusing (#49528) and I never found a service where Edit: I think a good example where things break is scipted networking. Where it's important that the |
@nh2 @Gabriel439 would setting Maybe we close this by documenting this better? "If you want transactional stops & starts, please set stopIfChanged = false" . "If you want to sacrifice transactionality for usecase where ExecStop is in the old generation and ExecStart is in the new, please leave it to the default Maybe also defaulting to |
@arianvp: Individually setting Something that would fix things for us would be to add a custom |
Edit: This was probably wrong, I just didn't notice that my service restarted and failed. Would still be nice for somebody in the knows to confirm how it will behave. |
@nh2 the restart behaviour comes from |
I don't fully understand what you are saying here. systemd services restart in isolation unless What you're suggesting is also exactly how
|
I commented more in detail on how to possibly adress this in #49528 (comment). The issue mentioned here is an instance of the "bug" mentioned there (only reloading, even though a restart would be necessary). Let's move the discussion over there, closing here. |
I keep encountering this; posting a concrete situation again, this time with the
Here we see again: |
Cannot we just replace The service would still be But I believe that's what @Gabriella439 meant in #49415 (comment): To replace So, concretely, I propose: nixpkgs/nixos/modules/system/activation/switch-to-configuration.pl Lines 944 to 946 in f7ae5ea
- print STDERR "starting the following units: ", join(", ", @units_to_start_filtered), "\n"
+ print STDERR "starting (via restart, see #49415) the following units: ", join(", ", @units_to_start_filtered), "\n"
}
-system("$new_systemd/bin/systemctl", "start", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
+system("$new_systemd/bin/systemctl", "restart", "--", sort(keys(%units_to_start))) == 0 or $res = 4; Edit: This doesn't work trivially, it makes the system hang because it restarts the network. Why? Because in this commit the logic was introduced that what's started isn't actually what's printed: print STDERR "starting the following units: ", join(", ", @units_to_start_filtered), "\n"
^
_filtered here
system("$new_systemd/bin/systemctl", "start", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
^
NO _filtered here |
I am reopening this issue to
@flokli That isn't quite correct, at least not as written. In here, I am re-opening this issue because I believe we can fix this issue -- |
I have created a PR that implements Please help test! |
…ixes NixOS#49415 See the added comment for a detail description. See also the related NixOS#49528.
The original title of this issue was
It was renamed after the root cause was identified to be systemd timers in #49415 (comment) and after.
Issue description
I just deployed the upgrade from Consul 0.9.3 -> 1.3.0 (#49165) to my 18.03 server cluster (in my case, I upgraded from 1.0.6).
The machines run identical configurations deployed by
nixops
, and deploying should have stopped all old consul processes and launched the new version.Yet I found that on 2 out of 7 machines, the old consul version was still running.
On a node where the deployment worked, the
journalctl
looks like this:Note the presence of
Stopping consul.service...
.On a node where it didn't work, it looks like this:
Note the absence of
Stopping consul.service...
This is despite
nixops
having the following output:Observe how:
stopping the following units: ..., consul.service
was printedStopping consul.service
didn't occur.Possibly related is the systemd warning
consul.service: Current command vanished from the unit file
.The text was updated successfully, but these errors were encountered: