Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad failed to start after node reboot #5584

Closed
pashinin opened this issue Apr 19, 2019 · 7 comments
Closed

Nomad failed to start after node reboot #5584

pashinin opened this issue Apr 19, 2019 · 7 comments

Comments

@pashinin
Copy link

pashinin commented Apr 19, 2019

Nomad version

0.9.0

Operating system and Environment details

Arch Linux, Debian 9

Issue

Each time I reboot Nomad can't start after it.

Maybe related: #4748

Reproduction steps

Just reboot

Nomad Server logs

There is nothing in logs at first:

systemd[1]: Started Nomad.
nomad[641]: ==> Loaded configuration from /etc/nomad/config.hcl
nomad[641]: ==> Starting Nomad agent...

But after I manually restart Nomad sudo systemctl restart nomad:

systemd[1]: Stopped Nomad.
systemd[1]: nomad.service: Found left-over process 1703 (nomad) in control group while starting unit. Ignoring.
systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd[1]: Started Nomad.
nomad[5365]: ==> Loaded configuration from /etc/nomad/config.hcl
nomad[5365]: ==> Starting Nomad agent...

Nomad isn't started actually. I need to delete alloc and client dirs to start it.

My thoughts

I think it is somehow connected with systemd. I remember that in Debian 9 systemd waited 1m 30s before it actually reboots. And I think Nomad had time to do what it wanted. On Arch Linux there is no wait time (almost instant reboot). And each time Nomad can't start after it.

Maybe some unmounting problem... don't know.

As soon as I delete alloc and client dirs Nomad starts perfectly.

@pashinin pashinin changed the title Nomad failed to start after node restart Nomad failed to start after node reboot Apr 19, 2019
@schmichael
Copy link
Member

If you're using Docker, Nomad's initial failure to start after rebooting is due to #5566 and fixed by #5568.

The subsequent failure is likely due to a docker_logger process still existing after the initial nomad process has exited? I'm unsure.

Can you share your systemd unit file and the output of systemctl status nomad after reboot?

@pashinin
Copy link
Author

systemctl status nomad says it's ok:

● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/nomad.service.d
           └─env.conf
   Active: active (running) since Fri 2019-04-19 21:03:07 MSK; 4min 15s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 649 (nomad)
    Tasks: 53 (limit: 4915)
   Memory: 145.5M
   CGroup: /system.slice/nomad.service
           ├─ 649 /usr/local/bin/nomad agent -config /etc/nomad
           └─1746 /usr/local/bin/nomad docker_logger

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

But nomad status:

Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: connect: connection refused

Systemd unit file:

[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target

# If you are running Consul, please uncomment following Wants/After configs.
# Assuming your Consul service unit name is "consul"
Wants=consul.service
After=consul.service

[Service]
EnvironmentFile=/etc/systemd/system/nomad.service.d/env.conf
KillMode=process
KillSignal=SIGINT
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
Restart=on-failure
RestartSec=2
StartLimitBurst=3
#StartLimitIntervalSec=10

[Install]
WantedBy=multi-user.target

@pashinin
Copy link
Author

pashinin commented Apr 19, 2019

journalctl -b -1:

nomad[9878]: ==> Caught signal: interrupt
nomad[9878]:     2019-04-19T19:30:12.233+0300 [INFO ] agent: requesting shutdown
nomad[9878]:     2019-04-19T19:30:12.233+0300 [INFO ] client: shutting down
systemd[1]: Stopping Nomad...
nomad[9878]:     2019-04-19T19:30:14.554+0300 [INFO ] client.plugin: shutting down plugin manager: plugin-type=device
nomad[9878]:     2019-04-19T19:30:14.652+0300 [INFO ] client.plugin: plugin manager finished: plugin-type=device
nomad[9878]:     2019-04-19T19:30:14.652+0300 [INFO ] client.plugin: shutting down plugin manager: plugin-type=driver
nomad[9878]:     2019-04-19T19:30:15.837+0300 [INFO ] client.plugin: plugin manager finished: plugin-type=driver
nomad[9878]:     2019-04-19T19:30:16.812+0300 [INFO ] nomad: shutting down server
nomad[9878]:     2019-04-19T19:30:16.812+0300 [WARN ] nomad: serf: Shutdown without a Leave
nomad[9878]:     2019-04-19T19:30:16.839+0300 [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 10.254.239.5:4647 10.254.239.5:4647}
nomad[9878]:     2019-04-19T19:30:16.839+0300 [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 10.254.239.1:4647 10.254.239.1:4647}
nomad[9878]:     2019-04-19T19:30:17.358+0300 [INFO ] agent: shutdown complete
systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: nomad.service: Failed with result 'exit-code'.
systemd[1]: Stopped Nomad.

journalctl -b -0:

systemd[1]: Started Nomad.
nomad[649]: ==> Loaded configuration from /etc/nomad/config.hcl
nomad[649]: ==> Starting Nomad agent...

Existing processes:

3164 root       24   4 2715M 76100 39136 S  0.0  0.2  0:00.37 /usr/local/bin/nomad agent -config /etc/nomad
...
1747 root       21   1 1809M 40036 25764 S  0.0  0.1  0:00.00 /usr/local/bin/nomad docker_logger
...

@zonnie
Copy link

zonnie commented Apr 20, 2019

Something similar is happening to me...when I restart Nomad client it simply hangs:

==> Loaded configuration from /etc/nomad/client.conf
==> Starting Nomad agent...

Only removing Nomad's state directories solves this.
For me this is actually new on 0.9.0 - before that it rarely happened due to corrupted state.

@pashinin
Copy link
Author

@zonnie Yes, it looks like 0.9.0 only. I was wrong thinking it was the same problem (as corrupted state). Will fix description. Happens on each reboot, does not depend on OS.

And by the way nomad server members (on other nodes) report it's alive. But it gives an error if I run nomad status on this (failed) node. Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: connect: connection refused

@pashinin
Copy link
Author

Looks it is #5566

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants