Nomad failed to start after node reboot #5584

pashinin · 2019-04-19T11:26:28Z

Nomad version

0.9.0

Operating system and Environment details

Arch Linux, Debian 9

Issue

Each time I reboot Nomad can't start after it.

Maybe related: #4748

Reproduction steps

Just reboot

Nomad Server logs

There is nothing in logs at first:

systemd[1]: Started Nomad.
nomad[641]: ==> Loaded configuration from /etc/nomad/config.hcl
nomad[641]: ==> Starting Nomad agent...

But after I manually restart Nomad sudo systemctl restart nomad:

systemd[1]: Stopped Nomad.
systemd[1]: nomad.service: Found left-over process 1703 (nomad) in control group while starting unit. Ignoring.
systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
systemd[1]: Started Nomad.
nomad[5365]: ==> Loaded configuration from /etc/nomad/config.hcl
nomad[5365]: ==> Starting Nomad agent...

Nomad isn't started actually. I need to delete alloc and client dirs to start it.

My thoughts

I think it is somehow connected with systemd. I remember that in Debian 9 systemd waited 1m 30s before it actually reboots. And I think Nomad had time to do what it wanted. On Arch Linux there is no wait time (almost instant reboot). And each time Nomad can't start after it.

Maybe some unmounting problem... don't know.

As soon as I delete alloc and client dirs Nomad starts perfectly.

The text was updated successfully, but these errors were encountered:

schmichael · 2019-04-19T15:20:44Z

If you're using Docker, Nomad's initial failure to start after rebooting is due to #5566 and fixed by #5568.

The subsequent failure is likely due to a docker_logger process still existing after the initial nomad process has exited? I'm unsure.

Can you share your systemd unit file and the output of systemctl status nomad after reboot?

pashinin · 2019-04-19T18:15:01Z

systemctl status nomad says it's ok:

● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/nomad.service.d
           └─env.conf
   Active: active (running) since Fri 2019-04-19 21:03:07 MSK; 4min 15s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 649 (nomad)
    Tasks: 53 (limit: 4915)
   Memory: 145.5M
   CGroup: /system.slice/nomad.service
           ├─ 649 /usr/local/bin/nomad agent -config /etc/nomad
           └─1746 /usr/local/bin/nomad docker_logger

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

But nomad status:

Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: connect: connection refused

Systemd unit file:

[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target

# If you are running Consul, please uncomment following Wants/After configs.
# Assuming your Consul service unit name is "consul"
Wants=consul.service
After=consul.service

[Service]
EnvironmentFile=/etc/systemd/system/nomad.service.d/env.conf
KillMode=process
KillSignal=SIGINT
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
Restart=on-failure
RestartSec=2
StartLimitBurst=3
#StartLimitIntervalSec=10

[Install]
WantedBy=multi-user.target

pashinin · 2019-04-19T18:26:33Z

journalctl -b -1:

nomad[9878]: ==> Caught signal: interrupt
nomad[9878]:     2019-04-19T19:30:12.233+0300 [INFO ] agent: requesting shutdown
nomad[9878]:     2019-04-19T19:30:12.233+0300 [INFO ] client: shutting down
systemd[1]: Stopping Nomad...
nomad[9878]:     2019-04-19T19:30:14.554+0300 [INFO ] client.plugin: shutting down plugin manager: plugin-type=device
nomad[9878]:     2019-04-19T19:30:14.652+0300 [INFO ] client.plugin: plugin manager finished: plugin-type=device
nomad[9878]:     2019-04-19T19:30:14.652+0300 [INFO ] client.plugin: shutting down plugin manager: plugin-type=driver
nomad[9878]:     2019-04-19T19:30:15.837+0300 [INFO ] client.plugin: plugin manager finished: plugin-type=driver
nomad[9878]:     2019-04-19T19:30:16.812+0300 [INFO ] nomad: shutting down server
nomad[9878]:     2019-04-19T19:30:16.812+0300 [WARN ] nomad: serf: Shutdown without a Leave
nomad[9878]:     2019-04-19T19:30:16.839+0300 [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 10.254.239.5:4647 10.254.239.5:4647}
nomad[9878]:     2019-04-19T19:30:16.839+0300 [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 10.254.239.1:4647 10.254.239.1:4647}
nomad[9878]:     2019-04-19T19:30:17.358+0300 [INFO ] agent: shutdown complete
systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: nomad.service: Failed with result 'exit-code'.
systemd[1]: Stopped Nomad.

journalctl -b -0:

systemd[1]: Started Nomad.
nomad[649]: ==> Loaded configuration from /etc/nomad/config.hcl
nomad[649]: ==> Starting Nomad agent...

Existing processes:

3164 root       24   4 2715M 76100 39136 S  0.0  0.2  0:00.37 /usr/local/bin/nomad agent -config /etc/nomad
...
1747 root       21   1 1809M 40036 25764 S  0.0  0.1  0:00.00 /usr/local/bin/nomad docker_logger
...

zonnie · 2019-04-20T17:28:02Z

Something similar is happening to me...when I restart Nomad client it simply hangs:

==> Loaded configuration from /etc/nomad/client.conf
==> Starting Nomad agent...

Only removing Nomad's state directories solves this.
For me this is actually new on 0.9.0 - before that it rarely happened due to corrupted state.

pashinin · 2019-04-20T21:21:02Z

@zonnie Yes, it looks like 0.9.0 only. I was wrong thinking it was the same problem (as corrupted state). Will fix description. Happens on each reboot, does not depend on OS.

And by the way nomad server members (on other nodes) report it's alive. But it gives an error if I run nomad status on this (failed) node. Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: connect: connection refused

pashinin · 2019-04-20T21:26:35Z

Looks it is #5566

github-actions · 2022-11-24T02:20:19Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

pashinin changed the title ~~Nomad failed to start after node restart~~ Nomad failed to start after node reboot Apr 19, 2019

schmichael added type/bug theme/client stage/waiting-reply labels Apr 19, 2019

pashinin closed this as completed Apr 20, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad failed to start after node reboot #5584

Nomad failed to start after node reboot #5584

pashinin commented Apr 19, 2019 •

edited

Loading

schmichael commented Apr 19, 2019

pashinin commented Apr 19, 2019

pashinin commented Apr 19, 2019 •

edited

Loading

zonnie commented Apr 20, 2019

pashinin commented Apr 20, 2019

pashinin commented Apr 20, 2019

github-actions bot commented Nov 24, 2022

Nomad failed to start after node reboot #5584

Nomad failed to start after node reboot #5584

Comments

pashinin commented Apr 19, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs

My thoughts

schmichael commented Apr 19, 2019

pashinin commented Apr 19, 2019

pashinin commented Apr 19, 2019 • edited Loading

zonnie commented Apr 20, 2019

pashinin commented Apr 20, 2019

pashinin commented Apr 20, 2019

github-actions bot commented Nov 24, 2022

pashinin commented Apr 19, 2019 •

edited

Loading

pashinin commented Apr 19, 2019 •

edited

Loading