Dead service after Nomad cluster restart #5919

jozef-slezak · 2019-07-04T09:36:38Z

We are restarting nomad cluster (3 nomad servers and tens of nomad clients all physical machines). Same test scenario #5917, #5908 but a different bug report.

Please check if this is related to #5669

Nomad version

0.9.3

Operating system and Environment details

Linux CentOS

Issue

Service is not running after Nomad cluster restart.

Reproduction steps

Restart all nomad servers and clients (sudo reboot).
Check Nomad console - sometimes we see DEAD services that was running before restart.

jozef-slezak · 2019-07-05T16:40:01Z

I was able to reproduce this behavior also in a small three-node cluster in AWS.
Just stop all nomad servers at the same time using sudo sytemctl stop nomad and then start at the same time.
We have 6 jobs running. We can easily reproduce that after few restarts at least one job will be DEAD (but before stopping it was running).

Job HCL reschedule

"ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },

I assume that Nomad did not continue rescheduling after failure (I waited 30 minutes):

jozef-slezak · 2019-07-05T16:49:02Z

I am attaching nomad config used in a small three-node cluster in AWS.

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.31.34.128"
  rpc = "172.31.34.128"
  serf = "172.31.34.128"
}
bind_addr = "172.31.34.128"
client = {
  enabled = true
  network_interface = "eth0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
  server_auto_join    = false
}
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "xxx"
  server_join {
    retry_join = [ "172.31.39.40", "172.31.36.243" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.31.39.40"
  rpc = "172.31.39.40"
  serf = "172.31.39.40"
}
bind_addr = "172.31.39.40"
client = {
  enabled = true
  network_interface = "eth0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
  server_auto_join    = false
}
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "xxx"
  server_join {
    retry_join = [ "172.31.34.128", "172.31.36.243" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.31.36.243"
  rpc = "172.31.36.243"
  serf = "172.31.36.243"
}
bind_addr = "172.31.36.243"
client = {
  enabled = true
  network_interface = "eth0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
  server_auto_join    = false
}
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "xxx"

  server_join {
    retry_join = [ "172.31.34.128", "172.31.39.40" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}

jozef-slezak · 2019-07-05T17:29:18Z

Attaching few more screenshots. Could a driver bug cause that job is not rescheduled?

preetapan · 2019-07-05T17:31:53Z

@jozef-slezak What does the output of nomad job status -evals for that job which is in this state look like?

jozef-slezak · 2019-07-05T17:37:57Z

ID                 Type     Priority  Status   Submit Date
id1              service  50        running  2019-07-03T13:42:51+02:00
activemq    service  50        running  2019-07-03T13:42:54+02:00
id3              service  50        running  2019-07-03T13:43:00+02:00
id4              service  50        running  2019-07-03T13:43:04+02:00
id5              service  50        pending  2019-07-03T13:43:08+02:00
id6              service  50        dead     2019-07-03T13:43:20+02:00

By the way, when this happens I need to start the DEAD job manually (I just click Start button in the console). I would appreciate that nomad just continues with retries.

The PENDING process displays Queued=1. I tried job stop & start but it did not help. Trying restarting nomad itself and it has been recovered and is RUNNING now. It should also recover automatically. Maybe nomad could restart driver/plugin process.

Please, could you at least suggest some workaround (for example some watchdog or restarting process by systemd) until the fix is available?

jozef-slezak · 2019-07-08T09:58:01Z

Could you please run the same test as @cgbaker did in #5917.

Repeat restarting the 3node cluster until you reproduce this behavior (DEAD job after restart).

stop all nomad nodes (system stop nomad)
wait for stopping
start all nomad nodes (system start nomad)
wait for starting and then check job status

cgbaker · 2019-07-08T22:12:02Z

I wasn't able to reproduce this with separate servers and clients. I did see some trouble that may be resolved in #5906. I will try to repro with that build.

jozef-slezak · 2019-07-08T23:33:50Z

@cgbaker, this issue is hard to reproduce. I believe it is related to "Error was unrecoverable" (see the screenshots above).

jozef-slezak · 2019-07-09T11:46:52Z

Today, I was able to simulate this buggy behavior on a single node (just by using sudo systemctl stop nomad and sudo systemctl start nomad). Please check the evals below.

> nomad job status -evals id6
ID            = id1
Name          = id1
Submit Date   = 2019-04-23T11:01:40+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id6    1       0         0        0       0         0

Evaluations
ID  Priority  Triggered By  Status  Placement Failures

Allocations
No allocations placed


> free
              total        used        free      shared  buff/cache   available
Mem:        5895704     2205248     2666648       21780     1023808     3413816
Swap:       5435388           0     5435388

cgbaker · 2019-07-09T18:56:02Z

I tried to reproduce this using 2f55f78b21a5e55ab122f2c1e1ed1ec21fde9566 and with KillMode=process in the systemd unit, but I was unable to. @jozef-slezak , do you have KillMode=process configured?

Update: forget this, I had the wrong Nomad binary. I am still curious whether you have the Nomad systemd service configured with KillMode=process?

cgbaker · 2019-07-09T20:10:26Z

@jozef-slezak : I attempted to repro this on a single node (client+server) with the latest Nomad 0eae387a96d6482cece2a8aa51f4aa8d8616549 (specifically, a Nomad including #5906. I repeatedly stopped and started the Nomad agent using systemctl stop/start, but no new allocations were ever created and the associated docker task was never interrupted.

jozef-slezak · 2019-07-09T20:22:21Z

Thanks for your cooperation. Tomorrow I will check the killmode. We are using raw exec instead of docker driver. This issue is about dead (not event started) process (not interrupted). Probably I should submit some test script and try with latest nomad.

jozef-slezak · 2019-07-10T15:25:31Z

We have KillMode=control-group. @cgbaker, what is you opinion, please? Please compare sudo systemctl stop nomad to sudo reboot.

cgbaker · 2019-07-10T16:16:46Z

I will do more testing with either reboot or with restart/KillMode=process. For the record, we recommend configuring Nomad's systemd unit with KillMode=process.

stale · 2019-10-08T16:48:53Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale · 2019-11-07T17:12:05Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-17T02:28:05Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cgbaker added type/bug theme/client labels Jul 9, 2019

cgbaker self-assigned this Jul 9, 2019

cgbaker added stage/needs-investigation stage/waiting-reply and removed type/bug labels Jul 9, 2019

stale bot removed stage/waiting-reply labels Jul 9, 2019

cgbaker added the stage/waiting-reply label Jul 9, 2019

stale bot removed the stage/waiting-reply label Jul 9, 2019

cgbaker mentioned this issue Jul 9, 2019

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Open

cgbaker added the stage/waiting-reply label Jul 9, 2019

stale bot removed the stage/waiting-reply label Jul 10, 2019

stale bot added the stage/waiting-reply label Oct 8, 2019

stale bot closed this as completed Nov 7, 2019

jozef-slezak mentioned this issue Sep 16, 2020

DEAD Task (desired run) after Power Off #8893

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead service after Nomad cluster restart #5919

Dead service after Nomad cluster restart #5919

jozef-slezak commented Jul 4, 2019

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

preetapan commented Jul 5, 2019

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 8, 2019

cgbaker commented Jul 8, 2019

jozef-slezak commented Jul 8, 2019

jozef-slezak commented Jul 9, 2019 •

edited

Loading

cgbaker commented Jul 9, 2019 •

edited

Loading

cgbaker commented Jul 9, 2019

jozef-slezak commented Jul 9, 2019

jozef-slezak commented Jul 10, 2019

cgbaker commented Jul 10, 2019

stale bot commented Oct 8, 2019

stale bot commented Nov 7, 2019

github-actions bot commented Nov 17, 2022

Dead service after Nomad cluster restart #5919

Dead service after Nomad cluster restart #5919

Comments

jozef-slezak commented Jul 4, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

jozef-slezak commented Jul 5, 2019 • edited Loading

jozef-slezak commented Jul 5, 2019 • edited Loading

jozef-slezak commented Jul 5, 2019 • edited Loading

preetapan commented Jul 5, 2019

jozef-slezak commented Jul 5, 2019 • edited Loading

jozef-slezak commented Jul 8, 2019

cgbaker commented Jul 8, 2019

jozef-slezak commented Jul 8, 2019

jozef-slezak commented Jul 9, 2019 • edited Loading

cgbaker commented Jul 9, 2019 • edited Loading

cgbaker commented Jul 9, 2019

jozef-slezak commented Jul 9, 2019

jozef-slezak commented Jul 10, 2019

cgbaker commented Jul 10, 2019

stale bot commented Oct 8, 2019

stale bot commented Nov 7, 2019

github-actions bot commented Nov 17, 2022

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 9, 2019 •

edited

Loading

cgbaker commented Jul 9, 2019 •

edited

Loading