Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead service after Nomad cluster restart #5919

Closed
jozef-slezak opened this issue Jul 4, 2019 · 17 comments
Closed

Dead service after Nomad cluster restart #5919

jozef-slezak opened this issue Jul 4, 2019 · 17 comments

Comments

@jozef-slezak
Copy link

We are restarting nomad cluster (3 nomad servers and tens of nomad clients all physical machines). Same test scenario #5917, #5908 but a different bug report.

Please check if this is related to #5669

Nomad version

0.9.3

Operating system and Environment details

Linux CentOS

Issue

Service is not running after Nomad cluster restart.

Reproduction steps

Restart all nomad servers and clients (sudo reboot).
Check Nomad console - sometimes we see DEAD services that was running before restart.

image

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 5, 2019

I was able to reproduce this behavior also in a small three-node cluster in AWS.
Just stop all nomad servers at the same time using sudo sytemctl stop nomad and then start at the same time.
We have 6 jobs running. We can easily reproduce that after few restarts at least one job will be DEAD (but before stopping it was running).

Job HCL reschedule

"ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },

I assume that Nomad did not continue rescheduling after failure (I waited 30 minutes):

image

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 5, 2019

I am attaching nomad config used in a small three-node cluster in AWS.

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.31.34.128"
  rpc = "172.31.34.128"
  serf = "172.31.34.128"
}
bind_addr = "172.31.34.128"
client = {
  enabled = true
  network_interface = "eth0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
  server_auto_join    = false
}
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "xxx"
  server_join {
    retry_join = [ "172.31.39.40", "172.31.36.243" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.31.39.40"
  rpc = "172.31.39.40"
  serf = "172.31.39.40"
}
bind_addr = "172.31.39.40"
client = {
  enabled = true
  network_interface = "eth0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
  server_auto_join    = false
}
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "xxx"
  server_join {
    retry_join = [ "172.31.34.128", "172.31.36.243" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}
addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.31.36.243"
  rpc = "172.31.36.243"
  serf = "172.31.36.243"
}
bind_addr = "172.31.36.243"
client = {
  enabled = true
  network_interface = "eth0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
  server_auto_join    = false
}
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "xxx"

  server_join {
    retry_join = [ "172.31.34.128", "172.31.39.40" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 5, 2019

Attaching few more screenshots. Could a driver bug cause that job is not rescheduled?

image

image

@preetapan
Copy link
Contributor

@jozef-slezak What does the output of nomad job status -evals for that job which is in this state look like?

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 5, 2019

ID                 Type     Priority  Status   Submit Date
id1              service  50        running  2019-07-03T13:42:51+02:00
activemq    service  50        running  2019-07-03T13:42:54+02:00
id3              service  50        running  2019-07-03T13:43:00+02:00
id4              service  50        running  2019-07-03T13:43:04+02:00
id5              service  50        pending  2019-07-03T13:43:08+02:00
id6              service  50        dead     2019-07-03T13:43:20+02:00

By the way, when this happens I need to start the DEAD job manually (I just click Start button in the console). I would appreciate that nomad just continues with retries.

The PENDING process displays Queued=1. I tried job stop & start but it did not help. Trying restarting nomad itself and it has been recovered and is RUNNING now. It should also recover automatically. Maybe nomad could restart driver/plugin process.

Please, could you at least suggest some workaround (for example some watchdog or restarting process by systemd) until the fix is available?

@jozef-slezak
Copy link
Author

Could you please run the same test as @cgbaker did in #5917.

Repeat restarting the 3node cluster until you reproduce this behavior (DEAD job after restart).

  • stop all nomad nodes (system stop nomad)
  • wait for stopping
  • start all nomad nodes (system start nomad)
  • wait for starting and then check job status

@cgbaker
Copy link
Contributor

cgbaker commented Jul 8, 2019

I wasn't able to reproduce this with separate servers and clients. I did see some trouble that may be resolved in #5906. I will try to repro with that build.

@jozef-slezak
Copy link
Author

@cgbaker, this issue is hard to reproduce. I believe it is related to "Error was unrecoverable" (see the screenshots above).

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 9, 2019

Today, I was able to simulate this buggy behavior on a single node (just by using sudo systemctl stop nomad and sudo systemctl start nomad). Please check the evals below.

> nomad job status -evals id6
ID            = id1
Name          = id1
Submit Date   = 2019-04-23T11:01:40+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id6    1       0         0        0       0         0

Evaluations
ID  Priority  Triggered By  Status  Placement Failures

Allocations
No allocations placed


> free
              total        used        free      shared  buff/cache   available
Mem:        5895704     2205248     2666648       21780     1023808     3413816
Swap:       5435388           0     5435388

@cgbaker
Copy link
Contributor

cgbaker commented Jul 9, 2019

I tried to reproduce this using 2f55f78b21a5e55ab122f2c1e1ed1ec21fde9566 and with KillMode=process in the systemd unit, but I was unable to. @jozef-slezak , do you have KillMode=process configured?

Update: forget this, I had the wrong Nomad binary. I am still curious whether you have the Nomad systemd service configured with KillMode=process?

@cgbaker
Copy link
Contributor

cgbaker commented Jul 9, 2019

@jozef-slezak : I attempted to repro this on a single node (client+server) with the latest Nomad 0eae387a96d6482cece2a8aa51f4aa8d8616549 (specifically, a Nomad including #5906. I repeatedly stopped and started the Nomad agent using systemctl stop/start, but no new allocations were ever created and the associated docker task was never interrupted.

@jozef-slezak
Copy link
Author

Thanks for your cooperation. Tomorrow I will check the killmode. We are using raw exec instead of docker driver. This issue is about dead (not event started) process (not interrupted). Probably I should submit some test script and try with latest nomad.

@jozef-slezak
Copy link
Author

We have KillMode=control-group. @cgbaker, what is you opinion, please? Please compare sudo systemctl stop nomad to sudo reboot.

@cgbaker
Copy link
Contributor

cgbaker commented Jul 10, 2019

I will do more testing with either reboot or with restart/KillMode=process. For the record, we recommend configuring Nomad's systemd unit with KillMode=process.

@stale
Copy link

stale bot commented Oct 8, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Nov 7, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants