Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad restarting services after lost state even with restart {attempts=0} #6212

Closed
cbnorman opened this issue Aug 26, 2019 · 9 comments
Closed

Comments

@cbnorman
Copy link

cbnorman commented Aug 26, 2019

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.9.4 (a81aa84)

Operating system and Environment details

debian 9

Issue

a small number of our services run in a remote datacenter utilising the raw_exec driver which is connected to our nomad cluster via a dedicated cloud connection. most of the services are stateful and require a controlled shutdown to avoid data loss. we have therefor configured the jobs with:

  datacenters = ["dc1"]
  group "test" {
    count = 1
    restart {
      attempts = 0
      mode = "fail"
    }
    reschedule {
      attempts  = 0
      unlimited = false
    }
    constraint {
      attribute = "${node.unique.name}"
      value     = "backend"
    }
    task "test-task" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/python3"
        args = [ "/root/test.py" ]
      }
      env {
      }
    }
  }
}

We have noticed that if there is a disconnect from the servers in the cloud to the clients in the datacenter all services go into a lost state and continue to run locally - which is great. The problem is when the clients re-connect back to the servers all the jobs are restarted.

here is the nomad status for a test job:

ID                  = 43d0b726
Eval ID             = 6906e42e
Name                = test.test[0]
Node ID             = 405655e5
Node Name           = backend
Job ID              = test
Job Version         = 824633979968
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created             = 1h9m ago
Modified            = 50m4s ago

Task "test-task" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  36 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2019-08-26T12:59:15Z
Finished At    = 2019-08-26T13:18:16Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2019-08-26T13:18:16Z  Killed      Task successfully killed
2019-08-26T13:18:16Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"
2019-08-26T13:18:16Z  Killing     Sent interrupt. Waiting 5s before force killing
2019-08-26T12:59:15Z  Started     Task started by client
2019-08-26T12:59:15Z  Task Setup  Building Task Directory
2019-08-26T12:59:15Z  Received    Task received by client

here are the logs from the client:

: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.450Z [WARN ] client: missed heartbeat: req_latency=7
.317776ms heartbeat_ttl=17.674206831s since_last_heartbeat=2m1.800974251s
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.462Z [ERROR] client.driver_mgr.raw_exec: error recei
ving stream from Stats executor RPC, closing stream: alloc_id=43d0b726-61a0-a176-9f54-2cea667a906f driver=raw_exec tas
k_name=test-task error="rpc error: code = Unavailable desc = transport is closing"
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.462Z [ERROR] client.alloc_runner.task_runner.task_ho
ok.stats_hook: failed to start stats collection for task: alloc_id=43d0b726-61a0-a176-9f54-2cea667a906f task=test-task
 error="rpc error: code = Unavailable desc = transport is closing"
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.487Z [INFO ] client.gc: marking allocation for GC: a
lloc_id=43d0b726-61a0-a176-9f54-2cea667a906f
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.490Z [ERROR] client.alloc_runner.task_runner.task_ho
ok.logmon.nomad: reading plugin stderr: alloc_id=43d0b726-61a0-a176-9f54-2cea667a906f task=test-task error="read |0: f
ile already closed"
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.507Z [INFO ] client.alloc_runner.task_runner.task_ho
ok.logmon.nomad: opening fifo: alloc_id=b3ab5bff-dae6-4df6-7211-3379067abc1a task=test-task @module=logmon path=/opt/n
omad/alloc/b3ab5bff-dae6-4df6-7211-3379067abc1a/alloc/logs/.test-task.stdout.fifo timestamp=2019-08-26T13:18:16.507Z
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.508Z [INFO ] client.alloc_runner.task_runner.task_ho
ok.logmon.nomad: opening fifo: alloc_id=b3ab5bff-dae6-4df6-7211-3379067abc1a task=test-task @module=logmon path=/opt/n
omad/alloc/b3ab5bff-dae6-4df6-7211-3379067abc1a/alloc/logs/.test-task.stderr.fifo timestamp=2019-08-26T13:18:16.508Z
Aug 26 13:18:16 backend nomad[30733]:     2019-08-26T13:18:16.517Z [INFO ] client.driver_mgr.raw_exec: starting ta
sk: driver=raw_exec driver_cfg="{Command:/usr/bin/python3 Args:[/root/test.py]}"

Is there anyway to completely stop nomad restarting a job, as mentioned the job functions fine while disconnected, its only on reconnection to the master that it decides to restart it even though the job has no restart configured?

@preetapan
Copy link
Contributor

@cbnorman This isn't currently possible because a lost node means the server doesn't have accurate information on the status of the running task. We have been talking internally about use cases like yours where even if a node goes lost the allocation should not be marked as failed because of the user explicitly opting in not to do so. Will take your use case into consideration, we usually like to gather evidence/use cases before prioritizing a feature like this.

@tgross
Copy link
Member

tgross commented Apr 8, 2020

We've got another report of this in #7607, and it looks like we'll need to take this case into account in #2185 which we'll be implementing as a requirement for CSI.

@tgross tgross added this to the 0.11.1 milestone Apr 9, 2020
@tgross tgross modified the milestones: 0.11.1, 0.11.2 Apr 22, 2020
@tgross tgross removed this from the 0.11.2 milestone May 5, 2020
@narendrapatel
Copy link

Facing similar issues in a test environment of ours. We have jobs that are allocated according to constraints marked on nodes which are static in nature. For example, Job A would run on nodes having constraints "app" and value as A. So lets say we have 2 nodes for A and the job is deployed on them. But post a hearbeat failure, nomad reschedules the allocation, which falls back on the same node when it re-registers back post successful heartbeat. As such the node still has the old alloc running plus nomad client starts the newly rescheduled replacement nomad allocation. Due this we get "address already in use" port confilct in our logs and the state of our app goes into failure. Maybe the "Stop After Client Disconnect" parameter of the group stanza could have helped stop the old alloc but since the hearbeat timeout period(few seconds) is relatively small and the stop process takes some time, it leaves us with no chance to stop it before nomad allocs a replacement.
For this situation, can increasing the heart beat grace period be the temporary solution?
Nomad version in use : 0.10.2

@tgross
Copy link
Member

tgross commented Feb 16, 2021

Hi @narendrapatel!

For this situation, can increasing the heart beat grace period be the temporary solution?

That's a reasonable workaround if you expect to see some intermittent networking issues with the clients, but that's probably worth spending some time debugging in your environment as well.

@tgross tgross changed the title [question] nomad restarting services after lost state even with restart {attempts=0} nomad restarting services after lost state even with restart {attempts=0} Feb 16, 2021
@narendrapatel
Copy link

narendrapatel commented Feb 22, 2021

Hi @tgross

but that's probably worth spending some time debugging in your environment as well.

Our network team has confirmed that the test environment would be having network issues due to some throttling constraints. I have analyzed some heartbeat timeout logs for agents and think 90s grace extension should be fair configuration for now. Also checking if we can have some alerting around the same. In addition, increasing the open file limits for the Nomad servers as found some increase in usage there. Is there anything that i can add more? can you guide me here if you have some pointers or if i am missing something?

Also, a suggestion. Can we get the Nomad Leader to ask the client to push details of all the allocations running on it before scheduling a allocation there. This can be used to avoid re-running a allocation on a client that came after a missed heartbeat and already has the allocation running on it. If the running allocation does not match with our latest job specifications then we can stop the allocation and schedule a new one if required.

@tgross
Copy link
Member

tgross commented Feb 22, 2021

Is there anything that i can add more? can you guide me here if you have some pointers or if i am missing something?

Definitely take a look at the production requirements docs if you haven't already.

Also, a suggestion. Can we get the Nomad Leader to ask the client to push details of all the allocations running on it before scheduling a allocation there. This can be used to avoid re-running a allocation on a client that came after a missed heartbeat and already has the allocation running on it. If the running allocation does not match with our latest job specifications then we can stop the allocation and schedule a new one if required.

Depending on why this is important to you, you may want to look at stop_after_client_disconnect but if that doesn't meet your needs please open a new issue for the feature request you're making here.

@narendrapatel
Copy link

Definitely take a look at the production requirements docs if you haven't already.

@tgross yes we have the given requirements met except for the fact that this being the test environment can have some network latency on and off.

you may want to look at stop_after_client_disconnect but if that doesn't meet your needs please open a new issue for the feature request you're making here.

Already checked the setting but unfortunately the heartbeat miss and re-registration time period is very low and the process stop takes some time.
As of now, to handle this, we've configured checking for labeled ports from task definition on task startup. If they are free proceed, else wait for the shutdown delay period + 30s(for jitter) and then re-check. If success, proceed with startup, else exit with code 1.
This plus increased grace period(90s) and monitoring for heartbeat timeouts would help us avoid situations where a new task is started where a previous one is still running. This would let the new task fail till the previous one shuts down and then proceed with the new allocation startup.
However, to be very safe, we would prefer nomad to not attempt a restart in case the client comes back post a heartbeat miss and has allocation matching the job spec. I'll file a FR. Thanks! Appreciate the help :)

@tgross
Copy link
Member

tgross commented Mar 3, 2021

Ok, going to close this issue in lieu of the one you'll open.

@tgross tgross closed this as completed Mar 3, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants