Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server failover vs. task restart - Consul DNS #5908

Closed
jozef-slezak opened this issue Jul 1, 2019 · 7 comments
Closed

Nomad server failover vs. task restart - Consul DNS #5908

jozef-slezak opened this issue Jul 1, 2019 · 7 comments

Comments

@jozef-slezak
Copy link

Some of our jobs/tasks (running nomad clients on different machines then nomad servers) are being restarted when restarted nomad server appears again in the cluster. I would expect defensive behavior. Please, minimize restarts or correct me if I am wrong.
Is this behavior related also to #5669?

Nomad version

0.9.1

Operating system and Environment details

CentOS Linux

Issue

Unexpected job restarts when nomad server comes back again (3node cluster).

Reproduction steps

  1. Start 3-node cluster
  2. Submit job
  3. Reboot one Nomad server (sudo reboot)
  4. Check that some of the allocations of same job/task were restarted

Job file (if appropriate)

Tested with both service and system jobs

@angrycub
Copy link
Contributor

angrycub commented Jul 1, 2019

You might be encountering #5654. This fix included in Nomad 0.9.2, I would encourage you to test this in the current version of Nomad and see if your issue is resolved.

Hope this helps!

@jozef-slezak
Copy link
Author

Thank you for your reply. We are upgrading to 0.9.3 in our test environment at the moment.

@tantra35
Copy link
Contributor

tantra35 commented Jul 2, 2019

@jozef-slezak Can you confirm that 0.9.3 fix your issue?

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 3, 2019

Hello, it definitely helped to upgrade to 0.9.3 (originally from 0.9.1) but we observed three more issues.

@cgbaker, could you please repeat the restart cluster test and wait until it breaks - the DNS entry will be missing (I believe that I saw also missing Consul entry before - in that situation we workarounded the situation by stopping alloc which caused reschedule and correct registration to Consul/DNS).

I was able to reproduce this behavior by using https://github.com/hashicorp/nomad/files/3364021/11afd990a7b76a9909c2b0328f17381ad3d27bef.zip.
during testing described here #5921 (comment). Nomad was able to failover some of the jobs (Count=1) to different machines but you need to improve the Consul integration while running only 2 from 3 nodes since Consul check is obsolete (because it is pointing to the failed machine) and therefore there is no DNS entry. Are there any TTLs for consul entries?

@jozef-slezak jozef-slezak changed the title Nomad server failover vs. task restart Nomad server failover vs. task restart - Consul DNS Jul 8, 2019
@stale
Copy link

stale bot commented Oct 6, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Nov 5, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed Nov 5, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants