Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script checks in Consul stuck at critical state "TTL expired" if node was restarted #6332

Closed
a-zagaevskiy opened this issue Sep 16, 2019 · 6 comments
Assignees

Comments

@a-zagaevskiy
Copy link
Contributor

Nomad version

Nomad v0.9.5

Operating system and Environment details

Ubuntu 18.04.1 LTS which runs a cluster consisted of few docker containers with nomad's and consul's agents into them.

Issue

Scripts checks that registered in Consul by Nomad stayed at not-running state after node had been restarted. This issue looks very similar to the old one: #1636

Reproduction steps

All you need is to restart the node.

@a-zagaevskiy
Copy link
Contributor Author

One of the possible fixes: #6333

@tgross
Copy link
Member

tgross commented Sep 16, 2019

Hi @AlexanderZagaevskiy and thanks for this report!

The handling of script checks was refactored in the 0.10.0-beta and that should resolve this problem. The lifecycle of the script check is now tied directly to the task runner for its task, so when we restore the task we'll restore the script check as well.

It looks like the fix in #6333 was opened against the master branch. But that branch already has the 0.10.0-beta work I described above. If you don't want to wait till 0.10.0 for the fix, would you be willing to fork from the v0.9.5 tag instead?

If you do that, you'll see we already have the logic you've added a bit further down at client.go#L587-L594 but -- as you've seen -- that doesn't help us on client restart. We won't want to start the checks twice, and we won't want to start script checks before they're registered at client.go#L580, but I'd be happy to help you out with that in a new PR!

@a-zagaevskiy
Copy link
Contributor Author

@tgross You are absolutely right. The PR #6333 was created mistakably with applying a patch for v0.9.5 on the current master branch. But it does fix the issue for me if that patch is applied for Nomad v0.9.5.

Great to read that described issue more likely will be fixed in the coming 0.10.0. So, is it worth to make any PR for fixing it for 0.9.5?

@tgross
Copy link
Member

tgross commented Sep 16, 2019

So, is it worth to make any PR for fixing it for 0.9.5?

We are going to be cutting a 0.9.6 bugfix release as well, so it would be great to have your PR in for those folks who won't be ready to go directly to 0.10.0.

@tgross
Copy link
Member

tgross commented Oct 29, 2019

Closing this as mentioned in #6351 (comment)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants