Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid de-registering slowly restored services #5837

Merged
merged 2 commits into from
Jul 17, 2019
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jun 14, 2019

When a nomad client restarts/upgraded, nomad restores state from running
task and starts the sync loop. If sync loop runs early, it may
deregister services from Consul prematurely even when Consul has the
running service as healthy.

This is not ideal, as re-registering the service means potentially
waiting a whole service health check interval before declaring the
service healthy.

We attempt to mitigate this by introducing an initialization probation
period. During this time, we only deregister services and checks that
were explicitly deregistered, and leave unrecognized ones alone. This
serves as a grace period for restoring to complete, or for operators to
restore should they recognize they restored with the wrong nomad data
directory.

I explored changing the order of starting up the consul client syncing goroutine, but given that tasks consul registration happen in restore/run goroutines, it's a significant drastic change to APIs to make consul agent wait until all restored services initialized correctly.

When a nomad client restarts/upgraded, nomad restores state from running
task and starts the sync loop.  If sync loop runs early, it may
deregister services from Consul prematurely even when Consul has the
running service as healthy.

This is not ideal, as re-registering the service means potentially
waiting a whole service health check interval before declaring the
service healthy.

We attempt to mitigate this by introducing an initialization probation
period.  During this time, we only deregister services and checks that
were explicitly deregistered, and leave unrecognized ones alone.  This
serves as a grace period for restoring to complete, or for operators to
restore should they recognize they restored with the wrong nomad data
directory.
@notnoop
Copy link
Contributor Author

notnoop commented Jun 14, 2019

FWIW, I had alternative implementation in #5838 - but I like this one more as I believe it's a bit simpler and convey the problem clearly (that we are only concerned about start up but not steady state of system.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a fine fix for an ugly bug.

Alternate approach

I think I'd prefer the approach of delaying go ServiceClient.Run() until after Client.restoreState() has returned (since restoration is synchronous), but that would also require moving RegisterTask calls into TaskRunner.Restore() for running tasks. I think that approach is optimal, but the changes are not as nicely encapsulated as they are in this PR's approach.

It would also require the server to manually call go ServiceClient.Run() itself which complicates the dev agent where you need to share a single ServiceClient. A quick check in the server to see if the client is enabled before starting Run would be sufficient -- or Run could detect multiple calls and noop on subsequent ones.

I guess we'd probably want to add a TaskRunner RestoreHook to encapsulate this since all of the service registration code is currently in a Poststart hook... that makes this alternate approach even more complicated to implement unfortunately.

This would also delay registering the Nomad service itself until the agent was closer to a working state, but I can't think of any reason that would be a significant benefit.

That being said I'd rather have this fixed well then worry about the approach, and this is done and tested! So feel free to ship as is.

command/agent/consul/client.go Outdated Show resolved Hide resolved
command/agent/consul/client.go Outdated Show resolved Hide resolved
@notnoop
Copy link
Contributor Author

notnoop commented Jul 16, 2019

I think I'd prefer the approach of delaying go ServiceClient.Run() until after Client.restoreState() has returned (since restoration is synchronous)

I considered this approach but I found it too complicated as you said. Though state restore is synchronise, TR.Run and consequently the consul service hooks run in goroutines, so it's not sufficient to simply reorder functions. We would need to add some gating/WaitGroups to track when all task service hooks finished executing after state restore before we invoke service synchronization , requiring substantial changes to the client/TR/AR apis as well as agent code.

@notnoop notnoop merged commit 15caf5c into master Jul 17, 2019
@notnoop notnoop deleted the b-consul-restore-sync-2 branch July 17, 2019 04:02
notnoop pushed a commit that referenced this pull request Jul 18, 2019
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants