-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid de-registering slowly restored services #5837
Conversation
When a nomad client restarts/upgraded, nomad restores state from running task and starts the sync loop. If sync loop runs early, it may deregister services from Consul prematurely even when Consul has the running service as healthy. This is not ideal, as re-registering the service means potentially waiting a whole service health check interval before declaring the service healthy. We attempt to mitigate this by introducing an initialization probation period. During this time, we only deregister services and checks that were explicitly deregistered, and leave unrecognized ones alone. This serves as a grace period for restoring to complete, or for operators to restore should they recognize they restored with the wrong nomad data directory.
FWIW, I had alternative implementation in #5838 - but I like this one more as I believe it's a bit simpler and convey the problem clearly (that we are only concerned about start up but not steady state of system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a fine fix for an ugly bug.
Alternate approach
I think I'd prefer the approach of delaying go ServiceClient.Run()
until after Client.restoreState()
has returned (since restoration is synchronous), but that would also require moving RegisterTask calls into TaskRunner.Restore()
for running tasks. I think that approach is optimal, but the changes are not as nicely encapsulated as they are in this PR's approach.
It would also require the server to manually call go ServiceClient.Run()
itself which complicates the dev agent where you need to share a single ServiceClient. A quick check in the server to see if the client is enabled before starting Run would be sufficient -- or Run could detect multiple calls and noop on subsequent ones.
I guess we'd probably want to add a TaskRunner RestoreHook to encapsulate this since all of the service registration code is currently in a Poststart hook... that makes this alternate approach even more complicated to implement unfortunately.
This would also delay registering the Nomad service itself until the agent was closer to a working state, but I can't think of any reason that would be a significant benefit.
That being said I'd rather have this fixed well then worry about the approach, and this is done and tested! So feel free to ship as is.
I considered this approach but I found it too complicated as you said. Though state restore is synchronise, |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When a nomad client restarts/upgraded, nomad restores state from running
task and starts the sync loop. If sync loop runs early, it may
deregister services from Consul prematurely even when Consul has the
running service as healthy.
This is not ideal, as re-registering the service means potentially
waiting a whole service health check interval before declaring the
service healthy.
We attempt to mitigate this by introducing an initialization probation
period. During this time, we only deregister services and checks that
were explicitly deregistered, and leave unrecognized ones alone. This
serves as a grace period for restoring to complete, or for operators to
restore should they recognize they restored with the wrong nomad data
directory.
I explored changing the order of starting up the consul client syncing goroutine, but given that tasks consul registration happen in restore/run goroutines, it's a significant drastic change to APIs to make consul agent wait until all restored services initialized correctly.