Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group shutdown_delay #6746

Merged
merged 4 commits into from
Dec 18, 2019
Merged

Group shutdown_delay #6746

merged 4 commits into from
Dec 18, 2019

Conversation

drewbailey
Copy link
Contributor

Adds a shutdown delay option to the group stanza to allow time between de-registering a consul service and a task group being shutting down.

Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!

client/allocrunner/groupservice_hook.go Outdated Show resolved Hide resolved
client/allocrunner/interfaces/runner_lifecycle.go Outdated Show resolved Hide resolved
client/allocrunner/interfaces/runner_lifecycle.go Outdated Show resolved Hide resolved
@drewbailey drewbailey force-pushed the f-shutdown-delay-tg branch 4 times, most recently from 598cf73 to 91d8faa Compare November 25, 2019 17:39
@drewbailey drewbailey marked this pull request as ready for review November 25, 2019 17:46
@drewbailey drewbailey requested a review from schmichael December 2, 2019 16:25
nomad/structs/structs.go Outdated Show resolved Hide resolved
client/allocrunner/interfaces/runner_lifecycle.go Outdated Show resolved Hide resolved
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does deregistering a service out from underneath Consul Connect effect Envoy? I'm concerned it could detect the deregistration and immediately stop proxying which would defeat the purpose of shutdown_delay for Connect tasks.

We'd also still need to open an issue to cover the case where shutdown_delay is used without a service.

Alternatively we could propagate the group shutdown_delay to each task and make a new delay prekill hook for tasks that runs after the service hook. I think that addresses all of the use cases in #6704 and allows users to set either a group delay that applies to all tasks or a per-task delay for fine grained control (eg kill the log shipper last).

It also avoids conflating shutdown delay hooks with services which is the source of the original issue: we coupled the logic instead of making them distinct hooks.

Don't forget docs for the new field.

@drewbailey
Copy link
Contributor Author

The way that this is currently implemented will delay shutdown regardless if there are registered group services or not, so it should cover the case linked in #6704.

I'll look into how it affects envoy. It's implemented the same way that individual task shutdown_delays deregister from consul

@jippi
Copy link
Contributor

jippi commented Dec 3, 2019

@drewbailey @nickethier could we make group->shutdown_delay default to max(task->*->shutdown_delay) ? :) seem to be what you would want in 9.9/10 cases?

@djenriquez
Copy link

How does deregistering a service out from underneath Consul Connect effect Envoy? I'm concerned it could detect the deregistration and immediately stop proxying which would defeat the purpose of shutdown_delay for Connect tasks.

@schmichael I'm confused, I thought the purpose of the shutdown_delay was to stop all new requests from hitting the allocations represented by the consul nodes immediately, and allow in-flight requests to complete. Wouldn't you want it to deregister and stop proxying immediately even for connect tasks?

@schmichael
Copy link
Member

The way that this is currently implemented will delay shutdown regardless if there are registered group services or not, so it should cover the case linked in #6704.

-- @drewbailey

Right, sorry I wasn't clear. I meant to say this implementation doesn't fix the bug where a task's shutdown_delay is ignored if the task lacks a service. We need to file a new issue/PR for that if this PR closes the original issue.

How does deregistering a service out from underneath Consul Connect effect Envoy? I'm concerned it could detect the deregistration and immediately stop proxying which would defeat the purpose of shutdown_delay for Connect tasks.

@schmichael I'm confused, I thought the purpose of the shutdown_delay was to stop all new requests from hitting the allocations represented by the consul nodes immediately, and allow in-flight requests to complete. Wouldn't you want it to deregister and stop proxying immediately even for connect tasks?

-- @djenriquez

Yes, but we need to ensure Envoy won't immediately stop proxying inflight tasks. I've never tested what the behavior of deregistering a service is on requests inflight in Envoy.

@drewbailey @nickethier could we make group->shutdown_delay default to max(task->*->shutdown_delay) ? :) seem to be what you would want in 9.9/10 cases?

-- @jippi

As this is currently implemented I think that would cause a double delay: first at the group level, then each individual task's level. I also don't think we propagate parameters "up" stanzas, so this seems like a potentially surprising behavior.

Which brings up a good point: as implemented the group and task shutdown delays are completely distinct. When an allocation is stopping for whatever reason, first the group delay applies, then the tasks delays, then tasks are killed.

That may be what users want, but we need to document it carefully as usually settings with the same name at different levels simply propagate down from job -> group -> task.

@drewbailey drewbailey force-pushed the f-shutdown-delay-tg branch 2 times, most recently from b70e5ec to e055b40 Compare December 4, 2019 15:45
@djenriquez
Copy link

Any updates with this PR?

@drewbailey drewbailey added this to the 0.10.3 milestone Dec 13, 2019
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple comments but nothing I need to re-review. Remember the changelog entry (can be a followup PR).

client/allocrunner/groupservice_hook.go Outdated Show resolved Hide resolved
client/allocrunner/groupservice_hook.go Show resolved Hide resolved
@@ -4736,6 +4740,8 @@ type TaskGroup struct {

// Volumes is a map of volumes that have been requested by the task group.
Volumes map[string]*VolumeRequest

ShutdownDelay *time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be a pointer as there's no difference between nil and 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed to be a pointer so that a job diff knew if it was set or not, otherwise I was getting

Failed
=== RUN   TestJobDiff
--- FAIL: TestJobDiff (0.00s)
    diff_test.go:1196: case 19: got:
        Job "" (Edited):
        Group "bam" (Added):
        "Count" (Added): "" => "1"
        "ShutdownDelay" (Added): "" => "0"

website/source/docs/job-specification/group.html.md Outdated Show resolved Hide resolved
website/source/docs/job-specification/group.html.md Outdated Show resolved Hide resolved
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
more explicit test case, remove select statement
update docs, address pr comments

ensure pointer is not nil

use pointer for diff tests, set vs unset
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants