-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: server: add mechanism for graceful shutdown. #3307
Conversation
This is useful for cases when the hot-restart-version is incompatible and the old server needs to stop accepting new connections, drain established connections, and close all shared resources that are needed to start a new Envoy. Signed-off-by: Greg Greenway <[email protected]>
I'd like to get some feedback on this before I spend any time polishing it. Should there be a separate admin call for each of the operations that are performed here, so they can be composed differently? There have been various requests for something along these lines. Does this help with those? If not, could it help via tweaks or small changes to this? |
@ggreenway for my understanding, why would someone use this vs |
@mattklein123 The goal isn't just to drain the traffic. The goal is to be able to start the new envoy while the old one is draining, so all the shared-resources need to be closed. |
I see. Since
? |
@ggreenway I think this could be used to help with #2802 (and the general case of having gRPC connections sourced at Envoy terminate with a graceful GOAWAY). |
@mattklein123 using There's some convenience in making the entire thing fire-and-forget, as that will simplify init-scripts. But I could just as easily launch bash with I'm tempted to make each of those operations a separate admin command. Some users may want to close the admin port before starting the new envoy, whereas some may not be starting a new envoy but want to graceful-shutdown, so they don't want to close admin. |
FWIW, this is basically what we do. We have wrapper run scripts that do various things.
I think my main concern is confusion over which admin commands to use. It seems reasonable to provide the requested shutdown functionality, but it's not clear to me if they should be different commands, params to /healthcheck/fail, etc. Thoughts? |
I don't think this should be a param to But we could have an options to Also, I just realized one more quirk: because this shuts down the admin port, killing on a timer is needed to stop it at the end, because we can't reach I think I might make shutting down the admin port optional with default to false, because if you do that accidentally you've shot yourself in the foot pretty bad. |
I think we should also disallow adding new listeners via LDS after we've closed listeners, so the management server doesn't need to be aware of the shutdown status. Thoughts? |
@ggreenway I think the thing I'm confused about is why you would want the admin server to be shutdown. I.e., why wouldn't you just hot restart in that case? I get why someone might want to do a listener close operation depending on how their deployment works re: health checking or additional load balancers, but the full restart while the other one keeps running is kind of confusing to me. |
Yeah I would probably shutdown LDS at that point. I guess the question is if you want to accept changes to existing listeners while not adding new ones, though that sounds complicated. |
This features helps deal with the case when the hot-restart-version has changed. I want to stop old-envoy and start new-envoy. To start new-envoy, old-envoy needs to release all shared resources. |
I see. The way we deal with this at Lyft is just to operationalize around it. Basically we do a rolling restart in which we:
IMO letting the old process keep going without monitoring it is potentially not great from an ops perspective and I wouldn't personally recommend it. I'm also not sure if it's worth the engineering effort to build the full close (including admin) into Envoy itself. With that said, I don't feel that strongly about it. (Another thing to point out is that the hot restart version changes so infrequently that IMO it's not a big deal if you drop a few extra connections that you wouldn't otherwise drop.) |
Some of the things I deal with involve very long-lived connections, so I want the "sleep N" to be a really long time. But I don't want to wait that long to start the new envoy. I agree that un-monitored envoy isn't a great idea, but given that this is very rare (due to infrequently-changing hot-restart-version) this is the best tradeoff I've found for my operational needs. My other idea on the admin port is to use a file-based UDS, and just move the old one to a different path before starting new-envoy. |
Yup fair enough. I think if we implement we just need to be super clear in the docs what to use when from an ops perspective, and hopefully have as little overlap as possible between the commands. |
Having some problems understanding your logic, here;
If during drainListeners() we are pulling backlog queued connections off of the listener, and that listener continues to fill with new connections, how will we reach closeListeners()? This doesn't seem to solve the underlying issue in #2920 of leaving listeners open and accepting connections in the three cases which claim to "stop listening" (libevent, under the hood, makes this claim about evconnlistener_disable(), but it certainly continues listening, and simply stops accepting connections; these connections up to the backlog value had been established and accepting some data.) |
@wrowe drainListeners() does not synchronously drain listeners, it starts the process documented at https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/draining.html. The goal of this PR wasn't to fix #2920; I'm figuring out if that can also be fixed though. The description at the top of the PR describes the problem I'm trying to address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of paranoid comments; it's possible the design prevents the concerns I was fabricating :)
const std::chrono::seconds shutdown_time(server_.options().parentShutdownTime()); | ||
ENVOY_LOG(info, "Starting graceful shutdown. Server will exit in {} seconds.", | ||
shutdown_time.count()); | ||
graceful_shutdown_timer_ = server_.dispatcher().createTimer([this]() { server_.shutdown(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if AdminImpl is destructed before the timeout is reached, the call to server_.shutdown() will crash. Can we cancel the timer in the destructor if non-null, and null the timer here?
Or is this not necessary due to the way the server_.dispatcher() works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer is held in a unique_ptr. If it is destructed, the timer is cancelled/unscheduled.
@@ -622,6 +622,20 @@ void ListenerManagerImpl::stopListeners() { | |||
} | |||
} | |||
|
|||
void ListenerManagerImpl::closeListeners() { | |||
for (auto& listener : active_listeners_) { | |||
listener->close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this ultimately mutate active_listeners_? When does a previously active listener get removed from the collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listeners are only removed when LDS removes them. This call does not mutate the vector (or list?), only the elements inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close() maybe should be renamed to closeSocket() to make it clearer.
Signed-off-by: Greg Greenway <[email protected]>
@ggreenway FYI in #3340 I moved LDS API into the listener manager, which might make it easier for you to shut down the LDS provider as part of your change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me, code-wise; I learned more about the awesome layers of SW in Envoy's codebase.
However I don't have a good feel for the operational tradeoffs as discussed. Happy to review further if desired; I assume the next step is to get CI green?
Yeah, I got busy with other things. I'll try to get back to this sometime soon. Still need to write tests for this. |
I've gotten busy on other things. Closing this until I have time to re-visit. |
@ggreenway What is your status for this PR? It is getting higher priority in Istio side, do you mind if I take over? |
@lizan Yes that's fine if you want to take it over. |
This is useful for cases when the hot-restart-version is incompatible
and the old server needs to stop accepting new connections, drain
established connections, and close all shared resources that are
needed to start a new Envoy.
Signed-off-by: Greg Greenway [email protected]
Risk Level: Medium
Testing: TODO
Docs Changes: TODO
Release Notes: TODO
[Optional Fixes #Issue]