-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential race in hot restart protocol #550
Comments
Since we don’t support SIGHUP inside Envoy itself, but rely on raw RPC communication (to deal with containers), there is no way to synchronize the restarts. I think the best I can do is to fail the new process and exit with 1 (or something) if the old process is still initializing and not ready for restart. Would this work for you? What kind of indication do you want that Envoy is not ready to be restarted? Theoretically we might be able to block the new process until the old one is ready, but if you keep restarting every 250ms, this will end up not working either, so we would still need the immediate fail path, so I would rather just do that. |
Retries are better in this situation I think, so failing fast in the new/child Envoy while the old/parent Envoy is still initializing sounds good to me. I can run a retry loop on top with exponential delays to wait until the parent Envoy is ready to gracefully shutdown. This is much better than what happens now when both child and parents instances die with errors. The length of restarts depends should probably >= time to fully initialize. Since I'm using *DS, I guess this also depends how long it takes to establish connections to the discovery services? |
Right, it's possible for full initialization to take some time. (Possibly even several seconds depending on the configuration). I could make it so that we deal with all of the edge cases and enable restart in the middle, but I don't think it's worth it. Do you want a specific error code for the exit in this case? I think you probably need to differentiate between "not ready to restart" and some other error. |
In my case, I can treat it as a transient error, so a special error code is necessary. I would prefer error code for invalid config file though, since that's a permanent error. |
…en-libc (envoyproxy#550) Signed-off-by: Shikugawa <[email protected]>
Description: add metrics service extension Risk Level: low Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
Description: add metrics service extension Risk Level: low Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
At high frequency of restarts (250ms), Envoy restart protocol seems to hit a race condition and takes down both child and parent proxies. See https://gist.github.com/kyessenov/0e446c8c0e838c9bedfdf5bc70af5fa8. Here we have 3 epochs, epoch 1 and epoch 2 hit a communication failure and both crashed. Epoch 0 exited gracefully.
The text was updated successfully, but these errors were encountered: