Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race in hot restart protocol #550

Closed
kyessenov opened this issue Mar 9, 2017 · 4 comments · Fixed by #553
Closed

Potential race in hot restart protocol #550

kyessenov opened this issue Mar 9, 2017 · 4 comments · Fixed by #553
Assignees
Labels
Milestone

Comments

@kyessenov
Copy link
Contributor

At high frequency of restarts (250ms), Envoy restart protocol seems to hit a race condition and takes down both child and parent proxies. See https://gist.github.com/kyessenov/0e446c8c0e838c9bedfdf5bc70af5fa8. Here we have 3 epochs, epoch 1 and epoch 2 hit a communication failure and both crashed. Epoch 0 exited gracefully.

@mattklein123 mattklein123 added this to the 1.3.0 milestone Mar 9, 2017
@mattklein123 mattklein123 self-assigned this Mar 9, 2017
@mattklein123
Copy link
Member

Since we don’t support SIGHUP inside Envoy itself, but rely on raw RPC communication (to deal with containers), there is no way to synchronize the restarts. I think the best I can do is to fail the new process and exit with 1 (or something) if the old process is still initializing and not ready for restart.

Would this work for you? What kind of indication do you want that Envoy is not ready to be restarted?

Theoretically we might be able to block the new process until the old one is ready, but if you keep restarting every 250ms, this will end up not working either, so we would still need the immediate fail path, so I would rather just do that.

@kyessenov
Copy link
Contributor Author

Retries are better in this situation I think, so failing fast in the new/child Envoy while the old/parent Envoy is still initializing sounds good to me. I can run a retry loop on top with exponential delays to wait until the parent Envoy is ready to gracefully shutdown. This is much better than what happens now when both child and parents instances die with errors.

The length of restarts depends should probably >= time to fully initialize. Since I'm using *DS, I guess this also depends how long it takes to establish connections to the discovery services?

@mattklein123
Copy link
Member

Right, it's possible for full initialization to take some time. (Possibly even several seconds depending on the configuration). I could make it so that we deal with all of the edge cases and enable restart in the middle, but I don't think it's worth it. Do you want a specific error code for the exit in this case? I think you probably need to differentiate between "not ready to restart" and some other error.

@kyessenov
Copy link
Contributor Author

In my case, I can treat it as a transient error, so a special error code is necessary. I would prefer error code for invalid config file though, since that's a permanent error.

lambdai pushed a commit to lambdai/envoy-dai that referenced this issue Jul 21, 2020
jpsim pushed a commit that referenced this issue Nov 28, 2022
Description: add metrics service extension
Risk Level: low

Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Description: add metrics service extension
Risk Level: low

Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants