Potential race in hot restart protocol #550

kyessenov · 2017-03-09T02:20:47Z

At high frequency of restarts (250ms), Envoy restart protocol seems to hit a race condition and takes down both child and parent proxies. See https://gist.github.com/kyessenov/0e446c8c0e838c9bedfdf5bc70af5fa8. Here we have 3 epochs, epoch 1 and epoch 2 hit a communication failure and both crashed. Epoch 0 exited gracefully.

mattklein123 · 2017-03-09T17:54:24Z

Since we don’t support SIGHUP inside Envoy itself, but rely on raw RPC communication (to deal with containers), there is no way to synchronize the restarts. I think the best I can do is to fail the new process and exit with 1 (or something) if the old process is still initializing and not ready for restart.

Would this work for you? What kind of indication do you want that Envoy is not ready to be restarted?

Theoretically we might be able to block the new process until the old one is ready, but if you keep restarting every 250ms, this will end up not working either, so we would still need the immediate fail path, so I would rather just do that.

kyessenov · 2017-03-09T18:06:40Z

Retries are better in this situation I think, so failing fast in the new/child Envoy while the old/parent Envoy is still initializing sounds good to me. I can run a retry loop on top with exponential delays to wait until the parent Envoy is ready to gracefully shutdown. This is much better than what happens now when both child and parents instances die with errors.

The length of restarts depends should probably >= time to fully initialize. Since I'm using *DS, I guess this also depends how long it takes to establish connections to the discovery services?

mattklein123 · 2017-03-09T18:11:52Z

Right, it's possible for full initialization to take some time. (Possibly even several seconds depending on the configuration). I could make it so that we deal with all of the edge cases and enable restart in the middle, but I don't think it's worth it. Do you want a specific error code for the exit in this case? I think you probably need to differentiate between "not ready to restart" and some other error.

kyessenov · 2017-03-09T18:13:40Z

In my case, I can treat it as a transient error, so a special error code is necessary. I would prefer error code for invalid config file though, since that's a permanent error.

fixes #550

…ng (#553) fixes #550

…en-libc (envoyproxy#550) Signed-off-by: Shikugawa <[email protected]>

Description: add metrics service extension Risk Level: low Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>

mattklein123 added the bug label Mar 9, 2017

mattklein123 added this to the 1.3.0 milestone Mar 9, 2017

mattklein123 self-assigned this Mar 9, 2017

mattklein123 added a commit that referenced this issue Mar 9, 2017

hot restart: block new starts if previous process is still initializing

1075bc5

fixes #550

mattklein123 mentioned this issue Mar 9, 2017

hot restart: block new starts if previous process is still initializing #553

Merged

mattklein123 closed this as completed in #553 Mar 10, 2017

mattklein123 added a commit that referenced this issue Mar 10, 2017

hot restart: block new starts if previous process is still initializi…

2685331

…ng (#553) fixes #550

lambdai pushed a commit to lambdai/envoy-dai that referenced this issue Jul 21, 2020

platform: avoid name confliction between platform header and emscript…

140d4e5

…en-libc (envoyproxy#550) Signed-off-by: Shikugawa <[email protected]>

jpsim pushed a commit that referenced this issue Nov 28, 2022

build: add metrics service extension (#550)

d160db5

Description: add metrics service extension Risk Level: low Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>

jpsim pushed a commit that referenced this issue Nov 29, 2022

build: add metrics service extension (#550)

0520de9

Description: add metrics service extension Risk Level: low Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential race in hot restart protocol #550

Potential race in hot restart protocol #550

kyessenov commented Mar 9, 2017

mattklein123 commented Mar 9, 2017

kyessenov commented Mar 9, 2017

mattklein123 commented Mar 9, 2017

kyessenov commented Mar 9, 2017

Potential race in hot restart protocol #550

Potential race in hot restart protocol #550

Comments

kyessenov commented Mar 9, 2017

mattklein123 commented Mar 9, 2017

kyessenov commented Mar 9, 2017

mattklein123 commented Mar 9, 2017

kyessenov commented Mar 9, 2017