Remote endpoint supervisor backoff delay overflow #527

alexeyzimarev · 2020-07-05T09:45:15Z

The _backoff field gets its initial value in the constructor. Then, it gets doubled when handling failures, endlessly. At some point, it gets over int32.MaxValue, which causes the delay to crash the supervisor, which effectively takes down the endpoint and the cluster node goes down.

Overall, I don't like the idea of doubling the backoff. If we want to have exponential retries, we need to make it explicit, using the backoff strategy. As per now, I'd rather prefer to use the fixed backoff value, maybe keeping the noise. The issue is the endpoint supervisor implements the strategy itself and it's fixed.

The text was updated successfully, but these errors were encountered:

sudsy · 2020-07-05T10:53:20Z

That was not the intended design I was going for with #463. But, at the time, it was very difficult to implement a test for the scenarios outlined there. Seems like there is a bug with it now anyway. The particular scenario I had was a cluster where remotes could die permanently and/or be replaced with a machine on a different ip address. The important thing was that the remote did not retry endlessly since that was bringing down the endpoint supervisor previously if the remote never recovered. The problem was made worse because the retry of the connection was very CPU intensive. My concern is that changing it back at this point is going to bring back the problems I was seeing. But I can also see that there is a bug that is not consistent with the intended design behind #463 . My preference would be to fix the bug that is causing the exponential retry to continue indefinitely rather than dumping to the dead letter queue as outlined in #464. But, I don't have the bandwidth to fix it right now. I understand if you need to fix it in the meantime but perhaps the longer term solution is to refactor it so we can each plug our desired strategy to suit our needs and have some tests to to ensure that the strategies work. I am happy to contribute to this again when I have bandwidth.

…

On Sun, 5 Jul 2020, 19:45 Alexey Zimarev, ***@***.***> wrote: The _backoff field gets its initial value in the constructor. Then, it gets doubled when handling failures, endlessly. At some point, it gets over int32.MaxValue, which causes the delay to crash the supervisor, which effectively takes down the endpoint and the cluster node goes down. Overall, I don't like the idea of doubling the backoff. If we want to have exponential retries, we need to make it explicit, using the backoff strategy. As per now, I'd rather prefer to use the fixed backoff value, maybe keeping the noise. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#527>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMOLB6L7XPZDGY5QBLFNLLR2BDTRANCNFSM4OQZMOJQ> .

alexeyzimarev · 2020-07-05T16:19:43Z

The thing is that it won't be increasing indefinitely if it gets calculated from the failure count as I did in the PR. The code there is taken from the exponential backoff strategy and it seems correct. The current code would keep increasing the back-off timeout even after recovering from a transient network failure.

In clustered scenarios, when one node can be replaced by another, it works fine as long as the cluster provider properly tracks changes in the cluster. I have discovered this particular issue when running a long test of a simple cluster using my new Kubernetes provider (going to contribs when I finish testing it).

alexeyzimarev · 2020-07-05T16:21:33Z

Yeah, I agree that the best way would be to plug a strategy. Currently, the strategy is baked in. The failure handling code is too much different from the normal strategy. My aim is to get this particular issue fixed since now it just doesn't work if the networking isn't very stable (Kubernetes experience).

sudsy · 2020-07-05T21:33:55Z

Thanks for the clarification. That sounds like a good fix.

…

On Mon, 6 Jul 2020, 02:21 Alexey Zimarev, ***@***.***> wrote: Yeah, I agree that the best way would be to plug a strategy. Currently, the strategy is baked in. The failure handling code is too much different from the normal strategy. My aim is to get this particular issue fixed since now it just doesn't work if the networking isn't very stable (Kubernetes experience). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#527 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMOLB3HQDNNRENP7OU6AODR2CSBTANCNFSM4OQZMOJQ> .

alexeyzimarev · 2020-07-06T20:12:39Z

Ok, I merged it and now I am closing this issue. Let's see how it will work.

alexeyzimarev mentioned this issue Jul 5, 2020

Use failure count for backoff calculation. #528

Merged

alexeyzimarev closed this as completed Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote endpoint supervisor backoff delay overflow #527

Remote endpoint supervisor backoff delay overflow #527

alexeyzimarev commented Jul 5, 2020 •

edited

Loading

sudsy commented Jul 5, 2020 via email

alexeyzimarev commented Jul 5, 2020

alexeyzimarev commented Jul 5, 2020

sudsy commented Jul 5, 2020 via email

alexeyzimarev commented Jul 6, 2020

Remote endpoint supervisor backoff delay overflow #527

Remote endpoint supervisor backoff delay overflow #527

Comments

alexeyzimarev commented Jul 5, 2020 • edited Loading

sudsy commented Jul 5, 2020 via email

alexeyzimarev commented Jul 5, 2020

alexeyzimarev commented Jul 5, 2020

sudsy commented Jul 5, 2020 via email

alexeyzimarev commented Jul 6, 2020

alexeyzimarev commented Jul 5, 2020 •

edited

Loading