-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote endpoint supervisor backoff delay overflow #527
Comments
That was not the intended design I was going for with #463. But, at the
time, it was very difficult to implement a test for the scenarios outlined
there. Seems like there is a bug with it now anyway.
The particular scenario I had was a cluster where remotes could die
permanently and/or be replaced with a machine on a different ip address.
The important thing was that the remote did not retry endlessly since that
was bringing down the endpoint supervisor previously if the remote never
recovered. The problem was made worse because the retry of the connection
was very CPU intensive.
My concern is that changing it back at this point is going to bring back
the problems I was seeing. But I can also see that there is a bug that is
not consistent with the intended design behind #463 .
My preference would be to fix the bug that is causing the exponential retry
to continue indefinitely rather than dumping to the dead letter queue as
outlined in #464. But, I don't have the bandwidth to fix it right now.
I understand if you need to fix it in the meantime but perhaps the longer
term solution is to refactor it so we can each plug our desired strategy to
suit our needs and have some tests to to ensure that the strategies work. I
am happy to contribute to this again when I have bandwidth.
…On Sun, 5 Jul 2020, 19:45 Alexey Zimarev, ***@***.***> wrote:
The _backoff field gets its initial value in the constructor. Then, it
gets doubled when handling failures, endlessly. At some point, it gets over
int32.MaxValue, which causes the delay to crash the supervisor, which
effectively takes down the endpoint and the cluster node goes down.
Overall, I don't like the idea of doubling the backoff. If we want to have
exponential retries, we need to make it explicit, using the backoff
strategy. As per now, I'd rather prefer to use the fixed backoff value,
maybe keeping the noise.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#527>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMOLB6L7XPZDGY5QBLFNLLR2BDTRANCNFSM4OQZMOJQ>
.
|
The thing is that it won't be increasing indefinitely if it gets calculated from the failure count as I did in the PR. The code there is taken from the exponential backoff strategy and it seems correct. The current code would keep increasing the back-off timeout even after recovering from a transient network failure. In clustered scenarios, when one node can be replaced by another, it works fine as long as the cluster provider properly tracks changes in the cluster. I have discovered this particular issue when running a long test of a simple cluster using my new Kubernetes provider (going to contribs when I finish testing it). |
Yeah, I agree that the best way would be to plug a strategy. Currently, the strategy is baked in. The failure handling code is too much different from the normal strategy. My aim is to get this particular issue fixed since now it just doesn't work if the networking isn't very stable (Kubernetes experience). |
Thanks for the clarification. That sounds like a good fix.
…On Mon, 6 Jul 2020, 02:21 Alexey Zimarev, ***@***.***> wrote:
Yeah, I agree that the best way would be to plug a strategy. Currently,
the strategy is baked in. The failure handling code is too much different
from the normal strategy. My aim is to get this particular issue fixed
since now it just doesn't work if the networking isn't very stable
(Kubernetes experience).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#527 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMOLB3HQDNNRENP7OU6AODR2CSBTANCNFSM4OQZMOJQ>
.
|
Ok, I merged it and now I am closing this issue. Let's see how it will work. |
The
_backoff
field gets its initial value in the constructor. Then, it gets doubled when handling failures, endlessly. At some point, it gets over int32.MaxValue, which causes the delay to crash the supervisor, which effectively takes down the endpoint and the cluster node goes down.Overall, I don't like the idea of doubling the backoff. If we want to have exponential retries, we need to make it explicit, using the backoff strategy. As per now, I'd rather prefer to use the fixed backoff value, maybe keeping the noise. The issue is the endpoint supervisor implements the strategy itself and it's fixed.
The text was updated successfully, but these errors were encountered: