Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote endpoint supervisor backoff delay overflow #527

Closed
alexeyzimarev opened this issue Jul 5, 2020 · 5 comments
Closed

Remote endpoint supervisor backoff delay overflow #527

alexeyzimarev opened this issue Jul 5, 2020 · 5 comments

Comments

@alexeyzimarev
Copy link
Member

alexeyzimarev commented Jul 5, 2020

The _backoff field gets its initial value in the constructor. Then, it gets doubled when handling failures, endlessly. At some point, it gets over int32.MaxValue, which causes the delay to crash the supervisor, which effectively takes down the endpoint and the cluster node goes down.

Overall, I don't like the idea of doubling the backoff. If we want to have exponential retries, we need to make it explicit, using the backoff strategy. As per now, I'd rather prefer to use the fixed backoff value, maybe keeping the noise. The issue is the endpoint supervisor implements the strategy itself and it's fixed.

@sudsy
Copy link
Contributor

sudsy commented Jul 5, 2020 via email

@alexeyzimarev
Copy link
Member Author

The thing is that it won't be increasing indefinitely if it gets calculated from the failure count as I did in the PR. The code there is taken from the exponential backoff strategy and it seems correct. The current code would keep increasing the back-off timeout even after recovering from a transient network failure.

In clustered scenarios, when one node can be replaced by another, it works fine as long as the cluster provider properly tracks changes in the cluster. I have discovered this particular issue when running a long test of a simple cluster using my new Kubernetes provider (going to contribs when I finish testing it).

@alexeyzimarev
Copy link
Member Author

Yeah, I agree that the best way would be to plug a strategy. Currently, the strategy is baked in. The failure handling code is too much different from the normal strategy. My aim is to get this particular issue fixed since now it just doesn't work if the networking isn't very stable (Kubernetes experience).

@sudsy
Copy link
Contributor

sudsy commented Jul 5, 2020 via email

@alexeyzimarev
Copy link
Member Author

Ok, I merged it and now I am closing this issue. Let's see how it will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants