-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove hard coded connect handshake timeouts #4176
Remove hard coded connect handshake timeouts #4176
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fjetter , this looks better than my attempt (apologies for never finishing that up). A few comments.
Tests fail because I changed the way exceptions are reraised. Will need to change this logic again |
@pentschev and I have been looking at these issues as well. @pentschev are you interested in testing out this PR for our UCX case ? If not, I can check it out |
@quasiben I tested this now and unfortunately this doesn't resolve our issues, but this particular piece of code seems to be causing trouble in different situations which is a bit worrying. I'm still not certain where's the problem on our side, but it seems like it's not really related to a timeout for us, we mostly see connections getting closed during read/write even if we increase the |
Thank you for testing @pentschev |
After implementing the suggestion of @jcrist to not include the handshakes in the retries, I added a test checking for "slow handshakes" and stumbled over the listener timeouts as well. I completely removed the timeouts for the handshake in the listener since I figured, it is fine on the connector side to enforce timeouts but I'm not entirely sure about this. If we require a timeout there as well, we'll need to configure it somehow since 1s is not enough. |
With the most recent changes I've also seen |
I still had an issue in the code where I encountered negative backoffs. That might've been the cause for the failing builds. The retry logic itself didn't change otherwise. I think the important change in the last commits was to remove the timeout from the listeners but I'm not sure if it is safe to not have any there. Other than this, I'm also wondering what to do with the |
What you have here is fine. I added a |
active_exception = exc | ||
# FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ | ||
|
||
upper_cap = min(time_left(), backoff_base * (2 ** attempt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also bump the intermediate_cap
by some fraction, in case the initial size is too small. As is right now, no connect attempt can last longer than timeout/5
. Perhaps 1.5 x it every attempt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good feeling about this intermediate cap but increasing it by a factor each attempt should be a safe default. I'll add the x1.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I'll add a smaller factor? Otherwise, we'd effectively limit ourselves to effectively three tries
In [1]: sum([0.2 * 1.5 ** attempt for attempt in range(3)])
Out[1]: 0.95
as I said, I don't have a good feeling about how important these intermediate caps really are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this, running into DNS race conditions feels like the sign of larger problems elsewhere, but for now we should at least match the existing behavior. My goal with increasing the value here is that depending on the value of distributed.comm.timeouts.connect
, no intermediate_cap
may be large enough to complete if it's set at 1/5 the timeout. I'm not too worried about limiting to 3 attempts (note this is only true if the timeout is hit each time, not some other error), as more attempts than that is likely a sign of deeper issues. 1.5 or 1.25 both seem fine, I wouldn't want to go lower than that.
# FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ | ||
|
||
upper_cap = min(time_left(), backoff_base * (2 ** attempt)) | ||
backoff = random.uniform(0, upper_cap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backoff should get progressively larger each attempt, as is this is adding a random sleep with progressively larger ranges (but 0 is still valid). I liked the old algorithm better of roughly 1.5x the previous backoff with some jitter. Why did you make this change? (Edit: followed the link to the aws post - the algorithm(s) there look fine, but there are some bugs in this implementation of them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AWS blog post does not include timeouts which is why I slightly changed the logic but I wouldn't classify it as a bug
I oriented myself on the FullJitter approach which is defined by
sleep = random_between(0, min(cap, base * 2 ** attempt))
where I chose the remaining time left as the cap. The reason why I chose this as a cap is because I don't want the coroutine to unnecessarily block for too much longer if there is no chance for it to complete anyhow.
In a previous iteration I was doing it the other way round, namely
sleep = min(cap, random_between(0, base * 2 ** attempt))
which is slightly different but what they both have in common is that they do not progressively produce larger backoffs. Only the average expected backoff is increasing progressively but 0 is theoretically still valid for all attempts. Only the algorithm EqualJitter offers the guarantee that zero is never chosen but it performs slightly worse than the others.
Would you prefer 1. over 2. or do I have an error in my logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AWS blog post does not include timeouts which is why I slightly changed the logic
But the calculation here is still for a backoff in attempts, not timeouts (so it seems the blogpost should apply as written?).
I missed the FullJitter option, which does look like what you've implemented here. For connect failures though, I think we do want to ensure some amount of backoff is used in case the server is still starting up or unavailable for other reasons. Both the "Equal Jitter" and "Decorrelated Jitter" options should (IIUC) provide a guarantee of non-zero backoff times (Decorrelated looking slightly better), but we could also use what you have here with a non-zero min (perhaps 0.05 or something small). What you have here seems fine too though on further thought, thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(so it seems the blogpost should apply as written?).
There is the one scenario where the new backoff would breach the timeout
(numbers only for clarity)
- Initial timeout maybe 5s
- We're in attempt 10 with about 100ms remaining
- New backoff would calculate to 200ms
- We'd wait and try to connect again but the 11th attempt will then have a timeout of zero, i.e. it is guaranteed to fail
-> We'd wait for 5.1s, i.e. longer than the configured timeout
I introduced the cap since it would, at least give the marginal chance of another try with 100ms timeout and it would not breach the total amount of time waited. The analysis of the blog post assumes that we'll retry indefinitely (or a fixed amount of max attempts) until we're successful but we're in a slightly different scenario here.
"Decorrelated Jitter" options should (IIUC) provide a guarantee of non-zero backoff times (Decorrelated looking slightly better)
Correct, this was a false statement of mine. The decorrelated jitter is always lower capped by the base in this example.
Finally, I'm wondering if the default connect timeout should be increased. We currently have 10s as a default connect timeout but I guess this was set at a time where we had a simpler retry mechanism (w/out intermediate capping) and without handshakes. Considering that multiple people encountered the |
I'd be fine increasing the default to something higher (perhaps 30s?), but don't think that necessarily needs to be done here. Unless others think otherwise, I think we should get this fix in but leave the timeout the same. |
This generally looks good to me, but there's a test failure at |
Yes, I'm looking into the tests and am currently suspecting the intermediate_cap to be too small and am trying to increase it. I'm also considering to remove the cap entirely after the initial failure but am waiting for the build. However, I don't think it is actually caused by this change but somehow amplified. I noticed an awful lot of error logs which are apparently retried and may be responsible for an overall flakiness of the system at the moment #4199 |
How's this looking @fjetter? Does #4200 / #4199 need to be resolved before this can be merged? For reference, we're planning to backport this fix and issue a release, ideally today but we can push if it isn't ready. Does this build in any way on #4200, so that it would need to be backported too, or are they likely independent? |
Thanks. #4204 is testing out this diff against 2.30.x. If CI passes there then we can be confident that the test failures here are unrelated. We plan to merge this to master, cherry-pick & backport it to 2.30.x, and then release 2.30.1. |
Hello - I see that 2.30.1 is in the changelog ( https://distributed.dask.org/en/latest/changelog.html ) but not available in PyPi yet. We are blocked on the connect timeout fix and was wondering when it would be available on PyPI. Thanks! |
The 2.30.1 release isn't out yet, see dask/community#105 for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests have passed over in #4204, so I'm going to merge this PR and include it in the 2.30.1 release. Thanks @fjetter (and @jcrist @quasiben @pentschev for reviewing)!
(cherry picked from commit 4205280)
This code section, in particular the handshake (#4019), is causing issues for me in "high load" scenarios. I see broken connections popping up once I reach about ~200 workers and these errors tear down the entire system. From the exceptions I cannot infer whether the handshake runs into a timeout or whether it is completely broken (mostly because our internal user reports are not as thorough as I'd like them to be but I'm investigating).
Whenever the handshake fails, it is raised as a
CommClosedError
which is, unfortunately, not anEnvironmentError
. Therefore, what I wanted to change is the exception type upon which we retry to be more inclusive.Then I had a look at the code and was really confused about the retry behaviour and the individual timeouts and started to (subjectively) simplify this piece of code and write a test for it. The test is a bit messy, works but I'd appreciate suggestions on how this can be implemented cleaner.
W.r.t the implementation, I am open for suggestions and can revert anything/everything/nothing depending on the feedback here.
Here a quick dump of my thoughts to this
EnvironmentError
s but rather more inclusive error classes (at least theCommClosed
we raise ourselves)In case anybody wonders, with the chosen base of 0.01 this results in (non randomised) backoffs
With the jitter this likely adds up to a probably significantly larger number of max retries
There is currently an alternative fix for this section open, see #4167 cc @jcrist