-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s
#6731
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
All of these tests are commonly failing with
|
This should have been obvious, but I just realized it now: every one of these tests involves running the scheduler in a subprocess with some form of Also, looking at the stderr output of most tests, they all show |
Current theory is that this is actually really simple:
We can reproduce a similar-looking error by trying to launch a Client when there's no scheduler available at all: def test_wat(loop, requires_default_ports):
with Client(f"127.0.0.1:{Scheduler.default_port}", loop=loop) as c:
pass
We can see from the test duration that it actually does wait the full 5s; this isn't one of those cases where the error message claims a timeout happened, but it was actually a CancelledError or something that got converted into a TimeoutError. FYI, the reason for the 5s timeout, rather than default 30s connect timeout, is that the So in theory, just using a longer timeout would probably make these tests pass most of the time. But I'm going to try to figure out why scheduler startup is so slow first. Ideally they would not all take 10s each, even on CI runners. |
In gjoseph92#4, I ran just the Comparing profiles between flaky tests that failed and ones that didn't on a particular run, there wasn't an obvious difference. I expected something major to stand out, but they looked pretty similar.
In the second one, you can see the So it seems like slow imports are actually the issue, and they're taking so long that normal variability is putting us right up against our 5s timeout. Slow importsOverall:
So with 4s of imports before useful work can happen, a 5s connection timeout indeed seems likely to fail sometimes 😁 Some imports that stood out:
None of these seem particularly avoidable (numpy will always be there #5729, urllib3/jinja/toolz/yaml/etc probably have to happen). A couple small things:
A larger change that @graingert proposed was to wait to set up the dashboard (aka import bokeh and pandas) until after we'd started listening for client connections. That way, we can be doing both at once (ish). The issue is that a number of the tests are assuming that once a client is connected, the dashboard must be up. We could refactor this, but I don't really want to deal with it. I don't imagine that it would make much difference for users in real life anyway. What should we doAfter all that, I think we should just make The 5s connect timeout arguably makes sense for async or I did some archaeology and The main question to me is, should we:
|
Waaait I think this flakiness was introduced in #6231 @graingert. Prior to that commit, the That change added the Prior to that, fixtures that set the 5s timeout (by calling
Another way to say that is: should this distributed/distributed/utils_test.py Lines 1882 to 1909 in 4f6960a
|
This is basically option 3 in dask#6731 (comment). I can't think of a justification why this timeout should be set globally. All the other things in there are necessary to make things run more reasonably in tests. The timeout is the opposite; there's nothing about Ci that should make us think connections will be faster.
Catch-all issue for tests failing like
OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s
while the client is trying to connect to the scheduler.TODO: link other flaky-test issues related to this.
The text was updated successfully, but these errors were encountered: