Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CLI Scheduler Tests #6502

Merged
merged 4 commits into from
Jun 8, 2022
Merged

Fix CLI Scheduler Tests #6502

merged 4 commits into from
Jun 8, 2022

Conversation

quasiben
Copy link
Member

@quasiben quasiben commented Jun 3, 2022

In https://dask.org/distributed/test_report.html there are flaky tests for:

  • test_dashboard_port_zero
  • test_dashboard_non_standard_ports

This PR attempts to resolve that flakiness by waiting for the scheduler to fully come up before creating the client. In one failed output we can see that the scheduler is up but the Client times out

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2022

Unit Test Results

       15 files  ±    0         15 suites  ±0   6h 16m 48s ⏱️ + 8m 48s
  2 853 tests ±    0    2 772 ✔️ +    2       81 💤  -   1  0  - 1 
21 586 runs  +449  20 554 ✔️ +365  1 032 💤 +85  0  - 1 

Results for commit 46230a8. ± Comparison against base commit 3468601.

♻️ This comment has been updated with latest results.

@quasiben
Copy link
Member Author

quasiben commented Jun 6, 2022

I ran this 1000 times locally and saw no issues.

@quasiben
Copy link
Member Author

quasiben commented Jun 7, 2022

Planning to merge EOD

@fjetter fjetter merged commit 781af78 into dask:main Jun 8, 2022
@quasiben quasiben deleted the fix-cli-tests branch June 8, 2022 13:30
@gjoseph92
Copy link
Collaborator

I ran this 1000 times locally and saw no issues.

Did you run main 1000 times locally and see issues? I was never able to reproduce the actual issue we're seeing in CI (subprocess.TimeoutExpired error) locally.

We've already been working on this test and others #6395. See #6483.

In one failed output we can see that the scheduler is up but the Client times out

In this case, you also see

2022-06-02 06:11:40,414 - distributed.scheduler - INFO - Receive client connection: Client-dccea0ee-e23a-11ec-bafa-005056bd5fdc
...
2022-06-02 06:11:48,470 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 8.07s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

So

  1. The client did connect, therefore trying to wait for the scheduler to come up by parsing logs is unnecessary
  2. As you can see on this line, the timeout occurred when the client was waiting for the scheduler to respond, after it had already connected
  3. The unresponsiveness in the scheduler is likely the problem

If you look at the test report, every failure of test_dashboard_port_zero is a subprocess.TimeoutExpired, not an asyncio.exceptions.TimeoutError like the case this test is trying to fix (and as I explained, I don't believe the changes here would fix that problem).

As discussed in #6483, piping stdout and parsing logs is better to avoid when it's not strictly necessary for the test.

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Jun 9, 2022
crusaderky pushed a commit that referenced this pull request Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants