Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_nanny_worker_port_range hangs on Windows #5925

Closed
crusaderky opened this issue Mar 10, 2022 · 4 comments · Fixed by #5956
Closed

test_nanny_worker_port_range hangs on Windows #5925

crusaderky opened this issue Mar 10, 2022 · 4 comments · Fixed by #5956
Assignees

Comments

@crusaderky
Copy link
Collaborator

test_nanny_worker_port_range has started deterministically hanging on Windows.
As it hangs on pytest-timeout, it brings down the whole test suite.

Last successful run March 9th, 6AM
First failing run March 9th, 10AM

Third party dependencies were not updated:

< dask                      2022.2.1+17.g2baf7f47          pypi_0    pypi
> dask                      2022.2.1+18.gd88b3442          pypi_0    pypi
< distributed               2022.2.1+9.gde94b408           dev_0    <develop>
> distributed               2022.2.1+10.ge1e43858           dev_0    <develop>

This regression seems to have been caused by #5897; see latest build on the PR (which predates March 9th).

@crusaderky
Copy link
Collaborator Author

crusaderky commented Mar 17, 2022

Confirmed that the issue is caused by the one-line change to the logging template.
dask-worker successfully connects to the scheduler (all 3 spawned worker processes) and then just simply gives no answer to the connection attempt from Scheduler.broadcast. Nothing on the stderr/stdout; it's like the connection never reached the worker. It makes no sense.

I'll further investigate. ETA tomorrow for either a fix or a revert.

@crusaderky
Copy link
Collaborator Author

The issue was that the added timestamps caused the popen stderr pipe to fill up and there was nothing flushing it. As a consequence the dask-worker subprocess was getting stuck indefinitely.

@gjoseph92
Copy link
Collaborator

added timestamps caused the popen stderr pipe to fill up and there was nothing flushing it

@crusaderky why was this only showing up on Windows? Could it actually happen anywhere, and Windows was just the canary in the coal mine?

@crusaderky
Copy link
Collaborator Author

It was happening on Windows because pipe buffer size is OS-specific and it turns out that on Windows it's slightly smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants