-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero
#6395
Comments
Tests like dask#6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug. Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.
Tests like dask#6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug. Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.
Tests like #6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug. Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.
We'll see what's happening in CI with #6461, but I'm wondering if this is just the typical thing where port 8787 is already in use. |
With #6461 in, we'll wait for a week to see if these |
It doesn't seem to be effective: https://github.com/dask/distributed/runs/6825736211?check_suite_focus=true |
Still an issue... https://github.com/fjetter/distributed/runs/7023391512?check_suite_focus=true shows a log message pattern like the following
and the test fails with a "client cannot connect" after 5s tearing the test down. I suspecit this is where the ~5s delay between the two info messages above come up. So, either one of the handlers is blocking before it reaches the http.proxy module or something before that in init is responsible |
I'm currently investigating the above "scheduler does not startup" error a bit more closely on the branch https://github.com/fjetter/distributed/tree/run_test_dashboard_port_zero_until_it_fails So far
After this, I could observe two things, so far
A couple of observations here
Note: Even though we have this additional debug/error information, the test is still failing with a |
Next curiosity. We're only executing python3.8 and python 3.9 jobs for OSX distributed/.github/workflows/tests.yaml Lines 25 to 31 in c992f80
I got the same on my branch as well https://github.com/fjetter/distributed/blob/b36cc0cf44f60c4491953ee9801e41b33b9788d4/.github/workflows/tests.yaml#L25-L31 However, If I inspect the actually installed versions, I can see that the python 3.8 job is using python3.9!
Edit: For some reason I cannot find the original test report anymore. I was suspecting an improper or buggy mamba solve but couldn't verify due to lack of logs |
I could reproduce the "stuck during pyarrow import" failure https://github.com/fjetter/distributed/runs/7178677135?check_suite_focus=true (faulthandler output bottom to top; i.e. first linemost recent. We're clearly stuck in pyarrow init
this one happened on 3.10 osx. Possibly, we see two different errors on different python versions / OS The line the pyarrow import is stuck in is indeed the initialization of the cython module (pyarrow 7.0.0) |
cc @graingert |
Just had a debugging session with @graingert . What we currently think is happening
It's still unclear why pyarrow is locking up. I am currently running two more CI runs
Edit: pyarrow==6.0.1 run also ran into the same problem https://github.com/fjetter/distributed/runs/7184426045?check_suite_focus=true |
Turns out pyarrow is not the culprit. When running the same run w/out pyarrow, the import is equally stuck. This time in pandas hashing. Maybe cython?? |
ok, so the event loop is closed because the timeout bubbles up as a keyboard interrupt
now, why are all the cython imports so slow?? I started a test run with an ancient cython version (I don't know how compatible this is / how this depends on the version that the libs were compiled with) I'm continuing these tests on https://github.com/fjetter/distributed/tree/try_no_pyarrow |
That's weird, it should be the wait_for_signals handler |
Ah |
Well that still doesn't explain why the import is locking up |
Ran an old cython version (0.29.24). Still locking up |
This test is evolving to be a frequent offender
Short test report (failures on main)
https://github.com/dask/distributed/actions/runs/2353756567
https://github.com/dask/distributed/actions/runs/2345977494
Failure on #6371
https://github.com/dask/distributed/runs/6517029607?check_suite_focus=true
The text was updated successfully, but these errors were encountered: