-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory realization of dask-based objects can hang process when using multi-threaded dask
config and high thread count
#19
Comments
Thanks for the detailed write-up! Just to broadly summarize, it seems like the current takeaways are (please correct me if I'm wrong):
This one is definitely a head scratcher... Maybe it's possible that the locks used in the logging module to ensure thread-safety are somehow preventing or resolving an issue with threads locking each other, but that's way beyond my understanding. I'm fine with leaving this for now as it seems to be an isolated issue with a simple workaround. |
Yes, I think this is a good synopsis of where I think we are.
Ooo, good find. I'm sure I don't understand either. |
@grovduck I just got an indefinite hang trying to compute import timeit
from sknnr import GNNRegressor
from sknnr_spatial import wrap
from sknnr_spatial.datasets import load_swo_ecoplot
if __name__ == "__main__":
X_img, X, y = load_swo_ecoplot(as_dataset=True, large_rasters=True)
est = wrap(GNNRegressor()).fit(X, y)
print(f"predict: {timeit.timeit(lambda: est.predict(X_img).compute(), number=1):.03f}s")
print(f"kneighbors: {timeit.timeit(lambda: [x.compute() for x in est.kneighbors(X_img)], number=1):.03f}s") Interestingly, I got a warning that I haven't seen before:
That warning led me to this thread, where one of the Dask maintainers suggested that Dask and BLAS (which I believe is used internally by I was able to solve my hanging with this suggestion to use I'm curious if you've ever noticed that OpenBLAS warning on your end? EDIT: I don't want to derail this, but I'm also getting OpenBLAS crashes on the |
@aazuspan, running your above code with the addition of the context manager
I then switched back to the
So it succeeded without issues or warnings for 1-5, 7 runs, threw the warning (but completed) on the sixth, and hung on the eighth run. So even though I don't remember it, it's possible that this warning cropped up during my initial testing, although I would have thought I would have mentioned it. EDIT: Substituting your context manager ( |
Thanks for the details @grovduck! It's nice to know this isn't isolated to my machine, but I'm at a loss for why this is just now popping up for both of us, and why it appears inconsistently.
Great, it seems like that strongly suggests that this is all related to threading and BLAS (somehow). I'd suggest we leave this open for now and I'll dive a little deeper into the research to try and figure out what's happening and why, but if the EDIT: Dask best practices recommend limiting OpenBLAS threads to 1 prior to parallelizing Numpy operations, to avoid performance issues. The Numpy docs also call out OpenBLAS parallelization as a potential issue when parallelizing with Dask or sklearn:
EDIT 2: I've never encountered OpenBLAS warnings or errors on my Linux machine, but did find that limiting OpenBLAS to 1 thread reduced prediction time from ~44s to ~23s. |
When running example code with
sknnr-spatial
based on the suggested workflow from @aazuspan in #18, I'm encountering an issue with the python process hanging. This only seems to occur when using the multi-threaded (default) version ofdask
. The single-threaded and multi-process versions ofdask
work as expected, as does the local cluster version.To reproduce the issue on my machine (with 36 threads available), I can run the following code:
I am running this from the project's default hatch configuration. This does not hang every time, but consistently hangs once per every 4-5 runs. When this process hangs, I can see that the python process is still running, typically consuming a high amount of CPU, but at a static level of RAM (typically about 250MB). I can use "End Task" to kill the process.
Attempted Fixes
Working with @aazuspan, we have tried a number of things to resolve this issue.
Create
xarray
dataset from scratch rather than reading from diskWe thought that the issue might be related to the way that the
xarray
dataset was being created from disk. To test this, we created a syntheticxarray.Dataset
from scratch.The script did not hang for 15 repeated attempts.
Compute the
X_img
on its ownFor this test, we read from disk, but we only test whether the
X_img
can be "computed" without hanging.The script did not hang for 15 repeated attempts.
Use an estimator other than
GNNRegressor
We can use a different predictor to see if the issue is specific to the
GNNRegressor
. We tried usingsknnr.EuclideanKNNRegressor
.The script hung on the first attempt.
Disable file locks when using
rioxarray.open_rasterio
Based on a rioxarray issue submitted by Tom Augspurger, I tried to set
lock=False
(per Tom's fix) when reading the raster data. This meant modifyingdatasets._base
on line 48 to add thelock=False
argument.From:
to:
This did not resolve the issue as the script hung on the first attempt.
Reduce the number of threads
On a whim, I reduced the number of threads used in calculation from 36 to 10 (chosen arbitrarily).
When the number of threads is set to 10, the script does not hang for 15 repeated attempts.
Introduced logging
In perhaps the most bizarre "fix" of all, I added logging to the script to see if I could get any information about where the script was hanging:
For reasons I can't explain, when I added logging, the script did not hang for 15 repeated attempts. I can also tell that all 36 threads are being used based on the output of the logging, i.e.
This one may be a total red herring, though.
Status
Unfortunately, this issue appears to only occur on processes with a high number of threads. @aazuspan is not able to reproduce this issue in any configuration, which makes this problematic to test.
Workarounds
Luckily, using any of these
dask
-based schedulers appears to work correctly:dask.config.set(scheduler="single-threaded")
dask.config.set(scheduler="processes")
The text was updated successfully, but these errors were encountered: