-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] repartition
failing on multiple-workers
#2321
Comments
Is this issue deterministic? I just ran the example above on a DGX with 16GB GPUs and I can't reproduce that. |
For completeness, I tested this on nightly build. |
@pentschev , Can you provide the exact versions that you are on , Cause i am on nightly too (just updated everything) . This seems to be non-deterministic for me. Also, this might be un-related to I have updated the heading to reflect that. %%bash
conda list | grep 'cudf'
conda list | grep 'dask'
|
repartition
repartition
failing on multiple-workers
@VibhuJawa here's a list of what I'm using:
I still couldn't reproduce this after some 3-4 attempts just now |
The only major difference I see is you're using CUDA 10, not sure if that is of big concern. Also, I just updated everything, so maybe your cudf build is from yesterday. |
Alright, I was able to reproduce this, and it was my mistake that I couldn't before (I was only running the first two blocks of code, and missed the third one). This is directly related to the issue in rapidsai/dask-cuda#57. What happens here is the worker crashes due to the amount of data. Setting I discussed offline with @VibhuJawa, and for such pipelines we have two options:
Both will incur in overhead, and this may be very dependent on the algorithm in question, but I tend to believe that option 1 will tend to perform better. I have been working on benchmarking alternatives for spilling memory in rapidsai/dask-cuda#92, but the outlook isn't great. Besides the cost of copying the memory to host, there's also a cost into serializing that memory. For an idea of the current status, spilling to host has currently a bandwidth of about 550 MB/s, in contrast to the real bandwidth we can achieve of 6 GB/s when using unpinned memory. I expect to be able to speedup serialization, but the actual spilling bandwidth will certainly be < 6 GB/s. |
Apologies for the late follow up. Just to be clear, could you make I changed the number of rows to just
|
Why do you want only one partition per worker? This seems possibly inefficient. For example, what happens if a few of the partitions end up on the same worker? Dask makes no guarantees about behavior here. |
So if I have a 16GB card, and 7GB of memory per worker spread among lots of partitions. Then I move the data around so that I can repartition it. Lets say that in a pessimistic case most pieces of data have to move, so now I have 14 GB of memory occupied per worker. Then, just for a moment, I need to copy 7GB of that data into a single 7GB dataframe before I can release the 7GB of smaller dataframes. Now I have, briefly, 21 GB of data on my 16GB cards. Perhaps this explains the problem? |
Dask's scheduling heuristics aren't good at playing very close to the line like this. I recommend keeping many small partitions. Things work more smoothly. If you need giant chunks for some reason, then Dask may not be a good fit. This approach is more common in MPI workloads. |
Is there a general estimate on how close can we be to the limit? Usually on host based workflows there is a sweet spot for the number of partitions (not too many, not too few) which gives the best performance. |
Our general recommendations are here: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions . Note that in your case I'm assuming that you're running with one thread rather than 10, as in that example. You should have space for a few times more chunks to be in memory than you have concurrent tasks. My understanding from @kkraus14 was that performance benefits should drop off a bit after you reach a large enough size to saturate all of the thread blocks. The estimate I was given was that this was likely to occur in the few hundred MB range. I don't know what's optimal though, it sounds like you have evidence against this. |
In your case, I'd recommend chunks in the gigabyte range, unless there is some very pressing need to go larger. |
I wanted to re-partition this because i was sending it to Dask-XGBoost to train, which See: https://github.com/rapidsai/dask-xgboost/blob/dask-cudf/dask_xgboost/core.py#L67-L69 @mt-jones , Can you please confirm if we can prevent the memory shoot up we were seeing by doing this or is it completely unrelated ?
Yup, that explains it but can we not side-step that problem by doing the below config changes ?
On the performance aspects of large chunks vs small-chunks, my experience is similar as @ayushdg . I ran some performance tests on a workflow yesterday which i can share on side-channels which show the performance drop with decreasing |
I do not immediately see how those changes would avoid the situation I'm talking about above. The bit that does this in XGBoost intentionally looks at the data that is already on the workers and concatenates that. That should remove the extra 7GB from moving data around. This is unrelated to the config settings you mention. |
FWIW I think that it'd be great to show such a benchmark in a public place like a GitHub issue if you're able to make something that is shareable. |
Okay, let me try to run the same operations on some dummy data and try to make something sharable. |
Also, I suspect that the conversation will be able to engage more people if it doesn't include Dask. It would be interesting to see how cudf operations react as data sizes get larger. If you can show performance degredation with small chunks with just cudf then you'll have a lot of people who can engage on the problem. If Dask is involved then it's probably just a couple of us (most of whom are saturated). |
This is resolved as of 0.14. |
Describe the bug
I am running into issues when i try to re partition on (multiple workers>=4) even though i have enough memory , I am at just
43 %
capacity.This seems to work fine if i have just 2 workers.
Steps/Code to reproduce bug
Helper Functions
Create dataframe
nvidia-smi output
Repartition df
Error Trace
Environment overview (please complete the following information)
Edit: Added
Nvidia-smi
output.Edit 2: Removed
OOM
from heading as this seems to be unrelated.Sorry for the confusion
The text was updated successfully, but these errors were encountered: