-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask-cudf multi partition merge slows down with ucx
#402
Comments
Adding the dask reports for ucx/tcp: The notable issue in the task graph is that for TCP there is more communication but the workers are fully engaged (no white space): For UCX, there is less communication workers are not actively working on a particular task (white space): |
Thanks Vibhu for filing and Ben for including the profiling plot. It looks like the read time on the worker is taking ~1.5x longer in the UCX case than the TCP case. This may be helped by releasing the GIL during these operations. ( #391 ) We also are spending a lot of time in generating the |
We are also spending more time working in the UCX report (3-4x the time). I'm going to be raising a similar issue with a reproducible example for groupby-aggregation behavior, today |
Ah was looking at the gaps initially based on Ben's comment. Should I be looking at something else? |
I just wanted to highlight the actual time spent doing compute is materially different, which may or may not related but also naively feels like a potential contributor to the slowdown. |
Meaning the overall runtime? If so, then I think Ben has the right idea to look at gaps. If not, some clarity on the compute time you are looking at would be helpful 🙂 |
Yes, sorry for being unclear 😄 From the worker profile in the linked performance report, we spend 130 total seconds doing "compute" with UCX on across the 16 GPUs. We only spend 40 total seconds with TCP across the 16 GPUs.
|
No worries. Thanks Nick! Just wanted to make sure we are focusing on the pain point 😉 Yeah that's what I was looking at above, Ben. Agree this is what we should focus on. |
It want to note that these issue could be a symptom of poor work stealing and we are still digging into this idea |
That may be true. There is also some cost in building Numba arrays, which we are doing. I've done some work to speed this up. Am still trying to figure out the right way to hook this into what we are doing here. In particular please see this rough benchmark. |
Here are some timings I gathered with this code:
As we can see above, openucx/ucx#4646 helps (and if cherry-picked to v1.7.0, matches performance of master) but TCP still outperforms UCX. @quasiben and I explored the |
In looking more thoroughly through the worker profile there seems to be a fair amount of time managing cuda contexts when using UCX: @kkraus14 do you happen to have any insights here ? |
Yeah that's what I was trying to get at with this comment. Am still trying to find the right way to hook Dask into higher level RMM/dask-cuda. PR ( rapidsai/rmm#264 ) is a start at that. |
Also after a fairly long rabbit hole earlier today, I think |
thanks @jakirkham for looking into this |
Have added PR ( rapidsai/rmm#268 ) to RMM. This should make it easier to directly copy from a device pointer to a host buffer without going through a bunch of Numba machinery first. Am hoping this will be useful in cuDF and dask-cuda where we can benefit from removing some of this overhead. |
I setup two dask-cuda-workers manually and ran with
We are spending a lot of time with I believe this is known to both @jakirkham and @kkraus14 |
Thanks for doing that Ben! 😄 This coming from Numba makes sense. Though it's nice to have the additional detail provided here. Just to make sure we are not missing anything, do we know of any other situations (outside of Numba array creation) where Edit: Should add we are hoping PR ( rapidsai/rmm#268 ) paves a short path for us to dispense with this overhead. |
I checked cudf/rmm/ucx and I didn't see https://github.com/rapidsai/rmm/blob/branch-0.12/python/rmm/rmm.py#L142-L164 which calls: |
FWIW I tried adding That said, I wouldn't be surprised if we still see a fair amount of time spent in Numba due to legacy RMM allocation methods. So am planning on pushing on PR ( rapidsai/rmm#268 ) more tomorrow in the hopes of quickly integrating this into cuDF and dask-cuda for further testing. |
One thing that confuses me about the flame graph above (and hopefully someone can help answer this 🙂), it appears that creating an In [1]: import rmm
In [2]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(2 * 2**30))
Out[2]: 0
In [3]: %timeit rmm.DeviceBuffer(size=50_000_000)
360 ns ± 1.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: import cupy
In [5]: %timeit cupy.empty((50_000_000,), dtype="u1")
2.15 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: import numba.cuda
In [7]: %timeit numba.cuda.device_array((50_000_000,), dtype="u1")
1.39 ms ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
Of course if the pool were not actually enabled, that would present a problem. In [1]: import rmm
In [2]: rmm.reinitialize(pool_allocator=False)
Out[2]: 0
In [3]: %timeit rmm.DeviceBuffer(size=50_000_000)
1.12 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Note that profiling |
As for
|
@quasiben also mentioned to me that we had ~81k calls to |
Wrote a quick test: import rmm
import time
rmm.reinitialize(pool_allocator=True, initial_pool_size=int(2**34))
num_allocations = 10000
allocations = [None] * num_allocations
start = time.time()
for i in range(num_allocations):
allocations[i] = rmm.DeviceBuffer(size=100)
end = time.time()
time_result = (end - start) * 1000
print(f"Time taken for allocating {num_allocations} buffers: {time_result}ms")
start = time.time()
for i in range(num_allocations):
allocations[i] = None
end = time.time()
time_result = (end - start) * 1000
print(f"Time taken for freeing {num_allocations} buffers: {time_result}ms") Results:
|
IDK if this would help, but one could add a Barring that one might rename the |
Adding some plots below to hopefully give more context. This is for one worker, but other workers look similar. The first plot shows a histogram of the number of allocations for a particular number of bytes. The second plot shows how many allocations are alive over in "time steps" (when an allocation occurs). |
This looks like we're keep the number of allocations to a very reasonable amount. |
Any chance of seeing where on that timeline you see the slowdown? |
My guess (though will try to back this up with data) is it lines up with where we see the very large spikes. The first half is mostly loading data and the second half is performing the merge. This is also where all of the data transfer is happening. |
Looking over all of these the mean is ~0.5MB, the standard deviation ~1MB, the median is ~0.25MB, and the max ~5MB. The standard deviation and mean seem to be pretty large based on these infrequent, large allocations. |
Is this example representative of important workloads? I am surprised that the largest allocation in a big data application is 5MB. What is the peak memory usage in bytes? |
For the last histogram, I'd like to see a more detailed breakdown of that bottom bin. Is it mostly <1KB allocations like in the first histogram? Or are they closer to 500KB? |
It is representative of this particular problem, but granted this is on a small scale dataset. I'm hoping we'll see larger allocations for larger problems. However, there's a bunch of small allocations regardless that happen probably in CUB when it's doing some DtoH<->HtoD transfers, probably transferring some 32 bit counters or something similar. |
@jakirkham you asked for reports from Azure, here they are: TCP |
It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔 |
At this point I think we should start getting some NSight profiles and looking at them to see where we're spending the time. If we need to add additional nvtx ranges into cuDF / RMM we can. |
Keith,
Before you spend more time on this, I found the cause for the slowdown with
nv peer mem enabled -- repeated IB registrations. I've shared a patch which
avoids repeated IB registrations. With this patch, Peter is able to see the
slowdown go away. I think Peter will add more details here soon.
…On Sat, Feb 15, 2020, 2:48 PM Keith Kraus ***@***.***> wrote:
It looks like UCX + NVLink + InfiniBand is a bit faster, but is
experiencing the memory allocation problem much worse than UCX + NVLink. It
seems memory allocation is taking roughly twice as much time in the former
than the latter. Though I would expect both perform the same number of
allocations. Why should one take so much longer? 🤔
At this point I think we should start getting some NSight profiles and
looking at them to see where we're spending the time. If we need to add
additional nvtx ranges into cuDF / RMM we can.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402?email_source=notifications&email_token=AAYNWAUABXD25VX3QF2BHKLRDBWMXA5CNFSM4KM5MV42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3YYHA#issuecomment-586648604>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYNWATJUWEYI5K6QMFNBPTRDBWMXANCNFSM4KM5MV4Q>
.
|
Which slowdown? The overall reason for this issue? Or the slowdown of UCX + NVLink + IB vs. UCX + NVLink? |
I think there are some confusions, let me attempt to clarify:
Finally, there has been a general trend assuming a large part of slowdowns comes from memory allocation. While that may be true, I think we need to be extra careful on interpreting our profile reports as all of them are limited in not including all the information we need (i.e., we don't have any profile that includes C++, CUDA and Python information all at once). For instance, I think one such assumptions is coming from the fact that While I'm not saying the above is actually happening, we need to be careful to identify exactly what's going on, since we don't have the full picture yet, we may be looking at the wrong piece of the puzzle. |
Thanks much for the summary, @pentschev |
Have rerun using the same setup as before. One small tweak was to bump up the RMM pool by about 2GB (so 28GB total). The main difference here is that TCP now avoids pickling the data as it uses the Here are the overall runtimes extracted from the profile plots:
Here are the profiles I got for the DGX-1 and DGX-2 using TCP and UCX: |
One small tweak was to bump up the RMM pool by about 2GB (so 28GB total).
From the issue description this seems like a slow down for |
It's worth noting those numbers come from the Dask Performance Reports themselves not |
So the previous reports did include the creation time, updated times for the merge are:
With setting |
So there may be issues from my build or my environment generally. It would be good to get others to replicate this and see what they get. However this is what I'm seeing using the suballocator work on a DGX-1. This uses
|
So the timings here include both creation and merging, can you try to profile just the merge and creation separately, as the |
I'm going to investigate and retest today. There still seem to be some weird things here that I've seen in previous runs (despite all new packages). Maybe it's time to purge Miniconda and reinstall everything from scratch? |
I checked how things are looking currently, and below are the results I got with RAPIDS 0.20 and UCX 1.9 on a DGX-2: TCP:
UCX + NVLink:
It seems like there has been a regression in performance with TCP, and UCX is outperforming TCP now (as we would expect). From a UCX-Py side, I think the original issue here has been resolved over the past year, do you think we can close this @VibhuJawa ? It may be worth investigating the potential regression in this merge workflow, but at this point I don't think there's anything to be done on the UCX front. |
Thanks for following up on this Peter. Agree that we should close this on UCX-PY as there is no action item left on that front. Might be worth exploring the TCP slow down though as the 2x slow don seems concerning (CC: @randerzander / @beckernick ) . |
Dask-cudf multi partition merge slows down with
ucx
.Dask-cudf merge seems to slow down with
ucx
.Wall time: (15.4 seconds on tcp) vs (37.8 s on ucx) (exp-01)
In the attached example we see a slow down with
ucx
vs just usingtcp
.Wall Times on exp-01
UCX Time
TCP times
Repro Code:
Helper Function to create distributed dask-cudf frame
RMM Setup:
Merge Code:
The slow down happens on the merge step.
Additional Context:
There has been discussion about this on our internal slack channel, please see for more context.
The text was updated successfully, but these errors were encountered: