-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory #99
Comments
On the dask side, knowing how much memory is used isn't sufficient. It still can't spill that memory to host, since it has no control over it.
I see two options here:
I've just opened rapidsai/rmm#112 with a proposal on option 2. Note that this is would be a major change, and will probably take some time (assuming the proposal is acceptable). |
Can't we do the same as we currently do for other cudf datatypes and cupy arrays? We don't need to manage the GPU memory. We just need to be able to copy it to host. (I think?) From |
I think that you just need to implement |
My apologies for the delay here by the way |
No, the problem here regards the memory on the C++ side, which is not at all exposed to Python, Dask can't serialize what it doesn't know about. |
To be clear, Dask isn't serializing things here, cuDF is. Is it possible for cuDF or nvstrings to return something like a Numba device array with the relevant bytes, along with perhaps a small Python dictionary with metadata if necessary? |
What do you mean by "return"? Do you mean returning something at the end of each cuDF call or give access to memory buffers to Python? If you mean the first, it will still have memory that is used for temporary computations that won't be ever known by Dask, and thus, won't be spilled (which is what we have today). For the latter situation, that's what the generic solution I'm proposing in rapidsai/rmm#112 is, permitting exposure of internal C++ memory to Python. IMO, the way we expose it (via a Numba array or else) is irrelevant at this time. |
I mean having a |
cc @kkraus14 |
Yes, and I'm talking, if it runs out-of-memory before that is returned, there's nothing Dask can do, and this is the case today.
That's also fine, I just thought would be useful to have a starting point for the discussion. I don't know if my RMM proposal is the best solution, but is probably one solution. |
That's fine with me. That's the case today anyway. I don't think that this has much to do with the concrete request here of supporting nvstrings in the same way that we support other data types. |
Indeed, that's the case today, and it limits the usability for @VibhuJawa.
Admittedly, I don't know much about nvstrings, but it seems to take a device pointer as input (chunk of a timeseries) and return some other device pointer. I think this is already generally supported, since a cuDF series/chunk is a Numba device array. The problem still seems to me that the footprint of nvstrings is too big, and a potential solution for that would be allowing the spilling of memory internally used by C++ implementation. |
Perhaps. I get the sense that @VibhuJawa 's question is "what should I be asking for in order to get the same spill to disk behavior that I have with cudf, but with string columns" I think that the answer to this question is, you should ask for cudf's |
Thanks a for the responses to this thread @mrocklin , @pentschev . Sincerest apologies for not being clear enough regarding my question.
@mrocklin , I believe
@pentschev , Thanks for clarifying this, this was essentially the root of my confusion. I wrongly assumed that
I think this is a major problem too as we have non trivial amounts of intermittent memory spikes (2x in some cases) which makes spilling over much more difficult with strings . Current Issues : In general, to make spill over work in workflows with spill-over, we have to:
We can also look at changing default
Both of these have considerable performance implications as well as makes usability difficult as one has to fiddle with both of them to tweak for performance or even make it work in some cases.
FWIW , I feel this is a step in the right direction to start a discussion around this topic. CC: @randerzander |
Could you try configuring RMM to use managed memory and see how that works? You would use
before For a managed memory pool, you can do:
|
Admittedly this was done a few weeks back, but it is worth noting that we improved serialization of cuDF objects with PR ( rapidsai/cudf#4101 ) and nvstrings objects specifically with PR ( rapidsai/cudf#4111 ). This avoids acquiring things like CUDA contexts, which slowed this down previously. |
[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory
We still have workflows that are limited by better spill over with
cudf
as it currently only works with limited workflows.A example where spill over fails is: #65 (comment)
According to #65 (comment), we need more changes on
cudf
side to support this.We now expose the device memory used with
nvstrings
: rapidsai/custrings#395Can you please list the changes we still require so that we can track and get them completed asap to unblock these workflows and enable a better spill over.
CC: @pentschev
CC: @randerzander
The text was updated successfully, but these errors were encountered: