[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory #99

VibhuJawa · 2019-07-30T22:38:54Z

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory

We still have workflows that are limited by better spill over with cudf as it currently only works with limited workflows.

A example where spill over fails is: #65 (comment)

According to #65 (comment), we need more changes on cudf side to support this.

We now expose the device memory used with nvstrings: rapidsai/custrings#395

Can you please list the changes we still require so that we can track and get them completed asap to unblock these workflows and enable a better spill over.

CC: @pentschev
CC: @randerzander

The text was updated successfully, but these errors were encountered:

pentschev · 2019-08-02T14:39:40Z

We now expose the device memory used with nvstrings

On the dask side, knowing how much memory is used isn't sufficient. It still can't spill that memory to host, since it has no control over it.

Can you please list the changes we still require so that we can track and get them completed asap to unblock these workflows and enable a better spill over.

I see two options here:

Libraries should take care of spilling the memory themselves (may be less robust, libraries won't know about memory used by other libraries); or
Expose internal C++ memory to Python and allow Python libraries (such as Dask) to take control of that memory. This is certainly a major change to RAPIDS libraries in general.

I've just opened rapidsai/rmm#112 with a proposal on option 2. Note that this is would be a major change, and will probably take some time (assuming the proposal is acceptable).

mrocklin · 2019-08-02T15:50:01Z

Can't we do the same as we currently do for other cudf datatypes and cupy arrays?

We don't need to manage the GPU memory. We just need to be able to copy it to host. (I think?)

From

mrocklin · 2019-08-02T15:51:52Z

Can you please list the changes we still require so that we can track and get them completed asap to unblock these workflows and enable a better spill over.

I think that you just need to implement serialize/deserialize methods on these objects. This methods already exist on current cudf objects like Series, Column, Buffer, and so on, and so there might be good models for you there. I think that @quasiben has also been trying to clean this code up a bit as well.

mrocklin · 2019-08-02T15:52:18Z

so that we can track and get them completed asap

My apologies for the delay here by the way

pentschev · 2019-08-02T15:54:51Z

No, the problem here regards the memory on the C++ side, which is not at all exposed to Python, Dask can't serialize what it doesn't know about.

mrocklin · 2019-08-02T15:57:57Z

To be clear, Dask isn't serializing things here, cuDF is.

Is it possible for cuDF or nvstrings to return something like a Numba device array with the relevant bytes, along with perhaps a small Python dictionary with metadata if necessary?

pentschev · 2019-08-02T16:08:43Z

Is it possible for cuDF or nvstrings to return something like a Numba device array with the relevant bytes, along with perhaps a small Python dictionary with metadata if necessary?

What do you mean by "return"? Do you mean returning something at the end of each cuDF call or give access to memory buffers to Python? If you mean the first, it will still have memory that is used for temporary computations that won't be ever known by Dask, and thus, won't be spilled (which is what we have today). For the latter situation, that's what the generic solution I'm proposing in rapidsai/rmm#112 is, permitting exposure of internal C++ memory to Python.

IMO, the way we expose it (via a Numba array or else) is irrelevant at this time.

mrocklin · 2019-08-02T16:14:05Z

I mean having a .serialize method on nvstrings objects that returns a header and device-bytes objects, similar to how the other cudf .serialize methods do this. I'm inclined to let the cudf devs think about the right way to do this. I get the sense that they understand the development constraints here better than I do.

mrocklin · 2019-08-02T16:14:59Z

cc @kkraus14

pentschev · 2019-08-02T16:19:03Z

I mean having a .serialize method on nvstrings objects that returns a header and device-bytes objects, similar to how the other cudf .serialize methods do this.

Yes, and I'm talking, if it runs out-of-memory before that is returned, there's nothing Dask can do, and this is the case today.

I'm inclined to let the cudf devs think about the right way to do this. I get the sense that they understand the development constraints here better than I do.

That's also fine, I just thought would be useful to have a starting point for the discussion. I don't know if my RMM proposal is the best solution, but is probably one solution.

mrocklin · 2019-08-02T16:21:23Z

Yes, and I'm talking, if it runs out-of-memory before that is returned, there's nothing Dask can do, and this is the case today.

That's fine with me. That's the case today anyway. I don't think that this has much to do with the concrete request here of supporting nvstrings in the same way that we support other data types.

pentschev · 2019-08-02T16:41:41Z

That's fine with me. That's the case today anyway.

Indeed, that's the case today, and it limits the usability for @VibhuJawa.

I don't think that this has much to do with the concrete request here of supporting nvstrings in the same way that we support other data types.

Admittedly, I don't know much about nvstrings, but it seems to take a device pointer as input (chunk of a timeseries) and return some other device pointer. I think this is already generally supported, since a cuDF series/chunk is a Numba device array. The problem still seems to me that the footprint of nvstrings is too big, and a potential solution for that would be allowing the spilling of memory internally used by C++ implementation.

mrocklin · 2019-08-02T16:45:05Z

Perhaps. I get the sense that @VibhuJawa 's question is "what should I be asking for in order to get the same spill to disk behavior that I have with cudf, but with string columns"

I think that the answer to this question is, you should ask for cudf's serialize/deserialize methods work with string columns. I think that cudf devs will understand this request and will handle it from there.

VibhuJawa · 2019-08-03T00:30:06Z

Thanks a for the responses to this thread @mrocklin , @pentschev .

Sincerest apologies for not being clear enough regarding my question.

Perhaps. I get the sense that @VibhuJawa 's question is "what should I be asking for in order to get the same spill to disk behavior that I have with cudf, but with string columns"

@mrocklin , I believe serialize/deserialize all ready methods work with string columns, see link.
I should have been clearer in my request, apologies .

On the dask side, knowing how much memory is used isn't sufficient. It still can't spill that memory to host, since it has no control over it.

@pentschev , Thanks for clarifying this, this was essentially the root of my confusion. I wrongly assumed that dask could spill over memory to host when it knew how much memory was being used .

The problem still seems to me that the footprint of nvstrings is too big, and a potential solution for that would be allowing the spilling of memory internally used by C++ implementation.

I think this is a major problem too as we have non trivial amounts of intermittent memory spikes (2x in some cases) which makes spilling over much more difficult with strings .

Current Issues :

In general, to make spill over work in workflows with spill-over, we have to:

Reduce the device memory limit by a quite a bit.

In both examples, we have to set 10,000 MiB on a 16,000 MiB card (about 62 percent)
- string example
- sort example

We can also look at changing default spill fraction for cuda workers but this might have performance implications for non cudf users (cupy etc)

Use much smaller chunk sizes .

Both of these have considerable performance implications as well as makes usability difficult as one has to fiddle with both of them to tweak for performance or even make it work in some cases.

I've just opened rapidsai/rmm#112 with a proposal on option 2. Note that this is would be a major change, and will probably take some time (assuming the proposal is acceptable).

FWIW , I feel this is a step in the right direction to start a discussion around this topic.

CC: @randerzander

jrhemstad · 2019-08-06T21:25:30Z

Could you try configuring RMM to use managed memory and see how that works?

See: https://github.com/rapidsai/rmm/blob/dfa8740883735e57bc9ebb95ed56a1321141a8b0/README.md#handling-rmm-options-in-python-code

You would use

from librmm_cffi import librmm_config as rmm_cfg
rmm_cfg.use_managed_memory = true

before import cudf

For a managed memory pool, you can do:

from librmm_cffi import librmm_config as rmm_cfg
rmm_cfg.use_managed_memory = true
rmm_cfg.use_pool_allocator = True

pentschev · 2020-01-09T21:11:35Z

This ended up to be mostly a duplicate of #57 , therefore I'm closing it in favor of that one. Please feel free to reopen if there's still anything here that is not comprised in #57 .

jakirkham · 2020-03-02T01:53:34Z

Admittedly this was done a few weeks back, but it is worth noting that we improved serialization of cuDF objects with PR ( rapidsai/cudf#4101 ) and nvstrings objects specifically with PR ( rapidsai/cudf#4111 ). This avoids acquiring things like CUDA contexts, which slowed this down previously.

pentschev mentioned this issue Aug 2, 2019

[DISCUSSION] Add deferrable memory management enchancement proposal rapidsai/rmm#112

Closed

VibhuJawa mentioned this issue Aug 7, 2019

[QST] Chunksize Performance Test rapidsai/cudf#2498

Closed

pentschev closed this as completed Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory #99

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory #99

VibhuJawa commented Jul 30, 2019 •

edited

Loading

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

mrocklin commented Aug 2, 2019

mrocklin commented Aug 2, 2019

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

mrocklin commented Aug 2, 2019

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

pentschev commented Aug 2, 2019

mrocklin commented Aug 2, 2019

VibhuJawa commented Aug 3, 2019 •

edited

Loading

jrhemstad commented Aug 6, 2019 •

edited

Loading

pentschev commented Jan 9, 2020

jakirkham commented Mar 2, 2020

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory #99

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory #99

Comments

VibhuJawa commented Jul 30, 2019 • edited Loading