Adding functionality for spilling to packed device memory #601

charlesbluca · 2021-05-10T14:25:19Z

I am currently working on a Python API for cuDF pack/unpack (rapidsai/cudf#8153), with the goal being to add functionality here to spill to packed device memory.

Currently, this API allows for something like:

gdf = cudf.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
packed = pack(gdf)         # returns a PackedColumns object
unpacked = unpack(packed)  # returns a cudf DataFrame

Is there anything else we would want in this API to open it up for use here?

cc @madsbk @pentschev

pentschev · 2021-05-10T14:59:21Z

From a "standard" spilling perspective, today things work as follows: Device<->Host<->Disk. To add this into that workflow, I think it would be as simple as adding a new layer: (Pack/Unpack)<->Device<->Host<->Disk. There's also the JIT spilling which is considerably more complex and @madsbk would be best suited to comment, but I believe the new (Pack/Unpack) layer could be added to the workflow in the same way as the standard spilling: (Pack/Unpack)<->(JIT Spill/Unspill)<->... .

Of course, all of the above is meant as an initial implementation for evaluation of the functional aspects. We would probably want to evaluate whether combining some of these layers would lead to improved performance and a simple codebase too.

EDIT: To answer on your specific question, I feel that a simple pack/unpack interface as you proposed is the best choice, if implementing that way in practice is doable.

charlesbluca · 2021-05-10T15:39:02Z

That idea of a new layer makes sense - I seem to recall from the last stand up the potential to maybe spill to packed memory first, so something like unpacked dev<->packed dev<->host<->disk. However, I think for right now it would make more sense to implement in the way you've laid out.

My biggest concern here is the return type of host_to_device(), since I'd imagine that's what we'd want to pass into some hypothetical device_to_packed() function. Currently, the C++ API for pack only allows table_views for input, meaning that we are limited to packing cuDF DataFrames. Should we perform a check in this packed spilling function to ensure that the object we are receiving as input is a DataFrame, or would the C++ API need to be expanded to even allow functionality here?

jakirkham · 2021-05-10T16:17:11Z

What about adding this directly to cudf.DataFrames? They already implement serialize/deserialize, which Dask uses

charlesbluca · 2021-05-10T17:20:43Z

I'm not sure I understand - do you mean implement host<->device spilling for DataFrames on cuDF's end, or device<->packed?

jakirkham · 2021-05-10T17:32:05Z

Am referring to these methods. Ideally packing/unpacking would be done there and Dask wouldn't need to know anything about it

charlesbluca · 2021-05-10T17:53:34Z

Ah thanks for the clarification - that makes sense. Would that mean that we wouldn't have an unpacked device layer, and that spilling DataFrames to/from device would always pack/unpack? If so, I'm wondering if we would want to keep the ability to serialize without packing by making separate functions for this (i.e. serialize/deserialize_packed).

jakirkham · 2021-05-10T19:42:52Z

Yeah this came up in issue ( rapidsai/cudf#7601 ). Basically we might want a config for cuDF ( rapidsai/cudf#5311 ), which could enable/disable packing

pentschev · 2021-05-10T20:03:58Z

I agree with @jakirkham 's idea too, having this in cuDF itself is a much better alternative, and allows Dask-CUDA to abstract that change, as well as letting pure Dask users to enjoy that compression mechanism as well.

charlesbluca · 2021-05-10T20:30:21Z

Same here - one question, is the serialization/deserialization of DataFrames used outside of Dask contexts? I think the config module for cuDF is a good idea, but trying to gauge if this pack/unpack option could go into Dask's config module along with RMM/UCX stuff.

If it is used outside of Dask, then it should definitely go in a cuDF specific configuration.

jakirkham · 2021-05-10T22:31:41Z

cuDF uses these functions when pickling objects as well, which is not Dask specific. So it probably makes more sense to have the config somewhere cuDF users can control

charlesbluca · 2021-07-23T14:51:04Z

Closing this as the solution forward here is making changes to cuDF 🙂

pentschev added 1 - On Deck To be worked on next improvement Improvement / enhancement to an existing function question Further information is requested labels May 20, 2021

quasiben mentioned this issue Jul 1, 2021

[FEA] A cudf.config module to manage configuration options rapidsai/cudf#5311

Closed

charlesbluca closed this as completed Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding functionality for spilling to packed device memory #601

Adding functionality for spilling to packed device memory #601

charlesbluca commented May 10, 2021

pentschev commented May 10, 2021 •

edited

Loading

charlesbluca commented May 10, 2021

jakirkham commented May 10, 2021

charlesbluca commented May 10, 2021

jakirkham commented May 10, 2021

charlesbluca commented May 10, 2021

jakirkham commented May 10, 2021

pentschev commented May 10, 2021

charlesbluca commented May 10, 2021 •

edited

Loading

jakirkham commented May 10, 2021

charlesbluca commented Jul 23, 2021

Adding functionality for spilling to packed device memory #601

Adding functionality for spilling to packed device memory #601

Comments

charlesbluca commented May 10, 2021

pentschev commented May 10, 2021 • edited Loading

charlesbluca commented May 10, 2021

jakirkham commented May 10, 2021

charlesbluca commented May 10, 2021

jakirkham commented May 10, 2021

charlesbluca commented May 10, 2021

jakirkham commented May 10, 2021

pentschev commented May 10, 2021

charlesbluca commented May 10, 2021 • edited Loading

jakirkham commented May 10, 2021

charlesbluca commented Jul 23, 2021

pentschev commented May 10, 2021 •

edited

Loading

charlesbluca commented May 10, 2021 •

edited

Loading