Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding functionality for spilling to packed device memory #601

Closed
charlesbluca opened this issue May 10, 2021 · 11 comments
Closed

Adding functionality for spilling to packed device memory #601

charlesbluca opened this issue May 10, 2021 · 11 comments
Labels
1 - On Deck To be worked on next improvement Improvement / enhancement to an existing function question Further information is requested

Comments

@charlesbluca
Copy link
Member

I am currently working on a Python API for cuDF pack/unpack (rapidsai/cudf#8153), with the goal being to add functionality here to spill to packed device memory.

Currently, this API allows for something like:

gdf = cudf.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
packed = pack(gdf)         # returns a PackedColumns object
unpacked = unpack(packed)  # returns a cudf DataFrame

Is there anything else we would want in this API to open it up for use here?

cc @madsbk @pentschev

@pentschev
Copy link
Member

pentschev commented May 10, 2021

From a "standard" spilling perspective, today things work as follows: Device<->Host<->Disk. To add this into that workflow, I think it would be as simple as adding a new layer: (Pack/Unpack)<->Device<->Host<->Disk. There's also the JIT spilling which is considerably more complex and @madsbk would be best suited to comment, but I believe the new (Pack/Unpack) layer could be added to the workflow in the same way as the standard spilling: (Pack/Unpack)<->(JIT Spill/Unspill)<->... .

Of course, all of the above is meant as an initial implementation for evaluation of the functional aspects. We would probably want to evaluate whether combining some of these layers would lead to improved performance and a simple codebase too.

EDIT: To answer on your specific question, I feel that a simple pack/unpack interface as you proposed is the best choice, if implementing that way in practice is doable.

@charlesbluca
Copy link
Member Author

That idea of a new layer makes sense - I seem to recall from the last stand up the potential to maybe spill to packed memory first, so something like unpacked dev<->packed dev<->host<->disk. However, I think for right now it would make more sense to implement in the way you've laid out.

My biggest concern here is the return type of host_to_device(), since I'd imagine that's what we'd want to pass into some hypothetical device_to_packed() function. Currently, the C++ API for pack only allows table_views for input, meaning that we are limited to packing cuDF DataFrames. Should we perform a check in this packed spilling function to ensure that the object we are receiving as input is a DataFrame, or would the C++ API need to be expanded to even allow functionality here?

@jakirkham
Copy link
Member

What about adding this directly to cudf.DataFrames? They already implement serialize/deserialize, which Dask uses

@charlesbluca
Copy link
Member Author

I'm not sure I understand - do you mean implement host<->device spilling for DataFrames on cuDF's end, or device<->packed?

@jakirkham
Copy link
Member

Am referring to these methods. Ideally packing/unpacking would be done there and Dask wouldn't need to know anything about it

@charlesbluca
Copy link
Member Author

Ah thanks for the clarification - that makes sense. Would that mean that we wouldn't have an unpacked device layer, and that spilling DataFrames to/from device would always pack/unpack? If so, I'm wondering if we would want to keep the ability to serialize without packing by making separate functions for this (i.e. serialize/deserialize_packed).

@jakirkham
Copy link
Member

Yeah this came up in issue ( rapidsai/cudf#7601 ). Basically we might want a config for cuDF ( rapidsai/cudf#5311 ), which could enable/disable packing

@pentschev
Copy link
Member

I agree with @jakirkham 's idea too, having this in cuDF itself is a much better alternative, and allows Dask-CUDA to abstract that change, as well as letting pure Dask users to enjoy that compression mechanism as well.

@charlesbluca
Copy link
Member Author

charlesbluca commented May 10, 2021

Same here - one question, is the serialization/deserialization of DataFrames used outside of Dask contexts? I think the config module for cuDF is a good idea, but trying to gauge if this pack/unpack option could go into Dask's config module along with RMM/UCX stuff.

If it is used outside of Dask, then it should definitely go in a cuDF specific configuration.

@jakirkham
Copy link
Member

cuDF uses these functions when pickling objects as well, which is not Dask specific. So it probably makes more sense to have the config somewhere cuDF users can control

@pentschev pentschev added 1 - On Deck To be worked on next improvement Improvement / enhancement to an existing function question Further information is requested labels May 20, 2021
@charlesbluca
Copy link
Member Author

Closing this as the solution forward here is making changes to cuDF 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next improvement Improvement / enhancement to an existing function question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants