Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement chunked_pack #13180

Closed
4 tasks done
abellina opened this issue Apr 20, 2023 · 0 comments
Closed
4 tasks done

[FEA] Implement chunked_pack #13180

abellina opened this issue Apr 20, 2023 · 0 comments
Assignees
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Apr 20, 2023

In spark-rapids, we use cudf::pack (in reality we cudf::contiguous_split with empty splits) to layout a cudf table with its various data, validity, offsets buffers and potential children, into a contiguous buffer described by metadata produced by cuDF. This has been a key operation for us to turn a table into something that we can trivially move to host or disk (spill). Because tables can be quite large, we have found that requiring them to be contiguous in memory, adds quite a bit of memory pressure and fragments the memory pool. Some tables can be GBs in size, so in order to satisfy the allocation needed for pack to return a contiguous buffer, we can actually introduce even more spill.

Lately we have been making more use of the spill framework as we are now able to retry some cuDF operations: NVIDIA/spark-rapids#7252. We would like to be able to turn cuDF tables into "spillable tables" without requiring a premeditated pack, instead performing the pack in chunks when it is actually needed. This means we want to call pack with a bounce buffer, ensuring that no large allocations happen during this process.

After the first invocation to pack, the bounce buffer contents are copied to host (where an allocation equal to the original contiguous size is waiting to be filled). We then keep calling pack iteratively, not unlike the other chunked interfaces in cuDF.

I've been working on this together with @nvdbaranec. I am going to post a series of PRs that get us there, but overall here's the plan:

@abellina abellina added feature request New feature or request Needs Triage Need team to review and classify labels Apr 20, 2023
@abellina abellina self-assigned this Apr 20, 2023
@abellina abellina added the Spark Functionality that helps Spark RAPIDS label Apr 20, 2023
rapids-bot bot pushed a commit that referenced this issue May 19, 2023
This PR introduced some unused arguments that are causing compilation errors in nvcc 11.5 #13180. Taking care of that here.

@davidwendt found these in his local nvcc 11.5 build

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #13387
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

2 participants