Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Investigate libcudf features needed to support struct schema pruning during loads #510

Closed
jlowe opened this issue Aug 4, 2020 · 2 comments
Labels
feature request New feature or request P1 Nice to have for release

Comments

@jlowe
Copy link
Contributor

jlowe commented Aug 4, 2020

Is your feature request related to a problem? Please describe.
Spark supports schema pruning (see #463) where the schema needed by a query can prune the fields needed to be loaded by a struct, saving precious distributed filesystem I/O bandwidth and file format decode on unnecessary data.

Describe the solution you'd like
We need to investigate how this will be exposed to the RAPIDS plugin and what, if any, extra features are required from libcudf to enable pruning of nested struct fields that are unused by the query schema.

cc: @nvdbaranec

@jlowe jlowe added feature request New feature or request P1 Nice to have for release labels Aug 4, 2020
@revans2
Copy link
Collaborator

revans2 commented Sep 8, 2020

From what I have read we should be able to filter out blocks from being read that we don't want when we are rewriting the file to be an in-memory buffer. We would also need to rewrite the footer metadata to not include the columns we don't care about.

The main issue with this would be if/when we want to have cudf to read the file directly. In those cases we are going to need cudf to have an API to lets us pass in some kind of a read schema, so it can skip the blocks that are not needed.

We also need this for orc at some point.

@jlowe Not sure if this is enough of an investigation or if we need to file something to be a follow on for this?

@jlowe
Copy link
Contributor Author

jlowe commented Sep 8, 2020

The main issue with this would be if/when we want to have cudf to read the file directly. In those cases we are going to need cudf to have an API to lets us pass in some kind of a read schema, so it can skip the blocks that are not needed.

I believe @nvdbaranec has been thinking about this and may be able to comment more on libcudf's plans for pruning struct schemas during parquet load.

Not sure if this is enough of an investigation

I think we're good. For the short-term, we have the luxury of being able to manipulate the footer to reflect what we want to load.

@jlowe jlowe closed this as completed Sep 8, 2020
pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
* Move example np codes into app_common

* Fix format
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request P1 Nice to have for release
Projects
None yet
Development

No branches or pull requests

2 participants