[FEA] Investigate libcudf features needed to support struct schema pruning during loads #510

jlowe · 2020-08-04T20:59:48Z

Is your feature request related to a problem? Please describe.
Spark supports schema pruning (see #463) where the schema needed by a query can prune the fields needed to be loaded by a struct, saving precious distributed filesystem I/O bandwidth and file format decode on unnecessary data.

Describe the solution you'd like
We need to investigate how this will be exposed to the RAPIDS plugin and what, if any, extra features are required from libcudf to enable pruning of nested struct fields that are unused by the query schema.

cc: @nvdbaranec

revans2 · 2020-09-08T19:34:02Z

From what I have read we should be able to filter out blocks from being read that we don't want when we are rewriting the file to be an in-memory buffer. We would also need to rewrite the footer metadata to not include the columns we don't care about.

The main issue with this would be if/when we want to have cudf to read the file directly. In those cases we are going to need cudf to have an API to lets us pass in some kind of a read schema, so it can skip the blocks that are not needed.

We also need this for orc at some point.

@jlowe Not sure if this is enough of an investigation or if we need to file something to be a follow on for this?

jlowe · 2020-09-08T19:38:32Z

The main issue with this would be if/when we want to have cudf to read the file directly. In those cases we are going to need cudf to have an API to lets us pass in some kind of a read schema, so it can skip the blocks that are not needed.

I believe @nvdbaranec has been thinking about this and may be able to comment more on libcudf's plans for pruning struct schemas during parquet load.

Not sure if this is enough of an investigation

I think we're good. For the short-term, we have the luxury of being able to manipulate the footer to reflect what we want to load.

* Move example np codes into app_common * Fix format

Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

jlowe added feature request New feature or request P1 Nice to have for release labels Aug 4, 2020

jlowe closed this as completed Sep 8, 2020

pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022

Move example np codes into app_common (NVIDIA#510)

934abf1

* Move example np codes into app_common * Fix format

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 48dc168 (NVIDIA#510)

5a64497

Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Investigate libcudf features needed to support struct schema pruning during loads #510

[FEA] Investigate libcudf features needed to support struct schema pruning during loads #510

jlowe commented Aug 4, 2020

revans2 commented Sep 8, 2020

jlowe commented Sep 8, 2020

[FEA] Investigate libcudf features needed to support struct schema pruning during loads #510

[FEA] Investigate libcudf features needed to support struct schema pruning during loads #510

Comments

jlowe commented Aug 4, 2020

revans2 commented Sep 8, 2020

jlowe commented Sep 8, 2020