You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Spark supports schema pruning (see #463) where the schema needed by a query can prune the fields needed to be loaded by a struct, saving precious distributed filesystem I/O bandwidth and file format decode on unnecessary data.
Describe the solution you'd like
We need to investigate how this will be exposed to the RAPIDS plugin and what, if any, extra features are required from libcudf to enable pruning of nested struct fields that are unused by the query schema.
From what I have read we should be able to filter out blocks from being read that we don't want when we are rewriting the file to be an in-memory buffer. We would also need to rewrite the footer metadata to not include the columns we don't care about.
The main issue with this would be if/when we want to have cudf to read the file directly. In those cases we are going to need cudf to have an API to lets us pass in some kind of a read schema, so it can skip the blocks that are not needed.
We also need this for orc at some point.
@jlowe Not sure if this is enough of an investigation or if we need to file something to be a follow on for this?
The main issue with this would be if/when we want to have cudf to read the file directly. In those cases we are going to need cudf to have an API to lets us pass in some kind of a read schema, so it can skip the blocks that are not needed.
I believe @nvdbaranec has been thinking about this and may be able to comment more on libcudf's plans for pruning struct schemas during parquet load.
Not sure if this is enough of an investigation
I think we're good. For the short-term, we have the luxury of being able to manipulate the footer to reflect what we want to load.
Is your feature request related to a problem? Please describe.
Spark supports schema pruning (see #463) where the schema needed by a query can prune the fields needed to be loaded by a struct, saving precious distributed filesystem I/O bandwidth and file format decode on unnecessary data.
Describe the solution you'd like
We need to investigate how this will be exposed to the RAPIDS plugin and what, if any, extra features are required from libcudf to enable pruning of nested struct fields that are unused by the query schema.
cc: @nvdbaranec
The text was updated successfully, but these errors were encountered: