-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune Non-referenced Fields from Nested RowTypes #23074
Prune Non-referenced Fields from Nested RowTypes #23074
Conversation
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This set of changes prunes nested RowTypes to only the fields that are actually dereferenced in the users' projections. The Parquet implementation already solves for this, but it works on it's own abstractions so it's not fit for use in the other Hive formats. I believe this approach could be adopted by the Parquet PageSource as well, thereby simplifying, but I don't want to bite that off now. I believe the approach will work for Avro as well, but the PageSource isn't plumbing the inferred reader schema down to the type resolver: it is just passing the selected columns from the writer schema as both reader and writer. I added a test that proves it works well for OpenXJson because it is simple to mock data for it and it supports position-based deserialization: a JSON Array into a Row.
cdb4cd9
to
ddb2f0a
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This reverts commit b96dc9e.
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
I'm going to Close this for now. I believe there is value to a general and more complete schema pruning. Will revisit when there is either more general interest, or I have a more pressing need and/or data. |
This set of changes prunes nested RowTypes to only the fields that are
actually referenced in the users' projections.
The Parquet implementation already solves for this, but it works on
it's own abstractions so it's not fit for use in the other Hive
formats. I believe this approach could be adopted by the Parquet
PageSource as well, thereby simplifying, but I don't want to bite that
off now.
I believe the approach will work for Avro as well, but the PageSource
isn't plumbing the inferred reader schema down to the type resolver:
it is just passing the selected columns from the writer schema as both
reader and writer.
I added a test that proves it works well for OpenXJson because it
is simple to mock data for it and it supports position-based
deserialization: a JSON Array into a Row. That, along with the changes
to the SEQUENCE format, reflect that this approach should work for
any implementation's needs.
Description
I discovered this while starting an implementation for the Amazon Ion
format. I was curious about
projectBaseColumns
, what it did and howit did it. I was surprised to find that the complete structures were being
materialized. I wanted to fix it in a way that would also for Ion, while
learning about the relevant parts of the Trino codebase.
I found a clustering of prior issues and believe there have been some
related threads in the Trino Slack workspace.
I didn't try to tackle anything that touched the optimizer, just took as a
given what the Hive connectors get today. I realize that means that
pruning based on array position or involving functions is out of scope.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: