Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune Non-referenced Fields from Nested RowTypes #23074

Closed

Conversation

rmarrowstone
Copy link

@rmarrowstone rmarrowstone commented Aug 19, 2024

This set of changes prunes nested RowTypes to only the fields that are
actually referenced in the users' projections.

The Parquet implementation already solves for this, but it works on
it's own abstractions so it's not fit for use in the other Hive
formats. I believe this approach could be adopted by the Parquet
PageSource as well, thereby simplifying, but I don't want to bite that
off now.

I believe the approach will work for Avro as well, but the PageSource
isn't plumbing the inferred reader schema down to the type resolver:
it is just passing the selected columns from the writer schema as both
reader and writer.

I added a test that proves it works well for OpenXJson because it
is simple to mock data for it and it supports position-based
deserialization: a JSON Array into a Row. That, along with the changes
to the SEQUENCE format, reflect that this approach should work for
any implementation's needs.

Description

I discovered this while starting an implementation for the Amazon Ion
format. I was curious about projectBaseColumns, what it did and how
it did it. I was surprised to find that the complete structures were being
materialized. I wanted to fix it in a way that would also for Ion, while
learning about the relevant parts of the Trino codebase.

I found a clustering of prior issues and believe there have been some
related threads in the Trino Slack workspace.

I didn't try to tackle anything that touched the optimizer, just took as a
given what the Hive connectors get today. I realize that means that
pruning based on array position or involving functions is out of scope.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Copy link

cla-bot bot commented Aug 19, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the hive Hive connector label Aug 19, 2024
This set of changes prunes nested RowTypes to only the fields that are
actually dereferenced in the users' projections.

The Parquet implementation already solves for this, but it works on
it's own abstractions so it's not fit for use in the other Hive
formats. I believe this approach could be adopted by the Parquet
PageSource as well, thereby simplifying, but I don't want to bite that
off now.

I believe the approach will work for Avro as well, but the PageSource
isn't plumbing the inferred reader schema down to the type resolver:
it is just passing the selected columns from the writer schema as both
reader and writer.

I added a test that proves it works well for OpenXJson because it
is simple to mock data for it and it supports position-based
deserialization: a JSON Array into a Row.
@rmarrowstone rmarrowstone force-pushed the prune-nested-projections branch from cdb4cd9 to ddb2f0a Compare August 20, 2024 18:50
Copy link

cla-bot bot commented Aug 20, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Aug 20, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Aug 20, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@rmarrowstone rmarrowstone marked this pull request as ready for review August 20, 2024 23:16
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Sep 11, 2024
@rmarrowstone
Copy link
Author

I'm going to Close this for now. I believe there is value to a general and more complete schema pruning. Will revisit when there is either more general interest, or I have a more pressing need and/or data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Hive connector stale
Development

Successfully merging this pull request may close these issues.

1 participant