Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate push down for parquet dereference column in Iceberg #17133

Merged

Conversation

leetcode-1533
Copy link
Contributor

@leetcode-1533 leetcode-1533 commented Apr 20, 2023

Description

From https://trino.io/blog/2020/08/14/dereference-pushdown.html: "Another future improvement will be the pushdown of predicates on subfields for data stored in Parquet format. Although the pruning of nested fields occurs with Parquet, the predicates are not yet pushed down into the reader."

This PR enables Parquet page source to use statistics for nested fields in the iceberg connector.

Additional context and related issues

Related ORC commit: 5069a55
Fixes #9928
Hive change PR: #15163

Release notes

() This is not user-visible or docs only and no release notes are required.
() Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve performance of queries with filters on fields in ROW type columns stored in parquet files.

@cla-bot cla-bot bot added the cla-signed label Apr 20, 2023
@github-actions github-actions bot added hive Hive connector iceberg Iceberg connector tests:hive labels Apr 20, 2023
@leetcode-1533 leetcode-1533 force-pushed the iceberg_dereferenceparquet_2 branch from d296d47 to b9ae609 Compare April 28, 2023 01:43
@leetcode-1533 leetcode-1533 changed the title Iceberg dereferenceparquet 2 Implement predicate push down for parquet dereference column in Iceberg Apr 28, 2023
@leetcode-1533 leetcode-1533 force-pushed the iceberg_dereferenceparquet_2 branch from b9ae609 to bf77929 Compare April 28, 2023 01:50
@leetcode-1533 leetcode-1533 marked this pull request as ready for review April 28, 2023 01:50
@leetcode-1533
Copy link
Contributor Author

Fixing product test: "2023-04-28T05:52:04.9369644Z tests | 2023-04-28 11:37:04 INFO: FAILURE / io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testIdBasedFieldMapping [PARQUET, 2] (Groups: iceberg_jdbc, profile_specific_tests, iceberg, iceberg_rest) took 3.7 seconds
"

@leetcode-1533 leetcode-1533 force-pushed the iceberg_dereferenceparquet_2 branch from bf77929 to 2323605 Compare April 30, 2023 06:52
@leetcode-1533
Copy link
Contributor Author

I fixed the bug and submitted the PR for another check

@leetcode-1533 leetcode-1533 force-pushed the iceberg_dereferenceparquet_2 branch 2 times, most recently from 1194880 to 234a076 Compare April 30, 2023 23:02
// 21 is a value between [2, 20000] but is an odd number, so won't be discarded by Iceberg table's statistics.
// At the meantime, 21 is not within the bound of any row group. So can be discarded by Parquet's row group statistics.
assertNoDataRead("SELECT * FROM " + tableName + " WHERE col1Row.a = 21");
assertNoDataRead("SELECT * FROM " + tableName + " WHERE col1Row.a IS NULL");
Copy link
Contributor

@findinpath findinpath May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertNoDataRead is currently a bit misleading because it builds on processedInputDataSize and not physicalInputDataSize

However this change relieves the engine of dealing with additional computations when the data leaves the parquet reader. Very good catch @leetcode-1533

No change requested in this PR

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 16, 2024
@mosabua
Copy link
Member

mosabua commented Jan 16, 2024

👋 @leetcode-1533 @findinpath @findepi - this PR has become inactive. If you're still interested in working on it, please let us know.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

// predicate domain
IcebergColumnHandle projectedColumn = new IcebergColumnHandle(
new ColumnIdentity(
5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 5 ?

@raunaqmorarka raunaqmorarka force-pushed the iceberg_dereferenceparquet_2 branch from 77e5049 to 38c58a3 Compare January 18, 2024 12:24
@raunaqmorarka raunaqmorarka merged commit 3a67a0a into trinodb:master Jan 18, 2024
48 checks passed
@github-actions github-actions bot added this to the 437 milestone Jan 18, 2024
@findinpath
Copy link
Contributor

Thank you @leetcode-1533 for this contribution.

@mosabua
Copy link
Member

mosabua commented Jan 18, 2024

Thanks @raunaqmorarka and @findinpath for finishing it up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector iceberg Iceberg connector stale
Development

Successfully merging this pull request may close these issues.

Predicate pushdown for nested fields in Parquet reader
4 participants