Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate push down for parquet dereference column #15163

Merged
merged 5 commits into from
Apr 25, 2023

Conversation

leetcode-1533
Copy link
Contributor

@leetcode-1533 leetcode-1533 commented Nov 23, 2022

Description

From https://trino.io/blog/2020/08/14/dereference-pushdown.html: "Another future improvement will be the pushdown of predicates on subfields for data stored in Parquet format. Although the pruning of nested fields occurs with Parquet, the predicates are not yet pushed down into the reader."

This PR enables Parquet page source to use statistics for nested fields.

Additional context and related issues

Related ORC commit: 5069a55
#9928

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Hive
* Improve performance of queries with selective filters on primitive fields in ROW type columns. ({issue}`15163`)

@cla-bot cla-bot bot added the cla-signed label Nov 23, 2022
@leetcode-1533 leetcode-1533 marked this pull request as ready for review November 23, 2022 08:53
@leetcode-1533
Copy link
Contributor Author

@raunaqmorarka and @phd3 Hi, can you help to take a look?

@leetcode-1533
Copy link
Contributor Author

I had waited 10 days before getting code reviewed..

@leetcode-1533
Copy link
Contributor Author

hi, the commit message has been reorganized, please take a further look.

@findepi
Copy link
Member

findepi commented Dec 13, 2022

@phd3 does this add nested field predicate pushdown for Iceberg (potentially answering #8759 a better way)?

cc @alexjo2144

@raunaqmorarka
Copy link
Member

@leetcode-1533 PTAL at #15388

@phd3
Copy link
Member

phd3 commented Dec 15, 2022

@findepi this makes use of unenforced predicates on "subfields" which are still primitive - on the workers for file level pruning. IIUC #8759 is about struct/map/list columns which seems orthogonal.

@findepi
Copy link
Member

findepi commented Dec 19, 2022

IIUC #8759 is about struct/map/list columns which seems orthogonal.

orthogonal -- in some sense, yes
maybe we don't need #8759 at all; we surely need this one

@raunaqmorarka
Copy link
Member

@leetcode-1533 can you please rebase this on latest master

@leetcode-1533
Copy link
Contributor Author

hi, I re-based to * [2023-02-09] [11942a4f91] | Support REFRESH MATERIALIZED VIEW when storage differs in schema {{Piotr Findeisen}} (origin/master, origin/HEAD)

And have changed the unit test in #15388

@github-actions github-actions bot added the hive Hive connector label Mar 18, 2023
@leetcode-1533
Copy link
Contributor Author

hi, it has been rebased with the latest master branch

@leetcode-1533
Copy link
Contributor Author

For dereferenceSubFieldTypes(), I am actually leaning toward @alexjo2144's suggestion, i.e. respecting useColumnName option also when dereferencing subtypes(though it is not a 'column' name, but rather a subtype's name) Would like know community's opinion on this.

@leetcode-1533
Copy link
Contributor Author

Current status:
When dereferencing top column name, respect “useColumnName” option(when calling getBaseColumnParquetType())
When dereferencing nested field name, only use columnName, implemented at dereferenceSubFieldTypes().

After chatting with @raunaqmorarka , I did not implement: “respect “useColumnName” option when dereferencing nested columns”, since we agreed that: "this change is not necessary for the nested columns predicate pushdown, Unless it is a pre-requisite for your change, I would prefer to avoid changing that”.

dereferenceSubFieldTypes() is extracted from getColumnType() method currently nothing has changed. My change just built on top of the dereferenceSubFieldTypes().

To highlight that the variable is after dereferencing.
@leetcode-1533 leetcode-1533 force-pushed the dereferenceparquet branch 2 times, most recently from f669a2a to dabc54d Compare April 21, 2023 18:34
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comment

@leetcode-1533 leetcode-1533 force-pushed the dereferenceparquet branch 2 times, most recently from 350dc56 to ba5bffe Compare April 24, 2023 18:52
@leetcode-1533
Copy link
Contributor Author

hello, can you help to merge in this.. So I could rebase the iceberg related changes..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants