Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply the dereference pushdown at the physical level on parquet in Iceberg #17387

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented May 8, 2023

Description

Fixes Parquet dereference pushdown on the physical level on #17156

ORC still open

Implementation overview for dereference pushdown at physical level in Parquet

PoC is the implementation for Hive

public static Optional<MessageType> getParquetMessageType(List<HiveColumnHandle> columns, boolean useColumnNames, MessageType fileSchema)
{
Optional<MessageType> message = projectSufficientColumns(columns)
.map(projection -> projection.get().stream()
.map(HiveColumnHandle.class::cast)
.collect(toUnmodifiableList()))
.orElse(columns).stream()
.filter(column -> column.getColumnType() == REGULAR)
.map(column -> getColumnType(column, fileSchema, useColumnNames))
.filter(Optional::isPresent)
.map(Optional::get)
.map(type -> new MessageType(fileSchema.getName(), type))
.reduce(MessageType::union);
return message;
}

What happens in the code:

  • only sufficient columns (if doing SELECT parent.child, child only parent field will actually be selected) will be selected
  • for each column, only the necessary information will be selected from the file schema sent to parquet reader

If nested.nested1level1.field4 gets selected and the column type hierarchy looks like this:

    nested
            field 1
            field 2
            nested1level1
                      field3
                      field4
            nested2level1
                      field5
                      field6 

for creating the paquet schema, there will be used the parquet type having the following hierarchy:

    nested
            nested1level1
                      field4
  • the types are all unioned together to create a parquet Message that corresponds to only to the nested columns which are actually selected from the nested rows.

Additional context and related issues

Corresponding tests which check also that the dereference pushdown is effective on the storage layer exist already in BaseConnectorTest. See 6a4e483e#diff-6be05909e810c0224c1951c4102cad6256e9e088feb91e4c260756bfde89b6d5

Lookup for usage of

SUPPORTS_DEREFERENCE_PUSHDOWN(SUPPORTS_ROW_TYPE),

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve performance of reading ROW types from parquet files. ({issue}`17387`)

@cla-bot cla-bot bot added the cla-signed label May 8, 2023
@github-actions github-actions bot added the iceberg Iceberg connector label May 8, 2023
@findinpath findinpath changed the title Apply the dereference pushdown at the physical level in Iceberg Apply the dereference pushdown at the physical level on parquet in Iceberg May 9, 2023
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch 2 times, most recently from 1dccc86 to 616281e Compare May 9, 2023 05:24
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from 616281e to 6da2aaa Compare May 9, 2023 05:32
@ebyhr
Copy link
Member

ebyhr commented May 9, 2023

What's the difference from #17133?

@findinpath findinpath marked this pull request as ready for review May 9, 2023 09:13
@findinpath
Copy link
Contributor Author

findinpath commented May 9, 2023

What's the difference from #17133?

@ebyhr this PR provides a way to dramatically reduce the amount of Parquet data read from the physical storage while selecting sub-fields from nested fields. See #17145 for details.

#17133 is another very useful improvement for being able to push down filters on Parquet so that entire row group get skipped for reading. cc @leetcode-1533

@krvikash krvikash self-requested a review May 10, 2023 15:14
@findinpath
Copy link
Contributor Author

Rebasing on master to ensure that the changes still work as expected.

@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from 6da2aaa to 163369d Compare June 19, 2023 16:22
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch 2 times, most recently from 255d46c to 20b6616 Compare June 30, 2023 13:11
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from 20b6616 to a33650c Compare June 30, 2023 13:29
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from a33650c to 61d8472 Compare July 1, 2023 05:08
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comments

@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from 61d8472 to b5bd8d5 Compare July 7, 2023 21:04
@findinpath findinpath requested a review from raunaqmorarka July 7, 2023 21:12
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from b5bd8d5 to 8bc9215 Compare July 7, 2023 21:42
@findinpath findinpath force-pushed the findinpath/iceberg-dereference-physical-pushdown branch from 8bc9215 to e134272 Compare July 7, 2023 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants