Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exploit iceberg row-count metadata #74

Merged
merged 6 commits into from
Oct 15, 2024

Conversation

peterboncz
Copy link

@peterboncz peterboncz commented Oct 9, 2024

Examine the row-counts in the manifest and count how many rows are in the (existing or added) data resp. deletion files.

Use these two counts to pass the new named_parameter "explicit_cardinality" into the generated parquet_scans.

Note that because iceberg is passing in an explicit schema into the parquet_scan, it does not open any file during bind. That means the parquet_scans it generates did not have any cardinality information during query optimization.

That can cause rather bad query plans. This PR (pb/explicit-iceberg-cardinality) fixes that. Note that this PR needs the DuckDB PR pb/explicit-parquet-cardinality that adds the "explicit_cardinality" named_parameter to parquet_scan.

peter added 2 commits October 9, 2024 14:09
Examine the row-counts in the manifest and count how many rows are in the (existing or added) data an deletion files.

Use these two counts to pass the new named_parameter "explicit_cardinality" into the generated parquet_scans.

Note that because iceberg is passing in an explicit schema into the parquet_scan, it does not open any file during
bind. That means the parquet_scans it generates did not have any cardinality information during query optimization.

That can cause rather bad query plans. This PR (pb/explicit-iceberg-cardinality) fixes that. Note that this PR needs
the DuckDB PR pb/exicit-parquet-cardinality that adds the "explicit_cardinality" named_parameter to parquet_scan.
@peterboncz
Copy link
Author

This CI is expected to fail (unknown named_parameter 'explicit_cardinality') until duckdb/duckdb#14292 would have been merged

@peterboncz peterboncz marked this pull request as draft October 11, 2024 17:59
@peterboncz peterboncz marked this pull request as ready for review October 11, 2024 17:59
@peterboncz peterboncz marked this pull request as draft October 11, 2024 18:00
@peterboncz peterboncz marked this pull request as ready for review October 11, 2024 18:00
@peterboncz peterboncz marked this pull request as draft October 14, 2024 12:29
@peterboncz peterboncz marked this pull request as ready for review October 14, 2024 12:29
peter added 3 commits October 14, 2024 14:30
- add test that checks that there are cardinalities in the generated parquet_scans
Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@samansmink samansmink merged commit d62d91d into duckdb:main Oct 15, 2024
16 checks passed
mike-luabase pushed a commit to definite-app/duckdb_iceberg that referenced this pull request Oct 27, 2024
…rdinality

exploit iceberg row-count metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants