Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config to reject Iceberg queries without partition pruning #20118

Closed

Conversation

okayhooni
Copy link
Contributor

@okayhooni okayhooni commented Dec 14, 2023

Description

We adopt the new catalog option iceberg.query-partition-filter-required introduced by #17263 (since v430)
But, we realized this option cannot prevent all the full scan queries with some edge cases.

We migrated some tables from Hive format to Iceberg format, and Trino users in our company usually submit queries to those new Iceberg format tables, just simply ported with cast(date(log_ts) as varchar) from their existing queries to legacy Hive format table (w/ string-typed date partition field, ex: log_dt="2023-12-14"), like below.

select *
from iceberg.very_big_size_table_containing_many_years_data_migrated_from_hive_format
where cast(date(log_ts) as varchar) = '2023-12-14' 
limit 1000 

Although the queries like above end up with trying full-scan on the table, but those queries passed the validation-checks of iceberg.query-partition-filter-required=true. and I found that validation logic allow the case of partitioning field in just constraint columns of query plans.

How about adding more strict constraint option iceberg.query-partition-pruning-required to prevent those edge cases and ensure partition-pruning on the query plan..?

I tried some queries to test this new option working well, but honestly.. I cannot make sure, there is no side-effect.

@zhangminglei , Could you review this minor update to your nice contribution..?

This looks like just a reverse issue case of #12925, fixed with #13567 by @findepi

If there is more fancy way to cover these edge cases, please fix it.. or tell me that alternative solution..!

Related PR

Copy link

cla-bot bot commented Dec 14, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added docs iceberg Iceberg connector labels Dec 14, 2023
Copy link

cla-bot bot commented Dec 15, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@okayhooni
Copy link
Contributor Author

okayhooni commented Dec 15, 2023

Previous commit cannot prevent all the issue queries like below.

  • example table's partition structure: (DATE(log_ts), event_type)
  • we intended to force our Trino users to submit queries to this table with log_ts partition filter, due to heavy query loads.
select *
from iceberg.very_big_size_table_containing_many_years_data_migrated_from_hive_format
where cast(date(log_ts) as varchar) = '2023-12-14' and event_type = 'Click'
limit 1000 

So, I added other options to force filtering on the specific partition fields.

  • Trino catalog level: iceberg.query-partition-filter-required-common-fields property
  • Each table level(w/ TBLPROPERTY): trino.query-partition-filter-required.fields property

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Jan 11, 2024
Copy link

cla-bot bot commented Jan 11, 2024

The cla-bot has been summoned, and re-checked this pull request!

@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

@electrum @martint @dain and @findepi .. please chime in here. It seems like this config options to narrow down what queries are allowed is spreading .. this PR adds more again.

@github-actions github-actions bot removed the stale label Jan 11, 2024
Copy link

github-actions bot commented Feb 2, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 2, 2024
@mosabua
Copy link
Member

mosabua commented Feb 2, 2024

@raunaqmorarka you merged a related PR .. could you take a look .. also do we really want to go down this approach of more and more policies for query shape?

@raunaqmorarka
Copy link
Member

For the provided example of

select *
from iceberg.very_big_size_table_containing_many_years_data_migrated_from_hive_format
where cast(date(log_ts) as varchar) = '2023-12-14' 
limit 1000 

Partition pruning should take place in iceberg using the code at


as long as log_ts is a partitioning column.
Can you verify that this code is not performing pruning in your case and check why it's not working ?

@github-actions github-actions bot removed the stale label Feb 5, 2024
@okayhooni
Copy link
Contributor Author

okayhooni commented Feb 6, 2024

@mosabua @raunaqmorarka
Thank you for reviewing this trivial PR..!

We currently deploy Trino based on version 433, and those custom codes have worked well in our production Trino clusters, as expected.
(But, I guess it can be handled more gracefully with updated logic on the latest version of Trino.)

On the vanilla version 433, when the queries like below were submitted, the log_ts partition column is categorized just to table.getConstraintColumns() after query-planning, and not used to partition pruning.

select *
from iceberg.very_big_size_table_containing_many_years_data_migrated_from_hive_format
where cast(date(log_ts) as varchar) = '2023-12-14' 
limit 1000 

I have observed partition pruning is utilized only if log_ts is on the enforcedPredicate or unenforcedPredicate of query plan, after doing experiments with several test queries.

Honestly, I didn't check the root cause, why it's not working, but it looks like just a reverse issue case of #12925, fixed with #13567 by @findepi

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 28, 2024
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants