Optional check for query partition filter for Hudi #19906

krvikash · 2023-11-27T10:11:10Z

Description

For Hudi partitioned tables, we should reject the table scan produced by the planner when the query does not have partition field.

Add option to enforce that a filter on a partition key be present in the query. This can be enabled by setting the
hudi.query-partition-filter-required config property or the query_partition_filter_required session property
to true

Additional context and related issues

Implementation for the Delta Lake connector #18345
cc @marcinsbd

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hudi
* Add option to enforce that a filter on a partition key be present in the query. This can be enabled by setting the 
  ``hudi.query-partition-filter-required`` config property or the ``query_partition_filter_required`` session property 
  to ``true``. ({issue}`issuenumber`)

codope · 2023-11-27T10:53:25Z

docs/src/main/sphinx/connector/hudi.md

+  - Set to `true` to force a query to use a partition filter. You can use the
+    `query_partition_filter_required` catalog session property for temporary,
+    catalog specific use.
+  - `false`


Thanks for the contribution @krvikash . I am yet to review the full code. But, just for my understanding, why should the default be false? Is the plan produced an unoptimized one if the query uses a partition filter? It would be helpful if you can paste the plans for a simple query with partition filter with and without this config.

See https://trino.io/docs/current/connector/delta-lake.html where delta.query-partition-filter-required is set to false by default as well.

Hi @codope, This PR aims to mandate the inclusion of partition filtering in SELECT queries. This prevents accidental execution of SELECT * queries on tables containing substantial amount of data. Notably, this enforcement won't alter the query plan.

Thanks for the clarification. I am still not sure why the default is false if it's a good thing?

We prefer not to enforce this unless explicitly instructed to do so, following the approach of Hive, Delta, and Iceberg.

But enabling the delta.query-partition-filter-required property by default will fail some queries that do not meet the partition column requirement. It will be unexpected for the existing user.

Exactly - This config avoid hitting potentially long running queries (for larger tables) but enabling it default might restrict the set of queries which would be executed on Trino for smaller tables as well.

@codope does this answer your question?

Yes thanks @krvikash. I think we should document the full behavior, if possible with an example. User should be aware of side-effects of enabling this config.

@codope Thanks for review. Added the behavior in the doc. Please take a look.

the other part is that depending on the expressions used in the predicate we may or may not be able to understand that the partitions are actually filtered leading to false positives too.

krvikash · 2023-11-27T13:28:33Z

(some cosmetic change)

plugin/trino-hudi/src/test/java/io/trino/plugin/hudi/TestHudiSmokeTest.java

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiTableHandle.java

Praveen2112 · 2023-11-28T08:36:44Z

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java

+                        // When there is some predicate which could not be translated into tuple domain,
+                        // such as in cases with complex conditions like 'id = 1 OR name = 'INDIA'.
+                        constraint.getPredicateColumns().stream()
+                                .flatMap(Collection::stream)
+                                .map(HiveColumnHandle.class::cast))


Does Hudi uses the Constraint#predicate ? Like if a query has a complex expression like id = 1 OR name = 'INDIA' the connector could access the expression via Constraint#predicatei.e Connector could pass in the partition values and check the result of the expression - so that it could either pick a partition or reject. I don't see the usages of the Constraint#predicate here - If the predicate is not used - then let's not add those column to constraintColumns as they are not participating in the partition pushdown. WDYT ?

cc: @codope I'm not a hudi expert - Please correct me if I miss something

I compared the behavior of hive and hudi when query_partition_filter_required is enabled. The partition column used in the query satisfies query_partition_filter_required requirement even when the column doesn't contribute to the pushdown.

Hive:

id --> Non-Part column ds --> Part column Query --> "EXPLAIN SELECT * FROM test_required_partition_filter WHERE id = 1 OR ds ='INDIA'" MaterializedResult{rows=[[Trino version: testversion Fragment 0 [SOURCE] Output layout: [id, a, b, ds] Output partitioning: SINGLE [] Output[columnNames = [id, a, b, ds]] │ Layout: [id:integer, a:varchar, b:varchar, ds:varchar] │ Estimates: {rows: 1 (22B), cpu: 0, memory: 0B, network: 0B} └─ ScanFilter[table = hive:tpch:test_required_partition_filter, filterPredicate = (("id" = 1) OR ("ds" = VARCHAR 'INDIA'))] Layout: [id:integer, a:varchar, b:varchar, ds:varchar] Estimates: {rows: 1 (22B), cpu: 22, memory: 0B, network: 0B}/{rows: 1 (22B), cpu: 22, memory: 0B, network: 0B} a := a:string:REGULAR b := b:string:REGULAR id := id:int:REGULAR ds := ds:string:PARTITION_KEY :: [[1]] ]], types=[varchar(768)], setSessionProperties={}, resetSessionProperties=[]}

Hudi:

id --> Non-Part column dt --> Part column Query --> "EXPLAIN SELECT name FROM " + HUDI_COW_PT_TBL + " WHERE id = 1 OR dt = '2021-12-09'" MaterializedResult{rows=[[Trino version: testversion Fragment 0 [SOURCE] Output layout: [name] Output partitioning: SINGLE [] Output[columnNames = [name]] │ Layout: [name:varchar] │ Estimates: {rows: ? (?), cpu: 0, memory: 0B, network: 0B} └─ ScanFilterProject[table = hudi:tests.hudi_cow_pt_tbl, filterPredicate = (("id" = BIGINT '1') OR ("dt" = VARCHAR '2021-12-09'))] Layout: [name:varchar] Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B} dt := dt:string:PARTITION_KEY name := name:string:REGULAR id := id:bigint:REGULAR ]], types=[varchar(686)], setSessionProperties={}, resetSessionProperties=[]}

@Praveen2112 Yes you were right Constraint#predicate is not getting utilized during split generation that means all the data files will be read for such cases where predicate does not get translated to TupleDomain. In this case constraint#predicateColumns does not need to be part of constraintColumns.

krvikash · 2023-12-06T09:02:07Z

(rebased and resolved conflicts)

The change is to make the schema sync with hudi_cow_pt_tbl.

codope

@krvikash Thanks for addressing other comments. Code changes look good. Left two more comments for clarification.
cc @yihua

codope · 2023-12-14T04:09:22Z

docs/src/main/sphinx/connector/hudi.md

+  - Set to `true` to force a query to use a partition filter. You can use the
+    `query_partition_filter_required` catalog session property for temporary,
+    catalog specific use.
+  - `false`


Thanks for the clarification. I am still not sure why the default is false if it's a good thing?

codope · 2023-12-14T04:20:13Z

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiMetadata.java

+
+        // TODO Since the constraint#predicate isn't utilized during split generation. So,
+        //  Let's not add constraint#predicateColumns to newConstraintColumns.
+        Set<HiveColumnHandle> newConstraintColumns = Stream.concat(


Regarding

Constraint#predicate is not getting utilized during split generation that means all the data files will be read for such cases where predicate does not get translated to TupleDomain.

It could have been a miss in the first implementation, probably due to lack of column stats index. What do we lose if we include Constraint#predicate in newConstraintColumns? I think we should still include it in the metadata layer. Using it in the split manager layer can be fixed in another PR.

What do we lose if we include Constraint#predicate in newConstraintColumns?

Thanks @codope for the review. Since during spilt generation we do not filter out the data files based on the Constraint#predicate then even if query-partition-filter-required is enforced all the data files will be read, which is the false alarm.

I have mentioned the TODO comment in the HudiMetadata class. Once Hudi uses Constraint#predicate then we can include Constraint#predicate in newConstraintColumns.

Praveen2112

LGTM. @mosabua Can you please look at the docs related change.

Praveen2112 · 2023-12-19T09:22:40Z

docs/src/main/sphinx/connector/hudi.md

+  - Set to `true` to force a query to use a partition filter. You can use the
+    `query_partition_filter_required` catalog session property for temporary,
+    catalog specific use.
+  - `false`


But enabling the delta.query-partition-filter-required property by default will fail some queries that do not meet the partition column requirement. It will be unexpected for the existing user.

Exactly - This config avoid hitting potentially long running queries (for larger tables) but enabling it default might restrict the set of queries which would be executed on Trino for smaller tables as well.

docs/src/main/sphinx/connector/hudi.md

krvikash · 2023-12-21T04:02:56Z

Thanks, @codope, @mosabua, @Praveen2112, @marcinsbd for the review. Addressed comments.

mosabua

I approve of the docs changes .. but overall I feel like this is a brittle feature .. definitely should be disabled by default and not too much relied upon when enabled..

Praveen2112 · 2023-12-22T05:48:44Z

I agree with you. In the future we could move it to the resource groups, this checks allow us to restrict long running query before hand.

cla-bot bot added the cla-signed label Nov 27, 2023

github-actions bot added docs hudi Hudi connector labels Nov 27, 2023

krvikash requested review from Praveen2112, codope, findinpath and marcinsbd November 27, 2023 10:14

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from cca8320 to 54b391a Compare November 27, 2023 10:36

codope reviewed Nov 27, 2023

View reviewed changes

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from 54b391a to e67badc Compare November 27, 2023 13:28

Praveen2112 reviewed Nov 28, 2023

View reviewed changes

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from e67badc to 5430982 Compare November 28, 2023 08:35

Praveen2112 reviewed Nov 28, 2023

View reviewed changes

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from 5430982 to b1aad68 Compare November 28, 2023 08:42

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from b1aad68 to 7bd0d25 Compare December 6, 2023 09:01

Update hudi test resource for hudi_non_part_cow table

7ccf278

The change is to make the schema sync with hudi_cow_pt_tbl.

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from 7bd0d25 to 0db58b7 Compare December 13, 2023 17:00

codope reviewed Dec 14, 2023

View reviewed changes

marcinsbd approved these changes Dec 14, 2023

View reviewed changes

Praveen2112 approved these changes Dec 19, 2023

View reviewed changes

marcinsbd approved these changes Dec 19, 2023

View reviewed changes

mosabua requested changes Dec 19, 2023

View reviewed changes

docs/src/main/sphinx/connector/hudi.md Outdated Show resolved Hide resolved

docs/src/main/sphinx/connector/hudi.md Outdated Show resolved Hide resolved

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch 2 times, most recently from 88162f4 to 76d000d Compare December 20, 2023 07:27

krvikash requested review from mosabua and codope December 20, 2023 07:30

mosabua reviewed Dec 20, 2023

View reviewed changes

docs/src/main/sphinx/connector/hudi.md Outdated Show resolved Hide resolved

mosabua reviewed Dec 20, 2023

View reviewed changes

docs/src/main/sphinx/connector/hudi.md Outdated Show resolved Hide resolved

codope approved these changes Dec 21, 2023

View reviewed changes

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from 76d000d to 686939f Compare December 21, 2023 04:01

Optional check for query partition filter for Hudi

fe2ef1b

krvikash force-pushed the krvikash/hudi-optional-check-for-query-partition-filter branch from 686939f to fe2ef1b Compare December 21, 2023 04:51

mosabua approved these changes Dec 21, 2023

View reviewed changes

Praveen2112 merged commit 1afaa52 into trinodb:master Dec 22, 2023
21 checks passed

github-actions bot added this to the 436 milestone Dec 22, 2023

krvikash deleted the krvikash/hudi-optional-check-for-query-partition-filter branch December 22, 2023 05:53

colebow mentioned this pull request Jan 9, 2024

Add Trino 436 release notes #20166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional check for query partition filter for Hudi #19906

Optional check for query partition filter for Hudi #19906

krvikash commented Nov 27, 2023

codope Nov 27, 2023 •

edited

Loading

findinpath Nov 27, 2023

krvikash Nov 27, 2023

codope Dec 14, 2023

krvikash Dec 14, 2023

Praveen2112 Dec 19, 2023

krvikash Dec 19, 2023

codope Dec 20, 2023

krvikash Dec 20, 2023

hashhar Dec 21, 2023

krvikash commented Nov 27, 2023

Praveen2112 Nov 28, 2023

krvikash Dec 6, 2023

krvikash Dec 13, 2023

krvikash commented Dec 6, 2023

codope left a comment

codope Dec 14, 2023

codope Dec 14, 2023

krvikash Dec 14, 2023

Praveen2112 left a comment

Praveen2112 Dec 19, 2023

krvikash commented Dec 21, 2023

mosabua left a comment

Praveen2112 commented Dec 22, 2023

Optional check for query partition filter for Hudi #19906

Optional check for query partition filter for Hudi #19906

Conversation

krvikash commented Nov 27, 2023

Description

Additional context and related issues

Release notes

codope Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krvikash commented Nov 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krvikash commented Dec 6, 2023

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Praveen2112 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krvikash commented Dec 21, 2023

mosabua left a comment

Choose a reason for hiding this comment

Praveen2112 commented Dec 22, 2023

codope Nov 27, 2023 •

edited

Loading