Add partitioning push down #23432

dain · 2024-09-16T04:02:12Z

Description

Add partitioning push down to table scan which a connector can use to activate optional partitioning, or choose between multiple partitioning strategies. This replaces the existing Metadata makeCompatiblePartitioning method used exclusively by Hive with a more generic applyPartitioning method.

Hive has been updated to the new system, and now only applies bucketed execution when it is actually used in the coordinator. This can improve performance when parallelism is limited by the bucketing and the bucketing isn't necessary for the query. Additionally, mismatched bucket execution (support join between tables where bucket count differes by a power of two) in Hive is activated by default. I believe this was disabled before, because we did not have the system to automatically disable bucket execution when the bucket count is small compred to the number of nodes.

Iceberg has been updated to support bucketed execution also. This applies the same optimizations available to Hive which allows the engine to eliminate unnecessary redistribution of tables. Additionally, since Iceberg supports multiple independent partitioning functions, a table can effectively have multiple distributions, which makes the optimization
even more effective. Iceberg bucket execution can be controlled with the iceberg.bucket-execution configuration property and the bucket_execution_enabled session property.

Finally, for bucketed tables without a fixed node assignment, the connector can request a stable node distribution across queries. This implemented in Hive and Iceberg and improves cache hit rate for file system caching. The implementation is a simple Rendezvous Hashing (Highest Random Weight) algorithm.

Follup Work

Iceberg support for mismatched buckets

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# SPI
* Add partitioning push down, which a connector can use to activate optional partitioning, or choose between multiple partitioning strategies. ({issue}`23432`)

# Iceberg
* Add bucketed execution which can improve performance when running a join or aggregation on a bucketed table. This can be disabled with `iceberg.bucket-execution` configuration property, and the `bucket_execution_enabled` session property. ({issue}`23432`)

# Hive
* Bucket execution is now only enabled when actually useful in the query. ({issue}`23432`)
* Enable mismatched bucket execution optimization by default.  This can be disabled with `hive.optimize-mismatched-bucket-count` configuration property, and the `optimize_mismatched_bucket_count` session property. ({issue}`23432`)

sopel39 · 2024-09-26T08:56:28Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveNodePartitioningProvider.java

@@ -99,7 +99,8 @@ public Optional<ConnectorBucketNodeMap> getBucketNodeMapping(ConnectorTransactio
    public ToIntFunction<ConnectorSplit> getSplitBucketFunction(
            ConnectorTransactionHandle transactionHandle,
            ConnectorSession session,
-            ConnectorPartitioningHandle partitioningHandle)
+            ConnectorPartitioningHandle partitioningHandle,
+            int bucketCount)
    {
        return value -> ((HiveSplit) value).getReadBucketNumber()


shouldn't this validate bucketCount that it matches table bucketing?

No, the ConnectorPartitioningHandle represents a logical partitioning, and is not specific to any table. This is a common misconception. Additionally, Hive supports mismatched bucket execution if the counts differ by a power of two.

Maybe we can add some of the above to the code comment in ConnectorPartitioningHandle ?

sopel39 · 2024-09-26T09:08:04Z

core/trino-main/src/main/java/io/trino/sql/planner/NodePartitioningManager.java

@@ -162,35 +164,40 @@ private NodePartitionMap getNodePartitioningMap(
        requireNonNull(partitioningHandle, "partitioningHandle is null");

        if (partitioningHandle.getConnectorHandle() instanceof SystemPartitioningHandle) {
-            return systemNodePartitionMap(session, partitioningHandle, systemPartitioningCache, partitionCount);
+            return new NodePartitionMap(systemBucketToNode(session, partitioningHandle, systemPartitioningCache, partitionCount), _ -> {


nit: getNodePartitioningMap is used in both scan and insert path, but currently NodePartitioningManager is making assumptions around partitionCount when it's absent. cc @marton-bod

I don't follow. This is no different than the behavior of the code we have today. Also all existing tests pass.

core/trino-main/src/main/java/io/trino/sql/planner/NodePartitioningManager.java

sopel39 · 2024-09-26T09:31:49Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveNodePartitioningProvider.java

@@ -81,18 +75,7 @@ public Optional<ConnectorBucketNodeMap> getBucketNodeMapping(ConnectorTransactio
        if (!handle.isUsePartitionedBucketing()) {
            return Optional.of(createBucketNodeMap(handle.getBucketCount()));
        }
-
-        // Allocate a fixed number of buckets. Trino will assign consecutive buckets


@raunaqmorarka could you run insert benchmarks on this change? I thin kit should be fine as https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/planner/NodePartitioningManager.java#L264 provides enough buckets for global and local parallelism.

The new code is just moving this decission to from the Hive connector to the core engine.

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddExchanges.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableHandle.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddExchanges.java

sopel39 · 2024-10-07T10:10:06Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddExchanges.java

+            return metadata.applyPartitioning(session, node.getTable(), partitioningProperties.getPartitioning().map(Partitioning::getHandle), partitionColumns)
+                    .map(node::withTableHandle)
+                    // Activate the partitioning if it passes the rules defined in DetermineTableScanNodePartitioning
+                    .map(newNode -> DetermineTableScanNodePartitioning.setUseConnectorNodePartitioning(


Why it doesn't return node in case setUseConnectorNodePartitioning doesn't set withUseConnectorNodePartitioning(true)?

I think that would be more complex, but also it does end up capturing the information that we could have used the partitioning, but the optimizer decided not to use it. I don't think this shows up in the output today, but it could be added.

sopel39 · 2024-10-08T09:22:02Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

+
+    @Config("iceberg.bucket-execution")
+    @ConfigDescription("Enable bucket-aware execution: use physical bucketing information to optimize queries")
+    public IcebergConfig setBucketExecutionEnabled(boolean bucketExecutionEnabled)


note that it won't work anyway because DetermineTableScanNodePartitioning uses:

int numberOfBuckets = bucketNodeMap.map(ConnectorBucketNodeMap::getBucketCount) .orElseGet(() -> nodePartitioningManager.getNodeCount(context.getSession(), partitioning.partitioningHandle()));

It's essential for connector to be able to provide bucket count to the engine.

If it's just scanning one partition, then it doesn't make sense to funnel that though single node (unless the partition is super small itself).

The iceberg and hive code both only handle the pushdown if hash bucketing is used. This is not used for value based.

I think you miss the point. CBO (DetermineTableScanNodePartitioning) decides whether to use connector provided bucketing or not. DetermineTableScanNodePartitioning won't use connector bucketing if it doesn't know how many "buckets" there are

Yes, but in AddExchanges we perform pushdown and if it works, we then directly call DetermineTableScanNodePartitioning so it can decide if we should actually use the bucketing from the connector. So both systems must decide that the partitioning should be used for execution.

@dain DetermineTableScanNodePartitioning will never choose to use Iceberg partitioning in current code, because it doesn't know partition count.

I assure it it does. When you have have a hash bucket function it uses that number. You can pull the branch and run it yourself.

When you have have a hash bucket function it uses that number

What is the number? Is it artificial or it actually represents how many partitions there are? What happens if there is only single partition scanned?

The code in io.trino.sql.planner.iterative.rule.DetermineTableScanNodePartitioning#apply is:

int numberOfBuckets = bucketNodeMap.map(ConnectorBucketNodeMap::getBucketCount) .orElseGet(() -> nodePartitioningManager.getNodeCount(context.getSession(), partitioning.getPartitioningHandle())); int numberOfTasks = max(taskCountEstimator.estimateSourceDistributedTaskCount(context.getSession()), 1); return Result.ofPlanNode(node .withUseConnectorNodePartitioning((double) numberOfBuckets / numberOfTasks >= getTableScanNodePartitioningMinBucketToTaskRatio(context.getSession())));

If the number of buckets is some arbitrary big number, but only single partition is scanned, then such query won't utilize the cluster properly (it will essentially utilize single node).

When you have partitioning that looks like:

partitioning = ARRAY['bucket(key1, 13)', 'bucket(key2, 17)']

If you group on key1, you get 13 buckets, and 17 for key2.

The answer to the question about number of buckets actually scanned is not affected by my code at all, so whatever happened before still happens. I believe (but don't know), you would likely end up with a constant for the partition funcation argument, and there is logic that disables the partitioning in that case. Regardless, this code actually improves this in more cases, because partitioning is only enabled, when the query needs it. Otherwise the table is read as if it were not partitioned.

@dain
Does your PR also enable DetermineTableScanNodePartitioning for Iceberg partitioning = ARRAY['c1', 'c2'].?
If so, then it's a new behavior which would lead to lack of query parallelism for single or two partitions scans.

github-actions · 2024-11-08T17:02:50Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

mosabua · 2024-11-08T18:45:33Z

Adding stale-ignore and performance labels.. also cc @martint

findinpath · 2024-11-11T12:21:15Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

+        return bucketExecutionEnabled;
+    }
+
+    @Config("iceberg.bucket-execution")


I'd like to see a test in this commit highlighting actually what the 🥩 of the PR is actually about.

You are right. I forgot to commit the test I copied from the Hive connector for this. In is now in this commit in BaseIcebergConnectorTest.testBucketedSelect()

raunaqmorarka

Can we do the last 2 commits as a separate PR ? I think the rest can land already and it will potentially allow more ppl to review the iceberg bits

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorNodePartitioningProvider.java

.../java/io/trino/execution/scheduler/faulttolerant/FaultTolerantPartitioningSchemeFactory.java

raunaqmorarka · 2024-11-19T05:32:35Z

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorNodePartitioningProvider.java

+    default Optional<ConnectorBucketNodeMap> getBucketNodeMapping(
+            ConnectorTransactionHandle transactionHandle,
+            ConnectorSession session,
+            ConnectorPartitioningHandle partitioningHandle)


formatting change in separate commit ?

raunaqmorarka · 2024-11-19T05:38:24Z

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorPartitioningHandle.java

+ * <p>
+ * If the ConnectorPartitioningHandle of two tables are equal, the tables are guaranteed
+ * to have the same partitioning scheme across nodes, and the engine may use a colocated join.
+ */
 public interface ConnectorPartitioningHandle
 {
    default boolean isSingleNode()


Can we also add comments for isSingleNode and isCoordinatorOnly ?

These are only implemented in SystemPartitioningHandle and I'm going to removes them eventually

raunaqmorarka · 2024-11-19T05:42:18Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveNodePartitioningProvider.java

@@ -99,7 +99,8 @@ public Optional<ConnectorBucketNodeMap> getBucketNodeMapping(ConnectorTransactio
    public ToIntFunction<ConnectorSplit> getSplitBucketFunction(
            ConnectorTransactionHandle transactionHandle,
            ConnectorSession session,
-            ConnectorPartitioningHandle partitioningHandle)
+            ConnectorPartitioningHandle partitioningHandle,
+            int bucketCount)
    {
        return value -> ((HiveSplit) value).getReadBucketNumber()


Maybe we can add some of the above to the code comment in ConnectorPartitioningHandle ?

raunaqmorarka · 2024-11-19T06:18:56Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddExchanges.java

+            List<ColumnHandle> partitionColumns;
+            if (partitioningProperties.getPartitioning().isPresent()) {
+                Partitioning partitioning = partitioningProperties.getPartitioning().get();
+                // constant partitioning values cannot be pushed into the table scan


When we have a combination of constants and partitioning columns, are we not able to take advantage of partitioning ?

Yep. This is a pre-existing constraint.... This logic came from somewhere else.

I'm not sure that its relevant here but I have seen a plan like below

Fragment 1 [mmdsbx:HivePartitioningHandle{buckets=32, hiveTypes=[string, int]}] CPU: 1.71m, Scheduled: 4.40m, Blocked 8.73m (Input: 44.35s, Output: 51.48s), Input: 19435177 rows (2.30GB); per task: avg.: 19435177.00 std.dev.: 0.00, Output: 14982018 rows (5.32GB) Output layout: [root_guid, guid, lvl1_root_guid, concat, concat_4, redundant, cycle, excluded] Output partitioning: SINGLE [] └─ Aggregate[type = FINAL, keys = [root_guid, guid, lvl1_root_guid, concat, concat_4, redundant, cycle, excluded], hash = [$hashvalue]] │ Layout: [root_guid:varchar, guid:varchar, lvl1_root_guid:varchar, concat:array(varchar), concat_4:varchar, redundant:boolean, cycle:boolean, excluded:boolean, $hashvalue:bigint] └─ LocalExchange[partitioning = HASH, hashColumn = [$hashvalue], arguments = ["root_guid", "guid", "lvl1_root_guid", "concat", "concat_4", "redundant", "cycle", "excluded"]] │ Layout: [root_guid:varchar, guid:varchar, lvl1_root_guid:varchar, concat:array(varchar), concat_4:varchar, redundant:boolean, cycle:boolean, excluded:boolean, $hashvalue:bigint] └─ Aggregate[type = PARTIAL, keys = [root_guid, guid, lvl1_root_guid, concat, concat_4, redundant, cycle, excluded], hash = [$hashvalue_9]] │ Layout: [root_guid:varchar, guid:varchar, lvl1_root_guid:varchar, concat:array(varchar), concat_4:varchar, redundant:boolean, cycle:boolean, excluded:boolean, $hashvalue_9:bigint] └─ InnerJoin[criteria = ("guid" = "root_guid_0"), hash = [$hashvalue_5, $hashvalue_6], distribution = PARTITIONED] │ Layout: [root_guid:varchar, guid:varchar, cycle:boolean, excluded:boolean, redundant:boolean, xpath:array(varchar), lvl1_root_guid:varchar] │ Distribution: PARTITIONED │ maySkipOutputDuplicates = true │ dynamicFilterAssignments = {root_guid_0 -> #df_631} ├─ ScanFilterProject[table = mmdsbx:stg_gbt_top_down_expanded_2 buckets=32, filterPredicate = ("level" = 3), dynamicFilters = {"guid" = #df_631}] │ Layout: [root_guid:varchar, guid:varchar, cycle:boolean, excluded:boolean, redundant:boolean, xpath:array(varchar), $hashvalue_5:bigint] │ $hashvalue_5 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("guid"), 0)) │ excluded := excluded:boolean:REGULAR │ xpath := xpath:array<string>:REGULAR │ level := level:int:REGULAR │ guid := guid:string:REGULAR │ root_guid := root_guid:string:REGULAR │ cycle := cycle:boolean:REGULAR │ redundant := redundant:boolean:REGULAR │ Input: 11886091 rows (1.72GB), Filtered: 44.99% │ Dynamic filters: │ - df_631, ALL, collection time=564.27ms └─ LocalExchange[partitioning = HASH, hashColumn = [$hashvalue_6], arguments = ["root_guid_0"]] │ Layout: [root_guid_0:varchar, lvl1_root_guid:varchar, $hashvalue_6:bigint] └─ RemoteSource[sourceFragmentIds = [2]] Layout: [root_guid_0:varchar, lvl1_root_guid:varchar, $hashvalue_7:bigint] Fragment 2 [SOURCE] CPU: 3.27s, Scheduled: 7.72s, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 7549086 rows (532.75MB); per task: avg.: 7549086.00 std.dev.: 0.00, Output: 7549086 rows (597.55MB) Output layout: [root_guid_0, lvl1_root_guid, $hashvalue_8] Output partitioning: mmdsbx:HivePartitioningHandle{buckets=32, hiveTypes=[string, int]} [root_guid_0, integer(3)] ScanProject[table = mmdsbx:stg_connections_wo_duplicates] Layout: [root_guid_0:varchar, lvl1_root_guid:varchar, $hashvalue_8:bigint] $hashvalue_8 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("root_guid_0"), 0)) lvl1_root_guid := lvl1_root_guid:string:REGULAR root_guid_0 := root_guid:string:REGULAR Input: 7549086 rows (532.75MB), Filtered: 0.00%

Both tables are bucketed by the string column guid. The one on the probe side also has bucketing by another column which has a predicate on it and somehow we were able to create an output partitioning of bucketed column + constant on the build side scan.

The query plan above is not taking advantage of partitioning on the build side. The probe side is pre-partitioned, so you avoid one repartitioning step. You can see this because it is not using a colocated join.

I will improve the comment here. This restriction only applies when the pushdown has a required partitioning function. Our APIs only support pushing down column references and not constants. We could modify the pushdown to support variables and constants in the future. It is even hard for me to think about this. It is asking the table if it can be partitioned on a function where one of the values is constant. This might be possible to implement but would take a lot of care.

BTW, I manually tested this behavior and you get the same behavior.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTablePartitioning.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveConfig.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java

core/trino-main/src/main/java/io/trino/metadata/Metadata.java

raunaqmorarka · 2024-12-04T19:26:10Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/AddExchanges.java

+            List<ColumnHandle> partitionColumns;
+            if (partitioningProperties.getPartitioning().isPresent()) {
+                Partitioning partitioning = partitioningProperties.getPartitioning().get();
+                // constant partitioning values cannot be pushed into the table scan


I'm not sure that its relevant here but I have seen a plan like below

Fragment 1 [mmdsbx:HivePartitioningHandle{buckets=32, hiveTypes=[string, int]}] CPU: 1.71m, Scheduled: 4.40m, Blocked 8.73m (Input: 44.35s, Output: 51.48s), Input: 19435177 rows (2.30GB); per task: avg.: 19435177.00 std.dev.: 0.00, Output: 14982018 rows (5.32GB) Output layout: [root_guid, guid, lvl1_root_guid, concat, concat_4, redundant, cycle, excluded] Output partitioning: SINGLE [] └─ Aggregate[type = FINAL, keys = [root_guid, guid, lvl1_root_guid, concat, concat_4, redundant, cycle, excluded], hash = [$hashvalue]] │ Layout: [root_guid:varchar, guid:varchar, lvl1_root_guid:varchar, concat:array(varchar), concat_4:varchar, redundant:boolean, cycle:boolean, excluded:boolean, $hashvalue:bigint] └─ LocalExchange[partitioning = HASH, hashColumn = [$hashvalue], arguments = ["root_guid", "guid", "lvl1_root_guid", "concat", "concat_4", "redundant", "cycle", "excluded"]] │ Layout: [root_guid:varchar, guid:varchar, lvl1_root_guid:varchar, concat:array(varchar), concat_4:varchar, redundant:boolean, cycle:boolean, excluded:boolean, $hashvalue:bigint] └─ Aggregate[type = PARTIAL, keys = [root_guid, guid, lvl1_root_guid, concat, concat_4, redundant, cycle, excluded], hash = [$hashvalue_9]] │ Layout: [root_guid:varchar, guid:varchar, lvl1_root_guid:varchar, concat:array(varchar), concat_4:varchar, redundant:boolean, cycle:boolean, excluded:boolean, $hashvalue_9:bigint] └─ InnerJoin[criteria = ("guid" = "root_guid_0"), hash = [$hashvalue_5, $hashvalue_6], distribution = PARTITIONED] │ Layout: [root_guid:varchar, guid:varchar, cycle:boolean, excluded:boolean, redundant:boolean, xpath:array(varchar), lvl1_root_guid:varchar] │ Distribution: PARTITIONED │ maySkipOutputDuplicates = true │ dynamicFilterAssignments = {root_guid_0 -> #df_631} ├─ ScanFilterProject[table = mmdsbx:stg_gbt_top_down_expanded_2 buckets=32, filterPredicate = ("level" = 3), dynamicFilters = {"guid" = #df_631}] │ Layout: [root_guid:varchar, guid:varchar, cycle:boolean, excluded:boolean, redundant:boolean, xpath:array(varchar), $hashvalue_5:bigint] │ $hashvalue_5 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("guid"), 0)) │ excluded := excluded:boolean:REGULAR │ xpath := xpath:array<string>:REGULAR │ level := level:int:REGULAR │ guid := guid:string:REGULAR │ root_guid := root_guid:string:REGULAR │ cycle := cycle:boolean:REGULAR │ redundant := redundant:boolean:REGULAR │ Input: 11886091 rows (1.72GB), Filtered: 44.99% │ Dynamic filters: │ - df_631, ALL, collection time=564.27ms └─ LocalExchange[partitioning = HASH, hashColumn = [$hashvalue_6], arguments = ["root_guid_0"]] │ Layout: [root_guid_0:varchar, lvl1_root_guid:varchar, $hashvalue_6:bigint] └─ RemoteSource[sourceFragmentIds = [2]] Layout: [root_guid_0:varchar, lvl1_root_guid:varchar, $hashvalue_7:bigint] Fragment 2 [SOURCE] CPU: 3.27s, Scheduled: 7.72s, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 7549086 rows (532.75MB); per task: avg.: 7549086.00 std.dev.: 0.00, Output: 7549086 rows (597.55MB) Output layout: [root_guid_0, lvl1_root_guid, $hashvalue_8] Output partitioning: mmdsbx:HivePartitioningHandle{buckets=32, hiveTypes=[string, int]} [root_guid_0, integer(3)] ScanProject[table = mmdsbx:stg_connections_wo_duplicates] Layout: [root_guid_0:varchar, lvl1_root_guid:varchar, $hashvalue_8:bigint] $hashvalue_8 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("root_guid_0"), 0)) lvl1_root_guid := lvl1_root_guid:string:REGULAR root_guid_0 := root_guid:string:REGULAR Input: 7549086 rows (532.75MB), Filtered: 0.00%

Both tables are bucketed by the string column guid. The one on the probe side also has bucketing by another column which has a predicate on it and somehow we were able to create an output partitioning of bucketed column + constant on the build side scan.

raunaqmorarka · 2024-12-04T19:36:26Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

                false,
                Optional.empty(),
                ImmutableSet.of(),
                Optional.of(false));
    }

+    private Optional<IcebergTablePartitioning> getTablePartitioning(ConnectorSession session, Table icebergTable)
+    {
+        if (!isBucketExecutionEnabled(session) || icebergTable.specs().size() != 1) {


Could you add a TODO about handling schema evolution related cases if that is something feasible to handle later on ?

That would only be reasonable in very limited cases. I don't think it is worrth a TODO

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

raunaqmorarka · 2024-12-04T19:40:44Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplit.java

@@ -168,6 +173,16 @@ public String getPartitionSpecJson()
        return partitionSpecJson;
    }

+    /**
+     * Trino (stack) values of the partition columns. The values are the result of evaluating


What does (stack) mean here ?

stack values are long, double, boolean and object, where as actual values might be something like float

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

raunaqmorarka · 2024-12-04T19:49:17Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+                partitioningColumns.stream().map(IcebergColumnHandle.class::cast).collect(toImmutableList()),
+                newPartitionStructFields.build()))));
+    }
+


I'm assuming lack of getCommonPartitioningHandle implementation here just means that bucket count mismatch cases are not handled yet ?

Mismatch is not supported. There is a comment in the applyPartitioning method

Check boolean nullsAndAnyReplicated field before more complex fields

Add partitioning push down to table scan which connector can use to activate optional partitioning, or choose between multiple partitioning strategies. This replaces the existing Metadata makeCompatiblePartitioning method used exclusively by Hive

Add support for pushing plan partitioning into Iceberg when Iceberg tables use hash bucked partitioning. This enables co-located joins which can be significantly more efficient. Additionally, since Iceberg supports multiple independent partitioning functions, a table can effectively have multiple distributions, which makes the optimization more effective. This feature can be controlled with the iceberg.bucket-execution configuration property and the bucket_execution_enabled session property.

This improves cache hit rate for file system caching

cla-bot bot added the cla-signed label Sep 16, 2024

github-actions bot added iceberg Iceberg connector hive Hive connector labels Sep 16, 2024

dain force-pushed the apply-partitioning branch 2 times, most recently from 2177896 to 1ebd691 Compare September 21, 2024 16:37

This was referenced Sep 25, 2024

[WIP] Add bucketed join support for iceberg connector (#79) #22206

Closed

Pass Iceberg partitioning information through getTableProperties #19062

Open

sopel39 reviewed Sep 26, 2024

View reviewed changes

dain force-pushed the apply-partitioning branch 8 times, most recently from 0ca7a63 to 5957f6f Compare October 2, 2024 23:05

dain force-pushed the apply-partitioning branch from 5957f6f to e7cb58b Compare October 6, 2024 01:22

dain changed the title ~~WIP Add partitioning push down~~ Add partitioning push down Oct 6, 2024

dain requested a review from electrum October 6, 2024 01:22

dain force-pushed the apply-partitioning branch from e7cb58b to 8c1a6d9 Compare October 6, 2024 01:27

sopel39 reviewed Oct 7, 2024

View reviewed changes

sopel39 reviewed Oct 8, 2024

View reviewed changes

github-actions bot added the stale label Nov 8, 2024

mosabua added performance stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Nov 8, 2024

dain force-pushed the apply-partitioning branch from 8c1a6d9 to 6904c4a Compare November 11, 2024 00:56

findinpath reviewed Nov 11, 2024

View reviewed changes

raunaqmorarka reviewed Nov 19, 2024

View reviewed changes

dain force-pushed the apply-partitioning branch from 6904c4a to ca05dcf Compare December 3, 2024 21:46

dain requested review from findinpath and raunaqmorarka December 3, 2024 21:48

dain force-pushed the apply-partitioning branch 3 times, most recently from dd91222 to 378f221 Compare December 4, 2024 01:31

raunaqmorarka approved these changes Dec 4, 2024

View reviewed changes

dain force-pushed the apply-partitioning branch from 378f221 to 3b98c61 Compare December 5, 2024 03:52

dain added 13 commits December 7, 2024 12:28

Fix plan rendering in Hive test failure messages

2192ff1

Rename HiveBucketHandle to HiveTablePartitioning

0b60ac7

Optimize ActualProperties compatible checks

75b5af1

Check boolean nullsAndAnyReplicated field before more complex fields

Improve connector partitioning JavaDocs

fd0c1e5

Add bucketedCount to getSplitBucketFunction

c8f2ddc

Support partition functions with no bucket count preference

069842c

Use automatic system bucket assignment in Hive node partitioning

15bdd67

Simplify FaultTolerantPartitioningSchemeFactory

f526954

Enable Hive mismatched bucked count optimization by default

d255d96

Improve description of Hive bucket-aware execution configs

4b951ae

Use stable node to bucket mapping for Hive and Iceberg

674b888

This improves cache hit rate for file system caching

dain force-pushed the apply-partitioning branch from 3b98c61 to 674b888 Compare December 7, 2024 20:31

dain merged commit 43b9cf9 into trinodb:master Dec 7, 2024
105 checks passed

dain deleted the apply-partitioning branch December 7, 2024 23:47

github-actions bot added this to the 468 milestone Dec 7, 2024

This was referenced Dec 9, 2024

Add Trino 468 release notes #24421

Merged

Support execution by partition in addition to by bucket for Hive connector #13319

Closed

Add partitioning push down #23432

Add partitioning push down #23432

Conversation

dain commented Sep 16, 2024 • edited Loading

Description

Follup Work

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 8, 2024

mosabua commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raunaqmorarka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dain commented Sep 16, 2024 •

edited

Loading