Multi channel group by hash first batch of optimizations #12072

skrzypo987 · 2022-04-21T07:35:23Z

Description

Some simple optimizations for multi channel group by hash.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement & refactor

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

[slow] -> improvement -> [less slow]

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Improve performance of aggregation operator

skrzypo987 · 2022-04-21T07:36:06Z

Labeled as WIP as some tests are still failing. This is due to decreased memory footprint

sopel39 · 2022-04-26T11:55:59Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -16,7 +16,6 @@
 import com.google.common.annotations.VisibleForTesting;


Do you have micro benchmarks for this? Same approach could be applied for PagesIndex and join

I updated the commit message. Some cases offer substantial gains.

I updated the commit message. Some cases offer substantial gains.

I don't think commit message has micro benchmarks results.

Nice find BTW!

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

sopel39 · 2022-04-26T12:00:22Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -140,23 +131,21 @@ public MultiChannelGroupByHash(
        PagesHashStrategyFactory pagesHashStrategyFactory = joinCompiler.compilePagesHashStrategyFactory(this.types, outputChannels.build());
        hashStrategy = pagesHashStrategyFactory.createPagesHashStrategy(this.channelBuilders, this.precomputedHashChannel);

-        startNewPage();
+        pageBuilder = new PageBuilder(types);


ince the hash table can only store
2^30 positions, the structure can be flattened to a single huge page.

Retained size of single large page will probably be greater than if multiple pages are used (because we use currentPageBuilder.newPageBuilderLike() which is adaptive)

Cost of rezising page is neglibable? What if page consists of long strings?

Retained size of single large page will probably be greater than if multiple pages are used

That is true. However, the size of hash table and other primitive arrays shrinks from ~37 bytes/row to ~9 bytes which gives us some headroom.
Just to play with numbers some more:
Block is resized by 1.5 so on average there is a 25% additional space retained. Wasted space for multiple-page approach is negligible for big enough number of groups.
So there is a breakeven for rows of average size (37-9) / 0.25 = 112B
So for rows lower than that we use less memory, but for bigger ones we use more. There is tradeoff here and we need to choose whether we accept it.

Cost of rezising page is neglibable?

I guess it is not bigger than handling and allocating multiple pages, especially for multi-row case.

What if page consists of long strings?

Aggregating by text is not necessarily the smartest thing. You can make a query fail in a much simpler way.

That is true. However, the size of hash table and other primitive arrays shrinks from ~37 bytes/row to ~9 bytes which gives us some headroom.

That is only valid assumption with primitive arrays

I guess it is not bigger than handling and allocating multiple pages, especially for multi-row case.

It also depends on not having wide rows.

Aggregating by text is not necessarily the smartest thing. You can make a query fail in a much simpler way.

There are already such aggregations in our favorite benchmark. We also insert deduplicating aggregations or aggregations for correlated subqueries synthetically which store entire rows. Customers sometimes have very wide rows (e.g. hundreds of columns).

That's why I'm asking if similar effect could be obtained by increasing page size which by default is pretty low 9f4dbe8#r858629278. It should reduce mem overhead, but also allow for 1-level addressing to fit into CPU cache

increasing page size

I don't quite follow.
Having a 2-level addressing with just a single page is still 2-level addressing. You still need to make 2 lookups per operation to get the data

sopel39 · 2022-04-26T12:03:22Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

    private final List<ObjectArrayList<Block>> channelBuilders;
    private final Optional<Integer> inputHashChannel;
    private final HashGenerator hashGenerator;
    private final OptionalInt precomputedHashChannel;
    private final boolean processDictionary;
-    private PageBuilder currentPageBuilder;
-
-    private long completedPagesMemorySize;


Would increasing max page size in PageBuilder achieve similar results?

I don't think so. The gain is IMO the result of the removal of two-level addressing.

skrzypo987

Remarks addressed.
I dropped the last two commits:
-load factor -> .25
-compact rawHashByHashPosition
The microbenchmark results were inconsistent. I will return to them after this is merged

skrzypo987 · 2022-04-27T06:29:38Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -16,7 +16,6 @@
 import com.google.common.annotations.VisibleForTesting;


I updated the commit message. Some cases offer substantial gains.

skrzypo987 · 2022-04-27T06:30:26Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

    private final List<ObjectArrayList<Block>> channelBuilders;
    private final Optional<Integer> inputHashChannel;
    private final HashGenerator hashGenerator;
    private final OptionalInt precomputedHashChannel;
    private final boolean processDictionary;
-    private PageBuilder currentPageBuilder;
-
-    private long completedPagesMemorySize;


I don't think so. The gain is IMO the result of the removal of two-level addressing.

skrzypo987 · 2022-04-27T06:43:44Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -140,23 +131,21 @@ public MultiChannelGroupByHash(
        PagesHashStrategyFactory pagesHashStrategyFactory = joinCompiler.compilePagesHashStrategyFactory(this.types, outputChannels.build());
        hashStrategy = pagesHashStrategyFactory.createPagesHashStrategy(this.channelBuilders, this.precomputedHashChannel);

-        startNewPage();
+        pageBuilder = new PageBuilder(types);


Retained size of single large page will probably be greater than if multiple pages are used

That is true. However, the size of hash table and other primitive arrays shrinks from ~37 bytes/row to ~9 bytes which gives us some headroom.
Just to play with numbers some more:
Block is resized by 1.5 so on average there is a 25% additional space retained. Wasted space for multiple-page approach is negligible for big enough number of groups.
So there is a breakeven for rows of average size (37-9) / 0.25 = 112B
So for rows lower than that we use less memory, but for bigger ones we use more. There is tradeoff here and we need to choose whether we accept it.

Cost of rezising page is neglibable?

I guess it is not bigger than handling and allocating multiple pages, especially for multi-row case.

What if page consists of long strings?

Aggregating by text is not necessarily the smartest thing. You can make a query fail in a much simpler way.

For simplicity and tiny performance gain.

To make them shorter and consistent with BigintGroupByHash

Temporary output data used to be kept in a list of variable-size pages. That required using a two-level address to point an exact row. Since the hash table can only store 2^30 positions, the number of groups can be stored in a 4-byte int variable, given that the page size is fixed. This way we can get rid of some data structures used to map between group id and output row position skipping some calculations and reducing memory footprint from 8+8+4+1 bytes per hash bucket to 4+1 bytes which improves memory locality.

Previously the hash table capacity was checked every row to see whether a rehash is needed. Now the input page is split into batches and it is assumed that every row in batch will create a new group (which is rarely the case) and rehashing is done in advance before processing. This may slightly increase memory footprint for small number of groups, however there is a tiny performance gain as the capacity is not checked every row.

skrzypo987 · 2022-04-29T09:17:17Z

Again some memory-management tests are failing, I'll look into that

sopel39

lgtm % comments

sopel39 · 2022-04-29T11:47:31Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -16,7 +16,6 @@
 import com.google.common.annotations.VisibleForTesting;


I updated the commit message. Some cases offer substantial gains.

I don't think commit message has micro benchmarks results.

Nice find BTW!

sopel39 · 2022-04-29T11:48:13Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -62,6 +59,9 @@
    private static final float FILL_RATIO = 0.75f;
    // Max (page value count / cumulative dictionary size) to trigger the low cardinality case
    private static final double SMALL_DICTIONARIES_MAX_CARDINALITY_RATIO = .25;
+    private static final int OUTPUT_PAGE_BITS = 14; // 16k positions


So page now always has 16k positions even with single row? Or we start with almost empty single builder?

sopel39 · 2022-04-29T11:51:16Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -62,6 +59,9 @@
    private static final float FILL_RATIO = 0.75f;
    // Max (page value count / cumulative dictionary size) to trigger the low cardinality case
    private static final double SMALL_DICTIONARIES_MAX_CARDINALITY_RATIO = .25;
+    private static final int OUTPUT_PAGE_BITS = 14; // 16k positions
+    private static final int OUTPUT_PAGE_SIZE = 1 << OUTPUT_PAGE_BITS;
+    private static final int OUTPUT_PAGE_MASK = OUTPUT_PAGE_SIZE - 1;


it's not really an output age, but rather VALUES_PAGE or GROUP_VALUES_PAGE. These pages keep values of grouping channels. Note that we already have io.trino.operator.GroupByHash#appendValuesTo method which adds rows from these pages.

renamed to VALUES_PAGE

sopel39 · 2022-04-29T11:59:54Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

-        long address = groupAddressByGroupId.get(groupId);
-        int blockIndex = decodeSliceIndex(address);
-        int position = decodePosition(address);
+        int blockIndex = groupId >> OUTPUT_PAGE_BITS;


I think you should either use >>> (unsigned shift) or mask result. Please add a test for edge case (large group id) for this method if it's possible to do

groupId is never negative so both shifts work the same way (extend using 0)

sopel39 · 2022-04-29T12:00:29Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

-        long address = groupAddressByGroupId.get(groupId);
-        int blockIndex = decodeSliceIndex(address);
-        int position = decodePosition(address);
+        int blockIndex = groupId >> OUTPUT_PAGE_BITS;


sopel39 · 2022-04-29T12:12:52Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -365,7 +350,7 @@ private void startNewPage()
            currentPageBuilder = currentPageBuilder.newPageBuilderLike();
        }
        else {
-            currentPageBuilder = new PageBuilder(types);
+            currentPageBuilder = new PageBuilder(OUTPUT_PAGE_SIZE, Integer.MAX_VALUE, types);


Why? Why not to start with minimal builder for first page so that we don't have to allocate for 16k positions and waste memory?

It doesn't seem to have impact on perf. I reverted it.

pre size to OUTPUT_PAGE_SIZE Benchmark (channelCount) (dataType) (groupCount) (hashEnabled) Mode Cnt Score Error Units BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 8000 true avgt 20 33.467 ± 2.076 ns/op BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 3000000 true avgt 20 234.790 ± 2.115 ns/op no pre size Benchmark (channelCount) (dataType) (groupCount) (hashEnabled) Mode Cnt Score Error Units BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 8000 true avgt 20 34.916 ± 1.100 ns/op BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 3000000 true avgt 20 233.863 ± 6.016 ns/op

sopel39 · 2022-04-29T12:19:06Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

        byte[] rawHashes = new byte[newCapacity];
-        Arrays.fill(newKey, -1);
        int[] newValue = new int[newCapacity];


better would be newGroupIdByHash

sopel39 · 2022-04-29T12:20:09Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

    {
-        int sliceIndex = decodeSliceIndex(sliceAddress);
-        int position = decodePosition(sliceAddress);
+        int blockIndex = groupId >> OUTPUT_PAGE_BITS;


diito about shift

sopel39 · 2022-04-29T12:20:20Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

    {
        if (rawHashByHashPosition[hashPosition] != rawHash) {
            return false;
        }
-        return hashStrategy.positionNotDistinctFromRow(decodeSliceIndex(address), decodePosition(address), position, page, hashChannels);
+        int blockIndex = groupId >> OUTPUT_PAGE_BITS;


sopel39 · 2022-04-29T12:21:39Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -576,7 +578,6 @@ public void setProcessed(int position, int groupId)
            implements Work<Void>
    {
        private final Page page;
-


sopel39 · 2022-04-29T12:26:49Z

@skrzypo987 do you have micro/maco benchmark results?

sopel39 · 2022-04-29T13:17:13Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -365,7 +350,7 @@ private void startNewPage()
            currentPageBuilder = currentPageBuilder.newPageBuilderLike();


For fixed width type you can assume here that there will be OUTPUT_PAGE_SIZE positions (type check on instaceof FixedWidthType.
For variable length type, (instanceof VariableWidthType) you can assume OUTPUT_PAGE_SIZE positions, but can also assume some skew via io.trino.spi.block.BlockUtil#calculateBlockResetBytes. You can then use io.trino.spi.type.Type#createBlockBuilder(io.trino.spi.block.BlockBuilderStatus, int, int) to construct builder of right size.

This should save memory compared to existing code. I wouldn't be surprised to see aggregation mem usage drop by 10-20%

Doing this properly is not that straightforward. currentPageBuilder is PageBuilder not a single BlockBuilder so this would have to be implemented in the PageBuilder.
Also, there is no direct access to the number of bytes in the VariableWidthBlockBuilder.We could use getSizeInBytes but the result contains also size of valueIsNull and offsets arrays.
IMO this looks like a separate PR. WDYT?

lukasz-stec

I addressed most comments but at the moment I don't have access to the source branch repo so I put the changes here .
Gonna push it here once I get access

lukasz-stec · 2022-05-05T14:10:17Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

-        long address = groupAddressByGroupId.get(groupId);
-        int blockIndex = decodeSliceIndex(address);
-        int position = decodePosition(address);
+        int blockIndex = groupId >> OUTPUT_PAGE_BITS;


groupId is never negative so both shifts work the same way (extend using 0)

lukasz-stec · 2022-05-06T14:08:20Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -365,7 +350,7 @@ private void startNewPage()
            currentPageBuilder = currentPageBuilder.newPageBuilderLike();


Doing this properly is not that straightforward. currentPageBuilder is PageBuilder not a single BlockBuilder so this would have to be implemented in the PageBuilder.
Also, there is no direct access to the number of bytes in the VariableWidthBlockBuilder.We could use getSizeInBytes but the result contains also size of valueIsNull and offsets arrays.
IMO this looks like a separate PR. WDYT?

lukasz-stec · 2022-05-06T14:25:35Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -365,7 +350,7 @@ private void startNewPage()
            currentPageBuilder = currentPageBuilder.newPageBuilderLike();
        }
        else {
-            currentPageBuilder = new PageBuilder(types);
+            currentPageBuilder = new PageBuilder(OUTPUT_PAGE_SIZE, Integer.MAX_VALUE, types);


It doesn't seem to have impact on perf. I reverted it.

pre size to OUTPUT_PAGE_SIZE Benchmark (channelCount) (dataType) (groupCount) (hashEnabled) Mode Cnt Score Error Units BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 8000 true avgt 20 33.467 ± 2.076 ns/op BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 3000000 true avgt 20 234.790 ± 2.115 ns/op no pre size Benchmark (channelCount) (dataType) (groupCount) (hashEnabled) Mode Cnt Score Error Units BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 8000 true avgt 20 34.916 ± 1.100 ns/op BenchmarkGroupByHash.groupByHashPreCompute 2 BIGINT 3000000 true avgt 20 233.863 ± 6.016 ns/op

lukasz-stec · 2022-05-06T14:38:32Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -80,12 +80,9 @@
    private int hashCapacity;
    private int maxFill;
    private int mask;
-    private long[] groupAddressByHash;
    private int[] groupIdsByHash;


comment added, take a look if it makes sense to you.

lukasz-stec · 2022-05-06T14:39:20Z

core/trino-main/src/main/java/io/trino/operator/MultiChannelGroupByHash.java

@@ -62,6 +59,9 @@
    private static final float FILL_RATIO = 0.75f;
    // Max (page value count / cumulative dictionary size) to trigger the low cardinality case
    private static final double SMALL_DICTIONARIES_MAX_CARDINALITY_RATIO = .25;
+    private static final int OUTPUT_PAGE_BITS = 14; // 16k positions
+    private static final int OUTPUT_PAGE_SIZE = 1 << OUTPUT_PAGE_BITS;
+    private static final int OUTPUT_PAGE_MASK = OUTPUT_PAGE_SIZE - 1;


renamed to VALUES_PAGE

lukasz-stec · 2022-05-12T12:49:19Z

I addressed most comments but at the moment I don't have access to the source branch repo so I put the changes [here](https://github.com/starburstdata/trino/tree/skrzypo-copy/067-multi-channel-group-by-hash-first-optimizations . Gonna push it here once I get access

@skrzypo987 is out without internet access so I forked this here + applied changes mentioned + fixed tests + improved memory accounting

sopel39 · 2022-05-19T11:28:56Z

Closed in favor of #12336

cla-bot bot added the cla-signed label Apr 21, 2022

skrzypo987 added the WIP label Apr 21, 2022

skrzypo987 requested review from sopel39 and radek-kondziolka April 24, 2022 17:49

skrzypo987 removed the WIP label Apr 26, 2022

skrzypo987 requested review from radek-kondziolka and removed request for radek-kondziolka April 26, 2022 05:24

sopel39 reviewed Apr 26, 2022

View reviewed changes

skrzypo987 force-pushed the skrzypo/067-multi-channel-group-by-hash-first-optimizations branch from 597f788 to 9f4dbe8 Compare April 27, 2022 06:28

skrzypo987 commented Apr 27, 2022

View reviewed changes

skrzypo987 added 5 commits April 28, 2022 15:08

Replace block builder with a simple long[]

7ec3bd0

For simplicity and tiny performance gain.

Rename inner classes

ef500f3

To make them shorter and consistent with BigintGroupByHash

Remove unnecessary casts

7bdb9ed

skrzypo987 force-pushed the skrzypo/067-multi-channel-group-by-hash-first-optimizations branch from 9f4dbe8 to b7de749 Compare April 29, 2022 06:13

sopel39 approved these changes Apr 29, 2022

View reviewed changes

sopel39 reviewed Apr 29, 2022

View reviewed changes

lukasz-stec reviewed May 6, 2022

View reviewed changes

lukasz-stec mentioned this pull request May 11, 2022

Multi channel group by hash first batch of optimizations #12336

Merged

sopel39 closed this May 19, 2022

		@@ -16,7 +16,6 @@
		import com.google.common.annotations.VisibleForTesting;

		@@ -365,7 +350,7 @@ private void startNewPage()
		currentPageBuilder = currentPageBuilder.newPageBuilderLike();

Multi channel group by hash first batch of optimizations #12072

Multi channel group by hash first batch of optimizations #12072

Conversation

skrzypo987 commented Apr 21, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

skrzypo987 commented Apr 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrzypo987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrzypo987 commented Apr 29, 2022

sopel39 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 commented Apr 29, 2022

sopel39 Apr 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukasz-stec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukasz-stec commented May 12, 2022

sopel39 commented May 19, 2022

sopel39 Apr 26, 2022 •

edited

Loading

sopel39 Apr 29, 2022 •

edited

Loading