-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatten decimal accumulator #9640
Conversation
792a0b1
to
334974d
Compare
benchmark results
cpu improvements: tpcds/q51, tpcds/q47, tpcds/q67 and some others |
...n/src/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateFactory.java
Show resolved
Hide resolved
...n/src/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateFactory.java
Outdated
Show resolved
Hide resolved
...n/src/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateFactory.java
Outdated
Show resolved
Hide resolved
...n/src/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateFactory.java
Outdated
Show resolved
Hide resolved
...n/src/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some minor comments/questions.
This causes regression when there is small number of groups. Need to figure out how to combine best of two worlds. |
334974d
to
8b35b2c
Compare
8b35b2c
to
28ebbf9
Compare
newer benchmarks:
|
28ebbf9
to
d38a999
Compare
This reduces benchmark noise from accumulator initialization
d38a999
to
7cc3e1c
Compare
new results
|
core/trino-spi/src/test/java/io/trino/spi/type/TestUnscaledDecimal128Arithmetic.java
Show resolved
Hide resolved
core/trino-spi/src/test/java/io/trino/spi/type/TestUnscaledDecimal128Arithmetic.java
Show resolved
Hide resolved
|
||
long intermediateResult = leftHigh + rightHigh + overflow; | ||
long z1 = intermediateResult & (~SIGN_LONG_MASK); | ||
pack(z0, z1, resultNegative, result, resultOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you do a preparatory refactor so argument order for pack
which operators on Slices and long arrays is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdtm?
they are same:
public static void pack(long low, long high, boolean negative, Slice result, int resultOffset)
public static void pack(long low, long high, boolean negative, long[] result, int resultOffset)
public static void pack(long low, long high, boolean negative, Slice result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends:
private static void pack(Slice decimal, long low, long high, boolean negative)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand. You want to do what with that pack
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this signature:
private static void pack(Slice decimal, long low, long high, boolean negative)
decimal
is effectively a result.
So to be in line with other methods you listed above, decimal
should be passed as last argument.
Of course you cannot just move it there because you will get conflict with public static void pack(long low, long high, boolean negative, Slice result)
.
But this raises question why do we have two methods? What is the semantics difference? I cannot answer this question quickly just by looking at the method. Maybe naming should more descriptive (packWithSomethin
)?
cc: @ksobolew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see. Probably pack
which check for negative 0 should be called differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see these two methods are identical, apart from the check for negative zero I added recently. It's just that one calls setNegativeLong
while the other inlines it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would merge them, personally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can have less code without significant performance degradation - i am all for it.
public static byte[] toByteArray(long value, byte[] result, int offset) | ||
{ | ||
// copied from Guava Longs#toByteArray | ||
for (int i = 7; i >= 0; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to manually unroll this loop? I guess no. But you may check.
cc: @skrzypo987
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, in a context where unscaledDecimalToBigInteger
is used (io.trino.operator.aggregation.DecimalAverageAggregation#average
), unrolling woundm't make much difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Account for SingleLongDecimalWithOverflowAndLongState instance size
This improves decimal aggregations performance BEFORE: Benchmark (function) (groupCount) (type) Mode Cnt Score Error Units BenchmarkDecimalAggregation.benchmark sum 10 LONG avgt 10 11,782 ± 0,511 ns/op BenchmarkDecimalAggregation.benchmark sum 10000 LONG avgt 10 20,122 ± 0,773 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateFinal sum 10000 LONG avgt 10 727254,852 ± 21119,807 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10 LONG avgt 10 11,628 ± 0,175 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10000 LONG avgt 10 20,417 ± 0,594 ns/op AFTER Benchmark (function) (groupCount) (type) Mode Cnt Score Error Units BenchmarkDecimalAggregation.benchmark sum 10 LONG avgt 10 5,488 ± 0,466 ns/op BenchmarkDecimalAggregation.benchmark sum 10000 LONG avgt 10 6,471 ± 0,664 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateFinal sum 10000 LONG avgt 10 552729,926 ± 23429,821 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10 LONG avgt 10 5,236 ± 0,279 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10000 LONG avgt 10 6,462 ± 0,323 ns/opfixup
7cc3e1c
to
b524793
Compare
Decimal avg results:
|
This improves decimal aggregations performance.