Flatten decimal accumulator #9640

sopel39 · 2021-10-14T11:38:04Z

This improves decimal aggregations performance.

after
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score      Error  Units
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum          1000    LONG  avgt   10   54239,529 ± 1091,320  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum          1000    LONG  avgt   10  140978,104 ± 4354,546  ns/op

before
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score      Error  Units
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum          1000    LONG  avgt   10   69347,207 ± 1895,007  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum          1000    LONG  avgt   10  185009,592 ± 3245,543  ns/op

sopel39 · 2021-10-15T08:45:01Z

benchmark results

	TPCH wall time	TPC-DS wall time	TPCH CPU time	TPC-DS CPU time
before	382.06925	1133.92525	25784.70	64565.54750
after	357.58375	1093.73525	25589.85	62462.05225

cpu improvements: tpcds/q51, tpcds/q47, tpcds/q67 and some others

...n/src/main/java/io/trino/operator/aggregation/state/LongDecimalWithOverflowStateFactory.java

losipiuk

LGTM. Some minor comments/questions.

sopel39 · 2021-10-15T10:53:34Z

This causes regression when there is small number of groups. Need to figure out how to combine best of two worlds.

sopel39 · 2021-10-15T21:24:04Z

newer benchmarks:

before:
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score       Error  Units
BenchmarkDecimalAggregation.benchmark                             sum            10    LONG  avgt   10      11,782 ±     0,511  ns/op
BenchmarkDecimalAggregation.benchmark                             sum         10000    LONG  avgt   10      20,122 ±     0,773  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum            10    LONG  avgt   10    1043,483 ±   100,551  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum         10000    LONG  avgt   10  727254,852 ± 21119,807  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum            10    LONG  avgt   10      11,628 ±     0,175  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum         10000    LONG  avgt   10      20,417 ±     0,594  ns/op

after

Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score       Error  Units
BenchmarkDecimalAggregation.benchmark                             sum            10    LONG  avgt   10       5,488 ±     0,466  ns/op
BenchmarkDecimalAggregation.benchmark                             sum         10000    LONG  avgt   10       6,471 ±     0,664  ns/op
# overhead of accumulator initialization for dumping 10 values
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum            10    LONG  avgt   10    1563,146 ±   393,037  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum         10000    LONG  avgt   10  552729,926 ± 23429,821  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum            10    LONG  avgt   10       5,236 ±     0,279  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum         10000    LONG  avgt   10       6,462 ±     0,323  ns/op

This reduces benchmark noise from accumulator initialization

sopel39 · 2021-10-18T11:14:02Z

new results

	TPCH wall time	TPC-DS wall time	TPCH CPU time	TPC-DS CPU time
before	382.06925	1133.92525	25784.7	64565.5475
after	341.46475	949.67400	24894.6	61884.5610

core/trino-spi/src/test/java/io/trino/spi/type/TestUnscaledDecimal128Arithmetic.java

losipiuk · 2021-10-18T11:34:54Z

core/trino-spi/src/main/java/io/trino/spi/type/UnscaledDecimal128Arithmetic.java

+
+        long intermediateResult = leftHigh + rightHigh + overflow;
+        long z1 = intermediateResult & (~SIGN_LONG_MASK);
+        pack(z0, z1, resultNegative, result, resultOffset);


nit: can you do a preparatory refactor so argument order for pack which operators on Slices and long arrays is the same?

wdtm?

they are same:

public static void pack(long low, long high, boolean negative, Slice result, int resultOffset) public static void pack(long low, long high, boolean negative, long[] result, int resultOffset) public static void pack(long low, long high, boolean negative, Slice result)

Depends:

private static void pack(Slice decimal, long low, long high, boolean negative)

I still don't understand. You want to do what with that pack method?

In this signature:

private static void pack(Slice decimal, long low, long high, boolean negative)

decimal is effectively a result.

So to be in line with other methods you listed above, decimal should be passed as last argument.
Of course you cannot just move it there because you will get conflict with public static void pack(long low, long high, boolean negative, Slice result).
But this raises question why do we have two methods? What is the semantics difference? I cannot answer this question quickly just by looking at the method. Maybe naming should more descriptive (packWithSomethin)?

cc: @ksobolew

Now I see. Probably pack which check for negative 0 should be called differently

From what I can see these two methods are identical, apart from the check for negative zero I added recently. It's just that one calls setNegativeLong while the other inlines it.

I would merge them, personally

If we can have less code without significant performance degradation - i am all for it.

losipiuk · 2021-10-18T11:36:03Z

core/trino-spi/src/main/java/io/trino/spi/type/UnscaledDecimal128Arithmetic.java

+    public static byte[] toByteArray(long value, byte[] result, int offset)
+    {
+        // copied from Guava Longs#toByteArray
+        for (int i = 7; i >= 0; i--) {


would it make sense to manually unroll this loop? I guess no. But you may check.
cc: @skrzypo987

Currently, in a context where unscaledDecimalToBigInteger is used (io.trino.operator.aggregation.DecimalAverageAggregation#average), unrolling woundm't make much difference

losipiuk

LGTM

Account for SingleLongDecimalWithOverflowAndLongState instance size

This improves decimal aggregations performance BEFORE: Benchmark (function) (groupCount) (type) Mode Cnt Score Error Units BenchmarkDecimalAggregation.benchmark sum 10 LONG avgt 10 11,782 ± 0,511 ns/op BenchmarkDecimalAggregation.benchmark sum 10000 LONG avgt 10 20,122 ± 0,773 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateFinal sum 10000 LONG avgt 10 727254,852 ± 21119,807 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10 LONG avgt 10 11,628 ± 0,175 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10000 LONG avgt 10 20,417 ± 0,594 ns/op AFTER Benchmark (function) (groupCount) (type) Mode Cnt Score Error Units BenchmarkDecimalAggregation.benchmark sum 10 LONG avgt 10 5,488 ± 0,466 ns/op BenchmarkDecimalAggregation.benchmark sum 10000 LONG avgt 10 6,471 ± 0,664 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateFinal sum 10000 LONG avgt 10 552729,926 ± 23429,821 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10 LONG avgt 10 5,236 ± 0,279 ns/op BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate sum 10000 LONG avgt 10 6,462 ± 0,323 ns/opfixup

sopel39 · 2021-10-19T09:00:50Z

Decimal avg results:

BEFORE
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score      Error  Units
BenchmarkDecimalAggregation.benchmark                             avg            10    LONG  avgt   10      16,118 ±    0,495  ns/op
BenchmarkDecimalAggregation.benchmark                             avg          1000    LONG  avgt   10      17,451 ±    0,387  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg            10    LONG  avgt   10    2462,136 ±   34,402  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg          1000    LONG  avgt   10  134229,269 ± 5554,034  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg            10    LONG  avgt   10      14,911 ±    0,284  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg          1000    LONG  avgt   10      20,239 ±    0,809  ns/op

AFTER
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt      Score      Error  Units
BenchmarkDecimalAggregation.benchmark                             avg            10    LONG  avgt   10      6,316 ±    0,112  ns/op
BenchmarkDecimalAggregation.benchmark                             avg          1000    LONG  avgt   10      6,734 ±    0,260  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg            10    LONG  avgt   10   2506,539 ±  128,118  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg          1000    LONG  avgt   10  99831,602 ± 3184,846  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg            10    LONG  avgt   10      6,423 ±    0,113  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg          1000    LONG  avgt   10      6,501 ±    0,129  ns/op

sopel39 requested review from martint and losipiuk October 14, 2021 11:38

cla-bot bot added the cla-signed label Oct 14, 2021

sopel39 requested a review from skrzypo987 October 14, 2021 11:38

sopel39 force-pushed the ks/improve_aggregation branch 5 times, most recently from 792a0b1 to 334974d Compare October 14, 2021 20:38