Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten decimal accumulator #9640

Merged
merged 6 commits into from
Oct 18, 2021

Conversation

sopel39
Copy link
Member

@sopel39 sopel39 commented Oct 14, 2021

This improves decimal aggregations performance.

after
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score      Error  Units
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum          1000    LONG  avgt   10   54239,529 ± 1091,320  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum          1000    LONG  avgt   10  140978,104 ± 4354,546  ns/op

before
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score      Error  Units
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum          1000    LONG  avgt   10   69347,207 ± 1895,007  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum          1000    LONG  avgt   10  185009,592 ± 3245,543  ns/op

@sopel39 sopel39 requested review from martint and losipiuk October 14, 2021 11:38
@cla-bot cla-bot bot added the cla-signed label Oct 14, 2021
@sopel39 sopel39 requested a review from skrzypo987 October 14, 2021 11:38
@sopel39 sopel39 force-pushed the ks/improve_aggregation branch 5 times, most recently from 792a0b1 to 334974d Compare October 14, 2021 20:38
@sopel39
Copy link
Member Author

sopel39 commented Oct 15, 2021

benchmark results

  TPCH wall time TPC-DS wall time TPCH CPU time TPC-DS CPU time
before 382.06925 1133.92525 25784.70 64565.54750
after 357.58375 1093.73525 25589.85 62462.05225

cpu improvements: tpcds/q51, tpcds/q47, tpcds/q67 and some others

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor comments/questions.

@sopel39
Copy link
Member Author

sopel39 commented Oct 15, 2021

This causes regression when there is small number of groups. Need to figure out how to combine best of two worlds.

@sopel39 sopel39 force-pushed the ks/improve_aggregation branch from 334974d to 8b35b2c Compare October 15, 2021 20:57
@sopel39 sopel39 added the WIP label Oct 15, 2021
@sopel39 sopel39 force-pushed the ks/improve_aggregation branch from 8b35b2c to 28ebbf9 Compare October 15, 2021 21:23
@sopel39
Copy link
Member Author

sopel39 commented Oct 15, 2021

newer benchmarks:

before:
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score       Error  Units
BenchmarkDecimalAggregation.benchmark                             sum            10    LONG  avgt   10      11,782 ±     0,511  ns/op
BenchmarkDecimalAggregation.benchmark                             sum         10000    LONG  avgt   10      20,122 ±     0,773  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum            10    LONG  avgt   10    1043,483 ±   100,551  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum         10000    LONG  avgt   10  727254,852 ± 21119,807  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum            10    LONG  avgt   10      11,628 ±     0,175  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum         10000    LONG  avgt   10      20,417 ±     0,594  ns/op

after

Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score       Error  Units
BenchmarkDecimalAggregation.benchmark                             sum            10    LONG  avgt   10       5,488 ±     0,466  ns/op
BenchmarkDecimalAggregation.benchmark                             sum         10000    LONG  avgt   10       6,471 ±     0,664  ns/op
# overhead of accumulator initialization for dumping 10 values
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum            10    LONG  avgt   10    1563,146 ±   393,037  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum         10000    LONG  avgt   10  552729,926 ± 23429,821  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum            10    LONG  avgt   10       5,236 ±     0,279  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum         10000    LONG  avgt   10       6,462 ±     0,323  ns/op

@sopel39 sopel39 force-pushed the ks/improve_aggregation branch from 28ebbf9 to d38a999 Compare October 17, 2021 19:18
@sopel39 sopel39 force-pushed the ks/improve_aggregation branch from d38a999 to 7cc3e1c Compare October 18, 2021 11:12
@sopel39 sopel39 removed the WIP label Oct 18, 2021
@sopel39 sopel39 requested a review from losipiuk October 18, 2021 11:12
@sopel39
Copy link
Member Author

sopel39 commented Oct 18, 2021

new results

  TPCH wall time TPC-DS wall time TPCH CPU time TPC-DS CPU time
before 382.06925 1133.92525 25784.7 64565.5475
after 341.46475 949.67400 24894.6 61884.5610


long intermediateResult = leftHigh + rightHigh + overflow;
long z1 = intermediateResult & (~SIGN_LONG_MASK);
pack(z0, z1, resultNegative, result, resultOffset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you do a preparatory refactor so argument order for pack which operators on Slices and long arrays is the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdtm?

they are same:

    public static void pack(long low, long high, boolean negative, Slice result, int resultOffset)
    public static void pack(long low, long high, boolean negative, long[] result, int resultOffset)
    public static void pack(long low, long high, boolean negative, Slice result)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends:

private static void pack(Slice decimal, long low, long high, boolean negative)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand. You want to do what with that pack method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this signature:

private static void pack(Slice decimal, long low, long high, boolean negative)

decimal is effectively a result.

So to be in line with other methods you listed above, decimal should be passed as last argument.
Of course you cannot just move it there because you will get conflict with public static void pack(long low, long high, boolean negative, Slice result).
But this raises question why do we have two methods? What is the semantics difference? I cannot answer this question quickly just by looking at the method. Maybe naming should more descriptive (packWithSomethin)?

cc: @ksobolew

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see. Probably pack which check for negative 0 should be called differently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see these two methods are identical, apart from the check for negative zero I added recently. It's just that one calls setNegativeLong while the other inlines it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would merge them, personally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can have less code without significant performance degradation - i am all for it.

public static byte[] toByteArray(long value, byte[] result, int offset)
{
// copied from Guava Longs#toByteArray
for (int i = 7; i >= 0; i--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to manually unroll this loop? I guess no. But you may check.
cc: @skrzypo987

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in a context where unscaledDecimalToBigInteger is used (io.trino.operator.aggregation.DecimalAverageAggregation#average), unrolling woundm't make much difference

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Account for SingleLongDecimalWithOverflowAndLongState instance size
This improves decimal aggregations performance

BEFORE:
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score       Error  Units
BenchmarkDecimalAggregation.benchmark                             sum            10    LONG  avgt   10      11,782 ±     0,511  ns/op
BenchmarkDecimalAggregation.benchmark                             sum         10000    LONG  avgt   10      20,122 ±     0,773  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum         10000    LONG  avgt   10  727254,852 ± 21119,807  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum            10    LONG  avgt   10      11,628 ±     0,175  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum         10000    LONG  avgt   10      20,417 ±     0,594  ns/op

AFTER
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score       Error  Units
BenchmarkDecimalAggregation.benchmark                             sum            10    LONG  avgt   10       5,488 ±     0,466  ns/op
BenchmarkDecimalAggregation.benchmark                             sum         10000    LONG  avgt   10       6,471 ±     0,664  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                sum         10000    LONG  avgt   10  552729,926 ± 23429,821  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum            10    LONG  avgt   10       5,236 ±     0,279  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         sum         10000    LONG  avgt   10       6,462 ±     0,323  ns/opfixup
@sopel39 sopel39 force-pushed the ks/improve_aggregation branch from 7cc3e1c to b524793 Compare October 18, 2021 11:57
@sopel39 sopel39 merged commit 18cb242 into trinodb:master Oct 18, 2021
@sopel39 sopel39 deleted the ks/improve_aggregation branch October 18, 2021 13:43
@sopel39 sopel39 mentioned this pull request Oct 18, 2021
12 tasks
@github-actions github-actions bot added this to the 364 milestone Oct 18, 2021
@sopel39
Copy link
Member Author

sopel39 commented Oct 19, 2021

Decimal avg results:

BEFORE
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt       Score      Error  Units
BenchmarkDecimalAggregation.benchmark                             avg            10    LONG  avgt   10      16,118 ±    0,495  ns/op
BenchmarkDecimalAggregation.benchmark                             avg          1000    LONG  avgt   10      17,451 ±    0,387  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg            10    LONG  avgt   10    2462,136 ±   34,402  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg          1000    LONG  avgt   10  134229,269 ± 5554,034  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg            10    LONG  avgt   10      14,911 ±    0,284  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg          1000    LONG  avgt   10      20,239 ±    0,809  ns/op

AFTER
Benchmark                                                  (function)  (groupCount)  (type)  Mode  Cnt      Score      Error  Units
BenchmarkDecimalAggregation.benchmark                             avg            10    LONG  avgt   10      6,316 ±    0,112  ns/op
BenchmarkDecimalAggregation.benchmark                             avg          1000    LONG  avgt   10      6,734 ±    0,260  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg            10    LONG  avgt   10   2506,539 ±  128,118  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateFinal                avg          1000    LONG  avgt   10  99831,602 ± 3184,846  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg            10    LONG  avgt   10      6,423 ±    0,113  ns/op
BenchmarkDecimalAggregation.benchmarkEvaluateIntermediate         avg          1000    LONG  avgt   10      6,501 ±    0,129  ns/op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants