Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long decimal partial aggregation below join #21083

Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Mar 14, 2024

Fixes #21099

@findepi findepi requested review from dain and sopel39 March 14, 2024 15:49
@cla-bot cla-bot bot added the cla-signed label Mar 14, 2024
@findepi
Copy link
Member Author

findepi commented Mar 14, 2024

according to @lukasz-stec 's analysis, this may be fixing some bug related to partial aggregations, but i don't know how to reproduce that bug.

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Slice slice = variableWidthBlock.getRawSlice();
int sliceOffset = variableWidthBlock.getRawSliceOffset(index);
int sliceLength = variableWidthBlock.getSliceLength(index);
Slice slice = VARBINARY.getSlice(block, index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocates Slice instance per position and @raunaqmorarka changed that via #18868.
If we want be closer the original performance, you could get access to the raw slice as

VariableWidthBlock valueBlock = (VariableWidthBlock) block.getUnderlyingValueBlock();
Slice slice =  valueBlock.getRawSlice();

instead of VariableWidthBlock variableWidthBlock = (VariableWidthBlock) block;
plus get the right index
index = block.getUnderlyingValuePosition(index)

block.getUnderlyingValueBlock and block.getUnderlyingValuePosition can still be virtual calls so there is potential regression here anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution. I also would not wory about the object allocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was pretty clear performance impact of object allocation here #18868 (comment)
We should at least benchmark it before deciding either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, if allocating a single Slice (which is basically a record) is 10% of CPU, I'd expect that benchmark is broken given how much other stuff is happening in an aggregation query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, if you actually care, my initial suggestion was:

index = block.getUnderlyingValuePosition(index);
block = block.getUnderlyingValueBlock();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block.getUnderlyingValueBlock and block.getUnderlyingValuePosition can still be virtual calls so there is potential regression here anyway.

good point.

int sliceOffset = variableWidthBlock.getRawSliceOffset(index);
int sliceLength = variableWidthBlock.getSliceLength(index);

Slice slice = VARBINARY.getSlice(block, index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same comment about Slice allocation

Slice slice = variableWidthBlock.getRawSlice();
int sliceOffset = variableWidthBlock.getRawSliceOffset(index);
int sliceLength = variableWidthBlock.getSliceLength(index);
Slice slice = VARBINARY.getSlice(block, index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this may be hard to repo using query I would at least add a unit test the deserialize works also for Dictionary and RLE blocks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i don't know what the contract is for state serializers, that's why i didn't add any test

do we have an example I should follow?

Slice slice = variableWidthBlock.getRawSlice();
int sliceOffset = variableWidthBlock.getRawSliceOffset(index);
int sliceLength = variableWidthBlock.getSliceLength(index);
Slice slice = VARBINARY.getSlice(block, index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution. I also would not wory about the object allocation.

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also:

  • LongDecimalWithOverflowStateSerializer
  • @lukasz-stec comments

Alternative approach is to handle dictionaries in StateCompiler#generateDeserialize as it would prevent polymorphic calls and extra indirection

@findepi
Copy link
Member Author

findepi commented Mar 14, 2024

@lukasz-stec can you help me with testing this?
would especially be good to have query-based regression that that uncovers the necessity for state serializers to handle dict blocks.

@lukasz-stec
Copy link
Member

@lukasz-stec can you help me with testing this? would especially be good to have query-based regression that that uncovers the necessity for state serializers to handle dict blocks.

@findepi I don't know how to repro this using a query. It may be imposiible at the current state. It requires partial and final or intermediate aggregation to be separated by operation that can produce dictionary blocks for aggregation state. Join is one potential such operation.

Here are unit tests that cover this though.
Test_dict_and_rle_blocks_deserialization.patch

@findepi findepi changed the title Long deserializers code cleanup Fix long decimal partial aggregation below join Mar 15, 2024
@findepi
Copy link
Member Author

findepi commented Mar 15, 2024

There is also:

  • LongDecimalWithOverflowStateSerializer

@sopel39 this is the first class i fixed. is there some other state serializer class i missed?
i searched for (VariableWidthBlock) in remaining state serializers, but didn't find any

@findepi
Copy link
Member Author

findepi commented Mar 15, 2024

I don't know how to repro this using a query. It may be imposiible at the current state

i was able -- #21099

@findepi
Copy link
Member Author

findepi commented Mar 15, 2024

Here are unit tests that cover this though.
Test_dict_and_rle_blocks_deserialization.patch

thanks!

@findepi findepi force-pushed the findepi/long-deserializers-code-cleanup-f9dbc2 branch from 6860d9f to 112788b Compare March 15, 2024 09:07
@findepi
Copy link
Member Author

findepi commented Mar 15, 2024

i added a unit test from @lukasz-stec , thank you
i added a query-based regression test (from #21099 (comment))

There is open discussion how to write the code the most performant way and there seems to be disagreement how this should look like. I propose we merge the code, since it's an obvious improvement over current (broken) state.
Alternatively we do not merge this code, but then I would like you guys, @sopel39 @raunaqmorarka @lukasz-stec , take this over.

PTAL

@sopel39
Copy link
Member

sopel39 commented Mar 15, 2024

@findepi let's use @dain approach:
#21083 (comment)
if you don't plan to handle dictionaries in StateCompiler#generateDeserialize

@findepi
Copy link
Member Author

findepi commented Mar 15, 2024

@lukasz-stec had concerns. @lukasz-stec will you be OK with that approach?

Co-authored-by: Dain Sundstrom <[email protected]>
Co-authored-by: Lukasz Stec <[email protected]>
@lukasz-stec
Copy link
Member

@lukasz-stec had concerns. @lukasz-stec will you be OK with that approach?

Yes, the only concern was a potential virtual call and a) it is not that bad, b) it will happen only in rare conditions if at all as it needs all three block types to be there (value, dict, RLE)

- use `Long.BYTES * n` offsets as in the serializer code (easier to
  correlate the two)
- use same control flow pattern in the two similar classes
@findepi findepi force-pushed the findepi/long-deserializers-code-cleanup-f9dbc2 branch from 112788b to 7043e20 Compare March 15, 2024 09:39
@findepi
Copy link
Member Author

findepi commented Mar 15, 2024

updated, PTAL

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! thanks

@findepi findepi merged commit e035562 into trinodb:master Mar 15, 2024
95 checks passed
@findepi findepi deleted the findepi/long-deserializers-code-cleanup-f9dbc2 branch March 15, 2024 13:00
@github-actions github-actions bot added this to the 443 milestone Mar 15, 2024
@mosabua
Copy link
Member

mosabua commented Mar 21, 2024

Since this fixes an issue I added a release notes entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Partial aggregation over long decimals fails if pushed below join
6 participants