-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long decimal partial aggregation below join #21083
Fix long decimal partial aggregation below join #21083
Conversation
according to @lukasz-stec 's analysis, this may be fixing some bug related to partial aggregations, but i don't know how to reproduce that bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm % comments
Slice slice = variableWidthBlock.getRawSlice(); | ||
int sliceOffset = variableWidthBlock.getRawSliceOffset(index); | ||
int sliceLength = variableWidthBlock.getSliceLength(index); | ||
Slice slice = VARBINARY.getSlice(block, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocates Slice instance per position and @raunaqmorarka changed that via #18868.
If we want be closer the original performance, you could get access to the raw slice as
VariableWidthBlock valueBlock = (VariableWidthBlock) block.getUnderlyingValueBlock();
Slice slice = valueBlock.getRawSlice();
instead of VariableWidthBlock variableWidthBlock = (VariableWidthBlock) block;
plus get the right index
index = block.getUnderlyingValuePosition(index)
block.getUnderlyingValueBlock
and block.getUnderlyingValuePosition
can still be virtual calls so there is potential regression here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution. I also would not wory about the object allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was pretty clear performance impact of object allocation here #18868 (comment)
We should at least benchmark it before deciding either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, if allocating a single Slice (which is basically a record) is 10% of CPU, I'd expect that benchmark is broken given how much other stuff is happening in an aggregation query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, if you actually care, my initial suggestion was:
index = block.getUnderlyingValuePosition(index);
block = block.getUnderlyingValueBlock();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block.getUnderlyingValueBlock
andblock.getUnderlyingValuePosition
can still be virtual calls so there is potential regression here anyway.
good point.
int sliceOffset = variableWidthBlock.getRawSliceOffset(index); | ||
int sliceLength = variableWidthBlock.getSliceLength(index); | ||
|
||
Slice slice = VARBINARY.getSlice(block, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same comment about Slice allocation
Slice slice = variableWidthBlock.getRawSlice(); | ||
int sliceOffset = variableWidthBlock.getRawSliceOffset(index); | ||
int sliceLength = variableWidthBlock.getSliceLength(index); | ||
Slice slice = VARBINARY.getSlice(block, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this may be hard to repo using query I would at least add a unit test the deserialize works also for Dictionary and RLE blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i don't know what the contract is for state serializers, that's why i didn't add any test
do we have an example I should follow?
Slice slice = variableWidthBlock.getRawSlice(); | ||
int sliceOffset = variableWidthBlock.getRawSliceOffset(index); | ||
int sliceLength = variableWidthBlock.getSliceLength(index); | ||
Slice slice = VARBINARY.getSlice(block, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution. I also would not wory about the object allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also:
- LongDecimalWithOverflowStateSerializer
- @lukasz-stec comments
Alternative approach is to handle dictionaries in StateCompiler#generateDeserialize
as it would prevent polymorphic calls and extra indirection
@lukasz-stec can you help me with testing this? |
@findepi I don't know how to repro this using a query. It may be imposiible at the current state. It requires partial and final or intermediate aggregation to be separated by operation that can produce dictionary blocks for aggregation state. Join is one potential such operation. Here are unit tests that cover this though. |
@sopel39 this is the first class i fixed. is there some other state serializer class i missed? |
i was able -- #21099 |
thanks! |
6860d9f
to
112788b
Compare
i added a unit test from @lukasz-stec , thank you There is open discussion how to write the code the most performant way and there seems to be disagreement how this should look like. I propose we merge the code, since it's an obvious improvement over current (broken) state. PTAL |
@findepi let's use @dain approach: |
@lukasz-stec had concerns. @lukasz-stec will you be OK with that approach? |
Co-authored-by: Dain Sundstrom <[email protected]> Co-authored-by: Lukasz Stec <[email protected]>
Yes, the only concern was a potential virtual call and a) it is not that bad, b) it will happen only in rare conditions if at all as it needs all three block types to be there (value, dict, RLE) |
- use `Long.BYTES * n` offsets as in the serializer code (easier to correlate the two) - use same control flow pattern in the two similar classes
112788b
to
7043e20
Compare
updated, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome! thanks
Since this fixes an issue I added a release notes entry |
Fixes #21099