-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce overheads related to DictionaryBlock#getSizeInBytes() #10970
Reduce overheads related to DictionaryBlock#getSizeInBytes() #10970
Conversation
d2878b6
to
533f5f0
Compare
Benchmark results Before
After
|
805d041
to
d6d3860
Compare
1c8e4cd
to
d56413d
Compare
core/trino-spi/src/main/java/io/trino/spi/block/AbstractMapBlock.java
Outdated
Show resolved
Hide resolved
9064e23
to
105467f
Compare
@skrzypo987 think you'll have a chance to take a full review pass on this PR sometime soon? |
I'll take a look today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok from my perspective. If you want to get it merged soon ping some other maintainer since @sopel39 is out this week.
@dain maybe you have a minute to take a look and decide whether to merge it while Karol is out? |
@@ -103,6 +106,12 @@ public Block getRegion(int position, int length) | |||
getRawElementBlock()); | |||
} | |||
|
|||
@Override | |||
public OptionalInt fixedSizeInBytesPerPosition() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe just getSizeInBytesPerPosition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "fixed" adds extra information here and makes it more clear (to me) that this method is only for blocks where the size in bytes is not dependent on the value at any given position, but that's just my opinion- if you feel strongly then I'm willing to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics is was kinda natural for me even with shorter name based on the fact that single number is returned and is guarded by Optional
to handle case when we cannot compute it. But I do not feel strongly. it may stay.
} | ||
else if (rawElementBlock instanceof RunLengthEncodedBlock) { | ||
// RLE blocks don't have fixed size per position, but accept null for the positions array | ||
elementsSizeInBytes = rawElementBlock.getPositionsSizeInBytes(null, countSelectedPositionsFromOffsets(positions, offsets, offsetBase)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are exploiting the fact that rawElementBlock is of specific class do we need to compute countSelectedPositionsFromOffsets
. This arg is ignored by RunLengthEncodedBlock.getPositionsSizeInBytes
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I left this in because I was planning on making RunLengthEncodedBlock#getPositionsSizeInBytes
do something like return selectedPositions > 0 ? value.getSizeInBytes() : 0;
but that broke a lot of tests and it didn't seem worth the headache- so I instead added tests that assert the opposite.
It doesn't quite feel right for AbstractArrayBlock
to special case the handling of RunLengthEncodedBlock
by calling RunLengthEncodedBlock#getSizeInBytes
, but it certainly would reduce the overhead involved.
I'm open to doing whatever you think makes the most sense here.
core/trino-spi/src/main/java/io/trino/spi/block/DictionaryBlock.java
Outdated
Show resolved
Hide resolved
dcbbe4d
to
d992eb0
Compare
Adds a positionCount argument to Block#getPositionSizeInBytes and adds a new method: Block#fixedSizeInBytesPerPosition() to reduce the overhead associated with calculating DictionaryBlock size in bytes when the underlying dictionary size in bytes can be calculated without specific information about which positions are referenced.
d992eb0
to
79ea564
Compare
More complete benchmark results were easier to produce in the Presto equivalent PR, and links to those results can be found in my comment there. |
@pettyjamesm The JMH benchmarks look nice! I have a question still. Did you have a chance to verify how much the change impacts end-to-end query execution? how much are we gaining in CPU/wall time over some verbose benchmark (e.g. tpc-h/ds), or over your production queries? |
I haven't had a chance to put these changes into our production environment, and in TPCDS / TPCH queries I wouldn't expect much of a measurable improvement at the macro level. The biggest gains are going to be in queries that make heavy use of |
Nice. Thanks. |
Description
Reduces the overhead of calculating
DictionaryBlock#getSizeInBytes()
by making two changes to theBlock
interface definition:selectedPositionCount
toBlock#getPositionsSizeInBytes(boolean[] positions, int selectedPositionCount)
. Callers to the previous methodgetPositionsSizeInBytes(boolean[] positions)
could trivially calculate and pass the total count of the positions selected, and failing to do so required the called blocks to count the number oftrue
values in the positions array- sometimes repeatedly in the case of eg:RowBlock
.Block#fixedSizeInBytesPerPosition()
that allows blocks to describe whether theirgetSizeInBytes()
can be calculated directly from the position count, ie: the size is not variable based on the specific positions selectedAlso includes a change to eagerly populate the unique position count and size in bytes on the results of
DictionaryBlock#getPositions
when the result is not compact, since the selected positions array may have already been created and would otherwise have to be reconstructed in a subsequent call toDictionaryBlock#getSizeInBytes()
.General information
Is this change a fix, improvement, new feature, refactoring, or other?
This is an improvement that reduces the overhead for common scenarios involving
DictionaryBlock#getSizeInBytes()
Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
This change affects the SPI interface of blocks and refactors the implementations of
DictionaryBlock#getPositions
andBlock#getPositionsSizeInBytes
to leverage the new information available.How would you describe this change to a non-technical end user or system administrator?
Specifically describing this change to a non-technical audience should not be necessary
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: