-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document domain-compaction-threshold description #12965
Changes from 1 commit
e5b7891
106c2c8
8168f5d
252c72c
6a05b47
9f88671
6a38289
043d7c2
829881f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -136,10 +136,10 @@ connector. | |
- Description | ||
- Default | ||
* - ``delta.domain-compaction-threshold`` | ||
- Sets the number of transactions to act as threshold. Once reached the | ||
connector initiates compaction of the underlying files and the delta | ||
files. A higher compaction threshold means reading less data from the | ||
underlying data source, but a higher memory and network consumption. | ||
- Sets the number of transactions to act as a threshold. After reaching | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no idea what "transactions" this refers to. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't either. Is there someone who can clarify what that represents, or if there is some other term we should be using? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per the discussion above it seems like this is actually talking about a predicate count |
||
the threshold, the connector initiates compacting a large IN or OR | ||
clause into a min-max range predicate for pushdown into an ORC or | ||
Parquet reader. | ||
tlblessing marked this conversation as resolved.
Show resolved
Hide resolved
tlblessing marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- 100 | ||
* - ``delta.max-outstanding-splits`` | ||
- The target number of buffered splits for each table scan in a query, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as a threshold during query planning."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description seems wrong - IIRC the compaction threshold basically says that if you have a large predicate (WHERE or IN, doesn't apply to
OR
) then it can be simplified into smaller predicates (the implementation detail doesn't matter - may vary across connectors) but only if the number of predicates is < domain-compaction-threshold.e.g.
IN (1, 2, 3, 4)
withdomain_compaction_threshold=2
would not get optimised. But withdomain-compaction-threshold=10
it would get optimised.cc: @findepi please correct if I understand wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for user facing docs I as a user would be happy with:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we must not be specific about that because e.g. sometimes OR can be replaced with IN.
The opposite.
domain_compaction_threshold=2
=> predicate simplifies to equivalent ofBETWEEN 1 and 4
with
domain_compaction_threshold=10
it's kept as is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks updated the "Also for user facing docs I as a user would be happy with:" to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested rewording of your proposed wording, for style. Should start with description then move onto justification:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I came up with based on everyone's suggestions:
Minimum size of query predicates above which Trino starts compaction. Some databases perform poorly when a large list of predicates is pushed down to the data source. For optimization in that situation, Trino can compact the large predicates. When necessary, adjust the threshold to ensure a balance between performance and pushdown.
Please let me know if this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confusing thing, to me at least, is that it's not the minimum size for compaction but a maximum number of predicates that are compacted. Per @hashhar 's example,
IN (1, 2, 3, 4)
is not compacted if this property is set to 2, but it is if the property is set to 10. If that's the case, then the word "threshold" is unfortunately very misleading and we need to word around it somehow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing to me too based on the examples and sizes discussed. The default value is 100. @findepi said it was the opposite behavior actually than what @hashhar 's statement. Hashar's description as a minimum threshold makes some sense to me, and threshold is part of the config property's name. I'll wait for another round of reviews. Thank you all for your guidance.