-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document domain-compaction-threshold description #12965
Conversation
Thanks for improving the documentation. Please avoid posting internal jira link in this repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions to help simplify, if I'm understanding the property correctly
connector initiates compaction of the underlying files and the delta | ||
files. A higher compaction threshold means reading less data from the | ||
underlying data source, but a higher memory and network consumption. | ||
- Sets the number of transactions to act as a threshold. After reaching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as a threshold during query planning."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description seems wrong - IIRC the compaction threshold basically says that if you have a large predicate (WHERE or IN, doesn't apply to OR
) then it can be simplified into smaller predicates (the implementation detail doesn't matter - may vary across connectors) but only if the number of predicates is < domain-compaction-threshold.
e.g. IN (1, 2, 3, 4)
with domain_compaction_threshold=2
would not get optimised. But with domain-compaction-threshold=10
it would get optimised.
cc: @findepi please correct if I understand wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for user facing docs I as a user would be happy with:
Some databases perform poorly when a large list of predicates is pushed down to the data source. To optimise such situations Trino can compact the large predicates. To ensure a balance of performance and pushdown the minimum size of predicates above which Trino will start compaction can be controlled using
domain-compaction-threshold
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have a large predicate (WHERE or IN, doesn't apply to OR)
we must not be specific about that because e.g. sometimes OR can be replaced with IN.
e.g.
IN (1, 2, 3, 4)
withdomain_compaction_threshold=2
would not get optimised. But withdomain-compaction-threshold=10
it would get optimised.
The opposite. domain_compaction_threshold=2
=> predicate simplifies to equivalent of BETWEEN 1 and 4
with domain_compaction_threshold=10
it's kept as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks updated the "Also for user facing docs I as a user would be happy with:" to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested rewording of your proposed wording, for style. Should start with description then move onto justification:
Maximum number of query predicates that Trino compacts for the data source. Some databases perform poorly when a large list of predicates is pushed down to the data source, so for optimization Trino can compact large predicates up to this threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I came up with based on everyone's suggestions:
Minimum size of query predicates above which Trino starts compaction. Some databases perform poorly when a large list of predicates is pushed down to the data source. For optimization in that situation, Trino can compact the large predicates. When necessary, adjust the threshold to ensure a balance between performance and pushdown.
Please let me know if this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confusing thing, to me at least, is that it's not the minimum size for compaction but a maximum number of predicates that are compacted. Per @hashhar 's example, IN (1, 2, 3, 4)
is not compacted if this property is set to 2, but it is if the property is set to 10. If that's the case, then the word "threshold" is unfortunately very misleading and we need to word around it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing to me too based on the examples and sizes discussed. The default value is 100. @findepi said it was the opposite behavior actually than what @hashhar 's statement. Hashar's description as a minimum threshold makes some sense to me, and threshold is part of the config property's name. I'll wait for another round of reviews. Thank you all for your guidance.
connector initiates compaction of the underlying files and the delta | ||
files. A higher compaction threshold means reading less data from the | ||
underlying data source, but a higher memory and network consumption. | ||
- Sets the number of transactions to act as a threshold. After reaching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what "transactions" this refers to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't either. Is there someone who can clarify what that represents, or if there is some other term we should be using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the discussion above it seems like this is actually talking about a predicate count
files. A higher compaction threshold means reading less data from the | ||
underlying data source, but a higher memory and network consumption. | ||
- Minimum size of query predicates above which Trino starts compaction. | ||
Some databases perform poorly when a large list of predicates is pushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this explanation about Delta Lake and not "some databases". Also with Delta Lake .. what does predicate pushdown even mean
we should link to https://trino.io/docs/current/optimizer/pushdown.html#predicate-pushdown and maybe expand that explanation so it covers what happens in an object storage system .. e.g. is there any processing happening in Delta Lake .. I dont think so ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Delta Lake/Hive/Iceberg predicate pushdown means using predicates inside ORC and parquet readers to eliminate portions of the data based on min/max indexes maintained in those files. It is a processing of metadata that takes place in the connector code running on Trino workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this explanation about Delta Lake and not "some databases". Also with Delta Lake .. what does predicate pushdown even mean
@mosabua I had edited the suggested description from @hashhar above #12965 (comment). Will revise again.
we should link to https://trino.io/docs/current/optimizer/pushdown.html#predicate-pushdown and maybe expand that explanation so it covers what happens in an object storage system .. e.g. is there any processing happening in Delta Lake .. I dont think so ..
Based on @raunaqmorarka 's comment after yours, processing is not happening in Delta Lake but rather the connector code. His description was deemed as too specific by @findepi so I'm trying to craft a suitable description.
I think a subsequent ticket should be logged to expand the predicate pushdown section in the future. We are not linking to it at this time as discussed.
The PR mention by Cole was just a typo. |
fine-tune merge conflict
You need to squash all those commits @tlblessing .. and all the unrelated grammar changes need to either be undone or moved to a separate commit. |
Closing this PR per @mosabua comment #12965 (comment). Please see separate PR for config description only #13039. Will create another ticket/PR for the copy edits. |
Description
The description of the domain compaction threshold config property was incorrect and confusing as explained in the linked DOC ticket.
Delta Lake Connector
The domain-compaction-threshold compacts domains rather than files prior to predicate pushdown, and has nothing to do with the optimize run that compacts files.
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
(x) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.