Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document domain-compaction-threshold description #12965

Closed
wants to merge 9 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/src/main/sphinx/connector/delta-lake.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,10 +136,10 @@ connector.
- Description
- Default
* - ``delta.domain-compaction-threshold``
- Sets the number of transactions to act as threshold. Once reached the
connector initiates compaction of the underlying files and the delta
files. A higher compaction threshold means reading less data from the
underlying data source, but a higher memory and network consumption.
- Sets the number of transactions to act as a threshold. After reaching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as a threshold during query planning."

Copy link
Member

@hashhar hashhar Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description seems wrong - IIRC the compaction threshold basically says that if you have a large predicate (WHERE or IN, doesn't apply to OR) then it can be simplified into smaller predicates (the implementation detail doesn't matter - may vary across connectors) but only if the number of predicates is < domain-compaction-threshold.

e.g. IN (1, 2, 3, 4) with domain_compaction_threshold=2 would not get optimised. But with domain-compaction-threshold=10 it would get optimised.

cc: @findepi please correct if I understand wrong.

Copy link
Member

@hashhar hashhar Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for user facing docs I as a user would be happy with:

Some databases perform poorly when a large list of predicates is pushed down to the data source. To optimise such situations Trino can compact the large predicates. To ensure a balance of performance and pushdown the minimum size of predicates above which Trino will start compaction can be controlled using domain-compaction-threshold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a large predicate (WHERE or IN, doesn't apply to OR)

we must not be specific about that because e.g. sometimes OR can be replaced with IN.

e.g. IN (1, 2, 3, 4) with domain_compaction_threshold=2 would not get optimised. But with domain-compaction-threshold=10 it would get optimised.

The opposite. domain_compaction_threshold=2 => predicate simplifies to equivalent of BETWEEN 1 and 4
with domain_compaction_threshold=10 it's kept as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks updated the "Also for user facing docs I as a user would be happy with:" to match.

Copy link
Contributor

@jhlodin jhlodin Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested rewording of your proposed wording, for style. Should start with description then move onto justification:

Maximum number of query predicates that Trino compacts for the data source. Some databases perform poorly when a large list of predicates is pushed down to the data source, so for optimization Trino can compact large predicates up to this threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I came up with based on everyone's suggestions:

Minimum size of query predicates above which Trino starts compaction. Some databases perform poorly when a large list of predicates is pushed down to the data source. For optimization in that situation, Trino can compact the large predicates. When necessary, adjust the threshold to ensure a balance between performance and pushdown.

Please let me know if this is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusing thing, to me at least, is that it's not the minimum size for compaction but a maximum number of predicates that are compacted. Per @hashhar 's example, IN (1, 2, 3, 4) is not compacted if this property is set to 2, but it is if the property is set to 10. If that's the case, then the word "threshold" is unfortunately very misleading and we need to word around it somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing to me too based on the examples and sizes discussed. The default value is 100. @findepi said it was the opposite behavior actually than what @hashhar 's statement. Hashar's description as a minimum threshold makes some sense to me, and threshold is part of the config property's name. I'll wait for another round of reviews. Thank you all for your guidance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what "transactions" this refers to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't either. Is there someone who can clarify what that represents, or if there is some other term we should be using?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the discussion above it seems like this is actually talking about a predicate count

the threshold, the connector initiates compacting a large IN or OR
clause into a min-max range predicate for pushdown into an ORC or
Parquet reader.
tlblessing marked this conversation as resolved.
Show resolved Hide resolved
tlblessing marked this conversation as resolved.
Show resolved Hide resolved
- 100
* - ``delta.max-outstanding-splits``
- The target number of buffered splits for each table scan in a query,
Expand Down