Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document domain-compaction-threshold description #12965

Closed
wants to merge 9 commits into from

Conversation

tlblessing
Copy link
Member

@tlblessing tlblessing commented Jun 23, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?

The description of the domain compaction threshold config property was incorrect and confusing as explained in the linked DOC ticket.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Delta Lake Connector

How would you describe this change to a non-technical end user or system administrator?

The domain-compaction-threshold compacts domains rather than files prior to predicate pushdown, and has nothing to do with the optimize run that compacts files.

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
(x) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.

@cla-bot cla-bot bot added the cla-signed label Jun 23, 2022
@github-actions github-actions bot added the docs label Jun 23, 2022
@ebyhr ebyhr changed the title DOC-3436 domain-compaction-threshold description Document domain-compaction-threshold description Jun 23, 2022
@ebyhr
Copy link
Member

ebyhr commented Jun 23, 2022

Thanks for improving the documentation. Please avoid posting internal jira link in this repository.

@mosabua mosabua requested a review from raunaqmorarka June 23, 2022 23:11
Copy link
Contributor

@jhlodin jhlodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions to help simplify, if I'm understanding the property correctly

connector initiates compaction of the underlying files and the delta
files. A higher compaction threshold means reading less data from the
underlying data source, but a higher memory and network consumption.
- Sets the number of transactions to act as a threshold. After reaching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as a threshold during query planning."

Copy link
Member

@hashhar hashhar Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description seems wrong - IIRC the compaction threshold basically says that if you have a large predicate (WHERE or IN, doesn't apply to OR) then it can be simplified into smaller predicates (the implementation detail doesn't matter - may vary across connectors) but only if the number of predicates is < domain-compaction-threshold.

e.g. IN (1, 2, 3, 4) with domain_compaction_threshold=2 would not get optimised. But with domain-compaction-threshold=10 it would get optimised.

cc: @findepi please correct if I understand wrong.

Copy link
Member

@hashhar hashhar Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for user facing docs I as a user would be happy with:

Some databases perform poorly when a large list of predicates is pushed down to the data source. To optimise such situations Trino can compact the large predicates. To ensure a balance of performance and pushdown the minimum size of predicates above which Trino will start compaction can be controlled using domain-compaction-threshold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a large predicate (WHERE or IN, doesn't apply to OR)

we must not be specific about that because e.g. sometimes OR can be replaced with IN.

e.g. IN (1, 2, 3, 4) with domain_compaction_threshold=2 would not get optimised. But with domain-compaction-threshold=10 it would get optimised.

The opposite. domain_compaction_threshold=2 => predicate simplifies to equivalent of BETWEEN 1 and 4
with domain_compaction_threshold=10 it's kept as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks updated the "Also for user facing docs I as a user would be happy with:" to match.

Copy link
Contributor

@jhlodin jhlodin Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested rewording of your proposed wording, for style. Should start with description then move onto justification:

Maximum number of query predicates that Trino compacts for the data source. Some databases perform poorly when a large list of predicates is pushed down to the data source, so for optimization Trino can compact large predicates up to this threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I came up with based on everyone's suggestions:

Minimum size of query predicates above which Trino starts compaction. Some databases perform poorly when a large list of predicates is pushed down to the data source. For optimization in that situation, Trino can compact the large predicates. When necessary, adjust the threshold to ensure a balance between performance and pushdown.

Please let me know if this is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusing thing, to me at least, is that it's not the minimum size for compaction but a maximum number of predicates that are compacted. Per @hashhar 's example, IN (1, 2, 3, 4) is not compacted if this property is set to 2, but it is if the property is set to 10. If that's the case, then the word "threshold" is unfortunately very misleading and we need to word around it somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing to me too based on the examples and sizes discussed. The default value is 100. @findepi said it was the opposite behavior actually than what @hashhar 's statement. Hashar's description as a minimum threshold makes some sense to me, and threshold is part of the config property's name. I'll wait for another round of reviews. Thank you all for your guidance.

docs/src/main/sphinx/connector/delta-lake.rst Outdated Show resolved Hide resolved
@hashhar hashhar requested review from findinpath and findepi and removed request for raunaqmorarka June 24, 2022 06:01
connector initiates compaction of the underlying files and the delta
files. A higher compaction threshold means reading less data from the
underlying data source, but a higher memory and network consumption.
- Sets the number of transactions to act as a threshold. After reaching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what "transactions" this refers to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't either. Is there someone who can clarify what that represents, or if there is some other term we should be using?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the discussion above it seems like this is actually talking about a predicate count

files. A higher compaction threshold means reading less data from the
underlying data source, but a higher memory and network consumption.
- Minimum size of query predicates above which Trino starts compaction.
Some databases perform poorly when a large list of predicates is pushed
Copy link
Member

@mosabua mosabua Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this explanation about Delta Lake and not "some databases". Also with Delta Lake .. what does predicate pushdown even mean

we should link to https://trino.io/docs/current/optimizer/pushdown.html#predicate-pushdown and maybe expand that explanation so it covers what happens in an object storage system .. e.g. is there any processing happening in Delta Lake .. I dont think so ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Delta Lake/Hive/Iceberg predicate pushdown means using predicates inside ORC and parquet readers to eliminate portions of the data based on min/max indexes maintained in those files. It is a processing of metadata that takes place in the connector code running on Trino workers.

Copy link
Member Author

@tlblessing tlblessing Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this explanation about Delta Lake and not "some databases". Also with Delta Lake .. what does predicate pushdown even mean

@mosabua I had edited the suggested description from @hashhar above #12965 (comment). Will revise again.

we should link to https://trino.io/docs/current/optimizer/pushdown.html#predicate-pushdown and maybe expand that explanation so it covers what happens in an object storage system .. e.g. is there any processing happening in Delta Lake .. I dont think so ..

Based on @raunaqmorarka 's comment after yours, processing is not happening in Delta Lake but rather the connector code. His description was deemed as too specific by @findepi so I'm trying to craft a suitable description.

I think a subsequent ticket should be logged to expand the predicate pushdown section in the future. We are not linking to it at this time as discussed.

@tlblessing
Copy link
Member Author

The PR mention by Cole was just a typo.

fine-tune merge conflict
@mosabua
Copy link
Member

mosabua commented Jun 27, 2022

You need to squash all those commits @tlblessing .. and all the unrelated grammar changes need to either be undone or moved to a separate commit.

@tlblessing
Copy link
Member Author

tlblessing commented Jun 29, 2022

Closing this PR per @mosabua comment #12965 (comment). Please see separate PR for config description only #13039. Will create another ticket/PR for the copy edits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants