-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally reject queries with non-partition predicates on Iceberg partitioned tables #17239
Comments
@zhangminglei we were also working in the same direction (to do static analysis of the queries and also do partition level checks), thanks for raising issue and the PR. Here is the thread, which was to introduce in general a QuerySentry/GateKeeper kind of feature which can work across connectors (for generic use cases and few use-cases will be specific to the connectors): https://trinodb.slack.com/archives/CP1MUNEUX/p1676398477999809 Let us know if we can help in anyways! |
Thanks @zhangminglei for opening this up. We were also just thinking about implementing this interface as otherwise we don't have any guardrails against full-table scans. I think what you're proposing in the PR (disallowing queries which have no partition predicates) is the most obvious low-hanging fruit and the most beneficial. As a next step of direction, I would think of the following:
|
Sorry for later response since the holidays, thanks @marton-bod and @osscm for participating in this issue! If only doing a development extension to validateScan would only benefit the iceberg connector. Of course, in this PR, I think it is enough for us to only focus on this point, after all, this is a PR for iceberg. Enforcing predicates on sort order fields is a very good suggestion. Currently we allow non-sorted order fields to be used in predicates. This is not friendly for filtering because we may have to read every data file if the data distribution is bad. If we want to do this function, we can make it a configurable predicate field option. I have implemented some independent rules , such as Do you guys already have some implementation about |
No, we have not started any implementation yet, only toying with the idea at the moment. We are mostly interested in optimizing Iceberg queries, so the priority would be to use the validateScan method, but of course the ideal scenario would be to create the connector-agnostic solution (though it comes with more compelxity)
Nice. Is this available in open source or somewhere on github? I was thinking about doing something similar but on the connector/validateScan level, e.g. if users could define a custom rule list and pass them in conforming to an interface:
Yeah, I would either start working on this using a separate config flag like |
No, I've only implemented it locally and haven't uploaded it to github, but the rules I implemented are very, very simple but it works :) . For example, the
I might prefer ValidationRule style approach because it's more flexible, such as passing a set of rules for scan. But it seems that this can only be used to verify the table scan. Things that have nothing to do with scan, such as prohibiting cross join, are impossible. Because prohibiting cross join has nothing to do with connectors.
I simply extended the default implementation of the
If we bind ValidationRule and validateScan together, in fact, this rule can only do things related to table scan. However I think this is acceptable. What do you think ? @marton-bod |
@zhangminglei I would also prefer the ValidationRule approach as it provides the most flexibility for the cluster administrators. I think it's acceptable to make it available on the connector level.
Yes, this is what I had in mind, although I think the |
@findepi @alexjo2144 Do you think it's worth exploring the above idea? |
@marton-bod I think your idea should be like the following pseudo code ?
|
Thanks @zhangminglei We also want to give option to have strict mode or warn mode for these rules, so that user can also run this in warn mode, if they want to pilot it and monitor it. |
May be not scope of this issue, but discussing so that will have trail and can result into another issue. Wondering is it possible to also optionally provide And/Or also provide a way to configure schema/tables on which this rule will be applied. |
For iceberg partitioned tables, we should reject the table scan produced by the planner when the query does not have partition field. I will give a MR these days.
The text was updated successfully, but these errors were encountered: