Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally reject queries with non-partition predicates on Iceberg partitioned tables #17239

Closed
rice668 opened this issue Apr 26, 2023 · 10 comments · Fixed by #17263
Closed

Optionally reject queries with non-partition predicates on Iceberg partitioned tables #17239

rice668 opened this issue Apr 26, 2023 · 10 comments · Fixed by #17263
Labels
enhancement New feature or request

Comments

@rice668
Copy link
Contributor

rice668 commented Apr 26, 2023

For iceberg partitioned tables, we should reject the table scan produced by the planner when the query does not have partition field. I will give a MR these days.

@osscm
Copy link
Contributor

osscm commented Apr 27, 2023

@zhangminglei we were also working in the same direction (to do static analysis of the queries and also do partition level checks), thanks for raising issue and the PR.

Here is the thread, which was to introduce in general a QuerySentry/GateKeeper kind of feature which can work across connectors (for generic use cases and few use-cases will be specific to the connectors): https://trinodb.slack.com/archives/CP1MUNEUX/p1676398477999809

Let us know if we can help in anyways!

@marton-bod
Copy link
Contributor

Thanks @zhangminglei for opening this up. We were also just thinking about implementing this interface as otherwise we don't have any guardrails against full-table scans. I think what you're proposing in the PR (disallowing queries which have no partition predicates) is the most obvious low-hanging fruit and the most beneficial. As a next step of direction, I would think of the following:

  • Implementing the equivalent of "hive.query-partition-filter-required-schemas in Iceberg connector to make this more flexible
  • I'm thinking about extending the validateScan logic with additional custom rules. These rules could be pluginable/user-supplied in some way. E.g. disallowing select * queries, enforcing predicates on sort order fields if there's any defined, etc. Having the ability for cluster admins to reject certain query types to save cluster resources (and making this flexible as users have different data layouts, query patterns and resourcing needs) would give value I think. Of course, the validateScan extension would only benefit the Iceberg connector - if we wanted to take a step further to create a connector-agnostic solution for this, it would have to be a QueryGateKeeper component mentioned by @osscm

@rice668
Copy link
Contributor Author

rice668 commented May 9, 2023

Sorry for later response since the holidays, thanks @marton-bod and @osscm for participating in this issue! If only doing a development extension to validateScan would only benefit the iceberg connector. Of course, in this PR, I think it is enough for us to only focus on this point, after all, this is a PR for iceberg.

Enforcing predicates on sort order fields is a very good suggestion. Currently we allow non-sorted order fields to be used in predicates. This is not friendly for filtering because we may have to read every data file if the data distribution is bad. If we want to do this function, we can make it a configurable predicate field option.

I have implemented some independent rules , such as ForbidCrossJoin, OrderByFullTable which implements Rule. Because this has nothing to do with specific connectors, I didn't put these rule codes in this PR. These rules are for the execution plan tree and have nothing to do with specific connectors. I do not implement full table scan disallowing select * queries.

Do you guys already have some implementation about QueryGateKeeper ? Can you be more specific? For example, how is full table scan not allowed to be defined in the interface.

@marton-bod
Copy link
Contributor

marton-bod commented May 9, 2023

Do you guys already have some implementation about QueryGateKeeper

No, we have not started any implementation yet, only toying with the idea at the moment. We are mostly interested in optimizing Iceberg queries, so the priority would be to use the validateScan method, but of course the ideal scenario would be to create the connector-agnostic solution (though it comes with more compelxity)

I have implemented some independent rules

Nice. Is this available in open source or somewhere on github? I was thinking about doing something similar but on the connector/validateScan level, e.g. if users could define a custom rule list and pass them in conforming to an interface:

ValidationRule {
  boolean accept(ConnectorSession session, ConnectorTableHandle tableHandle);
}

Enforcing predicates on sort order fields is a very good suggestion

Yeah, I would either start working on this using a separate config flag like iceberg.query-sort-column-filter-required, or if there's support for the above ValidationRule style approach, we can just use that and create a rule for the sort key predicate check as well.
@osscm what's your take on this?

@rice668
Copy link
Contributor Author

rice668 commented May 10, 2023

Is this available in open source or somewhere on github?

No, I've only implemented it locally and haven't uploaded it to github, but the rules I implemented are very, very simple but it works :) . For example, the ForbidCrossJoin code is as follows, OrderByFullTable is also similar to ForbidCrossJoin, making judgments on SortNode.

/**
 * Queries that optimized to CrossJoin should not be forbidden.
 */
public class ForbidCrossJoin
        implements Rule<JoinNode>
{
    @Override
    public boolean isEnabled(Session session)
    {
        return forbidCrossJoin(session);
    }

    @Override
    public Pattern<JoinNode> getPattern()
    {
        return Patterns.join();
    }

    @Override
    public Result apply(JoinNode node, Captures captures, Context context)
    {
        if (node.isCrossJoin()) {
            throw new TrinoException(StandardErrorCode.QUERY_REJECTED, "Cross join is not allowed," +
                    " if you actually want query run cross join, please set session forbid_cross_join to false.");
        }
        return Result.empty();
    }
}

or if there's support for the above ValidationRule style approach

I might prefer ValidationRule style approach because it's more flexible, such as passing a set of rules for scan. But it seems that this can only be used to verify the table scan. Things that have nothing to do with scan, such as prohibiting cross join, are impossible. Because prohibiting cross join has nothing to do with connectors.

e.g. if users could define a custom rule list and pass them in conforming to an interface:

I simply extended the default implementation of the validateScan interface. Please help to see if it is consistent with what you think?

public interface ValidationRule {
    boolean accept(ConnectorSession session, ConnectorTableHandle tableHandle);
}

default void validateScan(ConnectorSession session, ConnectorTableHandle handle, List<ValidationRule> rules) {
    for (ValidationRule rule : rules) {
        if (!rule.accept(session, handle)) {
            throw new IllegalStateException("Scan rejected due to custom validation rule: " + rule);
        }
    }
}

If we bind ValidationRule and validateScan together, in fact, this rule can only do things related to table scan. However I think this is acceptable. What do you think ? @marton-bod

@marton-bod
Copy link
Contributor

@zhangminglei I would also prefer the ValidationRule approach as it provides the most flexibility for the cluster administrators. I think it's acceptable to make it available on the connector level.

Please help to see if it is consistent with what you think?

Yes, this is what I had in mind, although I think the List<ValidationRule> rules would need to be injected into IcebergMetadata as a field using a ValidationRuleProvider and we should not change the interface signature.

@marton-bod
Copy link
Contributor

@findepi @alexjo2144 Do you think it's worth exploring the above idea?

@rice668
Copy link
Contributor Author

rice668 commented May 11, 2023

@marton-bod I think your idea should be like the following pseudo code ?

public interface ValidationRuleProvider {
    List<ValidationRule> getValidationRules();
}


public class IcebergMetadata
        implements ConnectorMetadata {
    // ... other ...

    private final ValidationRuleProvider validationRuleProvider;

    public IcebergMetadata(
            TypeManager typeManager,
            JsonCodec<CommitTaskData> commitTaskCodec,
            TrinoCatalog catalog,
            TrinoFileSystemFactory fileSystemFactory,
            ValidationRuleProvider validationRuleProvider) {
        // ... other inits ...
        this.validationRuleProvider = requireNonNull(validationRuleProvider, "validationRuleProvider is null");
    }

    // ... other methods ...

    public void validateScan(ConnectorSession session, ConnectorTableHandle handle) {
        List<ValidationRule> rules = validationRuleProvider.getValidationRules();
        for (ValidationRule rule : rules) {
            if (!rule.accept(session, handle)) {
                throw new IllegalStateException("Scan rejected due to custom validation rule: " + rule);
            }
        }
    }
}

@findepi findepi changed the title IcebergMetadata should implement validateScan interface Optionally reject queries with non-partition predicates on Iceberg partitioned tables May 12, 2023
@osscm
Copy link
Contributor

osscm commented Jun 7, 2023

@marton-bod I think your idea should be like the following pseudo code ?

public interface ValidationRuleProvider {
    List<ValidationRule> getValidationRules();
}


public class IcebergMetadata
        implements ConnectorMetadata {
    // ... other ...

    private final ValidationRuleProvider validationRuleProvider;

    public IcebergMetadata(
            TypeManager typeManager,
            JsonCodec<CommitTaskData> commitTaskCodec,
            TrinoCatalog catalog,
            TrinoFileSystemFactory fileSystemFactory,
            ValidationRuleProvider validationRuleProvider) {
        // ... other inits ...
        this.validationRuleProvider = requireNonNull(validationRuleProvider, "validationRuleProvider is null");
    }

    // ... other methods ...

    public void validateScan(ConnectorSession session, ConnectorTableHandle handle) {
        List<ValidationRule> rules = validationRuleProvider.getValidationRules();
        for (ValidationRule rule : rules) {
            if (!rule.accept(session, handle)) {
                throw new IllegalStateException("Scan rejected due to custom validation rule: " + rule);
            }
        }
    }
}

Thanks @zhangminglei
Yeah, something like this is in our mind, where users will have control to say what kind of queries they want to disable or enable.

We also want to give option to have strict mode or warn mode for these rules, so that user can also run this in warn mode, if they want to pilot it and monitor it.
We will create a separate issue for this. Very soon, we are looking to add these two features to Trino.
So, please let us know if you need any help to get the current PR in.

@osscm
Copy link
Contributor

osscm commented Jun 7, 2023

May be not scope of this issue, but discussing so that will have trail and can result into another issue.

Wondering is it possible to also optionally provide Warning instead of rejecting the query.
I think for Druid queries it is happening.

And/Or also provide a way to configure schema/tables on which this rule will be applied.
I think it will be useful to the use-cases where user still wants to explore data without providing partitions.

cc @findepi @alexjo2144 @marton-bod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

4 participants