Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate Pushdown of Partition Key Inconsistent with Specified Filter in SQL #12538

Closed
MichelleArk opened this issue Mar 27, 2019 · 2 comments
Closed
Assignees
Labels

Comments

@MichelleArk
Copy link

Filtering on a partition key in a SQL query does not always produce a plan with a table scan constrained to scan the minimum number of partitions required for the query.

Example using the orders fixture:

WITH cte AS (
    SELECT *, CAST(orderkey AS varchar) as orderkey_string 
    FROM orders
) 

SELECT * 
FROM cte 
WHERE (orderstatus = 'F' OR orderstatus = 'P') AND orderkey_string = '2'

Produces the following plan:

 - Output[orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment, orderkey_string] => [orderkey:bigint, custkey:bigint, orderstatus:varchar(1), totalprice:double, orderdate:date, orderpriority:varchar(15), clerk:varchar(15), shippriority:integer, comment:varchar(79), expr_8:varchar]
        Cost: ?, Output: ? rows (?B)
        orderkey_string := expr_8
    - RemoteExchange[GATHER] => [orderkey:bigint, custkey:bigint, orderstatus:varchar(1), totalprice:double, orderdate:date, orderpriority:varchar(15), clerk:varchar(15), shippriority:integer, comment:varchar(79), expr_8:varchar]
            Cost: ?, Output: ? rows (?B)
        - Filter[filterPredicate = (("orderstatus" = 'F') OR ("orderstatus" = 'P'))] => [orderkey:bigint, custkey:bigint, orderstatus:varchar(1), totalprice:double, orderdate:date, orderpriority:varchar(15), clerk:varchar(15), shippriority:integer, comment:varchar(79), expr_8:varchar]
                Cost: ?, Output: ? rows (?B)
            - ScanFilterProject[table = local:orders:sf0.01, filterPredicate = (CAST("orderkey" AS varchar) = CAST('2' AS varchar))] => [orderkey:bigint, custkey:bigint, orderstatus:varchar(1), totalprice:double, orderdate:date, orderpriority:varchar(15), clerk:varchar(15), shippriority:integer, comment:varchar(79), expr_8:varchar]
                    Cost: ?, Output: ? rows (?B)
                    expr_8 := CAST("orderkey" AS varchar)
                    clerk := tpch:clerk
                    orderkey := tpch:orderkey
                    orderstatus := tpch:orderstatus
                        :: [[F], [O], [P]]
                    totalprice := tpch:totalprice
                    custkey := tpch:custkey
                    comment := tpch:comment
                    orderdate := tpch:orderdate
                    shippriority := tpch:shippriority
                    orderpriority := tpch:orderpriority

Notice from bolded line in the plan, all the possible values (F, O, P) of orderstatus are chosen in the ScanFilterProject node. Then, a Filter node above it does the actual filtering on orderstatus specified in the query (just F and P).

@nayeemzen and I have been digging into why this occurs and it seems related to the way predicates are pushed down from a Project. The following 3 conditions seem to trigger the suboptimal partition selection in the ScanFilterProject node.

  1. Query involves a subquery (ex: CTE or view)
  2. The subquery contains a non-identity expression that is referenced in the main query (ex: CAST(orderkey AS varchar) as orderkey_string)
  3. More than 1 reference to the same partition key within a clause (ex: (orderstatus = 'F' OR orderstatus = 'P')

From tracing through how the plan is optimized for the example query, we've seen that the PredicatePushDown optimizer identifies the (orderstatus = 'F' OR orderstatus = 'P') clause as a non-inlining candidate, which excludes it from being pushed down to the table scan. Instead an additional filter node is created for the (orderstatus = 'F' OR orderstatus = 'P') clause.

This issue seems related to the this change: #10860.
Specifically, constraining the number of references to a symbol within a given clause to be 1 is what flags the clause in the example query as non-inlining candidate.

@wenleix
Copy link
Contributor

wenleix commented Mar 28, 2019

@MichelleArk : A similar issue has been discussed in #11265 . Here is the conclusion:

  • This behavior, in fact, conforms to the SQL Standard
  • However, our current behavior on subquery is not consistent (some subquery can still have predicate pushdown). We need to have consistent behavior, and probably have this controlled by a session property.

Let me know if you have any further questions :)

@wenleix wenleix self-assigned this Mar 28, 2019
@stale
Copy link

stale bot commented Jun 22, 2021

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants