Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform multiple ORs into a single SQL IN #34507

Open
roji opened this issue Aug 22, 2024 · 8 comments
Open

Transform multiple ORs into a single SQL IN #34507

roji opened this issue Aug 22, 2024 · 8 comments

Comments

@roji
Copy link
Member

roji commented Aug 22, 2024

We've recently been focusing a bit on reducing duplication of expressions in our generated SQL, especially around null semantics. For example, translating to "x IS NOT DISTINCT FROM y" instead of x = y OR (x IS NULL AND y IS NULL) would avoid duplicating x and y in the expression (#29624); this is valuable especially when x/y aren't simple columns/parameters/scalars, but rather complex arbitrary expressions which may be expensive to evaluate. As another example, #12634 tracks transforming x >= y AND x <= z to x BETWEEN y AND z.

We could do the same by transforming multiple disjunctions into a single SQL IN (i.e. x = 3 OR x = 4 becomes x IN (3, 4); this would allow evaluating x only once, at least in some databases. This is essentially the same idea as #12634 for BETWEEN, and could probably even be implemented at the same time.

Note that the same caveats apply here as for BETWEEN - impure expressions should in theory not be collapsed together (though we don't currently handle such aspects in general), and identifying duplicated expression requires deep comparison, which can be expensive (#34149 would fix that).

/cc @ranma42

@ranma42
Copy link
Contributor

ranma42 commented Aug 22, 2024

Some other things to consider:

  • we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)
  • as a first step we might want to restrict the cases we handle to comparisons against constants and/or parameters (aka avoid decisions for x = y OR x = z OR y = z)
  • it looks like in several cases we want to inspect several clauses in AND or OR together (this is more of a thought for a could-be intermediate representation); note that these operations are slightly different in C# and SQL (C# has guaranteed short-circuit semantics; SQL makes no such guarantee); in the case of SQL semantics we might be able to re-order and de-duplicate as we wish (which is basically the no-side-effects assumptions above)

@roji
Copy link
Member Author

roji commented Aug 22, 2024

we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)

I'm not completely sure about this...

Thinking about it, this is actually something we may even want to do in preprocessing, as a universal optimization (not SQL-specific), where we'd transform comparisons in the pre-translated LINQ expression tree (a == b || a == c -> new[] { b, c }.Contains(a)). One problem with doing it like this in pre-processing, is that we can't cache the hashcode for optimizaing the necessary deep comparison.

But even if we do it in post-processing, what exact problem do you see with NULLs here? Isn't the semantics of multiple ORs identical to the semantics of SQL IN even before SqlNullabilityProcessor?

@ranma42
Copy link
Contributor

ranma42 commented Aug 22, 2024

we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)

I'm not completely sure about this...

Thinking about it, this is actually something we may even want to do in preprocessing, as a universal optimization (not SQL-specific), where we'd transform comparisons in the pre-translated LINQ expression tree (a == b || a == c -> new[] { b, c }.Contains(a)). One problem with doing it like this in pre-processing, is that we can't cache the hashcode for optimizaing the necessary deep comparison.

Why can't we cache that? Is it only because of the SELECT mutability? Or are you thinking about issues with nullable parameters?

But even if we do it in post-processing, what exact problem do you see with NULLs here? Isn't the semantics of multiple ORs identical to the semantics of SQL IN even before SqlNullabilityProcessor?

I believe that the issue is the usual one: equality (and consequently inclusion) change meaning over time.
Specifically, a == b || a == null succeeds in C# for a that is null, while a IN (b, NULL) is not satisfied (in SQL) if a is NULL.
OTOH a == b || a == c and new[] { b, c }.Contains(a) are equivalent in C#, so it might not be a problem.
Regardless, we should ensure that new[] { b, c, null }.Contains(a) is converted to a IN (b, c) OR a IS NULL in SQL (well... we could have some tricks to avoid duplicating a in this case: IIUC COALESCE(a IN (b, c), TRUE) would work as intended)

@roji
Copy link
Member Author

roji commented Aug 22, 2024

Why can't we cache that? Is it only because of the SELECT mutability? Or are you thinking about issues with nullable parameters?

I just mean that in preprocessing (which happens before we translate to SQL, there's no SELECT or anything yet), we're still dealing with the built-in LINQ expression node types, e.g. BinaryExpression as opposed to our SqlBinaryExpression; and we can't modify that expression in order to do the caching...

I believe that the issue is the usual one: equality (and consequently inclusion) change meaning over time.
Specifically, a == b || a == null succeeds in C# for a that is null, while a IN (b, NULL) is not satisfied (in SQL) if a is NULL.

So again, assuming we do the transformation in post-processing - before SqlNullabilityProcessor - we'd be transforming WHERE a = 1 OR a = 2 (as nested SqlBinaryExpressions) to WHERE a IN (1, 2). SqlNullabilityProcessor would later run on this and perform the regular nullability expansion for InExpression (see answer below). So I think everything should be fine here.

In fact, the transformation would also pick up our IS NULL representation (which is represented via SqlUnaryExpression), and integrate that into the resulting InExpression as well, so from WHERE a = 1 OR a IS NULL you'd get WHERE as IN (1, NULL), which would then get expanded out in SqlNullabilityProcessor (unless UseRelationalNulls is on).

Regardless, we should ensure that new[] { b, c, null }.Contains(a) is converted to a IN (b, c) OR a IS NULL in SQL (well... we could have some tricks to avoid duplicating a in this case: IIUC COALESCE(a IN (b, c), TRUE) would work as intended)

This should already be the case - I worked on doing full nullability semantics for InExpression in EF Core 8.0, it's probably one of the more complex parts of SqlNullabilityProcessor. There may be bugs there (or opportunities for improvement) but that's orthogonal to this issue, I think.

@roji
Copy link
Member Author

roji commented Oct 9, 2024

Note that we already have an inferior version of this proposal implemented today in SqlExpressionSimplifyingExpressionVisitor (except that it only works on ColumnExpressions, and only for equality comparisons directly within the same OR node). Note also that we have a bug where we incorrectly deduplicate values if they're equal in .NET; but equality in the database may evaluate differently (e.g. case-insensitive collation), so we shouldn't do that (see #34862 (comment)).

@roji
Copy link
Member Author

roji commented Oct 10, 2024

Following an internal conversation, the following shows that SQL Server PostgreSQL doesn't automatically

CREATE TABLE Data(Id INTEGER NOT NULL PRIMARY KEY, Foo int NOT NULL);
EXPLAIN SELECT * FROM Data WHERE Foo IN (3, 4);
EXPLAIN SELECT * FROM Data WHERE Foo = 3 OR Foo = 4;

This gives the following two plans:

+----------------------------------------------------+
|QUERY PLAN                                          |
+----------------------------------------------------+
|Seq Scan on data  (cost=0.00..38.25 rows=23 width=8)|
|  Filter: (foo = ANY ('{3,4}'::integer[]))          |
+----------------------------------------------------+

+----------------------------------------------------+
|QUERY PLAN                                          |
+----------------------------------------------------+
|Seq Scan on data  (cost=0.00..43.90 rows=23 width=8)|
|  Filter: ((foo = 3) OR (foo = 4))                  |
+----------------------------------------------------+

In other words, on PostgreSQL, IN isn't executed in the same way as the equivalent decomposed ORs, and is slightly more efficient.

One possible objection against doing collapsing multiplle ORs to a single IN in EF, is that the user has a means of expressing IN directly, via LINQ Contains, and we generally don't make an effort to optimize badly-written LINQ queries. This is in contrast to the proposed BETWEEN optimization (#12634), where there's no C# way to express it. A counter-argument would be that EF itself (rather than the user) could internally produce the multiple ORs, at which point this optimization is useful, though that might be a bit far-fetched.

@ranma42
Copy link
Contributor

ranma42 commented Oct 14, 2024

Following an internal conversation, the following shows that SQL Server doesn't automatically

[...]

In other words, on PostgreSQL, IN isn't executed in the same way as the equivalent decomposed ORs, and is slightly more efficient.

Is this PostgreSQL or SqlServer?

One possible objection against doing collapsing multiplle ORs to a single IN in EF, is that the user has a means of expressing IN directly, via LINQ Contains, and we generally don't make an effort to optimize badly-written LINQ queries. This is in contrast to the proposed BETWEEN optimization (#12634), where there's no C# way to express it. A counter-argument would be that EF itself (rather than the user) could internally produce the multiple ORs, at which point this optimization is useful, though that might be a bit far-fetched.

Another case that cannot easily be expressed right now is the "row value equality" (like in #26822 with equals) but I am not even sure if that is supposed to be included in this issue 😅

@roji
Copy link
Member Author

roji commented Oct 14, 2024

Is this PostgreSQL or SqlServer?

Sorry, PostgreSQL - corrected the above.

Another case that cannot easily be expressed right now is the "row value equality"

I think that one should be possible (but isn't currently supported) via new ValueTuple(x,y).Equals(new ValueTuple(a,b). Here I really doubt there's any advantage in the row value comparison compared to the decomposed multiple equalities, though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants