-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform multiple ORs into a single SQL IN #34507
Comments
Some other things to consider:
|
I'm not completely sure about this... Thinking about it, this is actually something we may even want to do in preprocessing, as a universal optimization (not SQL-specific), where we'd transform comparisons in the pre-translated LINQ expression tree ( But even if we do it in post-processing, what exact problem do you see with NULLs here? Isn't the semantics of multiple ORs identical to the semantics of SQL IN even before SqlNullabilityProcessor? |
Why can't we cache that? Is it only because of the SELECT mutability? Or are you thinking about issues with nullable parameters?
I believe that the issue is the usual one: equality (and consequently inclusion) change meaning over time. |
I just mean that in preprocessing (which happens before we translate to SQL, there's no SELECT or anything yet), we're still dealing with the built-in LINQ expression node types, e.g. BinaryExpression as opposed to our SqlBinaryExpression; and we can't modify that expression in order to do the caching...
So again, assuming we do the transformation in post-processing - before SqlNullabilityProcessor - we'd be transforming In fact, the transformation would also pick up our IS NULL representation (which is represented via SqlUnaryExpression), and integrate that into the resulting InExpression as well, so from
This should already be the case - I worked on doing full nullability semantics for InExpression in EF Core 8.0, it's probably one of the more complex parts of SqlNullabilityProcessor. There may be bugs there (or opportunities for improvement) but that's orthogonal to this issue, I think. |
Note that we already have an inferior version of this proposal implemented today in SqlExpressionSimplifyingExpressionVisitor (except that it only works on ColumnExpressions, and only for equality comparisons directly within the same OR node). Note also that we have a bug where we incorrectly deduplicate values if they're equal in .NET; but equality in the database may evaluate differently (e.g. case-insensitive collation), so we shouldn't do that (see #34862 (comment)). |
Following an internal conversation, the following shows that CREATE TABLE Data(Id INTEGER NOT NULL PRIMARY KEY, Foo int NOT NULL);
EXPLAIN SELECT * FROM Data WHERE Foo IN (3, 4);
EXPLAIN SELECT * FROM Data WHERE Foo = 3 OR Foo = 4; This gives the following two plans:
In other words, on PostgreSQL, IN isn't executed in the same way as the equivalent decomposed ORs, and is slightly more efficient. One possible objection against doing collapsing multiplle ORs to a single IN in EF, is that the user has a means of expressing IN directly, via LINQ Contains, and we generally don't make an effort to optimize badly-written LINQ queries. This is in contrast to the proposed BETWEEN optimization (#12634), where there's no C# way to express it. A counter-argument would be that EF itself (rather than the user) could internally produce the multiple ORs, at which point this optimization is useful, though that might be a bit far-fetched. |
Is this PostgreSQL or SqlServer?
Another case that cannot easily be expressed right now is the "row value equality" (like in #26822 with equals) but I am not even sure if that is supposed to be included in this issue 😅 |
Sorry, PostgreSQL - corrected the above.
I think that one should be possible (but isn't currently supported) via |
We've recently been focusing a bit on reducing duplication of expressions in our generated SQL, especially around null semantics. For example, translating to "x IS NOT DISTINCT FROM y" instead of x = y OR (x IS NULL AND y IS NULL) would avoid duplicating x and y in the expression (#29624); this is valuable especially when x/y aren't simple columns/parameters/scalars, but rather complex arbitrary expressions which may be expensive to evaluate. As another example, #12634 tracks transforming
x >= y AND x <= z
tox BETWEEN y AND z
.We could do the same by transforming multiple disjunctions into a single SQL IN (i.e.
x = 3 OR x = 4
becomesx IN (3, 4)
; this would allow evaluating x only once, at least in some databases. This is essentially the same idea as #12634 for BETWEEN, and could probably even be implemented at the same time.Note that the same caveats apply here as for BETWEEN - impure expressions should in theory not be collapsed together (though we don't currently handle such aspects in general), and identifying duplicated expression requires deep comparison, which can be expensive (#34149 would fix that).
/cc @ranma42
The text was updated successfully, but these errors were encountered: