-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to express SQL row value comparison syntax #26822
Comments
This is supported by PostgreSQL, SQLite, and MySQL (though your linked article claims that it isn't index compatible - that may have changed); it's not supported by SQL Server. I've just been looking into this recently, with regards to keyset pagination (see this deck, slides 22-23). First, since row value comparison ( Note that in the PostgreSQL, I recently implemented pattern matching over the decomposed version, identifying it and generating row value comparison in SQL instead (see npgsql/efcore.pg#2111, this was mostly a fun little project without any specific benefits in mind). This means that the C# side stays ugly, but the SQL becomes terser/nicer. We could promote this to relational, but I do think that should be done if there's a tangible benefit to doing so (beyond terser SQL). |
I've investigated this, and yes I'm sure of at least one perf benefit. In PostgreSQL, I created a table with a million records, and I have an index on created + id. I'm doing a simple seek + limit on the 2 columns. Here's the execution plan for a logical condition ( There's an Index Scan with a filter operation. In contrast, here's the execution plan when using row value instead:
The difference stems from the fact that no db will be able to realize that it can use an access predicate on the 1st column when using the logical condition, but it understands exactly what access predicates to use when using row value. One way to force all dbs to recognize this without using row value is to add a redundant clause as following:
The first line above results in the db using an access predicate before filtering:
I'm not sure if this exactly matches in perf a db implemented row value for 2 columns, but it definitely won't match it for more than 2 columns (it'll be harder to keep generating these access predicate hints in the logic conditions properly then). (There's one alternative form to do this without using a redundant clause, but it's much much harder for a human to understand.) My test table is extremely simple (which is also why the filter had nothing to do after the access predicate), I'm sure the perf benefit will be more apparent in the real world. I want to emphasize again how much easier it is to form the row value syntax than a generalized logical condition when we're dealing with more than a few columns. Granted, it's rare to have a query doing this over more than 2-3 columns, but anyway, the generalized condition expression for this is:
It's very easy to get wrong (and a horror to maintain). This is great. Trying to translate into row value from a db function instead of pattern matching will obviously be much easier too. Also, when using efcore.pg, I'm wondering (haven't checked much of the code) if this will pick up the generated condition expressions I dynamically form in my package, since there's the additional optimization I have with the access predicate clause which might throw your pattern matcher off. (But with a quick look, it seems that it needs a very specific form when there's 3 columns, which wouldn't work with the generalized logical condition, and no support for more than that?). In conclusion, I do think it has performance benefits as the db can optimize the execution because the intent is clear, and it would still be much easier to write in C# (even if uglier than others) than forming the error-prone logical condition ourselves. I think there's also a lot to gain if the EFCore db provider that's being used automatically either choses to translate it to row value when supported and applicable, or form the logical condition otherwise (for reference, I do this here). |
@mrahhal thank you for the details and for the perf investigation! I got slightly different-looking query plans in my test (see full details below), but the end result is the same; row values do indeed improve performance in PostgreSQL. This shows once again that it's always worth investigating generating simpler SQL. Perf investigationDatabase setupCREATE TABLE products
(
id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
sort1 INT,
sort2 INT,
name TEXT
);
CREATE INDEX IX_name ON products(sort1, sort2);
DO $$BEGIN
FOR i IN 1..5000000 LOOP
INSERT INTO products (sort1, sort2, name) VALUES (i % 10, i % 5, 'Product' || i);
END LOOP;
END$$; With row value comparisonEXPLAIN SELECT id, name FROM products WHERE (sort1, sort2) > (5, 2); Resulting plan:
Without row value comparisonEXPLAIN SELECT id, name FROM products WHERE sort1 > 5 OR sort1 = 5 AND sort2 > 2; Resulting plan:
On the question of how users can express this in C#...
In any case, I've opened dotnet/EntityFramework.Docs#3582 to track adding docs on good pagination practices. Thanks again for opening this issue! |
Here's my own code for the table I created in my perf investigation for completeness.CREATE TABLE t1 (
id SERIAL PRIMARY KEY,
created timestamp with time zone
);
CREATE INDEX IX_created ON t1 (
created
);
CREATE INDEX IX_created_id ON t1 (
created,
id
);
-- Generates about 1.5 million records
INSERT INTO t1 (created)
SELECT g.created FROM generate_series(
timestamp with time zone '2019-01-01',
timestamp with time zone '2021-11-17',
'1 minute'
) AS g (created); And the particular queries (originally from the ef log when I was testing inside a sample app so forgive the verboseness): -- Logical condition
SELECT "u"."id", "u"."created"
FROM t1 AS "u"
WHERE ("u"."created" < '2020-11-24 18:51:23.831772') OR (("u"."created" = '2020-11-24 18:51:23.831772') AND ("u"."id" < 21))
ORDER BY "u"."created" DESC, "u"."id" DESC
LIMIT 20
-- Logical condition with an added clause to hint for access predicate
SELECT "u"."id", "u"."created"
FROM t1 AS "u"
WHERE ("u"."created" <= '2020-11-24 18:51:23.831772') AND (("u"."created" < '2020-11-24 18:51:23.831772') OR (("u"."created" = '2020-11-24 18:51:23.831772') AND ("u"."id" < 21)))
ORDER BY "u"."created" DESC, "u"."id" DESC
LIMIT 20
-- Row value
SELECT "u"."id", "u"."created"
FROM t1 AS "u"
WHERE ("u"."created", "u"."id") < ('2020-11-24 18:51:23.831772', 21)
ORDER BY "u"."created" DESC, "u"."id" DESC
LIMIT 20
Now that you mention this, it would be great having this in C# itself with tuples... But otherwise, I really like this one you mentioned: ctx.Blogs.Where(b => EF.Functions.GreaterThan(new[] { b.Column1, b.Column2 }, new[] { 1, 2 }) Better than what I proposed, looks great and very readable. The tuple overload is a good idea too, otherwise maybe always having it
What do you think about EFCore translating it to conditional logic (as I do in my package) in such a case? I think that's a pretty valid alternative for row value (still efficient + supported in all dbs) and would save consumers, especially package authors who want to use this and don't have control over the provider, from having to worry about whether it'll be supported or not. Otherwise, there will need to be some discoverability method to know beforehand whether we can use it and then use an alternative. And thanks for the doc issue! |
Yep, there's definitely no efficiency issue here. I do suspect that in 90% of cases, users will have 2, maybe 3 keys at most, so a shorter value tuple construct would be a bit nicer... But definitely not critical.
Absolutely. Assuming the team agrees, the thing I have in mind is a general, relational translator which produces a new RowValueExpression, which then gets generated in SQL. SQL Server (and possibly other databases which don't support rowsets) would instead opt into translating to the decomposed conditional logic variant instead. Since it's always possible to translate this and the question is only whether to choose between a more efficient variant and a less efficient one, that should be an internal detail controlled by the provider. |
Awesome! If possible I'd love to work on this when a design is decided. |
Design decision: we think it's a good idea to introduce this as a building block. Here's what this would introduce:
@smitpatel does all the above sound right? If so, @mrahhal we'd be happy to accept a PR for this. /cc @michaelstaib, once this is implemented, you probably want to implement your paging support with this. |
@roji Thanks for the details. Will give this a go soon and keep you updated. |
Some more examples of row value usage across databases, to help inform our decision on how to design the support: PostgreSQL allows referring to an actual table row as a row value, comparing it to another literal row value: CREATE TABLE foo (bar1 INT, bar2 INT);
INSERT INTO foo VALUES (1, 2), (3, 4);
SELECT * FROM foo WHERE foo = (1, 2); PostgreSQL also has composite types, like a UDT (class/struct type definition in the database). The literal for those is also a row value: CREATE TYPE complex AS (r INT, i INT);
CREATE TABLE bar (id INT, complex complex);
INSERT INTO bar (id, complex) VALUES (1, (1, 2)), (2, (3, 4));
SELECT * FROM bar WHERE complex = (1, 2); However, PostgreSQL also has 1st-class arrays: CREATE TABLE baz (id INT, arr INT[]);
INSERT INTO baz (id, arr) VALUES (1, ARRAY[1, 2]), (2, ARRAY[3, 4]);
SELECT * FROM baz WHERE arr = ARRAY[1, 2]; So expression tree-wise, I think row values and arrays probably need to be two different node types (row values in relational, arrays in PG only). Note that SQLite also supports various cool things with row values: SELECT * FROM info WHERE (year,month,day) BETWEEN (2015,9,12) AND (2016,9,12);
SELECT ordid, prodid, qty
FROM item
WHERE (prodid, qty) IN (SELECT prodid, qty FROM item WHERE ordid = 365);
UPDATE tab1 SET (a,b)=(b,a); MySQL/MariaDB also supports at least some forms of row values: CREATE TABLE foo (bar1 INT, bar2 INT);
SELECT * FROM foo WHERE (bar1, bar2) = (3, 4);
SELECT * FROM foo WHERE (bar1, bar2) > (3, 4);
SELECT * FROM foo WHERE (bar1, bar2) IN ((1, 2), (3, 4)); |
Since this is being given more time in design. I can't help but think about how C# tuples would have been perfect to represent row value in several places, if there was a way to give them comparison operators that is (only equality operators are defined at the moment). This isn't possible without custom operator overloading (extension everything), tracked here: dotnet/csharplang#192. Would have allowed to represent all kinds of row value you want to support cleanly in C#: .Where(b => (b.Column1, b.Column2) > (1, 2))
// instead of
.Where(b => EF.Functions.GreaterThan(new[] { b.Column1, b.Column2 }, new[] { 1, 2 })) There doesn't seem to be progress at all in this proposal (dotnet/csharplang#192) unfortunately, but it might be good to keep it in mind. |
@mrahhal you're right that value tuple comparison would have been great here - but I think this is trickier than it looks:
To summarize, I don't think this would be very feasible with extension operators; ideally this would simply be a native capability in C#. However, I'm not aware of many uses for tuple comparisons beyond pagination, which is more of a database thing. In other words, I'm not sure how useful this would be as a C# feature beyond helping us express SQL in EF Core LINQ queries. |
Yes there are a lot of tricky aspects to it. There's also no good namespace to add these operator extensions to since a lot of EF consumer code is tied to IQueryable and not to an EF Core namespace. As for the tuple types, I thought it'll be feasible to overload some of the common scalar types for 1-3 arguments, but yeah not very pretty. In any case this isn't doable for now yeah. |
Note that this was done for PostgreSQL in npgsql/efcore.pg#2350, using ITuple (ValueTuple/Tuple) to represent a row value. There's nothing PG-specific in that implementation, so if we want to, we should be able to bring at least some of it over to relational. Note that since SQL Server doesn't support row value comparisons, we could have a post-processing step that converts row value comparisons to the expanded form (i.e. instead of |
Looks good! And I agree that using the currently uglier value tuples is better than arrays. Hopefully they get added to expression trees soon.
I would also want to consider generating some redundant clauses that help get us closer to row value perf (as discussed here. I implement a simple optimization on the 1st column here). Although I haven't actually done perf comparisons with large data in the real world. This is purely based on the observed query plans. |
Good suggestion - we should definitely keep this in mind if/when we get around to implementing this for SQL Server (and analyze the perf to make sure). Beyond the perf aspect, hopefully this improvement (currently in PG only) helps people adopt keyset pagination as it removes the need to deal with the complex expanded comparisons... |
Hey @roji. You removed the blocked tag, so do you think I can start working on this? I'll take your work in pg for reference and then implement the expansion part depending on the provider (we can discuss any possible additional optimizations after). |
Introducing my PG work into EF Core's relational layer would include considerable work, and @smitpatel would have to devote quite a bit of time to reviewing etc. So I think it's up to him to decide if it's a good idea to work on this. |
Perhaps this is a dumb question, but what about pagination where we're sorting one column in ASC order and another column in DESC order (e.g. Or do most pagination systems just not support this? Sorry, I know this is kind of tangential to the topic, but I've been implementing LINQ-based keyset pagination for an arbitrary number of columns and having been struggling with this 😄 |
@Bosch-Eli-Black not a dumb questions at all... AFAIK this isn't something you can do with row value comparison (and it seems to be generally a pretty exotic thing with pagination systems...) |
Yes this isn't something you can do with row value, I do support it in MR.EntityFrameworkCore.KeysetPagination though. |
@roji Thanks 🙂 I was trying to make my pagination system be general-purpose, but maybe supporting something this exotic is a bit too general purpose 😅 @mrahhal Thanks! I've starred your repo. After taking another look at our requirements, I've decided to go with offset pagination, though, in part because it was easier to implement but also in part because our company's style guide mandates that users should be able to navigate to arbitrary pages of results 🙂 |
@Bosch-Eli-Black you can still implement a hybrid system where clicking "next/prev page" uses row set navigation, and only random-access jumping to arbitrary pages uses offset navigation. |
@roji Ooooh, hadn't thought of that! Thanks! 🙂 |
Hello, I just want do dump this code here just in case someone needs keyset pagination using EF Core 6 (strictly for PostgreSQL) with the row value comparison syntax (for the perf benefit), since unfortunately the Show code:
|
you are working on this? |
Row value is part of the SQL standard -though not supported by every database in all places- and it allows using the following syntax (for example in a where clause):
This is in particular a highly valuable syntax when doing keyset/seek/cursor pagination, and being able to use it with databases that support it will be a big win for seek pagination in particular, but also composite conditions that can gain from this syntax in general.
The row value syntax above is equivalent, in theory, to something like this:
But, for one, this logic condition becomes more complex the more columns you have. In addition, the db implementation of row value will usually be much more efficient than this (there are a few things you can do that can help the db engine in the above condition, but it doesn't look nice or readable).
More info about row value here: https://use-the-index-luke.com/sql/partial-results/fetch-next-page#sb-row-values
My use case is a package I worked on that fully implements keyset pagination with EF Core, but does so by building the expressions so that it translates to the logical condition syntax above. I could use a row value translator.
For a related issue to keyset pagination see this: #9115
Proposed solution
What I would love to see is a db function that EF Core understands and translates into the row value syntax for db providers that support it. Something like this:
One thing I'm not sure of is if there's a system in place that would allow a consumer to know if the current database provider supports this syntax or not. Because I want to be able to fallback to my implementation in such a case.
If this seems like an interesting and feasible addition, I'm willing to work on it.
The text was updated successfully, but these errors were encountered: