-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add commonly required Enumerable methods (DistinctBy, ExceptBy, AsChunked, ...) #14753
Comments
👍 As I had to implement many of ..By methods myself. |
These seem like two distinct proposals that should be two distinct issues. I agree with both proposals.
|
If you're chunking, you need to/should implement the |
I believe the set operations are for convenience, but it's a significant convenience.
And you can use |
@svick - Okay. Hopefully they've implemented a reasonable hash code for their @omariom - in such a case the set operations could be overloaded, but You're right about the use of "Chunk" vs "Partition", although we'd almost certainly want a custom partitioner for such a type too.... |
I missed that. Suffix "By" is needed. |
@Clockwork-Muse IntersectBy should be fixed now.
Yes, it returns the first item. For compat reasons this can never be changed now. It should simply be documented. I hereby request that as well. GroupBy instead of chunking is super tedious, slow, non-order-preserving (at least without additional work) and non-streaming. |
I've done (At least some) tests and stubbed out PLINQ's side. What have I missed? From a behavior standpoint, the set operations can actually be implemented via a delegated equality comparer, that you could then use like
I've stubbed out only one, completely lazy, version of I'll continue to update the tests as I get related PRs in (I have some other, ongoing work). If I run out of other stuff to do, I might try to implement this stuff. |
I would object to the name
I do not know of another name that I would find acceptable. |
@GSPP - I don't care too much about the names, so it can/will certainly be changed. I'd like to wait until I have a more definite direction....? |
If these are to be added (and I'm not sure whether they should be) can we make it a requirement that they are supported for both
And so on. Therefore An alternative form of the above would have:
This avoids taking the more expensive approach for linq-to-objects of sorting in memory before taking the first if an in-memory collection is accessed as a queryable, at the cost of adding an extra check. |
I think all IQueryable expressions should go through the same process. No fallback methods. Represent queries full-fidelity and let the provider handle it. EF can be easily extended to implement the solution that you mentioned. Query providers should throw exceptions for anything they do not understand. |
In that case, I think the additions would be more trouble than it's worth. Taking Approach people take in the current situation: Either use Approach people will take if Use Approach people will take if Use (Of course, we could provide it just on enumerables, but then people would likely use it and effectively be doing At the same time, I doubt there would be many implementations that wouldn't result in the same behaviour as if These are meant to be convenience methods. A method that throws on some providers and not others isn't very convenient. As it is lack of support for |
What about an 'Extended LINQ' nuget package out-of-band? |
Fixes #2146 if it is indeed considered desirable functionality. Add MaxBy, MinBy, DistinctBy, ExceptBy, IntersectBy, UnionBy, and Chunk methods to both Queryable (extension methods applied to IQueryable<T>) and Enumerable (extension methods applied to IEnumerable<T>). The methods are fully consistent with each other, giving the same result as observed from the outside, but by different means. The Queryable methods are all delivered in terms of existing methods, and so will be supported by existing providers that. In particular care is taken to ensure that new methods that don't use IComparer<T> don't call into Queryable methods that do, as they are relatively less-well supported. The Queryable methods test for the query provider being EnumerableQuery and in that case pass to the enumerable version, being better optimised for that provider. Note that this introduces a project dependency of System.Linq.Queryable.csproj and System.Linq.Queryable.Tests.csproj on System.Linq.csproj
Fixes #2146 if it is indeed considered desirable functionality. Add MaxBy, MinBy, DistinctBy, ExceptBy, IntersectBy, UnionBy, and Chunk methods to both Queryable (extension methods applied to IQueryable<T>) and Enumerable (extension methods applied to IEnumerable<T>). The methods are fully consistent with each other, giving the same result as observed from the outside, but by different means. The Queryable methods are all delivered in terms of existing methods, and so will be supported by existing providers that. In particular care is taken to ensure that new methods that don't use IComparer<T> don't call into Queryable methods that do, as they are relatively less-well supported. The Queryable methods test for the query provider being EnumerableQuery and in that case pass to the enumerable version, being better optimised for that provider. Note that this introduces a project dependency of System.Linq.Queryable.csproj and System.Linq.Queryable.Tests.csproj on System.Linq.csproj
Well, there's an implementation. With the set of operations that this gives, the lack of a |
For the version when called on |
Good ideas. Just one remark about:
Extremely large chunks such as 1 billion elements should be fully supported in a streaming fashion. A possible use case would be to stream data into multiple big files each of which is supposed to stay below a certain size (such as 2GB). Automatic materialization of chunks if used in a non-streaming way is a booby-trap but it might be worth it to get the two proposed APIs to unify into one. Nice idea. I'll probably add that to my own helper methods library and scrap the existing two methods :) AsChunkedEager has an important use case: Improving efficiency of PLINQ.
This cuts PLINQ overhead by 1000x and makes small work items more feasible. I'm concerned that this would stop working if chunk materialization is lazy because concurrent materialization is not thread-safe. Maybe we need an |
The question isn't so much one a lazy vs eager (my approach is lazy after all, but its caching means that something that makes eager assumptions from the outside won't realise, and the extra work done isn't a lot) as caching vs non-caching (with caching being inherent in an eager approach, and optional in a lazy one). I think the most important thing is that things "just work" by default, without one having to worry about whether one can repeat enumerating a chunk, etc. and with things like I would say that if a lazy-non caching approach was desired, then it should still check that it is starting at the correct point: Such a pure streamed approach would have to work on |
I just noticed that
That way the caller can make himself a completely lazy stream ( Just an idea. |
Not outside of linq-to-objects they can't. If we want this to be a Linq method then the IEnumerable case is just one possible provider. You can't convert |
Anyway. See JonHanna/corefx@fed4174 for something on the fully-streamed approach. |
Back to the more general question. Now there's something to look at, does it belong in .NET core? |
In my experience Min, Max & DistinctBy are more useful than GroupBy. I don't see any big downside. Maintaining useful code is a very small issue. Also I don't think expanding the API wit these methods will make LINQ that much more overwhelming for beginners. |
I am very strongly in favor of having this built-in and available everywhere. On Stack Overflow we have a lot of questions that would make use of these features. We always have to answer with complicated workarounds or repeat the same code 100 times. There is demand for these things. Stack Overflow is a great place to find evidence for framework deficiencies. Maybe the team should answer 10 questions per day. |
Looking at some of the SO questions you mention I see that there are a few nugets already. (I even submitted a patch to one, called morelinq, but completely forgot in the meantime because I never used it but was hoping to add stuff of my own I did use so I could use it instead). I think this argues against a nuget; we're just competing with nugets and I might think them less than ideal in some regards (e.g. taking the enumerable-only approach I argued against above) they're there and they're working and if we want a nuget we should probably contribute to one of those instead. The existence of the nuget offers arguments both for and against inclusion here: Against:
For:
On balance I'm becoming less sceptical now. I was pretty much 50/50 on whether they should be added to Enumerable and Queryable directly, but am beginning to think they'd be a good addition. |
@hackcraft - Couldn't do |
@Clockwork-Muse can you clarify? Is this referring to the use @GSPP says chunked would have for PLinq or something else? |
@hackcraft - the problem is that you couldn't have a PLINQ Chunk extension method unless corefx provides it - ParallelQuery has one constructor that takes an internal-only type, so third parties can't provide custom implementations. You can cheat by providing wrappers for existing ones, so you could do most stuff that way, but Chunk would have to go the GroupBy route if you tried (can't be lazy). |
Perhaps stddev makes sense, it comes up every now and then. |
It's worth noting that while a fully optimal |
@JonHanna MaxBy() reads better also. |
@JonHanna interesting point and valid. Do you have a rough estimate for how much slower that would be? In any case MaxBy is far quicker to type and documents what you mean. |
Another thought: This design of MaxBy does not support multiple criteria, right? Sorting does. Maybe we should have In my private utility library I have chosen the overload approach. |
The main thing that would make a well-optimised |
I imagine that producing the first element of a sort is still quite expensive despite being O(1). In fact the whole sequence must be materialized which is O(N) space whereas MaxBy always is O(1) space. Looking at MayBy implementations should be specialized to the number of criteria for small numbers of criteria (1-3?). That potential disadvantage can be solved. |
QuickSelect is only used for ElementAt/OrDefault. First/OrDefault and Last/OrDefault use a Θ(n) worse-case path that is O(1) in space. They're essentially MaxBy and MinBy. The only extra overhead is creating the comparer used. |
(That extra overhead is proportional to the number of potential tiebreak cases introduced by ThenBy steps, which multi key MaxBy would add to MaxBy). |
It is the goal of Linq to support all possible scenarios. Any additional API introduces costs of ongoing support, and additional size. As I see there are two parts in this proposal
"Chunking" seems to be a special case scenario, that would not come up very often and when it would come up, a simple solution may not be sufficient. |
I think that at least You didn't address I don't think chunking is a special case. I think APIs (especially web APIs) that accept a limited number of inputs are fairly common1 and chunking is the natural solution to that. I'm not sure what is the right solution regarding laziness, but even the simple solution returning lazy enumerable of eager enumerables should be enough for the common uses. 1 Specifically, I encountered this when making a SQL Server query that exceeded some limit on the query size. And also when using the MediaWiki API to request information about multiple articles based on their ID, where the limit is 50 articles per query. |
Chunking is so common. All business apps that I have worked on needed this. Processing data in batches is often required. Many Stack Overflow questions have a simple answer provided a chunking primitive is available. I often see ad hoc chunking code and it's nasty. Most of the time, batching should look like:
Or:
The way to query many rows by some key from an RDBMS is:
Trivial. |
I think this really should support in core library, it really common Actually the linq var minValue = items.Min((item) => item.value);
// same
var minValue = items.MinBy((item) => item.value).value; It a little more verbose with this specific task but
Please reopen this |
We have Distinct, Except, Intersect, Union, etc. Often it is required, though, to base the set operation on some key. OrderBy can do that for example.
I request that the following methods be added:
The set operations need to deal with duplicate keys in a sensible way. I have added to that list my idea for how to do that in sane and well-defined way.
Another commonly required helper method is one that splits a sequence into chunks of a fixed size:
This is useful for many things, for example:
This requirement comes up on Stack Overflow a lot and there is no built-in way to do it.
The chunks should be drawn lazily. In fact the items of each chunk should be lazy as well. This allows for very big chunks such as 1 billion items per chunk.
Maybe we need two variants:
IEnumerable<IEnumerable<T>> AsChunkedLazy
IEnumerable<T[]> AsChunkedEager
The second one would be easier to handle for consumers. The first one requires more disciplined use because each chunk must be enumerated at most once and the chunks must be enumerated in order. This is required for an efficient streaming operation.
The text was updated successfully, but these errors were encountered: