Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FirstOrDefault after OrderBy invokes predicate for every element (breaking change) #31554

Closed
mnmr opened this issue Nov 23, 2019 · 81 comments · Fixed by #36643
Closed

FirstOrDefault after OrderBy invokes predicate for every element (breaking change) #31554

mnmr opened this issue Nov 23, 2019 · 81 comments · Fixed by #36643
Labels
area-System.Linq untriaged New issue has not been triaged by the area owner

Comments

@mnmr
Copy link

mnmr commented Nov 23, 2019

We have a bit of code to reserve an account from an available pool, which looks like this:

var account = accounts.OrderBy(x => x.UsageCount).FirstOrDefault(x => x.TryReserve(token));

After porting our code from .NET Framework to .NET Core, this now invokes the predicate method for every item in the list. In practise, this code now reserves ALL accounts and then returns the first.

The problem is with the OrderedEnumerable.TryGetFirst method, which is invoked from the FirstOrDefault extension (presumably as a performance optimization). It orders items lazily and skips elements that do not match the predicate, which therefore has the side-effect of invoking the predicate once per element.

Given how ubiquitous .OrderBy(...).FirstOrDefault(...) is in code bases everywhere, I am surprised that this change in behavior was acceptable. While it's true that you should be careful with predicates with side effects, coalescing the two operations into one significantly changes what the code does (it is not ordering followed by filtering, as the code reads, but instead an amalgam of the two). Additionally, it's subtle and may go undetected because the code compiles just fine - the worst kind of a breaking change.

@stephentoub

@GrabYourPitchforks
Copy link
Member

Previous discussion, for context:
https://github.com/dotnet/corefx/issues/2400
dotnet/corefx#2401

@msloan
Copy link

msloan commented Dec 24, 2019

Reading through that discussion is a little terrifying. This significant breaking change is glossed over as simply being not "a 'pure' optimization".

Is there any comprehensive list of breaking changes that includes this one? I'd like our department to move to Core, but it now seems like an unwise move if changes like these have been made silently.

Any chance of opening back up the original discussion that led to this change? @stephentoub @VSadov

@danmoseley
Copy link
Member

cc @JonHanna

@danmoseley
Copy link
Member

@stephentoub thoughts about this, now you're back?

cc @PriyaPurkayastha

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@stephentoub
Copy link
Member

stephentoub commented May 13, 2020

I'm hesitant to change this.

I understand the concern, but having to maintain the exact number of times a predicate is invoked in something like LINQ is incredibly limiting. Consider OrderBy itself: saying we must invoke its comparer function as it was in the past means we couldn't make any changes to the sort algorithm at all ever. I realize that's not the exact case being discussed here, but it's similar in nature: the concern is that we're invoking a predicate differently than we did in the past.

The change has also been around for years, since the beginning of .NET Core, and even with the extensive use cited as motivation, this is the only complaint I've heard about it.

On top of all that, the perf implications are significant. This change effectively made something that was O(N log N) or O(N^2) into something that's O(N). Reverting it now could absolutely negatively impact code.

I'd strongly recommend not using side-effecting functions with LINQ.

@mnmr
Copy link
Author

mnmr commented May 13, 2020

saying we must invoke its function as it was in the past means we couldn't make any changes to the sort algorithm at all ever

I'm hesitant to object here because I know you're a lot smarter than me, but I don't see why the predicate passed to First(OrDefault) should have any influence on the ability to change the sort algorithm in OrderBy.

the concern is that we're invoking a predicate differently than we did in the past.

My personal concern is about readability. LINQ used to be a sequential method chain, where you could reliably consider each step separately. With this change, you can no longer decipher what will happen purely by reading the code - you must know the implementation as well.

The change has also been around for years, since the beginning of .NET Core, and even with the extensive use cited as motivation, this is the only complaint I've heard about it.

The danger of silently breaking existing behavior is that people may not have discovered the problem, or won't discover it until after it has caused them trouble.

On top of all that, the perf implications are significant. This change effectively made something that was O(N^2) into something that's O(N). Reverting it now could absolutely negatively impact code.

Even if there is some gain to be made by combining the two (as the current implementation does), it should rightly be a new method (e.g. FirstWithOrderBy) instead of a replacement (because the behavior is different, breaking and non-intuitive).

I'd strongly recommend not using side-effecting functions with LINQ.

This was always true but the code is succinct and readable, which to a lot of folks matters more than purity. I expect there are thousands of custom ForEach implementations for the exact same reason.

Please reconsider.

@msloan
Copy link

msloan commented May 13, 2020

The change has also been around for years, since the beginning of .NET Core, and even with the extensive use cited as motivation, this is the only complaint I've heard about it.

Many companies haven't switched over to .NET Core yet (mine still hasn't). I expect as more people transition over there will be more complaints and bugs silently introduced.

@stephentoub Do you think it would be reasonable to provide a list of breaking changes people should expect (such as this one) to aid in their decision for when and how to switch to .NET Core?

@danmoseley
Copy link
Member

There is a breaking change list -- e.g., this one is for .NET Framework -> .NET Core.

The process to add to it is to open an issue with this template: https://github.com/dotnet/docs/issues/new?template=dotnet-breaking-change.md

@stephentoub
Copy link
Member

stephentoub commented May 13, 2020

I don't see why the predicate passed to First(OrDefault) should have any influence on the ability to change the sort algorithm in OrderBy.

I was referring to the predicate passed to OrderBy itself. It shouldn't be any more or less special than other predicates in LINQ.

LINQ used to be a sequential method chain, where you could reliably consider each step separately.

All sorts of LINQ implementations do optimizations that involve rearranging the query, e.g. Entity Framework, and more generally most any LINQ implementation based on queryables.

The danger of silently breaking existing behavior is that people may not have discovered the problem, or won't discover it until after it has caused them trouble.

This is true for practically any bug fix, functional, performance, or otherwise. Developers can take dependencies on all sorts of things that weren't intended, weren't documented, were intended as an implementation detail... something behaved one way and they took a dependency on it behaving that way.

I expect there are thousands of custom ForEach implementations for the exact same reason.

It's also why ForEach has been proposed numerous times and has never been added to Enumerable: Enumerable isn't meant to be used with side-effecting callbacks.

Please reconsider.

That's what we're doing by having this discussion.

@stephentoub
Copy link
Member

All that said, while I'm personally less motivated by the side-effecting case, it is probably worth looking at it again from a performance perspective. FirstOrDefault with no predicate is hard to argue with from a perf perspective. FirstOrDefault with a predicate now ends up invoking the predicate N times whereas previously it was likely to invoke it a small number of times, and if that predicate performed a non-trivial computation, it could easily outweigh the gains from a small-to-medium size input being sorted.

@terrajobst
Copy link
Contributor

terrajobst commented May 13, 2020

I agree with @stephentoub and said this earlier:

Looks like that change was included since .NET Core 1.0, changing that now would have issues for folks already on .NET Core.

Given that behavior is in the product for so long and this is the first time I’ve heard about it, I’m not sure it’s that impactful to be honest.

On top of the arguments that were made I'd like to add this:

  • This current behavior impacts people when moving from .NET Framework to .NET Core.
  • The ask is now to change a behavior that exists since .NET Core 1.0, which would impact people going from an older version of .NET Core to a newer version.

Generally speaking, moving between different frameworks is always a breaking change while we try hard make moving to a later version of the same framework a smooth experience with as few breaks as possible, ideally none.

And the argument why that is a breaking change in .NET Core is for the same reason: if the predicate has side effects it can alter the behavior of your program. However, even if we were to accept that as a less impactful breaking change it would still result in a perf regression, which can be significant. Like @stephentoub I find myself not eager to accept that fallout.

On top, there is the question what guarantees we want to provide for LINQ. It seems indeed too limiting to say that order of operations and number of calls of provided callbacks can't change.

@VSadov
Copy link
Member

VSadov commented May 13, 2020

Yes. We can think of differences in "no predicate" and "with predicate" behavior. But it is probably an overriding fact, that the linear-scan behavior has been in the framework for so long - since Core 1.0

So, in a case if sideeffecting/counting behavior is expected from FirstOrDefault , is there a simple workaround?

@stephentoub
Copy link
Member

So, in a case if sideeffecting/counting behavior is expected from FirstOrDefault , is there a simple workaround?

OrderBy(...).ToArray().FirstOrDefault(...)

?

@mnmr
Copy link
Author

mnmr commented May 13, 2020

  • The ask is now to change a behavior that exists since .NET Core 1.0, which would impact people going from an older version of .NET Core to a newer version.

I don't believe this is correct. .NET Core developers cannot use lambdas with side-effects precisely because there is no guarantee how many times they will be invoked. Adding back that guarantee will thus not break existing code.

That said, you could have code in a unit test that relied on the lambda being called a specific number of times for a given data set, but that would be brittle for many other reasons as well (such as the sort algorithm changing).

@mnmr
Copy link
Author

mnmr commented May 13, 2020

I feel I should add that there has never been an assumption that the lambda passed to OrderBy could (safely) have side-effects, as sorting by nature needs to invoke the lambda many times.

The case here is that the lambda passed to FirstOrDefault has side-effects, and this is triggered by the preceding OrderBy clause because of the current optimized implementation that merges the two operations.

@terrajobst
Copy link
Contributor

  • The ask is now to change a behavior that exists since .NET Core 1.0, which would impact people going from an older version of .NET Core to a newer version.

I don't believe this is correct. .NET Core developers cannot use lambdas with side-effects precisely because there is no guarantee how many times they will be invoked. Adding back that guarantee will thus not break existing code.

My point is that code often relies on side effects by accident. People test their code in the context of a given framework, and that assumption might break later. Within .NET Framework we wouldn't have made this change in either direction, because it's likely breaking someone. On .NET Core I can see us making that change in either direction, which is why I said we might convince ourselves that it's an acceptable breaking change to undo the optimization. Breaking changes need to be properly motivated and significantly improving perf is a good justification but so is breaking less code.

It seems your arguments are:

  1. It broke my code so it's likely affecting other people's code too. Given the current behavior shipped four years ago, the number can't be that high or we would have heard about it more often.

  2. It's not intuitive. I can buy that the behavior can be surprising, but we need to consider the larger picture here. You're asking us to change LINQ so that the predicate passed to FirstOrDefault() is only executed once. What about all the other methods that accept lambdas? Why is FirstOrDefault() special? If we want a consistent set of rules that customers can reason about then we'd have to document each and every method and disallow any optimization that would shuffle operators around. I don't think we'll want that. If anything, we'd want more flexibility. Other people have asked for the compiler to "see through" lambdas and create more efficient code by lowering things into for loops and inline lambdas. Looking at the other examples of LINQ (EF, or more generally query plans in SQL engines) the norm is that queries aren't an implementation spec but a declarative description. Preserving side effects seems to me to be too limiting to be a good design choice here.

@msloan
Copy link

msloan commented May 13, 2020

The primary cost here to me is that these semantics will likely be suprising to 99%+ of developers. I just hope decision makers here take this into account if they decide to value the potential performance gain higher.

@mnmr
Copy link
Author

mnmr commented May 13, 2020

On .NET Core I can see us making that change in either direction, which is why I said we might convince ourselves that it's an acceptable breaking change to undo the optimization.

What particular implementation detail could .NET Core developers rely on here that would make reverting a breaking change?

The only thing I can think of is if you have code that relied on the lambda passed to First being called at least once for every item, despite only returning the first element. But if I had code like that I would want you to break it for me.

Breaking changes need to be properly motivated and significantly improving perf is a good justification but so is breaking less code.

As Stephen said: the additional performance gains from special-case handling of OrderBy(..).First(..) could very well end up being lost if the lambda passed to First has any non-trivial code in it. So the net gain in performance is questionable. I have no objections to optimizations made purely to the OrderBy implementation.

Why is FirstOrDefault() special?

But that's my point exactly. Why have special-case handling for First-and-friends that behaves differently from (a) what it used to be and (b) what is intuitive behavior from looking at the code.

Injecting ToList/ToArray between the two operations, as the current workaround, should really be a no-op (except for the expected performance hit).

If we want a consistent set of rules that customers can reason about then we'd have to document each and every method and disallow any optimization that would shuffle operators around.

The ask here is only that terminating expressions that do not visit all items "by nature" (so pretty much only First and FirstOrDefault) also behave as such.

For IQueryable there is an implicit understanding that you are writing LINQ expressions that are then translated into something else, e.g. SQL. I don't think it makes sense to consider them under the same umbrella as plain LINQ, where there is no expression tree translation involved (nor do users expect one to be).

@terrajobst
Copy link
Contributor

terrajobst commented May 13, 2020

What particular implementation detail could .NET Core developers rely on here that would make reverting a breaking change?

...OrderByDefault().FirstOrDefault(func) will currently call the lambda for all elements. It's possible that evaluating this lambda has side effects for the elements it's executed on that code happens to rely on. When not doing this anymore code might, for example, null reference because it didn't properly check the underlying field for null. Before the bug wasn't observed because the lambda happened to trigger the initialization.

Am I making stuff up? Yes. But have I seen bug reports like that? For sure. My point isn't that this is such a big deal that we wouldn't do it. I'm just pointing out that it has very real potential of breaking customer code too. The problem usually isn't "will correct customer code continue to work". It's "will an app that worked before now no longer work". In .NET Framework this bar prevented us from doing lots of things. In .NET Core we're more willing to accept breaks, especially when they can't be observed in correctly written code.

But the last part is the important piece here: what amounts to correct code is often in the eye of the beholder. I think what we're saying is that relying on side effects in lambdas passed to LINQ is incorrect. And to a certain extent that's your argument as well (when you asked that question above).

But that's my point exactly. Why have special-case handling for First-and-friends that behaves differently from (a) what it used to be and (b) what is intuitive behavior from looking at the code.

Plain and simple: perf. Most perf optimizations are achieved by dissolving abstraction boundaries in order to shortcut. This optimization is no different from that.

The ask here is only that terminating expressions that do not visit all items "by nature" (so pretty much only First and FirstOrDefault) also behave as such.

Now we're talking. That answers my question of why you think FirstOrDefault is being special. I actually think that's an interesting point.

@maxkatz6
Copy link
Contributor

It would be useful to write Roslyn Analyser, which can detect side-effect operations in Linq. But I don't believe that it is possible now with limited information from compiler.

@mnmr
Copy link
Author

mnmr commented May 13, 2020

I think what we're saying is that relying on side effects in lambdas passed to LINQ is incorrect. And to a certain extent that's your argument as well (when you asked that question above).

I'm aware that side-effects are to be avoided, but there are rare cases where it is hard to achieve succinctness without. Indeed, sometimes the side-effect is all you care about (e.g. send an email for each element), but I'm sure that discussion (.ForEach) has been had in the past many times over.

Most perf optimizations are achieved by dissolving abstraction boundaries in order to shortcut. This optimization is no different from that.

There is truthyness here, but it is more complicated than that. I do not expect wild code transformations to happen in the name of performance to methods that take a Func argument, unlike methods that take an Expression<Func>.

Now we're talking. That answers my question of why you think FirstOrDefault is being special. I actually think that's an interesting point.

The implication of this is that the scope of the reversal is quite limited and thus unlikely to significantly affect what optimizations can be applied to LINQ expressions as a whole.

@smitpatel
Copy link
Contributor

Even though EF uses Queryables, it has preserved the ordering of operators as much as it does not cause side-effects. An OrderBy().Where() which can be optimized, EF still generates a subquery with OrderBy. Though there are also many re-arrangements happening since everything is translated to SQL hence it does not cause side-effects with client functions. For example, source.Where().Join() would apply Join before the predicate.

@FransBouma
Copy link
Contributor

Two things are merged together in a single element, for no other reasons than performance. This is a slippery slope and not a good thing IMHO.

Say I have a .netstandard 2.0 library with code like:

var x = someEnumerable.OrderBy(a=>a.Field)
		.FirstOrDefault(b=>SomeClass.SomeVeryExpensivePredicateMethod(b));

Say the someEnumerable has 1000 elements and ordering it takes 1ms on netfx (I know that's dead slow, it's for simplicity). The SomeClass.SomeVeryExpensivePredicateMethod(b) method call takes 100ms. On netfx this whole operation now takes 1ms + 100ms is 101ms.

Running my .netstandard library in an app on .net core however will now take 1000*100 + 1 is 100s+

No side effect method used, a readonly, predicate method that does nothing with the argument passed in. Same IL. This is the problem. You merged two distinct methods for an optimization which does have a side effect. The optimization is understandable, and for a long time no-one filed an issue regarding a side effect (There might have been people running into this, remember: for every person filing an issue, there are numerous others who ran into the same problem but didn't file an issue).

Of course there are workarounds, but that's not the point. Linq to objects has a clear design: enumerable.ExtensionMethodA().ExtensionMethodB()....ExtensionMethodZ(); where one extension method works on the sequence passed in. Here, you merged two of these into one. Not in the public API but under the hood, for performance. While understandable, this isn't right. Like in my example where the same IL works fine on netfx but performs terrible on .netcore due to this optimization.

As a library author of .netstandard libraries I need to be able to rely on APIs having the same net effect, like Linq to objects' extension methods. Mind you, I target netstandard APIs here, and the behavior changes based on the platform I run on. That's a red flag.

The only way to keep the optimization is to push it into roslyn: if the argument of FirstOrDefault isn't a methodcall, call the optimized method, otherwise keep the 'old' behavior as-is. Only then you can avoid behavior changes / terrible performance dips (like I showed above) in e.g. netstandard targeting libraries.

(What surprises me a bit is that this is still a debate. Two distinct things were merged for performance (while good API design dictates: every method should do one thing and one thing only), and it has side effects. You then do the right thing and un-merge them to avoid the side effect.)

@taisbak
Copy link

taisbak commented May 14, 2020

So, in a case if sideeffecting/counting behavior is expected from FirstOrDefault , is there a simple workaround?

How about ...:

OrderBy(...).SkipWhile(...).FirstOrDefault()

?

@JesperTreetop
Copy link
Contributor

JesperTreetop commented May 14, 2020

The ask here is only that terminating expressions that do not visit all items "by nature" (so pretty much only First and FirstOrDefault) also behave as such.

Now we're talking. That answers my question of why you think FirstOrDefault is being special. I actually think that's an interesting point.

I fully agree with this point. It's reasonable to infer, from a lazily evaluated sequence, that something which grabs the first element stops when it has the first element and this behavior has been the actual behavior of most or all LINQ operators of this kind since LINQ was introduced. This ticket is frightening because it's as if the && operator sometimes stopped being short-circuiting due to certain optimizations. FirstOrDefault and its ilk are even called LINQ operators, and this turns into "spooky action at a distance" when things jump from one place to another in the pipeline.

If it was AsParallel()..., you would have opted into a different behavior (PLINQ) which undoes those expectations. Here, no such behavior is desired, and there's a long history of the behavior since .NET 3.5. If this behavior is kept there really needs to be a readily available analyzer flagging this - this is a landmine, and I think the compatibility concerns should be reconsidered due to being so close to a bug (since the change in behavior wasn't announced), due to Microsoft now heavily pushing everyone to move to .NET Core/.NET 5 and due to compatibility concerns therefore going at least as strongly the other way. To say nothing of the major confusion .NET Standard libraries will be in.

@poke
Copy link
Contributor

poke commented May 14, 2020

I’m really on the fence with this. One one side, I really like performance improvements and gladly accept a few oddities with them. On the other hand, as some examples here show, these performance improvements are really dependent on your use case: If the OrderBy is expensive, then this change makes it perform better; but if the FirstOrDefault is expensive, then that simply isn’t true.

What bothers me most is that it conflicts with the very common concept that is used to explain LINQ: Every LINQ method operates independently in a pipeline where one method passes items to the next. In that world, FirstOrDefault() could be written like this (simplified):

public static T FirstOrDefault<T>(this IEnumerable<T> source, Predicate<T> predicate)
{
    foreach (var item in source)
        if (predicate(item))
            return item;
    return default(T);
}

In fact, writing such simplified implementations yourself is a very common way to learn LINQ and is often used to teach the idea that each method operates on its own.

But this optimization now breaks that assumption: You now have the combination of OrderBy().FirstOrDefault() work differently, and in a way that is difficult to comprehend and not intuitive when compared to the other parts. I agree with others here that the expectation for IQueryable<> is that the provider will happily combine the query so that it can work with it most efficiently. After all, I’m writing a declarative query there. But I don’t think that the same is necessarily true for IEnumerable<>. I feel like this optimization should rather be a new OrderedFirstOrDefault that makes it clear that this is a combined operation.

As for why nobody ever said anything about this in four years: I think it’s simply because you have to actively look for this to find out about this behavior. Unless you actually hit a performance issue or have avoidable side effects, you simply won’t see this changed behavior. Yes, in those cases it likely doesn’t matter either way, but if you do hit this issue, then I think the current implementation is not really intuitive.

@stephentoub
Copy link
Member

Two things are merged together in a single element, for no other reasons than performance. This is a slippery slope and not a good thing IMHO.

Things like this have been done since the beginning of LINQ. Look, for example, at the reference source implementation of Where and Select; the implementation optimizes the implementation of Where.Where and Where.Select by merging their implementations, with a subsequent operation detecting the previous one and causing the combined operation to be implemented differently.

I very much disagree that such merging shouldn't be done. The real question is when such optimizations are done and they impact observable behavior. Note that there are such optimizations as well since the beginning of LINQ, where the behavior is observably different from the naive foreach-over-the-source behavior. Consider:

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    static void Main()
    {
        IEnumerable<int> list = new MyList() { 1 };
        foreach (int result in list.Select(i => i))
        {
            Console.WriteLine(result);
        }
    }
}

class MyList : List<int>, IEnumerable<int>
{
    IEnumerator<int> IEnumerable<int>.GetEnumerator()
    {
        Console.WriteLine("did this print?");
        yield return 42;
    }
}

My custom enumerable implementation provides its own GetEnumerator implementation, so a foreach-over-the source implementation would end up printing out "Did this print?" and "42". But try it and you'll see it print out "1", because Select is special-casing List<T> and ignoring its IEnumerable<T> implementation.

My point is, in all of these cases, there's often a lot more ambiguity and gray area than the strict purist view cited provides, and we collectively need to make a judgement call about the tradeoffs.

.FirstOrDefault(b=>SomeClass.SomeVeryExpensivePredicateMethod(b));

This is what I raised earlier as a valid concern:
#31554 (comment)

At the same time, there's nothing in the documentation that says "FirstOrDefault" is only going to execute the delegate until it finds the first. It says that it returns "default(TSource) if source is empty or if no element passes the test specified by predicate; otherwise, the first element in source that passes the test specified by predicate." So forget OrderBy for a moment; a perfectly conforming implementation of that method alone could execute the delegate for every value and then return the first that passed. Is that a good idea from a performance perspective? Maybe, maybe not.

There might have been people running into this, remember: for every person filing an issue, there are numerous others who ran into the same problem but didn't file an issue

And for every person filing an issue, there are many more not filing an issue because they're not facing one and benefiting from the improved performance the change enabled for them.

Mind you, I target netstandard APIs here, and the behavior changes based on the platform I run on. That's a red flag.

This is an argument against ever taking a bug fix, ever making a performance improvement, ever evolving .NET Core. We do not maintain bug-for-bug compatibility with .NET Framework nor with previous releases of .NET Core. We do want to document places where behavior changes meaningfully so as to help developers migrate and upgrade, but that doesn't mean we don't make such changes; it means we're thoughtful about them, the risks, and the rewards.

@JesperTreetop
Copy link
Contributor

JesperTreetop commented May 14, 2020

I very much disagree that such merging shouldn't be done. The real question is when such optimizations are done and they impact observable behavior. Note that there are such optimizations as well since the beginning of LINQ, where the behavior is observably different from the naive foreach-over-the-source behavior.

This is fair and the observable behavior is the important part. When .Any() checks if it's a list and whether its Count is larger than zero, that's a great optimization because it does the obvious thing without impacting the user's expectations or the method's contract. Same with various classes to merge where+select pipeline stages, and so on.

I'm personally far from a purist because the purist would say "this is a declarative function that's only well-defined for non-side-effecting functions, and if you have side-effecting functions you deserve whatever madness comes your way".

The bit that worries me isn't that an optimization was made but that I can't find any case where the post-optimization behavior is what anyone who uses a side-effecting function would ever expect. In other words: I think it's hard to find code that uses side-effects that safe-guards against this change, and very easy to find uses where the stop-immediately behavior is assumed. Is that part of the method's contract? Maybe not explicitly. The iteration variable in C# foreach statements not being part of the loop's scope was correct according to the language specification too, but it broke people's brains and indirectly caused people to write bugs in a similar way.

Consider also that Roslyn now has code fixes to turn loops into LINQ and LINQ into loops. That strengthens the already strong assumption that LINQ's observable behavior is "like loops", that xyz.OrderBy(...).FirstOrDefault(predicate) is equal to foreach (var z in xyz.OrderBy(...)) { if (predicate(z)) { break; } }.

(As additional appeal to this being "in the water", if maybe not strictly "on the books", Jon Skeet's EduLINQ reimplementation literally has a test checking for this.)

@juliusfriedman
Copy link
Contributor

juliusfriedman commented May 18, 2020

I personally think this should be fixed e.g. only enumerate the sequence until the first item is found and also LastOrDefault, if that exhibits any of these traits it should be fixed as well although I imagine you will find this logic works correctly there... what makes FirstOrDefault hard to fix is that people are relying on the bug?!?....

...

I have used OrderBy.First(...) or OrderByDescending.Last(...) in a few of my recent .net core projects (without reviewing for this bug). Also have some of .OrderBy(...).ThenBy(...).FirstOrDefault() a few of .OrderByDescending().FirstOrDefault() and a few of OrderBy().FirstOrDefault()

  1. I hope the overload without the predicate works as expected
  2. I hope First, Last and LastOrDefault do not exhibit this bug.
  3. I use it in situations where an Enumerable is usually produced by you can skip the lookup if you have what has to be enumerated already e.g. passing it to the function. In that cases the user either received a IEnumerable<T> or a Func<IEnumerable<T>>. In such cases just think about the implications as I call OrderBy or OrderByDescending and pass an Enumerable to something how will they know to avoid my OrderBy, OrderByDescending etc? (Please do not suggest checking if the passed is IOrderedEnumerable it never mattered before (e.g. pre .net core)...

This must be fixed for those reasons alone right?

@Nirmal4G
Copy link

I personally like the optimization as it assumes predicate should not change state of the object being compared. But I do get that there are people who want to do it like this...

var account = accounts.OrderBy(x => x.UsageCount).FirstOrDefault(x => x.TryReserve(token));

But this is what you get... more/less...

var account = accounts.OrderBy(x => x.UsageCount).Where(x => x.TryReserve(token)).FirstOrDefault();

If I write this, I would do it like this

var account = accounts.OrderBy(x => x.UsageCount).FirstOrDefault(x => x.IsReservable(token));

try
{
  account.Reserve(token);
  doStuff();
}
catch
{
  throwError();
}

No sort optimizations can be done (without the predicate), unless we set the precedent that the predicate cannot change the state of the object being compared.

@jakobbotsch
Copy link
Member

@Nirmal4G such code can simply be written using a foreach loop:

foreach (var account in accounts.OrderBy(x => x.UsageCount))
{
  if (account.TryReserve(token))
    return account;
}

Due to the side effect this is probably how I would prefer the code to look anyway.

@Nirmal4G
Copy link

Nirmal4G commented May 18, 2020

@jakobbotsch As I understand from reading the query, @mnmr trying to reserve the token on the first account with the least usage.

But as he says, the ordering should be done before the First/Last runs the predicate on each item. So, that the predicate (no matter what is doing to the item) will only run until the predicate succeeds (that internal success is what changes the state). Realistically, I agree with him. But what if failure causes the state change, then we run into a problem, where it changes the object state on every item until the predicate succeeds. We'll end up with the same problem as above.

So, keep in mind that no optimizations should be performed, unless you can do transformations on the call chain and ensure that result will be the same. Which means that predicate should never change the state of the object being compared.

@mnmr
Copy link
Author

mnmr commented May 18, 2020

But what if failure causes the state change, then we run into a problem, where it changes the object state on every item until the predicate succeeds. We'll end up with the same problem as above.

That would in my mind result in First being a while loop that runs until it succeeds, which makes no sense at all. Unless you mean the delegate passed to OrderBy, which no-one is arguing should safely allow for side-effects. It’d be pure madness to sort by a self-modifying property.

So, keep in mind that no optimizations can be performed, unless you can do transformations on the call chain and ensure that result will be the same.

You can optimize the entire call chain, except the final step (if it has a predicate). And you can optimize each step independently too. So not much of a limitation really.

Which means that predicate should never change the state of the object being compared.

The key term here being should rather than must. But again, sounds like you are considering the OrderBy delegate, which is not really what this issue is about.

@jzabroski
Copy link
Contributor

@jzabroski I remember code contracts, and well understand the difference between an object and a function ;-)

I also keep seeing the notion that developers are using LINQ "wrong".

This might sound confusing but I don't care about opinions on wrong or right. I want to first define first principles, and then define what is wrong or right. I don't want "FirstOrDefault sounds like it stops at the first match vs. consumes the whole sequence, so it should work that way based on the name we chose", because that tends to cause unforeseen problems later on.

What I am trying to arrive it as "algebraic laws" for LINQ - or PEMDOS for LINQ. It seems most of us are voting that FirstOrDefault should not have a distributive property, and that OrderBy.FirstOrDefault disallows composing FirstOrDefault with OrderBy by distributing the FirstOrDefault aggregate over OrderBy operator. But then, what about other aggregate operators?

And the hard part here is that your example I quote below is only relevant if we allow for impure predicates. (e.g., I imagine @mnmr's TryReserve function might be used to TryReserveHotelRoom or TryReserveSomeEmbeddedHospitalDevicePort).

There's no protection from the compiler, nothing. Imagine some delegate in there that deleted the first customer in a table from the yielded sequence and then returned true. Worked on NET Framework. Run on Core: Boom. And you won't even know it until that code is run - maybe weeks or even months after the migration.

I agree, this logic makes porting code from Linq2Sql or EF6 to EFCore very difficult, especially given all the different permutations even EFCore has gone through with respect to when an IQueryable becomes an IEnumerable (e.g., in EFCore 3.x, a ".Include" creates a cartesian product query in the db).

@isaacabraham
Copy link

@jzabroski

What I am trying to arrive it as "algebraic laws" for LINQ - or PEMDOS for LINQ. It seems most of us are voting that FirstOrDefault should not have a distributive property, and that OrderBy.FirstOrDefault disallows composing FirstOrDefault with OrderBy by distributing the FirstOrDefault aggregate over OrderBy operator. But then, what about other aggregate operators?

Realistically whilst that sounds a really worthy goal to me, I'm think that the time to do this has long gone - probably around 2006/7 when LINQ was being worked on - leave aside the fact that C#'s type system has no real way of detecting pure or impure functions.

The priority, I feel, should simply be ensuring non-breaking changes to users that migrate from Framework to Core so that they fall into the pit of success. IMHO. I'll stop here though.

@jzabroski
Copy link
Contributor

Realistically whilst that sounds a really worthy goal to me, I'm think that the time to do this has long gone - probably around 2006/7 when LINQ was being worked on - leave aside the fact that C#'s type system has no real way of detecting pure or impure functions.

Except, when I search the documentation, I found this gem: https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/refactoring-into-pure-functions#standard-query-operators

Standard Query Operators

An important characteristic of the standard query operators is that they are implemented as pure functions.

It implies therefore that the overloads that accept external predicates should also be pure, and teaches you how to think in terms of pure functions.

Look, I actually would revert this optimization if it were for me. But I also don't think that it's fair to constantly review public APIs used by 7 million developers. It's better to get it right, and move on.

A hack would be to add Stephen's optimizations into a separate OOB nuget package called System.Linq.Pure or similar, and leave System.Linq as having legacy .NET Framework behavior without this optimization.

Another possible hack would be an annotation on FirstOrDefault that lets the developer declare if the predicate is pure or impure.

I'm not giving up on the idea we should have our cake and eat it, too.

@Nirmal4G
Copy link

Nirmal4G commented May 18, 2020

sounds like you are considering the OrderBy delegate

NO, I'm most certainly not.

The key term here being should rather than must

That was my intent. Sorry for the bad English usage.

@jzabroski

What I am trying to arrive it as "algebraic laws" for LINQ

I agree. First of all, predicates must be pure. That should solve most of our problems. C#/IL needs pure (const) members! IMHO.

.NET Core is a new platform after all. We should assume by default that all logic have been touched at some point.

@jzabroski
Copy link
Contributor

jzabroski commented May 18, 2020

@jzabroski

What I am trying to arrive it as "algebraic laws" for LINQ

I agree. First of all, predicates must be pure. That should solve most of our problems. C#/IL needs pure (const) members! IMHO.

So, @stephentoub Can I ask you to discuss with @terrajobst , @danmosemsft and others the idea of temporarily moving your optimization into a System.Linq.Pure OOB package until CIL has verified pure members? Is that a reasonable compromise? Then it becomes a breaking change in .NET Core release notes for people who want to use System.Linq to point to System.Linq.Pure instead. That also lets people choose their footgun/destiny. (Edit: I guess the havoc this might cause is in all transitive dependencies downstream from packages that switch, as well as Roslyn find missing packages feature possibly getting confused)

Given Rust has const functions and as engineers we keep up with the Joneses, it doesn't seem like a stretch to imagine at some point in the next 10 years this happens for CIL/C#.

@Nirmal4G
Copy link

Nirmal4G commented May 18, 2020

We really need a mathematically verified implementations as a separate framework package with compiler/runtime modified to support those at compile time and runtime. The whole purpose of .NET Core was to allow breaking changes, So, I would not mind if you guys make .NET mathematically verified with each major version until it completely is.

@JesperTreetop
Copy link
Contributor

JesperTreetop commented May 18, 2020

An important characteristic of the standard query operators is that they are implemented as pure functions.

It implies therefore that the overloads that accept external predicates should also be pure, and teaches you how to think in terms of pure functions.

I don't think that follows. I think it just means that they themselves work in terms of pure functions. Builders, which look like LINQ operators when they're written in a "fluent" style, often don't, and just mask setting flags or fields on some underlying object, but the LINQ operator call itself should be pure in that it returns a new object, and that if you have saved an old object, it should not be changed. ie: var x = seq.Where(d => d % 2 == 0); var y = x.Select(d => d * d); should allow using x afterwards without the Select having made destructive changes to a fused WhereSelect optimization operator, because you have no idea where the result of the Where operator went. For all you know it's reused in hundreds of programmatically generated queries. (This does not apply to this issue's contested change, just to the quote above and what it means in general.)

I would be completely onboard with a restriction to pure functions in predicates and selectors (input functions to LINQ operators) as long as it could be documented and verified by the compiler and the type system. That would allow more optimizations safely, and it would let us know what could and what couldn't be done. If there'd been error messages from the start, this would never have been an issue. But with such a restriction not being in place, it's dangerous for the implementation to suddenly change the assumptions about what the code does now that code has been written for 12 years with a different set of assumptions.

The two-phase solution listed in an answer before assumes that you can split it into two operations. Maybe that's impossible because both steps need to be taken atomically, or because they're handled by code you don't control.

At the root of this, it comes down to:

var x = [...].FirstOrDefault(impure function);

There's 12 years of programmers writing that instead of:

foreach ([...]) {
    if ([impure function]) {
        break;
    }
}

When both solutions are available, people often choose the shorter one. If the first one had been ruled out by a more capable system, this would never have been possible. (Also, people really like the shorter code: GitHub and NuGet are both full of projects that implement .ForEach(...) extension methods for IEnumerable to avoid using foreach directly. I can't say I agree with that personally, but that's where it comes from.)

@juliusfriedman
Copy link
Contributor

I personally think this should be fixed e.g. only enumerate the sequence until the first item is found and also LastOrDefault, if that exhibits any of these traits it should be fixed as well although I imagine you will find this logic works correctly there... what makes FirstOrDefault hard to fix is that people are relying on the bug?!?....

...

I have used OrderBy.First(...) or OrderByDescending.Last(...) in a few of my recent .net core projects (without reviewing for this bug). Also have some of .OrderBy(...).ThenBy(...).FirstOrDefault() a few of .OrderByDescending().FirstOrDefault() and a few of OrderBy().FirstOrDefault()

  1. I hope the overload without the predicate works as expected
  2. I hope First, Last and LastOrDefault do not exhibit this bug.
  3. I use it in situations where an Enumerable is usually produced by you can skip the lookup if you have what has to be enumerated already e.g. passing it to the function. In that cases the user either received a IEnumerable<T> or a Func<IEnumerable<T>>. In such cases just think about the implications as I call OrderBy or OrderByDescending and pass an Enumerable to something how will they know to avoid my OrderBy, OrderByDescending etc? (Please do not suggest checking if the passed is IOrderedEnumerable it never mattered before (e.g. pre .net core)...

This must be fixed for those reasons alone right?

@stephentoub since you closed this issue I was wondering if you can confirm the above?

@stephentoub
Copy link
Member

since you closed this issue I was wondering if you can confirm the above?

GitHub closed it when I merged #36643. What is it specifically you're questioning?

@juliusfriedman
Copy link
Contributor

juliusfriedman commented May 18, 2020

I quoted my questions directly above your response... IPE..

  1. Does LastOrDefault(predicate) exhibit this same issue?
  2. Does LastOrDefault() or FirstOrDefault() exhibit this same issue?
  3. Does First() or Last() exhibit this same issue?
  4. Does First(predicate) or Last(predicate) exhibit this same issue?
  5. I assume OrderByDescending is also effected in the same way.

@stephentoub
Copy link
Member

stephentoub commented May 18, 2020

Does LastOrDefault(predicate) exhibit this same issue?

FirstOrDefault(predicate) in .NET Framework iterated from the beginning of the enumerable and stopped the first time the predicate passed. That's not the case for LastOrDefault, so I don't know what the question is that you're asking.

Does LastOrDefault() or FirstOrDefault() exhibit this same issue?

The concern with this issue was the invocations of the predicate. I don't know what issue you're referring to when discussing the overloads of FirstOrDefault and LastOrDefault that don't take a predicate.

Does First() or Last() exhibit this same issue?

Same as above.

Does First(predicate) or Last(predicate) exhibit this same issue?

First(predicate), yes. Last(predicate), see my first answer above.

I assume OrderByDescending is also effected in the same way.

Yes.

@JesperTreetop
Copy link
Contributor

By the way, thanks to everyone at Microsoft for listening here. I understand that it can be frustrating when optimizations are undone by people writing code that can be viewed as "misbehaving", especially when the changes have been in effect for several versions now. At the end of the day I think the right choice was made to keep a behavior that was at least consistent and free of "pockets of different behavior" that people need to pay attention to.

I have enormous respect for @stephentoub for what he has accomplished over the years, from TPL, CDS and TPL Dataflow to all manner of multi-threaded and concurrent library work and now a general eye on performance across all of the BCL, and I understand that it can't have been very pleasant to have people descend on you in the way this thread unfolded - you have my deepest gratitude for all your work over the years and .NET would be a much less hospitable environment without technologies you have been involved with and been one of the public faces for. Thanks also to @terrajobst who posted the thread, which is how I found this.

@jzabroski
Copy link
Contributor

Agree, I aspire to be as good an engineer as @stephentoub .

@juliusfriedman
Copy link
Contributor

juliusfriedman commented May 18, 2020

Just to be sure, to work around this one must write their own First/FirstOrDefault(predicate) methods so they can be sure only the first item in which the predicate matches will be iterated for?

E.g. as @poke has shown above?

public static T First<T>(this IEnumerable<T> source, Predicate<T> predicate)
{
     if(null == source) throw new ArgumentNullException(nameof(source));
    foreach (var item in source)
        if (predicate(item))
            return item;
    throw new ArgumentException();
}
public static T FirstOrDefault<T>(this IEnumerable<T> source, Predicate<T> predicate)
{
    if(null == source) throw new ArgumentNullException(nameof(source));
    foreach (var item in source)
        if (predicate(item))
            return item;
    return default(T);
}

And this is not the case in .Net Full Framework and only really needed when an OrderedEnumerable is encountered in source?

@jzabroski
Copy link
Contributor

@juliusfriedman Are you asking about portability across netstandard2.1/netstandard2.2 libraries where the end-user might be on .netcoreapp2.1/.netcoreapp3.1 LTS? The simplest workaround is to call ToList() or ToArray() between the OrderBy and FirstOrDefault. The drawback to the workaround is extra intermediary allocations. And yes, the reason it works is because you break the interface to OrderedEnumerable. Effectively, you go from "linq land" to "object land".

@JesperTreetop
Copy link
Contributor

JesperTreetop commented May 18, 2020

Another way of opting out of the optimization without full-on ToList or ToArray is to create a cloaking enumerable wrapper which just passes through all enumeration (at the cost of one allocation):

public class CloakedEnumerable<T> : IEnumerable<T> {
    private readonly IEnumerable<T> _inner;

    public CloakedEnumerable(IEnumerable<T> inner) {
        _inner = inner;
    }

    public IEnumerator<T> GetEnumerator() {
        return _inner.GetEnumerator();
    }

    System.Collections.IEnumerator IEnumerable.GetEnumerator() {
        return _inner.GetEnumerator();
    }
}

public static class CloakedEnumerableExtensions {
    public static IEnumerable<T> AsCloakedEnumerable<T>(this IEnumerable<T> sequence) {
        return new CloakedEnumerable<T>(sequence);
    }
}

// later...
sequence.OrderBy(x => x).FirstOrDefault(x => x % 42 == 0) // does not opt out
sequence.OrderBy(x => x).AsCloakedEnumerable().FirstOrDefault(x => x % 42 == 0) // opts out

Note that this needs to be done after all subsequent ThenBy ordering is done since it doesn't implement IOrderedEnumerable.

AsEnumerable() is a tempting alternative but just casts to IEnumerable<T>, and doesn't prevent finding out the identity of the class that the optimization is based on.

@danmoseley
Copy link
Member

Logged breaking change as dotnet/docs#19683

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.