Exploration: What would it take to achieve cost free Linq #2482

YairHalberstadt · 2019-05-03T07:59:05Z

YairHalberstadt
May 3, 2019
Collaborator

Introduction

rust and C++ are languages which attempt to provide cost free abstractions. This has two meanings:

Pay for play. A feature will not cost you anything unless you use it.
If you do use a feature, the generated code should be no worse than that which you would have written yourself.

C# does not attempt to do either of these, but instead tries to make sure that the value of any abstraction justifies it's costs, and that these costs are kept to a minimum. Unfortunately this can mean that some of C#s features are best avoided in high performance scenarios.

One example of such a feature is Linq, one of C#'s 'killer features'.

In many (the vast majority of?) cases a Linq query is consumed immediately, and could in theory be converted to an equavelent foreach loop. Of the remainder, it is very common for the query not to escape the method in which it was declared.

For example

int num = 1;
var result = list.Select(x => x.GetFoo(num)).ToList();

Is equavelent to

int num = 1;
var result = new List<Foo>(list.Count);
foreach(var x in list) result.Add(x.GetFoo(num));

However there are a number of associated costs with the Linq version.

Any captured locals have to be stored in a display class, which must be allocated.
The lambda must be converted to a delegate, which leads to another allocation.
The Select function returns an enumerable, which must be allocated.
ToList() allocates 2 enumerators, one from the original list, one from the selectEnumerable.
Each iteration of the enumerable makes 4 virtual calls (2 to MoveNext, 2 to Current). These calls are slower than non-virtual calls, and cannot be inlined, which prevents other optimisations.

As a result a number of projects (such as Roslyn) ban Linq in any places where performance is necessary.

The aim here is to discuss how these costs might be mitigated, and what language features might be necessary.

This is not a feature request, or a proposal. It's merely the beginnings of an investigation into how a number of disparate language features could come together to take C# into avenues which are currently impossible. Perhaps it will inform C#s long term road plan, perhaps not, but I've found it interesting to explore nonetheless, and I hope you will too.

Avoiding allocating the enumerator, and avoiding virtual dispatch.

Firstly we add the following interface:

public interface IEnumerable<T, TEnumerator> where TEnumerator : struct, IEnumerator<T>
{
   TEnumerator GetEnumerator();
}

Secondly we change the signature of Select to this:

public static SelectEnumerable<TEnumerable, TEnumerator, TSource, TResult> Select<TEnumerable, TEnumerator, TSource, TResult>(this TEnumerable source, Func<TSource, TResult> func) where TEnumerable : struct, IEnumerable<TSource, TEnumerator> where TEnumerator : struct, IEnumerator<TSource>;

public struct SelectEnumerable<TEnumerable, TEnumerator, TSource, TResult> : IEnumerable<TResult, SelectEnumerator<TEnumerator, TSource, TResult>> where TEnumerable : struct, IEnumerable<TSource, TEnumerator> where TEnumerator : struct, IEnumerator<TSource> {...}

public struct SelectEnumerator<TEnumerator, TSource, TResult> : IEnumerator<TResult> where TEnumerator : IEnumerator<TSource> {...}

Full Code is avaliable on Sharplab.

Since, the CLR is guaranteed not to box structs if used as type parameters, if other Linq methods are similarly defined, the methods will all accept struct enumerables, get struct enumerators, and return struct enumerables. This means that no enumerables or enumerators will need to be allocated. Also the JIT should devirtualise all these methods. In theory, given a powerful enough JIT, the generated machine code for iterating over the collections should be as efficient as if it had been directly hand written (not including the costs associated with the lambda). See here for an example of the generated asm, and what the equivalent would look like if it were hand written.

Now this signature looks utterly hideous, but to the consumer they should ideally never need to worry about this, due to type inference and type parameter inference. Unfortunately #741 will have to be implemented first.

The producer will have to work with these hideous signatures, but with a little practice this isn't so hard, and as it's only library writers that will have to so, so this is less problematic. I do have some ideas for using something akin to rust's impl trait to make things significantly easier for those library writers. Once the ideas are more fully formed I may write them up.

There are however 3 pain points that the consumer will have to deal with.

In order to use this with a non-struct enumerable, you would have to create a struct enumerable wrapper around it.
No enumerables currently implement the IEnumerable<T, TEnumerator> interface.
If a single type parameter can't be inferred you would have to write all of them, even if Improved overload inference from generic type parameters/constraints #741 were implemented.

The first and second problems could be solved using shapes/extension everything.

The third problem can be solved through partial type inference, for which a championed proposal already exists: #1349.

avoiding allocating the lambda and the display class.

The first step to avoid allocating the display class, is obviously to make it a struct. The problem with this is that when passing the struct to Select a copy is made, meaning that if the lambda mutates any locals the effect will not be seen outside the lambda.

We could pass the struct by ref, but then we wouldn't be able to store the struct by ref on the returned SelectEnumerator.

What we could however do is make the display class a ref struct. Then just like a Span can store a ref through the ByReference JIT intrinsic, the display class would store the locals by ref using ByReference. The compiler already knows how to make sure a stackalloced Span is never returned from the method it is declared in, and it would similarly make sure the display class is never returned from the method it is returned in.

Since the display class is a ref struct, the returned enumerable from Select has to be a ref struct, in order to store display classes. This means it can't implement an interface. However we've already discussed above how we could replace IEnumerable with the shape SEnumerable<T, TEnumerator>, and return structs, so this should not be a problem.

We should create the following shape:

public shape SFunc<TParam, TResult>
{
     TResult Invoke(TParam param);
}

We then change the signature of Select to:

public static SelectEnumerable<TEnumerable, TEnumerator, TSource, TResult, TFunc> Select<TEnumerable, TEnumerator, TSource, TResult, TFunc>(this TEnumerable source, TFunc func) where TEnumerable : struct, IEnumerable<TSource, TEnumerator> where TEnumerator : struct, IEnumerator<TSource> where TFunc : ref struct, SFunc<TSource, TResult>;

public ref struct SelectEnumerable<TEnumerable, TEnumerator, TSource, TResult, TFunc> : IEnumerable<TResult, SelectEnumerator<TEnumerator, TSource, TResult>> where TEnumerable : struct, IEnumerable<TSource, TEnumerator> where TEnumerator : struct, IEnumerator<TSource> where TFunc : ref struct, SFunc<TSource, TResult> {...}

public ref struct SelectEnumerator<TEnumerator, TSource, TResult, TFunc> : IEnumerator<TResult> where TEnumerator : IEnumerator<TSource> where TFunc : ref struct, SFunc<TSource, TResult> {...}

And introduce the following conversion from a lambda to an SFunc:

var a = 42;
var selected = source.Select(x => x + a);

...
// becomes
...

private struct Params { public int a; }

private ref struct DisplayClass : SFunc<int, int>
{
    public DisplayClass(ByReference<Params> params) => _params = params;
    private readonly ByReference<Params> _params;
    public int Invoke(int param) => param + _params.a;
}

...
var params = new Params { a = 42 };
var function = new DisplayClass(ref params);
var selected = source.Select(function);

This leaves us with a fully allocation free Select statement, which in theory could be reduced by a powerful enough JITter to almost exactly the same code as we would have coded ourselves.

Issues

ref structs cannot be used as type parameters.

This is because ref structs cannot be stored on the heap, and we can't guarantee how a type parameter is used.

I would suggest adding a new constraint ref struct. A type parameter with this constraint must be treated as if it were a ref struct (i.e. cannot be put on the heap). This should allow any type, including both ref structs and normal structs, to be used as a type argument for that parameter. Note that this is opposite to normal constraints, which decrease the number of types that can be used as type arguments, this constraint increases the number of types that can be used.

This has been discussed already at #1148

ByReference is internal to CoreFX.

I don't know the best way to deal with this. Exposing it publicly as an unspeakable type might be considered. However this would allow anyone to write unsafe IL without needing permission to run unsafe code. I imagine someone from the runtime could shed more light on this.

You can not return a SelectEnumerable from the method where it is created.

The compiler would have to prevent this, as doing so would be unsafe, as the Params struct would no longer exist, and so we would be storing a ref to essentially random memory. The compiler already knows how to do this for a stack alloced span, so this should not be tricky to implement. However it does somewhat limit the number of places this cost-free linq can be used.

Like everything, this can be solved with another layer of indirection (by implementing a sort of COW, but one which updates all references to point to the copy rather than the original).

Essentially we would store locals in a Params struct, and then store a reference to that on the stack. We only ever access locals through that reference. We then store a reference to that reference in the display class. We can then box the underlying Params struct when necessary, and update the ref to that struct so that it points to the boxed struct, rather than the original. Since all access to the Params struct was through that reference, everything which had access to the original Params struct, now automatically accesses the boxed version. Now all our ref structs can safely be replaced with normal struct alternatives, since they are storing refs to items on the heap, not the stack.

This might sound confusing, and I would like to show you the code for it. However C# has no way of even writing this concept right now, so I won't bother as it would only be confusing. I believe IL should be capable of this, possibly with the addition of a couple of runtime helpers. The most difficult bit is that we end up storing internal pointers on the heap, which may adversely impact garbage collection performance. In theory though, since they will only ever point to a boxed struct, the GC could learn to recognise this scenario, and optimise for it.

Instead I will show you what it will look like to the consumer:

var a = 42;
var selected = source.Select(x => x + a);
return selected.ToNonRefStruct();

...

public ref struct SelectEnumerable<TEnumerable, TEnumerator, TSource, TResult, TFunc> : IEnumerable<TResult, SelectEnumerator<TEnumerator, TSource, TResult>> where TEnumerable : struct, IEnumerable<TSource, TEnumerator> where TEnumerator : struct, IEnumerator<TSource> where TFunc : ref struct, SFunc<TSource, TResult> 
{
    ...
    public NonRefStructSelectEnumerable<TEnumerable, TEnumerator, TSource, TResult, TFunc> ToNonRefStruct() => new NonRefStructSelectEnumerable<TEnumerable, TEnumerator, TSource, TResult, TFunc>(func.ToNonRefStruct());
}

public struct NonRefStructSelectEnumerable<TEnumerable, TEnumerator, TSource, TResult, TFunc> : IEnumerable<TResult, SelectEnumerator<TEnumerator, TSource, TResult>> where TEnumerable : struct, IEnumerable<TSource, TEnumerator> where TEnumerator : struct, IEnumerator<TSource> where TFunc : SFunc<TSource, TResult> {...}

Whether all this is worth it is a different question. I think the most sensible approach is to say that if you need to heap allocate your captured locals you should use ordinary Linq.

Summary and Conclusion

In this proposal we looked at what it would take to create an allocation free, virtual method call free, fully inlinable version of Linq.

We came up with what I would consider to be a pretty workable proposal, at least for the consumer of the library. As I said, I do have some other ideas about making life easier for the library writer.

Let's list the language and runtime changes that would be required to make this all work:

Shapes (Exploration: Shapes and Extensions #164)
Improved type parameter inference (Improved overload inference from generic type parameters/constraints #741, Champion: "Partial Type Inference" #1349)
Expose ByReference outside CoreFX
Create a lambda conversion to a suitable struct allocated func, as described above
Allow ref structs to be used as type parameters (generic constraint: where T : ref struct #1148)
yield return should be usable from a method returning an IEnumerable<T, TEnumerator>

Championed proposals already exist for 1, 2 and 5 and I think all seem reasonably likely to happen, given enough resources. 6 is also easily enough doable once all the other parts are in place.

The trickiest parts of this are probably 3 and 4. They allow for the creation of allocation free function types which capture variables.

Is it all worth it?

Linq is extremely well established, and is used across the vast majority of C# codebases.

This proposal seeks to create an alternative to Linq which is trickier to use, and more limited in general. It's only advantage is it's improved performance. That's a tough sell...

Most C# projects don't require a level of performance that makes this tradeoff worthwhile. As such an alloc free Linq library would probably be provided as a seperate nuget package, rather than be included in .Net Standard.

Is it worth making so many language changes for such a niche feature?

To that I will make the same arguments as were made for Span. Most people will never have to use an alloc free Linq library. But everyone will use libraries that do make use of this.

That argument isn't as powerful here, since Span allowed for writing code that previously just could not be written, whereas this only allows for the simplification of code which is currently perfectly writable.

However allocation free function types would allow for writing code which currently cannot be written, or at least not without significant pain. For example there are many cases where Roslyn would benefit from using generic visitors which take a func, but at the moment it hand crafts each visitor to avoid allocating the func. Even when the overhead of using a func is considered worth it, great pains are taken to make sure the function does not capture any locals, and instead all locals are passed in as parameters. For an example, take a look at the VisitTypeSymbol extension method.

Once allocation free function types exist, having Linq make use of them is a much smaller task.

So here's the questions I would be interested if people could answer:

Do you avoid using Linq or Functions due to performance concerns?
How many of the subfeatures that are required to make cost free linq work, would you find individually useful, even without the grand prize of cost free linq?
Would you use the cost free version of Linq if it was made available as a nuget package?
Have I made any mistakes here? Is there something I've missed, or will this idea not work for whatever reason?

svick · 2019-05-03T08:46:06Z

svick
May 3, 2019
Collaborator

I definitely think that turning LINQ into a zero cost abstraction would be amazing and it is worth investigating how to get there.

Though most of the proposed solution is about avoiding heap allocations for temporary objects, even though doing that is already being worked on in CoreCLR (https://github.com/dotnet/coreclr/issues/20253) and that version does not require any language support or using unwieldy deeply nested generic types. Another necessary feature is devirtualization, which is also being worked on (https://github.com/dotnet/coreclr/issues/9908).

So, my question is: isn't a better approach to invest more into CoreCLR? I think doing that could in theory achieve all benefits of this proposal, with none of its drawbacks. And it's even better, in that it would improve the performance of all code, not just the code that has been specifically tuned. Though I believe getting CoreCLR there would require a significant amount of work, possibly much more than implementing this proposal.

2 replies

jimmytokens Sep 22, 2020

The people working on CoreCLR are actually closer to this goal than you might think. I believe investing in CoreCLR would be a more effective and time efficient solution.

YairHalberstadt Sep 22, 2020
Collaborator Author

@jimmytokens do you have any links showing this? From what I've seen, they've actually given up in the main effort that could achieve this as infeasible.

YairHalberstadt · 2019-05-03T09:07:50Z

YairHalberstadt
May 3, 2019
Collaborator Author

@svick
I definitely agree with you, that it would be ideal if CoreCLR could optimise this all for us without our help.

However I'm a bit worried by a comment @HaloFour said he heard Brian Goetz made. He said Brian Goetz described JVM escape analysis as a failed feature. Given how much hotspot has invested into stack allocating objects, I think relying on the JIT to do all the work for us may be overly optimistic.

0 replies

svick · 2019-05-03T09:35:21Z

svick
May 3, 2019
Collaborator

He said Brian Goetz described JVM escape analysis as a failed feature.

If that's the case, then I think it would be very valuable to understand what made it fail. Maybe there's a way CLR can avoid making the same mistakes? Or maybe it fails at the code patterns commonly encountered in Java, but it would work fine for LINQ?

0 replies

HaloFour · 2019-05-03T12:26:26Z

HaloFour
May 3, 2019

@svick @YairHalberstadt

It was a comment he made at a recent conference where he was discussing Java evolution/futures, including Valhalla which seeks to add proper value types (or "inline" types) to the JRE and language. He didn't go into any detail and I didn't think to ask the question. The videos for the conference haven't yet been posted.

It would be awesome if the CoreCLR team could implement escape analysis in such a way that idiomatic LINQ would require zero-allocations.

0 replies

orthoxerox · 2019-05-03T12:30:32Z

orthoxerox
May 3, 2019

I see two other potential enablers, but both are runtime-related:

tracing in the JIT, LuaJIT is great for optimizing away closures and function calls, but this is a huge change to the runtime
double inlining, where the JIT is smart enough to recognize a pattern where method A calls method B with a lambda that is defined in method A, so it can inline method B into method A and the lambda into the new method A. IIRC, that's what Kotlin does (but at compile time).

0 replies

svick · 2019-05-03T12:41:48Z

svick
May 3, 2019
Collaborator

@orthoxerox

double inlining

Since invoking a lambda/delegate is effectively the same thing as invoking a virtual method (you treat the delegate's Method property as if it was a method table pointer), isn't that a subset of what devirtualization can do?

0 replies

YairHalberstadt · 2019-05-03T12:47:16Z

YairHalberstadt
May 3, 2019
Collaborator Author

@erozenfeld
I know you have been working on stack allocation of objects.
Do you have any idea what it would take to avoid any allocations when using Linq? Is this completely impossible, possible but very unlikely, or a goal worth working towards?
Thanks for your input.

0 replies

YairHalberstadt · 2019-05-03T14:29:09Z

YairHalberstadt
May 3, 2019
Collaborator Author

If you take a look at https://github.com/dotnet/corefx/blob/master/src/System.Linq/src/System/Linq/Select.cs,
A different enumerable is returned from Select depending on the type of the source. This means that it will probably be impossible to stack allocate the enumerable returned from Select, since it's impossible to know how much space it will require on the stack.

1 reply

CodingMadness Mar 19, 2022

You just allocate the same amount of storage like the calling sequence before hand?

svick · 2019-05-03T17:53:15Z

svick
May 3, 2019
Collaborator

@YairHalberstadt That does make it harder, but I don't think it makes it impossible.

For example, if the input is statically known to be List<T>, the JIT could inline Select, figure out that all the checks like source is Iterator<TSource> or source is TSource[] are guaranteed to be false, while source is List<TSource> is guarateed to be true, and thus the returned object has to be SelectListIterator.

0 replies

YairHalberstadt · 2019-05-03T17:57:43Z

YairHalberstadt
May 3, 2019
Collaborator Author

@svick
But in the case you don't know the type statically, not only will you not be able to stack allocate the returned enumerable, you wouldn't be able to stack allocate any chained Linq methods either.

0 replies

john-h-k · 2019-05-03T18:10:30Z

john-h-k
May 3, 2019

@YairHalberstadt

I don't know the best way to deal with this. Exposing it publicly as an unspeakable type might be considered. However this would allow anyone to write unsafe IL without needing permission to run unsafe code. I imagine someone from the runtime could shed more light on this.

Currently we expose stuff like Unsafe class so that should not be an issue

0 replies

svick · 2019-05-03T18:33:46Z

svick
May 3, 2019
Collaborator

@YairHalberstadt

But in the case you don't know the type statically, not only will you not be able to stack allocate the returned enumerable

Are you sure? I think it should be possible to allocate different number of bytes from the stack in different branches.

And even if that was true, that's just the current implementation. If that implementation prevented optimizations, it could be changed.

you wouldn't be able to stack allocate any chained Linq methods either.

I think it could still be possible. For example, if all the possible returned types inherited from Iterator<T>, then the following method in the chain could safely choose its Iterator<T> variant. Though this wouldn't always work (sometimes the returned types are too heterogeneous), but again, the implementation could be changed to make it work. Also, I don't know if something like this is actually feasible, but it sounds possible to me.

0 replies

john-h-k · 2019-05-03T18:34:24Z

john-h-k
May 3, 2019

Are you sure? I think it should be possible to allocate different number of bytes from the stack in different branches.

It is

0 replies

orthoxerox · 2019-05-03T20:46:29Z

orthoxerox
May 3, 2019

@svick Maybe I don't know what devirtualization is, but I understand it as getting rid of vtable lookup. Double inlining means quite literally inlining the code. Of course, it works better with eager evaluation, as generator functions cause syntactic diabetes and can't really be inlined. Here's what Kotlin does:

public inline fun <T, R> Array<out T>.map(transform: (T) -> R): List<R> {
    return mapTo(ArrayList<R>(size), transform)
}

public inline fun <T, R, C : MutableCollection<in R>> Array<out T>.mapTo(destination: C, transform: (T) -> R): C {
    for (item in this)
        destination.add(transform(item))
    return destination
}

So when you write

var result = source.map { it * 2 }

the compiler actually rewrites it into something like:

val tmp = ArrayList<Int>(source.size)
for (item in tmp) {
    tmp.add(item * 2)
}
var result = tmp

Of course, as C# has no idea what's inside Enumerable.Select (and it's usually a reference assembly anyway), it cannot do this at compile time.

0 replies

scalablecory · 2019-05-03T20:52:24Z

scalablecory
May 3, 2019

Some thoughts:

Even if not specifically for LINQ, I think a way to do allocationless Func<> lambdas is a fantastic feature that will enable more JIT-time optimization, the likes of which we see in C++.

I like this LINQ focus a lot, though. Not just due to the changes outlined here, but because turning LINQ from a convenience feature into a performance feature has great ramifications down the road: other optimizations that we've thought of but decided not to do (ordered enumerables and merge joins, etc.) suddenly become more worthwhile.

If we could make this happen for standard LINQ automatically, without any special syntax changes, it would be a killer feature. Otherwise it is still a useful feature but maybe not immediately useful enough to be part of corefx.

If we could make it happen for sub-queries, e.g. seamlessly slip in and out of IEnumerable and allocationless depending on what overloads exist, it would be even better. This would provide compatibility with all the existing code out there.

0 replies

CyrusNajmabadi · 2020-08-07T05:38:37Z

CyrusNajmabadi
Aug 7, 2020
Collaborator

Aren't delegates statically compiled by .NET also?

What do you mean by "statically compiled". I'm not sure what it means here, or what relationship it would have to do with delegates not being allocated in the heap.

0 replies

tonygiang · 2020-08-07T06:25:33Z

tonygiang
Aug 7, 2020

Aren't delegates statically compiled by .NET also?

What do you mean by "statically compiled". I'm not sure what it means here, or what relationship it would have to do with delegates not being allocated in the heap.

Statically compiled means the compiler would rewrite your delegate variable into a function as if it was that way to begin with. Let's say we have this delegate declaration:

public delegate int IntReturnInt(int x);

And you have this somewhere in our code:

void SomeFunction()
{
    IntReturnInt Increment = x => x + 1;
    var result = Increment(someInt);
}

Static compilation would rewrite that into:

int compiler_generated_name_Increment(int x) { return x + 1; }

void SomeFunction()
{
    var result = compiler_generated_name_Increment(someInt);
}

No runtime allocation.

0 replies

CyrusNajmabadi · 2020-08-07T07:42:01Z

CyrusNajmabadi
Aug 7, 2020
Collaborator

Statically compiled means the compiler would rewrite your delegate variable into a function as if it was that way to begin with.

No. What gets generated here is more like:

static IntReturnInt  s_cachedLambda = new IntReturnInt(actual_function);
...
var result = s_cachedLambda(someInt);

As to what the runtime does here, you'd have to check with it. It's potentially possible it could attempt to optimize this further.

Note: none of this really has to do with 'static compilation'. there are dynamic languages/compilers that will perform the above optimzations, and there are static languages/compilers that will not.

0 replies

tonygiang · 2020-08-07T07:55:36Z

tonygiang
Aug 7, 2020

Statically compiled means the compiler would rewrite your delegate variable into a function as if it was that way to begin with.

No. What gets generated here is more like:
static IntReturnInt  s_cachedLambda = new IntReturnInt(actual_function);
...
var result = s_cachedLambda(someInt);
As to what the runtime does here, you'd have to check with it. It's potentially possible it could attempt to optimize this further.

Note: none of this really has to do with 'static compilation'. there are dynamic languages/compilers that will perform the above optimzations, and there are static languages/compilers that will not.

Then you could just say .NET doesn't have static compilation.

0 replies

CyrusNajmabadi · 2020-08-07T08:16:35Z

CyrusNajmabadi
Aug 7, 2020
Collaborator

Then you could just say .NET doesn't have static compilation.

I think this is a matter of you using terms in way that makes sense to you, but doesn't follow normal nomenclature here.

Regardless, this tangent doesn't seem to be going anywhere. if you have further questions you might want to consider taking them to our gitter.im/dotnet/csharplang channel, or discord.gg/csharp (#roslyn channel).

0 replies

canton7 · 2020-08-07T08:16:58Z

canton7
Aug 7, 2020

It does compile the lambda to a method, but it then needs a delegate instance pointing to that method, which is allocated at runtime. I recommend playing around with https://sharplab.io to get a feel for how the compiler works.

0 replies

manofstick · 2020-10-29T06:53:47Z

manofstick
Oct 29, 2020

NOTE No 2: Hmmm. This message is becoming quite overloaded... Anyway, now I'm just confused (read previous NOTE below), because I have created a second test which follows a Select().Where().Max() pattern and ValueLinq is running in half (!) the time of the handcoded version. I'm assuming I have done something stupid, but otherwise I just need someone who can analyse assembly to help me understand. Anyway, this will be the last additional edit in this comment; from now on I'll document things where they should be, over at the ValueLinq repo

NOTE: The code mentioned here has now been pushed up to the here. I did find an error with the hand-coded version (it was doing some additional casting between int and doubles) which has been rectified. This meant that the baseline was slower than it should of been. Hence ValueLinq mentioned here was not ~10% slower, but rather ~100% slower. This is still very good (really!), but obviously a bit of a disappointment. What's the old saying? Don't count your chickens before they hatch! Benchmark.net results here

OK, I'm getting a little in front of myself in that what I have here isn't in my repo yet (but I'll give it a little scrub and have it up by this weekend), but I'm somewhat excited by progress and thought I would share my results at this current stage...

Anyway, I'm taking a crack at writing a value-based Linq which you can peruse here. (There is some precedent here and here), but I'm approaching it somewhat differently, merging in some of my ideas that I explored with a previous Linq implementation here (boy, what a glutton for punishment writing two versions of Linq!!!)

Anyway, my goal is that it should be (almost) as simple as changing from using System.Linq to using Cistern.ValueLinq (I say almost as sometimes you need to cast to IEnumerable<> depending on how you are using it...)

But, as an extension to the "standard" Linq functionality I have being playing a bit with some value-type based pure functions (the ones I'm using here lifed from @reegeek's StructLinq) in order to test how close we can get to matching straight code. So let's start with that:

        [Benchmark]
        public double Flatout()
        {
            var total = 0.0;
            foreach (var x in _ints)
            {
                if ((x & 1) == 0)
                    total += x * 2;
            }
            return total;
        }

OK; Pretty simple, the linq transform of this is:

        [Benchmark(Baseline = true)]
        public double Linq() =>
            _ints
            .Where(x => (x & 1) == 0)
            .Select(x => x * 2)
            .Sum();

And, the Cistern.ValueLinq looks exactly the same:

        [Benchmark]
        public double CisternValueLinq_normal() =>
            _ints
            .Where(x => (x & 1) == 0)
            .Select(x => x * 2)
            .Sum();

Now _ints here is an IEnumerable<int> which means that the poor Flatout benchmark doesn't get any loop goodness, so I'll give it a helping hand by having another version with a cast to the underlying array that it actually is, but obviously the "normal" code now has to have this additional knowledge that the Linqs all have wrapped up under the hood:

        [Benchmark]
        public double Flatout_cast_to_array()
        {
            var total = 0.0;
            foreach (var x in (int[])_ints)
            {
                if ((x & 1) == 0)
                    total += x * 2;
            }
            return total;
        }

OK, now we have a version which which uses the value-type structs to perform the actions. These are pure functions. They could copy some state around, but as per discussions above and in other forums this can lead to some perverse outcomes. Anyway, lets show them:

        struct DoubleAnInt : IFunc<int, int> { public int Invoke(int t) => t * 2; } 
        struct FilterEvenInts : IFunc<int, bool> { public bool Invoke(int t) => (t & 1) == 0; }

And then we can use them in my new Linq as follows

        public double CisternValueLinq_struct() =>
            _ints
            .Where(new FilterEvenInts()) // ug, sugar please
            .Select(new DoubleAnInt(), default(int)) // ug, sugar please + better type inference...
            .Sum();

Which once again touches on the type inference issue that has been mentioned in the preceding messages.

Now, the following doesn't exist, but I'm imagining you could have an additional syntax something like the following, where a >=> would create a value type IFunc, and allow a layer of type-inference to do some magic. But hey, one can but dream:

        [Benchmark]
        public double CisternValueLinq_struct_in_dream_land() =>
            _ints
            .Where(t >=> t * 2)
            .Select(t >=> (t & 1) == 0)
            .Sum();

And finally a "nothing up my sleave" version that seemly switches between the value type representation and the IEnumerable<> for trivial interop with existing code. It also shows that segments created in different parts can all be just put together with an Aggregating function, and hence all the magic that ties together everything at runtime.

        [Benchmark]
        public double CisternValueLinq_struct_nothing_up_my_sleve()
        {
            IEnumerable<int> collection = GetCollection();
            IEnumerable<int> withWhere = AddWhere(collection);
            IEnumerable<int> andSelect = AddSelect(withWhere);
            
            var result = andSelect.Sum();

            return result;

            IEnumerable<int> GetCollection() => _ints;
            IEnumerable<int> AddWhere(IEnumerable<int> stuff) => stuff.Where(new FilterEvenInts());
            IEnumerable<int> AddSelect(IEnumerable<int> stuff) => stuff.Select(new DoubleAnInt(), default(int));
        }

So now the drum-roll for how this runs!

Method	Length	Mean	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
CisternValueLinq_normal	1	93.307 ns	0.76	0.01	-	-	-	-
CisternValueLinq_struct	1	95.048 ns	0.78	0.01	-	-	-	-
..._nothing_up_my_sleve	1	156.451 ns	1.28	0.02	0.0153	-	-	64 B
Flatout	1	24.405 ns	0.20	0.00	0.0076	-	-	32 B
Flatout_cast_to_array	1	8.534 ns	0.07	0.00	-	-	-	-
Linq	1	122.235 ns	1.00	0.00	0.0248	-	-	104 B

CisternValueLinq_normal	100	841.937 ns	0.95	0.03	-	-	-	-
CisternValueLinq_struct	100	230.915 ns	0.26	0.01	-	-	-	-
..._nothing_up_my_sleve	100	293.547 ns	0.33	0.01	0.0153	-	-	64 B
Flatout	100	704.422 ns	0.79	0.03	0.0076	-	-	32 B
Flatout_cast_to_array	100	132.444 ns	0.15	0.00	-	-	-	-
Linq	100	886.122 ns	1.00	0.00	0.0248	-	-	104 B

CisternValueLinq_normal	1000000	7,717,081.944 ns	1.03	0.04	-	-	-	-
CisternValueLinq_struct	1000000	1,421,075.820 ns	0.19	0.01	-	-	-	-
..._nothing_up_my_sleve	1000000	1,413,621.875 ns	0.19	0.00	-	-	-	68 B
Flatout	1000000	6,875,806.944 ns	0.92	0.04	-	-	-	32 B
Flatout_cast_to_array	1000000	1,253,509.401 ns	0.17	0.00	-	-	-	3 B
Linq	1000000	7,469,367.057 ns	1.00	0.00	-	-	-	104 B

Some comments:

Benchmark run on .net core 3.1
CisternValueLinq isn't doing any special handling for this particular pattern (as opposed to System.Linq which is specifically optimized for Array->Where->Select)
For large collections it's only about 10% slower than the handcoded version! (hardcoded 17% of linq baseline, valueline at at 19% of linq baseline)
Cistern.ValueLinq doesn't allocate at all for "standard" use. The "nothing up my sleve" version shows that it can be seamlessly merged back and forth via IEnumerable<> (albeit at the expense of allocations, and a little extra startup initialization time, as shown in the length 1 array)
Looking at the single element we see ~90ns difference between this Linqs generated code and the handcoded loop, looking at the 100 elements we also see about ~100ns difference. So the compilers generated code for the Linq and handcoded must be pretty close (I would be very interested in someone with the appropriate skill set analysing the generated assembly)

So what's the catch?

Well, this thing is generics heavy. I mean really heavy. So you're going to have a longer JIT compilation phase, as well as code bloat.

Anyway, I'll polish up what I've got, so you can run it for yourselves (if you don't believe me :-P) But I'm tired and I've got some real work to do.

1 reply

SamPruden Jun 28, 2021

Do you avoid using Linq or Functions due to performance concerns?

Absolutely and routinely. Game development is an area where avoiding these things is standard, but I avoid them on any hot path in anything. I do often write a Linq version as an easy first pass and then optimise later, so there would definitely be value in fast Linq!

Realtime applications in particular care about avoiding allocation, because GC can cause a stutter. It's typical to have a strict zero temporary allocations per frame policy in games.

How many of the subfeatures that are required to make cost free linq work, would you find individually useful, even without the grand prize of cost free linq?

Cost free lambdas would be amazing in themselves.

Would you use the cost free version of Linq if it was made available as a nuget package?

Without question.

Have I made any mistakes here? Is there something I've missed, or will this idea not work for whatever reason?

Some of this discussion focuses on runtime optimisations that would be able to realise these speedups in many cases. This would be great for speeding up many applications, but I'm not sure that it would be useful to me when I'm writing specifically high performance code.

In high performance code I stick to constructs that have guaranteed performance characteristics. I would be wary of using a feature that the spec says allocates, but where the Jitter will probably save me. That's not good enough because it doesn't bake the performance characteristics into the code. I could use a profiler to verify that it does what I want, but I would be worried about future changes in my project, libraries, or runtime causing regressions. I wouldn't be able to offer this fast code as a library with performance guarantees, because there may be some reason the Jitter won't optimise it at their callsite.

Features like Span<T> are perfect for this. They make a clear guarantee that's easy to understand.

I would welcome Linq code getting faster however it happens, but in order to use it in a fast path I would need these well documented and understandable guarantees that are expressed in code.

The same is true of lambdas. Without a guarantee that a lambda is allocation free, I wouldn't use it on the hot path. If you introduced something like ValueAction that could offer that guarantee it would have my attention.

acaly · 2021-07-05T17:43:27Z

acaly
Jul 5, 2021

This would be a great feature to have.

The most convenient way from the user side is to change from using delegates to using interfaces, and ask the compiler to generate a struct from the lambda expression, implementing that interface. In that case, I think it is related to some extent with those proposals of anonymous type (or local type) declaration, for example, #4301 (comment).

0 replies

hez2010 · 2021-08-26T07:19:15Z

hez2010
Aug 26, 2021

I think this would benefit from escape analysis. See dotnet/runtime#11192
The allocation of delegates can definitely be eliminated if they're proved not escaping.

1 reply

alexrp Aug 26, 2021

This was already discussed here: #2482 (comment)

Blealtan · 2021-08-26T09:20:36Z

Blealtan
Aug 26, 2021

Another thought: instead of creating a Shape (the SFunc) and suffer from the incompatibility between delegates and shape, why not allow delegates to become a generic constraint just like interfaces? Along with allowing lambda expressions placed at such calling parameters, it will be possible to have the synthetic anonymous type being passed as generic parameters.

void InvokeAction<A>(A a) where A : Action => a();

InvokeAction(() => Console.WriteLine("From inside lambda"));

18 replies

hez2010 Aug 27, 2021

But these only take effect when the Linq methods are generic. Without the generic argument I don't think the JIT can (and should) inline too much.

It's not correct when you play generics with reference types due to shared generics, see dotnet/runtime#56562.
If you want the method to be inlined, you need to enable PGO. But if you enable PGO, non-generic methods with interface call will also be inlined because of guarded devirtualization. So this doesn't make a difference.

Blealtan Aug 27, 2021

But these only take effect when the Linq methods are generic. Without the generic argument I don't think the JIT can (and should) inline too much.

It's not correct when you play generics with reference types due to shared generics, see dotnet/runtime#56562.
If you want the method to be inlined, you need to enable PGO. But if you enable PGO, non-generic methods with interface call will also be inlined because of guarded devirtualization. So this doesn't make a difference.

@hez2010 I see acaly's point is the closure needs to be a ref struct instead of a class. In such cases, it will be generics with value types, instead of reference types. ref struct closures seems to solve both problems here.

In conclusion, we need to be able to use ref-structs as generic arguments in order to chain allocation-free Linq methods.

@acaly I'm now totally convinced. Will ref structs only allowed in generic methods and other ref structs? Being generic argument of normal generic classes might lead to lifetime problems, since it may have a field of the ref struct type.

Blealtan Aug 27, 2021

Also, if heading to a ref struct lambda, there are already proposals #1060 and #1862. What do you think of these ones?

acaly Aug 27, 2021

Will ref structs only allowed in generic methods and other ref structs? Being generic argument of normal generic classes might lead to lifetime problems, since it may have a field of the ref struct type.

I guess it's another hard question. #1148

Also related to #3452

acaly Aug 27, 2021

ref struct closures seems to solve both problems here.

I am not sure whether closures themselves must be ref-structs. The only restriction to ref-structs is about copying to heap, but it does not forbid copying to stack. dotnet/runtime#50389 is more related, but that original proposal was using attribute+analyzer, so it will be harder if other features like this need to depend on it.

huoyaoyuan · 2021-08-27T16:33:35Z

huoyaoyuan
Aug 27, 2021

I feel Kotlin-like approach very necessary. Recently CoreLib removed a bunch of usages of Linq to remove the dependency of System.Linq assembly in some workloads. If inlined, it can be used in such scenarios again.

1 reply

Blealtan Aug 28, 2021

The problem is Kotlin-like inline only provides the opportunity to manipulate functions, not types. Linq returns a lazy-evaluated Enumerable. Unless the inline method is allowed to:

create a new type inside it;
have an argument-dependent type,

it will not help to build a zero-overhead Linq.

Eli-Black-Work · 2022-10-06T05:28:11Z

Eli-Black-Work
Oct 6, 2022

It feels like this could almost be achieved via source generators, now that those have been released 🙂

10 replies

Eli-Black-Work Oct 7, 2022

@NN--- In that case, couldn't the source generator produce vectorized code?

NN--- Oct 7, 2022

It is possible if you know the type statically.
LINQ code has dynamic code dispatch depending on the actual type.

Eli-Black-Work Oct 11, 2022

@NN--- In this case, I think we'd usually know the type statically, since we could examine the source code for it 🙂

NN--- Oct 11, 2022

Once. You have method receiving IEnumerable,you have lost the static type.

Eli-Black-Work Oct 11, 2022

@NN--- Yes, that's true 🙂 However, this would allow people to use "cost free" LINQ in code that's similar to the code that I posted (where the type is known).

mattwar · 2022-10-09T21:52:30Z

mattwar
Oct 9, 2022
Collaborator

The original LINQ preview compiler could inline the LINQ queries if they were created and used inside a foreach statement, creating no lambdas or query objects. The originating collection might still allocate an enumerator though if it was not one of the simple types like array or list. We decided not to go with this feature in the actual release.

1 reply

Eli-Black-Work Oct 11, 2022

What was the reason for not going through with it?

jmarolf · 2022-10-11T16:01:18Z

jmarolf
Oct 11, 2022

As I see it there are two problems with linq today:

enumeration allocates and is generally slower that a foreach loop
lambdas allocation display classes

I think solving (1) first is the way to go and this is the best path I can see for this to move forward:

C# team implements existential types
runtime team extends IEnumerable and the like to include additional existential type information
C# compiler is updated (if necessary) to recognize these new types of patterns in foreach loops
The JIT is updated (if necessary) to ensure that it properly inlines these at the call sites.

This would get us everything we need except allocating delegates on the stack and enumerating a linq expression should have equivalent performance to constructing a for loop over the same sequence.

The Select function returns an enumerable, which must be allocated.

By having a generic type, it should be possible to avoid the allocation entirely.

Each iteration of the enumerable makes 4 virtual calls (2 to MoveNext, 2 to Current). These calls are slower than non-virtual calls, and cannot be inlined, which prevents other optimisations.

ToList() allocates 2 enumerators, one from the original list, one from the selectEnumerable.

This is an implementation detail of ToList but, again, having the generic type would give the JIT the information it needs to inline things

Why existential types?

I do not think the ergonomics of types with all these additional generic parameters are good enough to ship. Also, they are really just an implementation detail. Existential types allow us to pass this important information to the JIT compiler without asking the user to see it everywhere.

runtime / C# catch-22

We currently have the problem that existential types are not a priority for the language design team right now. Part of that is because Shapes/Roles needs more investigation, but the other part is because there is no point in implementing existential types unless the runtime thinks they can actually consume them. It's unclear to me what kinds of breaks the runtime team is willing to accept on IEnumerable. Changing anything on a type that fundamental is going to break something, it doesn't even matter if the change is technically breaking. At scale all changes are breaking. To move forward we would need to understand what sort of changes could actually be incorporated and then use that to drive the design of language feature.

Delegates

This would not help with display class allocation, but as you point out in the original post that is a much thornier issue. If you return the enumerable then a display class must be allocated. It would be very odd to tell someone that this is not allowed:

public static IEnumerable<T> GetAnnotatedNodes<T>(this SyntaxNode node, SyntaxAnnotation syntaxAnnotation) where T : SyntaxNode
    => node.GetAnnotatedNodesAndTokens(syntaxAnnotation).Select(static n => n.AsNode()).OfType<T>();

However, for cases where the delegate does not escape the function you can use local functions to avoid and allocation:

var tokenMap = pairs
    .GroupBy(GetSyntaxNode, GetSyntaxAnnotation).ToDictionary(GetKey, GetAnnotations);
return root.ReplaceNodes(tokenMap.Keys, (o, n) => o.WithAdditionalAnnotations(tokenMap[o]));

static SyntaxNode GetSyntaxNode(Tuple<SyntaxNode, SyntaxAnnotation> pair) => pair.Item1;
static SyntaxAnnotation GetSyntaxAnnotation(Tuple<SyntaxNode, SyntaxAnnotation> pair) => pair.Item2;
static SyntaxNode GetKey(IGrouping<SyntaxNode,SyntaxAnnotation> group) => group.Key;
static SyntaxAnnotation[] GetAnnotations(IGrouping<SyntaxNode,SyntaxAnnotation> group) => group.ToArray();

Considering that you can use static on lambdas to avoid capturing and you have local functions this problem is less of a concern for me since there are workarounds (as opposed to IEnumerable performance which has none). Stack-based delegates that are passed by-ref feels like an entirely separate feature to me. One that we may do, but it can be done at a later stage.

What can you do about it?

Well dear reader, benchmarking would be very helpful. All of these changes can be done today, just with ugly interface declarations. If folks share the performance improvements they see in their codebases with a change like this it would be great evidence that this needs to be prioritized.

0 replies

neon-sunset · 2023-11-10T19:19:38Z

neon-sunset
Nov 10, 2023

Realistically, Roslyn could simply port/adopt/merge https://github.com/dubiousconst282/DistIL and finally step away from the policy of almost never optimizing IL. Roslyn has sufficient knowledge to unroll, inline and otherwise lower many, many LINQ forms by doing even a simple form of escape analysis (LINQ consumed in place? can be completely optimized away).

The performance benefits this would bring are significant and will enable no-compromise terse syntax making C# competitive with newer languages which have zero/low-cost compile-time checked iterator expressions.

3 replies

CyrusNajmabadi Nov 10, 2023
Collaborator

Roslyn doesn't know the impl of any of hte linq methods, and the spec is very explicit that no assumptions are made about it either. The strong intent is that linq is a purely syntactic translation, and that the impls of those calls we translate to can do anything. This would heavily undo that core design principle that guided linq from the beginning. It's not impossible to do. But it woudl be a large hill to climb to go back on all of that.

neon-sunset Dec 1, 2023

Apologies, I meant the CoreLib LINQ specifically. Third-party libraries of today can already achieve significant improvements by using struct iterators where with DynamicPGO under certain conditions compiler is able to produce proper open-coded loops.

However, for CoreLib LINQ to achieve the same without ever significantly changing its implementation details or without IL rewriting, the JIT/ILC would need to be able to do all of the below:

Object escape analysis and stack allocation and then applying existing struct promotion opts
Unconditionally inlining anonymous delegates directly consumed at their declaration
Recognizing anonymous delegate declaration pattern and optimizing away its instantiation
Proving that the closure does not escape and transforming it into a struct
Significantly increasing inlining budget and making inlining much more aggressive

Given how prevalently LINQ is used, do you think taking a shortcut to improve 90% of user-written LINQ code instead is worth it? (without introducing breaking changes)

NN--- Dec 1, 2023

@neon-sunset Each your point deserves a separate proposal optimization.

For instance even without achieving cost free LINQ, lambda escape analysis and non-GC object for capture can improve a lot of code without any change.

Exploration: What would it take to achieve cost free Linq #2482

YairHalberstadt May 3, 2019 Collaborator

Introduction

Avoiding allocating the enumerator, and avoiding virtual dispatch.

avoiding allocating the lambda and the display class.

Issues

Summary and Conclusion

Is it all worth it?

Replies: 57 comments · 38 replies

svick May 3, 2019 Collaborator

YairHalberstadt Sep 22, 2020 Collaborator Author

YairHalberstadt May 3, 2019 Collaborator Author

svick May 3, 2019 Collaborator

svick May 3, 2019 Collaborator

YairHalberstadt May 3, 2019 Collaborator Author

YairHalberstadt May 3, 2019 Collaborator Author

svick May 3, 2019 Collaborator

YairHalberstadt May 3, 2019 Collaborator Author

svick May 3, 2019 Collaborator

CyrusNajmabadi Aug 7, 2020 Collaborator

CyrusNajmabadi Aug 7, 2020 Collaborator

CyrusNajmabadi Aug 7, 2020 Collaborator

mattwar Oct 9, 2022 Collaborator

Why existential types?

runtime / C# catch-22

Delegates

What can you do about it?

CyrusNajmabadi Nov 10, 2023 Collaborator

YairHalberstadt
May 3, 2019
Collaborator

Replies: 57 comments 38 replies

svick
May 3, 2019
Collaborator

YairHalberstadt Sep 22, 2020
Collaborator Author

YairHalberstadt
May 3, 2019
Collaborator Author

svick
May 3, 2019
Collaborator

svick
May 3, 2019
Collaborator

YairHalberstadt
May 3, 2019
Collaborator Author

YairHalberstadt
May 3, 2019
Collaborator Author

svick
May 3, 2019
Collaborator

YairHalberstadt
May 3, 2019
Collaborator Author

svick
May 3, 2019
Collaborator

CyrusNajmabadi
Aug 7, 2020
Collaborator

CyrusNajmabadi
Aug 7, 2020
Collaborator

CyrusNajmabadi
Aug 7, 2020
Collaborator

mattwar
Oct 9, 2022
Collaborator

CyrusNajmabadi Nov 10, 2023
Collaborator