Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance decrease when using interfaces #7291

Closed
Thealexbarney opened this issue Jan 25, 2017 · 79 comments
Closed

Performance decrease when using interfaces #7291

Thealexbarney opened this issue Jan 25, 2017 · 79 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@Thealexbarney
Copy link

Thealexbarney commented Jan 25, 2017

The general guidelines that I see for writing in C# usually say something like, "Accept as general of an object as you can for your parameters." For example, instead of requiring a List as a parameter when you don't actually use List specific methods, accept an IList, or even an IEnumerable if you can. This way you're not tied to a specific implementation. The basic concept of interfaces.

This works well in theory, but in practice, I've found that it's usually a bad idea when it comes to interfaces like IList, ICollection, or IEnumerable. In many of my projects, I've found that changing methods' parameter types from arrays or Lists to interfaces such as IList result in a 2-4x increase in execution time for the program.

Because virtual calls are used when using interfaces, making it so the JIT doesn't inline them, indexer access time is greatly increased. I was curious just how big the performance decrease was, so I put together a little demo program to try and benchmark this, and the results were worse than I expected:

Times are in milliseconds

18       Array
218      Array as IList
26       List
223      List as IList
19       ArrayWrapper
188      ArrayWrapper as IList

Using an IList resulted in about an order of magnitude longer execution time than not using an IList. I wondered if some of this was due to the extra stuff the CLR does when accessing an Array or List as an IList, so I made a simple wrapper class around an array. This resulted in some improvement, but not much.

This virtually negates the advantage of something like accepting an IList as a parameter instead of a T[] to decouple the code from any specific implementation. A new developer might expect that passing an object to a method accepting an IList would result in the same performance as if the method accepted that object's concrete type.

Is there any way to lessen this kind of performance loss? Or is it unavoidable without changing the JIT to be able to inline virtual calls?

category:cq
theme:devirtualization
skill-level:expert
cost:large
impact:large

@mikedn
Copy link
Contributor

mikedn commented Jan 25, 2017

The general guidelines that I see for writing in C# usually say something like, "Accept as general of an object as you can for your parameters."

Such guidelines need to be taken with a grain of salt. They usually make sense in public APIs but beyond that collection interfaces should be used only when needed. It makes no sense whatsoever to have a private method taking IList<T> when all its callers use List<T>.

Is there any way to lessen this kind of performance loss? Or is it unavoidable without changing the JIT to be able to inline virtual calls?

Not really. JIT improvements could help here and there but in the general case there's not much that can be done.

@benaadams
Copy link
Member

benaadams commented Jan 25, 2017

@Thealexbarney you can optimise specific cases with tight loops by casting; for example changing your SumIList method to the following would allow the the inlines when ilist is a list

private static int SumIList(IList<int> ilist)
{
    int result = 0;

    var list = ilist as List<int>;
    if (list != null)
    {
        var length = list.Count;
        for (int i = 0; i < length; i++)
            result += list[i];
    }
    else
    {
        var length = ilist.Count;
        for (int i = 0; i < length; i++)
            result += ilist[i];
    }

    return result;
}

In your example you are also doing a double interface dispatch per loop (for both count and item) and comparing that against a fully inlined bounds-check eliminated loop. The time should halve if you cache the count when going via the interface.

@jkotas
Copy link
Member

jkotas commented Jan 25, 2017

Abstractions have performance costs. Pretty much all runtime abstractions - interfaces, generics, async, linq, ... - have extra performance costs compared to their less abstract equivalents. This overhead does not matter for most of the code and it is often worth paying to get the productivity benefits that these abstractions provide. However, one has to be careful about this overhead for the performance critical parts.

@vancem wrote a great articles on this topic years ago. Links to the archived copies can be found here.

@vancem
Copy link
Contributor

vancem commented Jan 25, 2017

As Jan points out, abstractions often have costs. This is part of what makes design hard, because you have to make tradeoffs. We have problems with people going 'overboard' both ways. We often have people designing things without any concern for perf, often with really bad consequences. More rarely, we have people who try to optimize 'everything' and end up making hard-to-maintain code.

What we really want is to take advantage of a 90%-10% dynamic that typically happens with software. It is typically the case that WELL UNDER 10% of your code in is on high volume (HOT) code paths. Moreover, it is NOT THAT hard to anticipate this. In this code, there is a good chance you need to NOT be as abstract (e.g. using Arrays instead of IList), and think carefully about the APIs (make them chunky, so that they are not called often to do little work). The other 90% of your code you can skew toward all the good software abstractions etc that make the code understandable, reusable and maintainable.

In the article Jan mentions above I go into more detail, but it is not rocket science, it is al about exploiting the 90-10% characteristic. People who design general purpose libraries have the hardest since they really don't know what scenarios are the most important to optimize.

For what it is worth...

@Thealexbarney
Copy link
Author

Yes, abstractions have costs, but a 10x increase in execution time for this example doesn't seem very reasonable to me. I'm not very knowledgeable about the inner workings of the CLR's code generation, so somebody correct me if I'm wrong, but isn't one of the advantages of compiling at run-time the fact that the JIT compiler knows the type behind the interface, and can optimize accordingly?

I've written code across various projects that only operated on an object's indexer and length. We wanted to be able to run Arrays Lists, and other types through this code. Using an IList seemed like a good solution, but wasn't feasible for our project because of the performance issues.

I was curious how the JVM handled this, so I did a similar benchmark in Java testing a method that took an interface as a parameter. Going through an interface there didn't have much, if any, noticeable overhead compared to using a plain array. Looking at the compiled assembly confirmed that the interface calls were being inlined and optimized. It looks like when a new type goes through the interface, the method is JIT compiled for the new base type. This doesn't happen with the CLR.

I guess what I'm trying to ask is why is this still a problem after years of the CLR being developed? I can't speak for others, but I've personally run into it a decent number of times. Is there something about the way the CLR is designed that makes it difficult to improve on? Is it not worth the effort to improve? Or maybe there's some other reason that I'm overlooking.

@jkotas
Copy link
Member

jkotas commented Jan 26, 2017

Yes, it would be nice for CLR to have the optimizations you are asking about. They are not easy to build, but it is certainly doable with enough effort. They have not been built yet because of people working on CLR did not find them to be the best bang for the buck when choosing areas to improve so far.

@fatalerrorx
Copy link

I agree with @Thealexbarney 10x increase is unreasonable. I very often use IReadOnlyList instead of List for safety and code correctness.

@mattwarren
Copy link
Contributor

If anyone is interested in understanding 'why interface calls are more expensive', this BOTR page is an interesting read. Despite the title it is about interface dispatch, from the page:

Although it is possible for Virtual stub dispatching (VSD) to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.

@NickStrupat
Copy link

NickStrupat commented Jan 26, 2017

Hi all, @fatalerrorx, @benaadams

Correct me if I'm wrong, but don't generic constraints alleviate this pressure to overload/specialize in a lot of cases?

private static int SumIList<TList>(TList list) where TList : IList<int>
{
    int result = 0;
    if (list != null)
    {
        int length = list.Count;
        for (int i = 0; i < length; i++)
            result += list[i];
    }
    return result;
}

The JIT already knows to compile a different function for each type used, right?

@mikedn
Copy link
Contributor

mikedn commented Jan 26, 2017

Correct me if I'm wrong, but don't generic constraints alleviate this pressure to overload in a lot of cases?

Your example code still uses IList<int>.

The JIT already knows to compile a different function for each type used, right?

This only happens when TList is a struct type. If it's a reference type then the same code is used for arrays, List<int> etc.

@vancem
Copy link
Contributor

vancem commented Jan 26, 2017

I don't want to get into sound too defensive about optimization the .NET runtime does, but I do want to clarify two things that are important in deciding how important various optimization are.

Yes, abstractions have costs, but a 10x increase in execution time for this example doesn't seem very reasonable to me

I think you tested the worst case since the operation being done was literally one instruction. Thus I do think it is not fair to use the 10x as the representative increase. Of those cases where you are doing 'flyweight' things in hot loops, that actually have impact in overall performance, how problematic would it have been to simply use arrays?

I was curious how the JVM handled this

First, I would like to confirm that the Java experiment you did was 'representative' of your real code. What is going on is code specialization, and every system has its limits. There is a good chance that your microbenchmark lives within the limits (the whole function was inlined), but in more typical code, it is problematic (the methods are bigger and you have to decide which are worth it). Certainly 'hotspot' style dynamic re-compilation can do this (and that may be what is happening in your case), but that is a non-trivial investment.

My main point, however is that you should be careful about over-extrapolation of microbechmarks. If the benchmark REALLY represents your hot paths, then I can't argue with that, but real-world code has a surprising tendency to be both not as bad in some case (as in the first case above) or as good as (if the java JIT 'falls off' its optimization for your real code in the second case).

This is kind of tradeoff we have to make when the .NET Runtime decides how to invest in optimization. What we have found is that in 'real-world' code the performance problems are only very rarely related to JIT code quality issues of the type you suggest (but they do turn up in microbenchmarks all the time, which is what is happening here).

Anyway, what would be more useful to us is not this microbenchmark, but examples of 'hot kernels' in your code where you find this to be an actual problem. That would certainly add weight for us to do some optimization of those scenarios.

@NickStrupat
Copy link

@mikedn, wow I assumed it did that work for us. I must have misinterpreted something I read somewhere. I just took @Thealexbarney 's demo program and tried my change; same as not using generic. I'm kind of disappointed.

@NickStrupat
Copy link

Now it makes sense why the LINQ operators don't use this (non)-"trick". I wish they would have added that to the JIT compiler to support the LINQ implementation using that optimization.

@ThatRendle
Copy link

Regarding the generic constraint comment (@NickStrupat) there are areas where using that could avoid (a) the interface invocation tax and (b) the struct-boxing tax incurred when an instance is used via its interface, particularly for things like IEnumerable and IEnumerator, which work really well as structs.

@NickStrupat
Copy link

NickStrupat commented Jan 27, 2017

@markrendle absolutely. What I'm a bit disappointed about is that JIT doesn't emit a specialized method for classes as well. It could emit a method where all of the calls to the interface methods are resolved to call the methods on the class (at least if that class is sealed so it knows there won't be polymorphic behaviour).

I just tested @Thealexbarney 's demo program with an added SealedArrayWrapper and having all tests call a generic version of the Sum... method. The results are the same sealed or not.

Developing a library which reads the method body as IL and emits a DynamicMethod with the modified IL for sealed classes could be a workable solution.

@vancem
Copy link
Contributor

vancem commented Jan 27, 2017

Adding @cmckinsey

@NickStrupat
Copy link

NickStrupat commented Jan 27, 2017

So in summary, RyuJIT does not take advantage of the optimization opportunity to output specialized native executable code for generic methods taking an interface/class-constrained parameter where T is a sealed class, despite value types receiving this optimization. The specialized code would in theory be calling the method directly instead of during a virtual table look-up call, assuming this is how it works for value types(?).

Is that correct? Does anyone have expertise on what changes that would require?

Scenarios like this where high performance abstractions can be achieved through using a sealed derived class combined with generic methods in hot loops or elsewhere could be considered very valuable for people with their eyes on the "performance as a feature" aspect of .NET development.

@mikedn
Copy link
Contributor

mikedn commented Jan 27, 2017

So in summary, RyuJIT does not take advantage of the optimization opportunity to output specialized native executable code for generic methods taking an interface/class-constrained parameter where T is a sealed class, despite value types receiving this optimization. The specialized code would in theory be calling the method directly instead of during a virtual table look-up call, assuming this is how it works for value types

It works that way for value types because there aren't any reasonable alternatives (e.g the code for List<double> and List<int> has to be different because the 2 value types have different sizes and require different instructions, the only way to avoid this is to do box those values like Java generics do).

Doing the same thing indiscriminately in the case of reference types would result in a code explosion and in many cases it would be useless or worse, have adverse effects. If such specialization is done then it should be done only when needed and that makes things more difficult. The JIT needs to somehow figure which code paths are worth optimizing either at runtime (like Java supposedly does) or by having profiling information collected in advance (e.g. via tools like mpgo). And each approach has its pros and cons (e.g. the Java approach wouldn't probably work in AOT runtimes such as .NET Native and CoreRT).

@benaadams
Copy link
Member

benaadams commented Jan 27, 2017

@NickStrupat the caller can choose to force the implementation for the generic by creating a non-generic struct wrapper

struct ListWrapper : IList<int>
{
    private List<int> _list;
    public ListWrapper(List<int> list)
    {
        _list = list;
    } 
    public int Count => _list.Count;
    //...
}

var result = SumIList(new ListWrapper(list));

It should also give you the benefits of inlining back

@Thealexbarney
Copy link
Author

Thealexbarney commented Jan 28, 2017

@vancem The microbenchmark is not completely representative of the hot spots in our code, but it's not too far off. The most severely I run into this performance decrease is when working with code that uses interfaces such as IList or IReadOnlyList.

Here's one example of a possible situation. Let's say someone's writing some code that does some sort of data processing, and the functions to call this code accept an Array. This implementation is working fine, and performance is good.

Now this person needs to pass in portion of a large array to this function. They have two main options: Create a new, smaller array, allocating memory in the process, or create an object (Similar to ArraySegment or Span) that abstracts the access to the original array, requiring duplicated code to handle the object, incurring a slight increase in execution time, but without a large memory allocation.

I won't go into any more detail, but there are a few more memory abstractions that this person uses in different situations, and a common theme between all of them is that they can all implement IList. Currently there are multiple versions of the code that have to be maintained, but if a single version of the code accepted an IList, ease of maintenance and the size of the code base would be better. However, making the original code accept an IList instead of an Array results in about a 2-4x increase in execution time, depending on the situation.

I'm not that great at creating scenarios, but I hope this shows one situation where a developer could run into this issue.

@NickStrupat
Copy link

@mikedn I'm not suggesting indiscriminately doing the same thing for reference types. Only for sealed classes.

@benaadams The wrapper struct can actually be generic and receive the same inlining treatment.

@mikedn
Copy link
Contributor

mikedn commented Jan 31, 2017

I'm not suggesting indiscriminately doing the same thing for reference types. Only for sealed classes.

How does doing it only for sealed classes help? List<T> and most collection classes aren't sealed anyway. The only commonly used "collection class" that is sealed is T[]. But even then wholesale generic method specialization doesn't necessarily solve this issue. The JIT needs to somehow figure out which specialization to call and at compile time this is only possible if the actual type of the collection is known at the callsite. As in:

void foo() {
    IList<int> nums = new int[42];
    ...
    // the compiler knows that nums is int[] so it could call
    // a specialized version of SumIList
    SumIList(nums);
}

Otherwise the best a compiler can do is something along the lines of @benaadams first example. And that kind of approach needs to be driven by profiling information otherwise you risk ending up with a larger and slower program.

@NickStrupat
Copy link

@mikedn I think we've each been talking about slightly different scenarios, so I'll try to specify again.

If the calling code knows the IList<T> is T[] or some derived and sealed class of any IList<T>, the following method could and should perform the same "inlining" as value-type IList<T>s already receive.

private static int SumIList<TList>(TList list) where TList : IList<int>
{
    int result = 0;
    if (list != null)
    {
        int length = list.Count;
        for (int i = 0; i < length; i++)
            result += list[i];
    }
    return result;
}

If TList is a value-type, we get the same performance as if SumIList were overloaded specifically for TList, but not if TList is a sealed class.

Currently, as @benaadams points out, we need to make a value-type wrapper to get the "inlining". If someone has a List<T> or some other standard IList<T> collection, they need to make a struct wrapper for that particular class.

@benaadams
Copy link
Member

@NickStrupat as @mikedn points out if that was done wholesale it would cause a binary size explosion especially as then it would start inlining code into these functions. It would be worthwhile in hot paths but disadvantageous to do in all paths.

There is a proposal in rosyln to indicate that you want specific handling for various generic types dotnet/roslyn#15822 <-- Shameless self promotion 😉

@NickStrupat
Copy link

@benaadams, @mikedn, I don't know how many other ways I can point this out... I'm not talking about doing it wholesale.

I'm talking about doing it only for sealed classes where TList is the sealed class. In this scenario, the JIT compiler knows just as much about that type as if it were a struct. Only if it's sealed is this a good idea. Since the vast majority of classes are unsealed, there would be no code explosion in most cases.

I fully recognize and understand that generating different native code methods for every and any T would result in an undesirable code explosive à la C++ class templates. Again, I'm not proposing that.

@benaadams, generic specialization proposals get a +1 from me! I loved designing C++ libs which took advantage of that.

@benaadams
Copy link
Member

benaadams commented Feb 1, 2017

@NickStrupat sealed is only relevant for virtual methods; otherwise it already has all the type information. It would also cause code bloat unnecessarily for everywhere a sealed type was used rather than just in hot paths. e.g. as string is a sealed type creating a Dictionary<string, object> would generate full unique assembly for that dictionary, as would Dictionary<string, string> and List<string> etc but there is likely no reason to do so.

Regarding non-generic devirtualization for sealed types that is being looked at as part of dotnet/coreclr#9230; though that is different than this issue.

@mikedn
Copy link
Contributor

mikedn commented Feb 1, 2017

I think we've each been talking about slightly different scenarios, so I'll try to specify again.

I don't think so. Let's try again:

If the calling code knows the IList is T[] or some derived and sealed class of any IList, the following method could and should perform the same "inlining" as value-type ILists already receive.

Considering that doing this will invariably increase code size one can wonder why should the JIT be doing this in the absence of any indication that the code is performance sensitive. Does the fact that TList happens to be sealed implies that SumIList is performance sensitive? Clearly not.

If TList is a value-type, we get the same performance as if SumIList were overloaded specifically for TList, but not if TList is a sealed class.

And as I stated before value types works the way they work because there aren't any reasonable alternatives, not because the JIT is doing an optimization. If anything, the JIT could do better in the area of value types by avoiding generating separate code for similar structs (e.g. 2 structs that both contain a single int field)

@benaadams
Copy link
Member

benaadams commented Feb 2, 2017

After rechecking my assumptions, I take it back, alas a concrete struct wrapper doesn't inline via a generic. https://gist.github.com/benaadams/ba3c008a328f02c75663213cd9affa65

// incorrect numbers

The interface calls are slower, but closer to a non-inlined call (as above)

@benaadams
Copy link
Member

benaadams commented Feb 2, 2017

Ahh... had it wrong; they do inline, the wrappers were classes not structs, updated code, results are

 31      Array
237      Array as IList
 30      ArrayWrapper
149      ArrayWrapper Non-inlined        ** Non-inlined direct call
 45      List
207      List as IList
220      ArrayWrapper as IList
235      Array as Generic                ** via constraint
206      List as Generic                 ** via constraint
 34      ArrayWrapper as Generic         ** via constraint

@NickStrupat
Copy link

Considering that doing this will invariably increase code size one can wonder why should the JIT be doing this in the absence of any indication that the code is performance sensitive.

@mikedn Generics in general already causes this "code explosion" relative to non-generic types/collections). What are the chances that the current implementation of generics and their surrounding code base (LINQ, etc.) represents exactly the right balance of native code size vs performance? There is no real reason other than performance to generate the specializations at all. The type arguments could just be used for type-safety to wrap a non-generic with casting. Clearly the performance gains were considered more important than the potentially large generated code size.

Does the fact that TList happens to be sealed implies that SumIList is performance sensitive? Clearly not.

Currently the mechanism for achieving the inlining is "wrap with a struct", which is no less cryptic than "seal your class". That said, I would definitely be interested in some fine-grained control of which inlined specializations are generated.

And as I stated before value types works the way they work because there aren't any reasonable alternatives, not because the JIT is doing an optimization.

There is definitely at least one reasonable alternative. You could just box the value type wrapper like many parts of the .NET BCL already/still do.

If anything, the JIT could do better in the area of value types by avoiding generating separate code for similar structs (e.g. 2 structs that both contain a single int field)

That would be very cool. So the easy case would be if the fields match exactly?

@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Nov 3, 2020
@AndyAyersMS AndyAyersMS self-assigned this Nov 3, 2020
@bravequickcleverfibreyarn
Copy link

bravequickcleverfibreyarn commented Nov 11, 2020

Speed of List<T> indexer is not good in any case so taking it for performance critical test where virtual calls are taken into consideration is little awkward. (Array vs. List Performance)

@Thealexbarney You pointed out JVM capability that make these thing fine but you did can just make you way using abstract class.

public abstract class AC<T>
{
    public abstract T this[int index] { get; set; }
}

public class PerfAbstractedArray<T> : AC<T>
{
    private readonly T[] valuesMadeToArray;

    public PerfAbstractedArray(T[] valuesMadeToArray)
    {
        this.valuesMadeToArray = valuesMadeToArray;
    }

    public override T this[int index] 
    {
        get => valuesMadeToArray[index];
        set => valuesMadeToArray[index] = value;
    }
}

public class PerfAbstractedList<T> : AC<T>
{
    private readonly List<T> valuesMadeToList;

    public PerfAbstractedList(List<T> valuesMadeToList)
    {
        this.valuesMadeToList = valuesMadeToList;
    }

    public override T this[int index]
    {
        get => valuesMadeToList[index];
        set => valuesMadeToList[index] = value;
    }
}

FMPOV framework guidelines are fit for particular purpose so if you are interested in performance you are going to be tuner all the time. Even in the future. Other told well there are trade-offs, costs and priorities. On the other hand autogained performance for no research is not hurting anyone at all.

@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@AndyAyersMS
Copy link
Member

Things continue to look good with recent builds and TieredPGO (and TC_QuickJitForLoops). The tests linked above do a fair amount of work per iteration (sqrt) so it is possible with PGO to get close to the direct call perf.

If the work payload is cut down to something like an integer sum reduction, we can see the residual costs more clearly

            for (int i = 0; i < length; i++)
            {
                r += a[i] + b[i];
            }

and here the PGO version is about 2x slower.

Method Mean Error StdDev
RunInterface 25.71 us 0.476 us 0.446 us
RunNoInterface 11.33 us 0.224 us 0.210 us

BDN results can be a bit inconsistent at times, however. I am not sure why.

;; no-interface inner loop (both PGO and no PGO)
G_M44880_IG03:              ;; offset=0015H
       443BD1               cmp      r10d, ecx
       7338                 jae      SHORT G_M44880_IG07
						;; bbWeight=10000.90 PerfScore 12501.13
G_M44880_IG04:              ;; offset=001AH
       4C8B5A08             mov      r11, gword ptr [rdx+8]
       453B5308             cmp      r10d, dword ptr [r11+8]
       7334                 jae      SHORT G_M44880_IG08
       4963F2               movsxd   rsi, r10d
       458B5CB310           mov      r11d, dword ptr [r11+4*rsi+16]
       453B5010             cmp      r10d, dword ptr [r8+16]
       7320                 jae      SHORT G_M44880_IG07
						;; bbWeight=10000.90 PerfScore 102509.25
G_M44880_IG05:              ;; offset=0032H
       4103C3               add      eax, r11d
       4D8B5808             mov      r11, gword ptr [r8+8]
       453B5308             cmp      r10d, dword ptr [r11+8]
       7319                 jae      SHORT G_M44880_IG08
       410344B310           add      eax, dword ptr [r11+4*rsi+16]
       41FFC2               inc      r10d
       453BD1               cmp      r10d, r9d
       7CC9                 jl       SHORT G_M44880_IG03
;; interface dot-product innner loop (no PGO)

G_M18393_IG03:              ;; offset=0030H
       488BCE               mov      rcx, rsi
       418BD6               mov      edx, r14d
       49BBF004A443FC7F0000 mov      r11, 0x7FFC43A404F0
       FF158A4CC8FF         call     [IList`1:get_Item(int):int:this]
       448BF8               mov      r15d, eax
       488BCF               mov      rcx, rdi
       418BD6               mov      edx, r14d
       49BBF804A443FC7F0000 mov      r11, 0x7FFC43A404F8
       FF15794CC8FF         call     [IList`1:get_Item(int):int:this]
       4103EF               add      ebp, r15d
       03E8                 add      ebp, eax
       41FFC6               inc      r14d
       443BF3               cmp      r14d, ebx
       7CC4                 jl       SHORT G_M18393_IG03
;; interface dot-product inner loop (PGO)

G_M18393_IG04:              ;; offset=0039H
       48B8D016D343FC7F0000 mov      rax, 0x7FFC43D316D0    ;; test type of "a"
       483BD8               cmp      rbx, rax
       0F8591000000         jne      G_M18393_IG13
       488BC7               mov      rax, rdi
       443B7810             cmp      r15d, dword ptr [rax+16]    ;; length check
       0F83A2000000         jae      G_M18393_IG14
						;; bbWeight=10006.84 PerfScore 47532.50
G_M18393_IG05:              ;; offset=0059H
       488B4708             mov      rax, gword ptr [rdi+8]
       443B7808             cmp      r15d, dword ptr [rax+8]    ;; bounds check
       0F83B5000000         jae      G_M18393_IG16
       4963CF               movsxd   rcx, r15d
       448B648810           mov      r12d, dword ptr [rax+4*rcx+16]    ;; fetch element "a"
						;; bbWeight=10006.84 PerfScore 72549.60
G_M18393_IG06:              ;; offset=006FH
       48B8D016D343FC7F0000 mov      rax, 0x7FFC43D316D0     ;; test type of "b"
       483906               cmp      qword ptr [rsi], rax
       0F857F000000         jne      G_M18393_IG15
						;; bbWeight=10006.84 PerfScore 32522.24
G_M18393_IG07:              ;; offset=0082H
       488BC6               mov      rax, rsi
       443B7810             cmp      r15d, dword ptr [rax+16]     ;; length check
       7370                 jae      SHORT G_M18393_IG14
						;; bbWeight=10006.84 PerfScore 32522.24
G_M18393_IG08:              ;; offset=008BH
       488B4608             mov      rax, gword ptr [rsi+8]
       443B7808             cmp      r15d, dword ptr [rax+8]     ;; bounds check
       0F8383000000         jae      G_M18393_IG16
       4963CF               movsxd   rcx, r15d
       448B6C8810           mov      r13d, dword ptr [rax+4*rcx+16]    ;; fetch element b
						;; bbWeight=10006.84 PerfScore 72549.60
G_M18393_IG09:              ;; offset=00A1H
       4503F4               add      r14d, r12d
       4503F5               add      r14d, r13d
       41FFC7               inc      r15d
       443BFD               cmp      r15d, ebp
       7C8A                 jl       SHORT G_M18393_IG04

;; cold blocks 13 and 15 (not shown) make the residual interface calls if the type tests fail.

Compared with the no-interface version, the extra overhead in the PGO version comes from the two type checks and the slightly more expensive test against a.Count. That plus higher branch density (though all but the last should be not taken) and larger code likely explains the perf cost.

Note in the non-interface version a.Count() is treated as a loop invariant; it is not treated that way in the optimized PGO loop (possibly this is a dependent LICM hoist, ala #35735?).

Also in the PGO loop above we don't hoist the type fetch for b because of all the potential exceptions (that is if b is null) we don't want to check its type too early. This is an example where loop peeling (while costly) would unblock CSEs, as all those first-iteration exceptions would happen outside the loop. But it would be a lot of duplicated code.

As an alternative with less code duplication we could clone / unswitch on the type tests... that is instead of producing

for (...)
{
     if (a.GetType() == typeof(...))
     ...
     if (a.GetType() == typeof(...))
     ...
}

we'd produce

if ((a.GetType() == typeof(...)) && (b != null) && (b.GetType() == typeof(...)))
{
     for (...)    //  no type checks or interface calls
}
else
{
    for (...)   // interface calls
}

I'm going to keep this open for now but don't expect much to change for the remainder of 6.0, so will move to future.

cc @BruceForstall

@AndyAyersMS AndyAyersMS removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 9, 2021
@AndyAyersMS AndyAyersMS modified the milestones: 6.0.0, Future Apr 9, 2021
@bravequickcleverfibreyarn
Copy link

bravequickcleverfibreyarn commented Apr 10, 2021

@Thealexbarney,

I made up some simple quick tests using tight loop as you designed it. It seems that at least at my configuration I get ⅓ better resolution with abstract class design (compared to interface design). I guess it is not enough for you case. 👎🏻

public abstract class MyCollectionBase {
	public abstract int Length { get; }
	public abstract int this [int i] { get; }
}
public interface IMyCollection {
	public int Length { get; }
	public int this [int i] { get; }
}
public class MyCollection : MyCollectionBase, IMyCollection {
	int[] ints;
	public MyCollection(int[] ints) => this.ints = ints;
	public override int Length => ints.Length;
	public override int this [int i] => ints[i];
}

Even better results I get for the generics. Circa ½.

public abstract class MyCollectionBase<T> {
    public abstract int Length { get; }
    public abstract T this [int i] { get; }
}
public interface IMyCollection<T> {
	public int Length { get; }
	public T this [int i] { get; }
}
public class MyCollectionOfT<T> : MyCollectionBase<T>, IMyCollection<T> {
	T[] array;
	public MyCollectionOfT(T[] array) => this.array = array;
	public override int Length => array.Length;
	public override T this [int i] => array[i];
}

… still the burden is 5× for virtual method in generics case (while interface takes 10×).

@benaadams
Copy link
Member

benaadams commented Apr 12, 2021

Running the LinqBenchmarks at https://github.com/NetFabric/LinqBenchmarks; the improvement for IEnumerable<int> is very significant:

EnumerableInt32Select

Method EnvironmentVariables Runtime Count Mean Ratio Allocated
ForeachLoop - .NET 5.0 1000 4,114.66 ns 1.00 40 B
ForeachLoop TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 3,109.91 ns 0.76 40 B
Linq - .NET 5.0 1000 10,036.04 ns 2.44 96 B
Linq TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 5,429.17 ns 1.32 96 B
LinqAF - .NET 5.0 1000 6,956.42 ns 1.69 40 B
LinqAF TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 4,926.36 ns 1.20 40 B
StructLinq - .NET 5.0 1000 5,805.81 ns 1.41 64 B
StructLinq TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 3,631.02 ns 0.88 64 B
StructLinq_IFunction - .NET 5.0 1000 4,140.51 ns 1.01 40 B
StructLinq_IFunction TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 2,867.47 ns 0.70 40 B
Hyperlinq - .NET 5.0 1000 5,916.27 ns 1.44 40 B
Hyperlinq TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 3,845.47 ns 0.93 40 B
Hyperlinq_IFunction - .NET 5.0 1000 4,128.19 ns 1.00 40 B
Hyperlinq_IFunction TC_QuickJitForLoops=1,TieredPGO=1 .NET 6.0 1000 3,115.01 ns 0.76 40 B

@AndyAyersMS
Copy link
Member

@benaadams thanks for the ongoing interest in PGO!

FWIW I am starting to look at impact of PGO on various TechEmpower scenarios. Some seem to speed up nicely (plaintext/mvc), others not so much (json)... suspect we're not seeing anywhere near the full potential yet.

@benaadams
Copy link
Member

various TechEmpower scenarios. Some seem to speed up nicely (plaintext/mvc), others not so much (json)...

.NET 6.0 + PGO allows "Ben"; essentially a Kestrel stack, to pull ahead of the lighter .NET 5.0 "platform" on the TE database tests 😎

image

suspect we're not seeing anywhere near the full potential yet.

Love to hear it 💜

@AndyAyersMS
Copy link
Member

Json 😢

image

@benaadams
Copy link
Member

Json 😢

Isn't like for like; for Json "platform" is using 3 additional options:

  • DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS = 1
  • UnsafePreferInlineScheduling = true
  • WaitForDataBeforeAllocatingBuffer = false

Whereas "Ben" is just using the vanilla defaults; and implementation-wise its closer to "middleware" than "platform" (was mostly a demonstration that the ASP.NET Core stack was layered well enough that you could build a stack on top of Kestrel without having to "buy-in" to its opinionated Dependency Injection and Hosting model)

@AndyAyersMS
Copy link
Member

So that's good news, then? Is this with recent builds?

@benaadams
Copy link
Member

So that's good news, then?

Yes. its fairly miraculous it overtook "platform" in the DB Single Query; expected it to be a little faster than "middleware"; but... 😲

Is this with recent builds?

Is using the public releases; so I assume is preview 3?

Docker settings are

FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS runtime
ENV COMPlus_ReadyToRun 0
ENV COMPlus_TC_QuickJitForLoops 1
ENV COMPlus_TieredPGO 1

Though might not need to switch off ReadyToRun anymore?

@AndyAyersMS
Copy link
Member

My hope is that we can eventually drop the need for COMPlus_ReadyToRun=0 without losing too much perf. Hard to be sure just yet how things will play out. Each preview should be better than the one before.

For 6p4 COMPlus_ReadyToRun=0 should be less important for apps that spend a good deal of time in core framework code, as we aim to have static PGO data for key prejitted methods.

By 6p6 we plan have static PGO data flowing into ASP.NET assemblies as well, and maybe by then we ccan ease off COMPlus_ReadyToRun=0 for TE scenarios.

With current main the gap between enabling R2R and disabling on TE it is still pretty large, at least in my measurements. For example plaintext/mvc is 18% faster with the options above, but only 1% faster if R2R is left enabled.

I need to do more investigation into what all needs fixing and where these numbers should be; that's going to be my focus for the next few weeks.

@AndyAyersMS
Copy link
Member

cc @davidfowl since Julie says you want to know what we're up to over here.

@mangod9
Copy link
Member

mangod9 commented Apr 13, 2021

why is COMPlus_ReadyToRun=0 required? Shouldnt Tiering replace the R2R methods?

@benaadams
Copy link
Member

why is COMPlus_ReadyToRun=0 required? Shouldnt Tiering replace the R2R methods?

Lack of PGO data for the R2R methods; switching it off runs them at Tier0 which generates the PGO data for Tier1

@HyVong007
Copy link

Using the Interface vs using the specific type (class/struct) always have tradeoffs. So can i inline my method of a class which impl an interface ? So when i use the interface, JIT Compiler will inline my method if the interface is my class ?
I use attribute System.Runtime.CompilerServices.MethodImpl(MethodImplOptions.AggressiveInlining) for the method and it will be inlined ?

What are the criterias for the method to be inlined ? always inlined ?

Sorry for my bad english.

@AndyAyersMS
Copy link
Member

@HyVong007 thanks for the questions. Let me know if the information below helps.

Devirtualization

Generally if the jit can determine the type of the this object at an interface call, it can devirtualize and then potentially inline. There are two main mechanisms for determining types:

  • deduce the type from flow analysis within a method
  • enable PGO, have that observe the possible types for this, and then test for the most likely type when rejitting or in a future run of the process.

Last I looked flow analysis can enable devirualization and inlining in a relatively small fraction of interface call cases (say no more than 10%). Success here requires that there be some "organic" evidence of type identity (constructor call or type test) upstream from the interface site. Inlining can help bring together the needed information but currently the inlining heuristics do not include increased devirtualization potential as part of their evaluation. That may change soon (see eg #53670).

PGO is quite effective at devirtualizing interface calls; most studies I have done show upwards of 80% of interface call sites have one dominant implementing class for this.

Inlining

The inlining heuristics are complicated and difficult to summarize concisely. Roughly speaking a method will be inlined if:

  • there is a direct call to the method, OR the jit can devirtualize an interface or virtual call, AND
    • the method being invoked is small (16 bytes of IL or fewer), OR
    • the method being invoked is marked with AggressiveInlining, OR
    • the method is medium sized (17 to ~100 bytes of IL) and the inline heuristics determine the inline is worthwhile

@HyVong007
Copy link

HyVong007 commented Jun 8, 2021

@HyVong007 thanks for the questions. Let me know if the information below helps.

Devirtualization

Generally if the jit can determine the type of the this object at an interface call, it can devirtualize and then potentially inline. There are two main mechanisms for determining types:

  • deduce the type from flow analysis within a method
  • enable PGO, have that observe the possible types for this, and then test for the most likely type when rejitting or in a future run of the process.

Last I looked flow analysis can enable devirualization and inlining in a relatively small fraction of interface call cases (say no more than 10%). Success here requires that there be some "organic" evidence of type identity (constructor call or type test) upstream from the interface site. Inlining can help bring together the needed information but currently the inlining heuristics do not include increased devirtualization potential as part of their evaluation. That may change soon (see eg #53670).

PGO is quite effective at devirtualizing interface calls; most studies I have done show upwards of 80% of interface call sites have one dominant implementing class for this.

Inlining

The inlining heuristics are complicated and difficult to summarize concisely. Roughly speaking a method will be inlined if:

  • there is a direct call to the method, OR the jit can devirtualize an interface or virtual call, AND

    • the method being invoked is small (16 bytes of IL or fewer), OR
    • the method being invoked is marked with AggressiveInlining, OR
    • the method is medium sized (17 to ~100 bytes of IL) and the inline heuristics determine the inline is worthwhile

Ok thank you, i understand something now. And here for some example i found, which cases will be inlined ? Thank so much !

`
public int a { get; set; } // 1
private int bakingField;
public int b => bakingField; // 2

public int c // 3
{
get => bakingField;
set => bakingField = value;
}

public int Method() => a; // 4

public int Method2() => a > 0 ? a : b; // 5

public int Method3(int i, float f, string s, char c) => a * 2 + 5 - 7 + b; // 6

`

@tanveerbadar
Copy link

A long time has passed since this thread was opened. I am wondering how big of a difference still remains with all the improvements going in JIT like PGO, OSR, tiered compilation, GDV, frozen object heap and what-nots.

@maximilien-noal
Copy link

First run: .NET 8.
Second run: .NET 7

C:\Users\max>cd source && cd repos && mkdir perftest && cd perftest && dotnet new console
Le modèle « Application console » a bien été créé.

Traitement des actions postérieures à la création en cours... Merci de patienter.
Restauration de C:\Users\max\source\repos\perftest\perftest.csproj :
  Identification des projets à restaurer...
  Restauration effectuée de C:\Users\max\source\repos\perftest\perftest.csproj (en 493 ms).
Restauration réussie.

C:\Users\max\source\repos\perftest>notepad program.cs

C:\Users\max\source\repos\perftest>dotnet run -c Release
Direct: 1019ms
As base class: 1003ms
As interface: 1082ms

C:\Users\max\source\repos\perftest>notepad perftest.csproj

C:\Users\max\source\repos\perftest>dotnet run -c Release
Direct: 1048ms
As base class: 2329ms
As interface: 3888ms 

Seems pretty good to me.

Test program:


namespace PerfTest;

using System.Diagnostics;

internal static class Program {
    public static void Main(string[] args) {
        var sw = new Stopwatch();
        const int iterations = int.MaxValue;

        var direct = new Tester();
        ISlowThingsDown asInterface = direct;
        var asBaseClass = asInterface as MyAbstractBaseClass;

        sw.Start();
        for (int i = 0; i < iterations; i++) {
            direct.Property = i;
        }
        Console.WriteLine($"Direct: {sw.ElapsedMilliseconds}ms");
        sw.Restart();
        for (int i = 0; i < iterations; i++) {
            asBaseClass!.Property = i;
        }
        Console.WriteLine($"As base class: {sw.ElapsedMilliseconds}ms");
        sw.Restart();
        for (int i = 0; i < iterations; i++) {
            asInterface.Property = i;
        }
        Console.WriteLine($"As interface: {sw.ElapsedMilliseconds}ms");
        sw.Stop();
    }
}

public interface IAmBaseInterface {
    public long Property { set; }
}

public interface ISlowThingsDown : IAmBaseInterface {
}

public class Tester : MyAbstractBaseClass, ISlowThingsDown {
}

public abstract class MyAbstractBaseClass : IAmBaseInterface {
    public long Property { get; set; }
}

Maybe it's possible to measure this with Benchmark.net instead ?

This is left as an exercise to the reader.

@jkotas
Copy link
Member

jkotas commented Dec 20, 2023

Your micro-benchmark showcases profile guided devirtualization and inlining. Yes, we made a lot of progress improving various cases discussed here and we will continue the improvements.

I do not see anything concrete actionable left in this issue. Closing.

@jkotas jkotas closed this as completed Dec 20, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests