-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance decrease when using interfaces #7291
Comments
Such guidelines need to be taken with a grain of salt. They usually make sense in public APIs but beyond that collection interfaces should be used only when needed. It makes no sense whatsoever to have a private method taking
Not really. JIT improvements could help here and there but in the general case there's not much that can be done. |
@Thealexbarney you can optimise specific cases with tight loops by casting; for example changing your private static int SumIList(IList<int> ilist)
{
int result = 0;
var list = ilist as List<int>;
if (list != null)
{
var length = list.Count;
for (int i = 0; i < length; i++)
result += list[i];
}
else
{
var length = ilist.Count;
for (int i = 0; i < length; i++)
result += ilist[i];
}
return result;
} In your example you are also doing a double interface dispatch per loop (for both count and item) and comparing that against a fully inlined bounds-check eliminated loop. The time should halve if you cache the count when going via the interface. |
Abstractions have performance costs. Pretty much all runtime abstractions - interfaces, generics, async, linq, ... - have extra performance costs compared to their less abstract equivalents. This overhead does not matter for most of the code and it is often worth paying to get the productivity benefits that these abstractions provide. However, one has to be careful about this overhead for the performance critical parts. @vancem wrote a great articles on this topic years ago. Links to the archived copies can be found here. |
As Jan points out, abstractions often have costs. This is part of what makes design hard, because you have to make tradeoffs. We have problems with people going 'overboard' both ways. We often have people designing things without any concern for perf, often with really bad consequences. More rarely, we have people who try to optimize 'everything' and end up making hard-to-maintain code. What we really want is to take advantage of a 90%-10% dynamic that typically happens with software. It is typically the case that WELL UNDER 10% of your code in is on high volume (HOT) code paths. Moreover, it is NOT THAT hard to anticipate this. In this code, there is a good chance you need to NOT be as abstract (e.g. using Arrays instead of IList), and think carefully about the APIs (make them chunky, so that they are not called often to do little work). The other 90% of your code you can skew toward all the good software abstractions etc that make the code understandable, reusable and maintainable. In the article Jan mentions above I go into more detail, but it is not rocket science, it is al about exploiting the 90-10% characteristic. People who design general purpose libraries have the hardest since they really don't know what scenarios are the most important to optimize. For what it is worth... |
Yes, abstractions have costs, but a 10x increase in execution time for this example doesn't seem very reasonable to me. I'm not very knowledgeable about the inner workings of the CLR's code generation, so somebody correct me if I'm wrong, but isn't one of the advantages of compiling at run-time the fact that the JIT compiler knows the type behind the interface, and can optimize accordingly? I've written code across various projects that only operated on an object's indexer and length. We wanted to be able to run Arrays Lists, and other types through this code. Using an IList seemed like a good solution, but wasn't feasible for our project because of the performance issues. I was curious how the JVM handled this, so I did a similar benchmark in Java testing a method that took an interface as a parameter. Going through an interface there didn't have much, if any, noticeable overhead compared to using a plain array. Looking at the compiled assembly confirmed that the interface calls were being inlined and optimized. It looks like when a new type goes through the interface, the method is JIT compiled for the new base type. This doesn't happen with the CLR. I guess what I'm trying to ask is why is this still a problem after years of the CLR being developed? I can't speak for others, but I've personally run into it a decent number of times. Is there something about the way the CLR is designed that makes it difficult to improve on? Is it not worth the effort to improve? Or maybe there's some other reason that I'm overlooking. |
Yes, it would be nice for CLR to have the optimizations you are asking about. They are not easy to build, but it is certainly doable with enough effort. They have not been built yet because of people working on CLR did not find them to be the best bang for the buck when choosing areas to improve so far. |
I agree with @Thealexbarney 10x increase is unreasonable. I very often use IReadOnlyList instead of List for safety and code correctness. |
If anyone is interested in understanding 'why interface calls are more expensive', this BOTR page is an interesting read. Despite the title it is about interface dispatch, from the page:
|
Hi all, @fatalerrorx, @benaadams Correct me if I'm wrong, but don't generic constraints alleviate this pressure to overload/specialize in a lot of cases? private static int SumIList<TList>(TList list) where TList : IList<int>
{
int result = 0;
if (list != null)
{
int length = list.Count;
for (int i = 0; i < length; i++)
result += list[i];
}
return result;
} The JIT already knows to compile a different function for each type used, right? |
Your example code still uses
This only happens when |
I don't want to get into sound too defensive about optimization the .NET runtime does, but I do want to clarify two things that are important in deciding how important various optimization are.
I think you tested the worst case since the operation being done was literally one instruction. Thus I do think it is not fair to use the 10x as the representative increase. Of those cases where you are doing 'flyweight' things in hot loops, that actually have impact in overall performance, how problematic would it have been to simply use arrays?
First, I would like to confirm that the Java experiment you did was 'representative' of your real code. What is going on is code specialization, and every system has its limits. There is a good chance that your microbenchmark lives within the limits (the whole function was inlined), but in more typical code, it is problematic (the methods are bigger and you have to decide which are worth it). Certainly 'hotspot' style dynamic re-compilation can do this (and that may be what is happening in your case), but that is a non-trivial investment. My main point, however is that you should be careful about over-extrapolation of microbechmarks. If the benchmark REALLY represents your hot paths, then I can't argue with that, but real-world code has a surprising tendency to be both not as bad in some case (as in the first case above) or as good as (if the java JIT 'falls off' its optimization for your real code in the second case). This is kind of tradeoff we have to make when the .NET Runtime decides how to invest in optimization. What we have found is that in 'real-world' code the performance problems are only very rarely related to JIT code quality issues of the type you suggest (but they do turn up in microbenchmarks all the time, which is what is happening here). Anyway, what would be more useful to us is not this microbenchmark, but examples of 'hot kernels' in your code where you find this to be an actual problem. That would certainly add weight for us to do some optimization of those scenarios. |
@mikedn, wow I assumed it did that work for us. I must have misinterpreted something I read somewhere. I just took @Thealexbarney 's demo program and tried my change; same as not using generic. I'm kind of disappointed. |
Now it makes sense why the LINQ operators don't use this (non)-"trick". I wish they would have added that to the JIT compiler to support the LINQ implementation using that optimization. |
Regarding the generic constraint comment (@NickStrupat) there are areas where using that could avoid (a) the interface invocation tax and (b) the struct-boxing tax incurred when an instance is used via its interface, particularly for things like IEnumerable and IEnumerator, which work really well as structs. |
@markrendle absolutely. What I'm a bit disappointed about is that JIT doesn't emit a specialized method for classes as well. It could emit a method where all of the calls to the interface methods are resolved to call the methods on the class (at least if that class is sealed so it knows there won't be polymorphic behaviour). I just tested @Thealexbarney 's demo program with an added SealedArrayWrapper and having all tests call a generic version of the Developing a library which reads the method body as IL and emits a |
Adding @cmckinsey |
So in summary, RyuJIT does not take advantage of the optimization opportunity to output specialized native executable code for generic methods taking an interface/class-constrained parameter where Is that correct? Does anyone have expertise on what changes that would require? Scenarios like this where high performance abstractions can be achieved through using a sealed derived class combined with generic methods in hot loops or elsewhere could be considered very valuable for people with their eyes on the "performance as a feature" aspect of .NET development. |
It works that way for value types because there aren't any reasonable alternatives (e.g the code for Doing the same thing indiscriminately in the case of reference types would result in a code explosion and in many cases it would be useless or worse, have adverse effects. If such specialization is done then it should be done only when needed and that makes things more difficult. The JIT needs to somehow figure which code paths are worth optimizing either at runtime (like Java supposedly does) or by having profiling information collected in advance (e.g. via tools like mpgo). And each approach has its pros and cons (e.g. the Java approach wouldn't probably work in AOT runtimes such as .NET Native and CoreRT). |
@NickStrupat the caller can choose to force the implementation for the generic by creating a non-generic struct wrapper struct ListWrapper : IList<int>
{
private List<int> _list;
public ListWrapper(List<int> list)
{
_list = list;
}
public int Count => _list.Count;
//...
}
var result = SumIList(new ListWrapper(list)); It should also give you the benefits of inlining back |
@vancem The microbenchmark is not completely representative of the hot spots in our code, but it's not too far off. The most severely I run into this performance decrease is when working with code that uses interfaces such as IList or IReadOnlyList. Here's one example of a possible situation. Let's say someone's writing some code that does some sort of data processing, and the functions to call this code accept an Array. This implementation is working fine, and performance is good. Now this person needs to pass in portion of a large array to this function. They have two main options: Create a new, smaller array, allocating memory in the process, or create an object (Similar to ArraySegment or Span) that abstracts the access to the original array, requiring duplicated code to handle the object, incurring a slight increase in execution time, but without a large memory allocation. I won't go into any more detail, but there are a few more memory abstractions that this person uses in different situations, and a common theme between all of them is that they can all implement IList. Currently there are multiple versions of the code that have to be maintained, but if a single version of the code accepted an IList, ease of maintenance and the size of the code base would be better. However, making the original code accept an IList instead of an Array results in about a 2-4x increase in execution time, depending on the situation. I'm not that great at creating scenarios, but I hope this shows one situation where a developer could run into this issue. |
@mikedn I'm not suggesting indiscriminately doing the same thing for reference types. Only for sealed classes. @benaadams The wrapper struct can actually be generic and receive the same inlining treatment. |
How does doing it only for sealed classes help? void foo() {
IList<int> nums = new int[42];
...
// the compiler knows that nums is int[] so it could call
// a specialized version of SumIList
SumIList(nums);
} Otherwise the best a compiler can do is something along the lines of @benaadams first example. And that kind of approach needs to be driven by profiling information otherwise you risk ending up with a larger and slower program. |
@mikedn I think we've each been talking about slightly different scenarios, so I'll try to specify again. If the calling code knows the private static int SumIList<TList>(TList list) where TList : IList<int>
{
int result = 0;
if (list != null)
{
int length = list.Count;
for (int i = 0; i < length; i++)
result += list[i];
}
return result;
} If Currently, as @benaadams points out, we need to make a value-type wrapper to get the "inlining". If someone has a |
@NickStrupat as @mikedn points out if that was done wholesale it would cause a binary size explosion especially as then it would start inlining code into these functions. It would be worthwhile in hot paths but disadvantageous to do in all paths. There is a proposal in rosyln to indicate that you want specific handling for various generic types dotnet/roslyn#15822 <-- Shameless self promotion 😉 |
@benaadams, @mikedn, I don't know how many other ways I can point this out... I'm not talking about doing it wholesale. I'm talking about doing it only for sealed classes where I fully recognize and understand that generating different native code methods for every and any @benaadams, generic specialization proposals get a +1 from me! I loved designing C++ libs which took advantage of that. |
@NickStrupat sealed is only relevant for virtual methods; otherwise it already has all the type information. It would also cause code bloat unnecessarily for everywhere a sealed type was used rather than just in hot paths. e.g. as string is a sealed type creating a Regarding non-generic devirtualization for sealed types that is being looked at as part of dotnet/coreclr#9230; though that is different than this issue. |
I don't think so. Let's try again:
Considering that doing this will invariably increase code size one can wonder why should the JIT be doing this in the absence of any indication that the code is performance sensitive. Does the fact that
And as I stated before value types works the way they work because there aren't any reasonable alternatives, not because the JIT is doing an optimization. If anything, the JIT could do better in the area of value types by avoiding generating separate code for similar structs (e.g. 2 structs that both contain a single |
After rechecking my assumptions, I take it back, alas a concrete struct wrapper doesn't inline via a generic. https://gist.github.com/benaadams/ba3c008a328f02c75663213cd9affa65
The interface calls are slower, but closer to a non-inlined call (as above) |
Ahh... had it wrong; they do inline, the wrappers were classes not structs, updated code, results are
|
@mikedn Generics in general already causes this "code explosion" relative to non-generic types/collections). What are the chances that the current implementation of generics and their surrounding code base (LINQ, etc.) represents exactly the right balance of native code size vs performance? There is no real reason other than performance to generate the specializations at all. The type arguments could just be used for type-safety to wrap a non-generic with casting. Clearly the performance gains were considered more important than the potentially large generated code size.
Currently the mechanism for achieving the inlining is "wrap with a struct", which is no less cryptic than "seal your class". That said, I would definitely be interested in some fine-grained control of which inlined specializations are generated.
There is definitely at least one reasonable alternative. You could just box the value type wrapper like many parts of the .NET BCL already/still do.
That would be very cool. So the easy case would be if the fields match exactly? |
Speed of @Thealexbarney You pointed out JVM capability that make these thing fine but you did can just make you way using abstract class.
FMPOV framework guidelines are fit for particular purpose so if you are interested in performance you are going to be tuner all the time. Even in the future. Other told well there are trade-offs, costs and priorities. On the other hand autogained performance for no research is not hurting anyone at all. |
Things continue to look good with recent builds and TieredPGO (and TC_QuickJitForLoops). The tests linked above do a fair amount of work per iteration ( If the work payload is cut down to something like an integer sum reduction, we can see the residual costs more clearly for (int i = 0; i < length; i++)
{
r += a[i] + b[i];
} and here the PGO version is about 2x slower.
BDN results can be a bit inconsistent at times, however. I am not sure why. ;; no-interface inner loop (both PGO and no PGO)
G_M44880_IG03: ;; offset=0015H
443BD1 cmp r10d, ecx
7338 jae SHORT G_M44880_IG07
;; bbWeight=10000.90 PerfScore 12501.13
G_M44880_IG04: ;; offset=001AH
4C8B5A08 mov r11, gword ptr [rdx+8]
453B5308 cmp r10d, dword ptr [r11+8]
7334 jae SHORT G_M44880_IG08
4963F2 movsxd rsi, r10d
458B5CB310 mov r11d, dword ptr [r11+4*rsi+16]
453B5010 cmp r10d, dword ptr [r8+16]
7320 jae SHORT G_M44880_IG07
;; bbWeight=10000.90 PerfScore 102509.25
G_M44880_IG05: ;; offset=0032H
4103C3 add eax, r11d
4D8B5808 mov r11, gword ptr [r8+8]
453B5308 cmp r10d, dword ptr [r11+8]
7319 jae SHORT G_M44880_IG08
410344B310 add eax, dword ptr [r11+4*rsi+16]
41FFC2 inc r10d
453BD1 cmp r10d, r9d
7CC9 jl SHORT G_M44880_IG03 ;; interface dot-product innner loop (no PGO)
G_M18393_IG03: ;; offset=0030H
488BCE mov rcx, rsi
418BD6 mov edx, r14d
49BBF004A443FC7F0000 mov r11, 0x7FFC43A404F0
FF158A4CC8FF call [IList`1:get_Item(int):int:this]
448BF8 mov r15d, eax
488BCF mov rcx, rdi
418BD6 mov edx, r14d
49BBF804A443FC7F0000 mov r11, 0x7FFC43A404F8
FF15794CC8FF call [IList`1:get_Item(int):int:this]
4103EF add ebp, r15d
03E8 add ebp, eax
41FFC6 inc r14d
443BF3 cmp r14d, ebx
7CC4 jl SHORT G_M18393_IG03 ;; interface dot-product inner loop (PGO)
G_M18393_IG04: ;; offset=0039H
48B8D016D343FC7F0000 mov rax, 0x7FFC43D316D0 ;; test type of "a"
483BD8 cmp rbx, rax
0F8591000000 jne G_M18393_IG13
488BC7 mov rax, rdi
443B7810 cmp r15d, dword ptr [rax+16] ;; length check
0F83A2000000 jae G_M18393_IG14
;; bbWeight=10006.84 PerfScore 47532.50
G_M18393_IG05: ;; offset=0059H
488B4708 mov rax, gword ptr [rdi+8]
443B7808 cmp r15d, dword ptr [rax+8] ;; bounds check
0F83B5000000 jae G_M18393_IG16
4963CF movsxd rcx, r15d
448B648810 mov r12d, dword ptr [rax+4*rcx+16] ;; fetch element "a"
;; bbWeight=10006.84 PerfScore 72549.60
G_M18393_IG06: ;; offset=006FH
48B8D016D343FC7F0000 mov rax, 0x7FFC43D316D0 ;; test type of "b"
483906 cmp qword ptr [rsi], rax
0F857F000000 jne G_M18393_IG15
;; bbWeight=10006.84 PerfScore 32522.24
G_M18393_IG07: ;; offset=0082H
488BC6 mov rax, rsi
443B7810 cmp r15d, dword ptr [rax+16] ;; length check
7370 jae SHORT G_M18393_IG14
;; bbWeight=10006.84 PerfScore 32522.24
G_M18393_IG08: ;; offset=008BH
488B4608 mov rax, gword ptr [rsi+8]
443B7808 cmp r15d, dword ptr [rax+8] ;; bounds check
0F8383000000 jae G_M18393_IG16
4963CF movsxd rcx, r15d
448B6C8810 mov r13d, dword ptr [rax+4*rcx+16] ;; fetch element b
;; bbWeight=10006.84 PerfScore 72549.60
G_M18393_IG09: ;; offset=00A1H
4503F4 add r14d, r12d
4503F5 add r14d, r13d
41FFC7 inc r15d
443BFD cmp r15d, ebp
7C8A jl SHORT G_M18393_IG04
;; cold blocks 13 and 15 (not shown) make the residual interface calls if the type tests fail. Compared with the no-interface version, the extra overhead in the PGO version comes from the two type checks and the slightly more expensive test against Note in the non-interface version Also in the PGO loop above we don't hoist the type fetch for As an alternative with less code duplication we could clone / unswitch on the type tests... that is instead of producing for (...)
{
if (a.GetType() == typeof(...))
...
if (a.GetType() == typeof(...))
...
} we'd produce if ((a.GetType() == typeof(...)) && (b != null) && (b.GetType() == typeof(...)))
{
for (...) // no type checks or interface calls
}
else
{
for (...) // interface calls
} I'm going to keep this open for now but don't expect much to change for the remainder of 6.0, so will move to future. |
I made up some simple quick tests using tight loop as you designed it. It seems that at least at my configuration I get ⅓ better resolution with abstract class design (compared to interface design). I guess it is not enough for you case. 👎🏻
Even better results I get for the generics. Circa ½.
… still the burden is 5× for virtual method in generics case (while interface takes 10×). |
Running the LinqBenchmarks at https://github.com/NetFabric/LinqBenchmarks; the improvement for
|
@benaadams thanks for the ongoing interest in PGO! FWIW I am starting to look at impact of PGO on various TechEmpower scenarios. Some seem to speed up nicely (plaintext/mvc), others not so much (json)... suspect we're not seeing anywhere near the full potential yet. |
.NET 6.0 + PGO allows "Ben"; essentially a Kestrel stack, to pull ahead of the lighter .NET 5.0 "platform" on the TE database tests 😎
Love to hear it 💜 |
Isn't like for like; for Json "platform" is using 3 additional options:
Whereas "Ben" is just using the vanilla defaults; and implementation-wise its closer to "middleware" than "platform" (was mostly a demonstration that the ASP.NET Core stack was layered well enough that you could build a stack on top of Kestrel without having to "buy-in" to its opinionated Dependency Injection and Hosting model) |
So that's good news, then? Is this with recent builds? |
Yes. its fairly miraculous it overtook "platform" in the DB Single Query; expected it to be a little faster than "middleware"; but... 😲
Is using the public releases; so I assume is preview 3? Docker settings are
Though might not need to switch off ReadyToRun anymore? |
My hope is that we can eventually drop the need for For 6p4 By 6p6 we plan have static PGO data flowing into ASP.NET assemblies as well, and maybe by then we ccan ease off With current main the gap between enabling R2R and disabling on TE it is still pretty large, at least in my measurements. For example plaintext/mvc is 18% faster with the options above, but only 1% faster if R2R is left enabled. I need to do more investigation into what all needs fixing and where these numbers should be; that's going to be my focus for the next few weeks. |
cc @davidfowl since Julie says you want to know what we're up to over here. |
why is |
Lack of PGO data for the R2R methods; switching it off runs them at Tier0 which generates the PGO data for Tier1 |
Using the Interface vs using the specific type (class/struct) always have tradeoffs. So can i inline my method of a class which impl an interface ? So when i use the interface, JIT Compiler will inline my method if the interface is my class ? What are the criterias for the method to be inlined ? always inlined ? Sorry for my bad english. |
@HyVong007 thanks for the questions. Let me know if the information below helps. DevirtualizationGenerally if the jit can determine the type of the
Last I looked flow analysis can enable devirualization and inlining in a relatively small fraction of interface call cases (say no more than 10%). Success here requires that there be some "organic" evidence of type identity (constructor call or type test) upstream from the interface site. Inlining can help bring together the needed information but currently the inlining heuristics do not include increased devirtualization potential as part of their evaluation. That may change soon (see eg #53670). PGO is quite effective at devirtualizing interface calls; most studies I have done show upwards of 80% of interface call sites have one dominant implementing class for InliningThe inlining heuristics are complicated and difficult to summarize concisely. Roughly speaking a method will be inlined if:
|
Ok thank you, i understand something now. And here for some example i found, which cases will be inlined ? Thank so much ! `
` |
A long time has passed since this thread was opened. I am wondering how big of a difference still remains with all the improvements going in JIT like PGO, OSR, tiered compilation, GDV, frozen object heap and what-nots. |
Seems pretty good to me. Test program:
Maybe it's possible to measure this with Benchmark.net instead ? This is left as an exercise to the reader. |
Your micro-benchmark showcases profile guided devirtualization and inlining. Yes, we made a lot of progress improving various cases discussed here and we will continue the improvements. I do not see anything concrete actionable left in this issue. Closing. |
The general guidelines that I see for writing in C# usually say something like, "Accept as general of an object as you can for your parameters." For example, instead of requiring a List as a parameter when you don't actually use List specific methods, accept an IList, or even an IEnumerable if you can. This way you're not tied to a specific implementation. The basic concept of interfaces.
This works well in theory, but in practice, I've found that it's usually a bad idea when it comes to interfaces like IList, ICollection, or IEnumerable. In many of my projects, I've found that changing methods' parameter types from arrays or Lists to interfaces such as IList result in a 2-4x increase in execution time for the program.
Because virtual calls are used when using interfaces, making it so the JIT doesn't inline them, indexer access time is greatly increased. I was curious just how big the performance decrease was, so I put together a little demo program to try and benchmark this, and the results were worse than I expected:
Using an IList resulted in about an order of magnitude longer execution time than not using an IList. I wondered if some of this was due to the extra stuff the CLR does when accessing an Array or List as an IList, so I made a simple wrapper class around an array. This resulted in some improvement, but not much.
This virtually negates the advantage of something like accepting an IList as a parameter instead of a T[] to decouple the code from any specific implementation. A new developer might expect that passing an object to a method accepting an IList would result in the same performance as if the method accepted that object's concrete type.
Is there any way to lessen this kind of performance loss? Or is it unavoidable without changing the JIT to be able to inline virtual calls?
category:cq
theme:devirtualization
skill-level:expert
cost:large
impact:large
The text was updated successfully, but these errors were encountered: