-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve AOT compilation issues #1737
Conversation
Attempt to resolve the AOT warning in `DelegatingComponent`.
I'll just re-state what I wrote in martincostello@393b2db#r131086685. The allocation either matters or it doesn't. If it doesn't matter, this looks very similar to the problem that Rx had that got simply fixed like this: dotnet/reactive#1116. This could be a similar fix (just drop the If the allocation matters, it's going to matter for PublishAot too. The profiling/etc. workflows are different between native and non-native deployments. Having drastically different (assuming the allocation matters here) perf characteristics is not great. This will be hard to troubleshoot. Getting rid of the allocation would likely require shaking up more than just My gut feel is that the allocation doesn't matter. Whenever |
Thanks for the information. A core part of the original design was to make things allocation free, which is why I was trying to keep it to zero. However it sounds like achieving that while resolving the AOT issue isn't going to be possible without significant refactoring. It might be the case that we just accept that this code path needs to regress slightly (and we do that simple thing of dropping the I might have a bit of a further play around today to see if there's a different approach to the higher-level of what the code is doing that would indirectly solve the issue. As I mentioned in the original issue, do you have a sample project you can point me to for testing AoT compatibility here going forwards? I could pretty much only repro the issue by putting the exact code into an AoT app. We could do with something that's more realistic and representative of user code, but still uncovers the original problem. |
cc @agocke |
Some results for comparison on my laptop based on the benchmark in #1743. The Mean value seems quite noisy, so I'd just focus on the Allocated column.
Polly 8.0.0
With boxing via wrapper
internal override ValueTask<Outcome<TResult>> ExecuteCore<TResult, TState>(
Func<ResilienceContext, TState, ValueTask<Outcome<TResult>>> callback,
ResilienceContext context,
TState state)
{
return _component.ExecuteCore(
static (context, wrapper) =>
{
var callback = (Func<ResilienceContext, TState, ValueTask<Outcome<TResult>>>)wrapper.Callback;
var state = (TState)wrapper.State;
if (context.CancellationToken.IsCancellationRequested)
{
return Outcome.FromExceptionAsValueTask<TResult>(new OperationCanceledException(context.CancellationToken).TrySetStackTrace());
}
return wrapper.Next.ExecuteCore(callback, context, state);
},
context,
new StateWrapper(Next!, callback, state!));
}
public override ValueTask DisposeAsync() => default;
private readonly struct StateWrapper
{
public StateWrapper(PipelineComponent next, object callback, object state)
{
Next = next;
Callback = callback;
State = state;
}
public readonly PipelineComponent Next;
public readonly object Callback;
public readonly object State;
} Removing
|
Method | Mean | Error | StdDev | Gen0 | Allocated |
---|---|---|---|---|---|
CompositeComponent_ExecuteCore | 54.23 ns | 1.147 ns | 1.570 ns | 0.0101 | 96 B |
internal override ValueTask<Outcome<TResult>> ExecuteCore<TResult, TState>(
Func<ResilienceContext, TState, ValueTask<Outcome<TResult>>> callback,
ResilienceContext context,
TState state)
{
return _component.ExecuteCore(
(context, state) =>
{
if (context.CancellationToken.IsCancellationRequested)
{
return Outcome.FromExceptionAsValueTask<TResult>(new OperationCanceledException(context.CancellationToken).TrySetStackTrace());
}
return Next!.ExecuteCore(callback, context, state);
},
context,
state);
}
Is there a real-world application that could be tested with? Microbenchmarks are a good supportive evidence, but they fail to capture other aspects e.g. what proportion is the 24 bytes compared to allocations in callbacks that do actual/real work? How does this change the time we spent JITting and loading new types (this reduces AOT compilation, but also less JITting outside AOT)? It's possible this change is simply a noise level regression in allocations and noise level improvement in startup time/working set with JIT. And an improvement for AOT.
Testing for AOT compatibility is done by the analyzers. The fact they don't find generic cycles is an analyzer bug. I filed it as dotnet/runtime#94131. I thought we had a bug but I couldn't find it. Generic cycles are so rare that this hasn't been a huge priority in the past (the compiler even used to crash on them - we knew of the 3-4 libraries that had problems with it in the past, but not much else got ever reported). So Polly is just lucky. One shouldn't really need to do anything else besides enabling analyzers and ensuring all external dependencies run with analyzers. The low tech way to test AOT compatibility outside the analyzer (that one really shouldn't need because the analyzers are the testing story) is:
Step 3 ensures the compiler compiles everything in |
@davidfowl Are you (or someone you were reporting on behalf of) able to run this change through whatever app you found the issue with originally to see if we can get some better data on the net gain or loss with this change if it resolves the AOT issue?
@MichalStrehovsky OK cool, I didn't realise that was an issue (or that it was a very rare type of problem) - I misunderstood it to be "by design" that it wasn't detected by the analysers, so that's why I thought we needed a proper full repro project of our own to detect the issue. I'll put together a console app like you've described to check that it detects the original issue - if it does, then we can make that part of our CI until the analysis bug is fixed. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1737 +/- ##
==========================================
+ Coverage 84.52% 84.53% +0.01%
==========================================
Files 306 307 +1
Lines 6772 6777 +5
Branches 1042 1043 +1
==========================================
+ Hits 5724 5729 +5
Misses 839 839
Partials 209 209
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
The analyzers would ideally catch everything, but there are a few cases where it's very hard for them to see through all the possible situations. In those cases, running the AOT compiler is the best way to catch the problems. Luckily, as Michal detailed, it's pretty easy to create a simple console app for this. The instructions
Sums it up pretty well. This is worth a docs page though. https://learn.microsoft.com/en-us/dotnet/core/deploying/trimming/prepare-libraries-for-trimming?pivots=dotnet-8-0 is our page for trimming, but we should have one for Native AOT too. |
This was found using the new resilience libraries in a branch of eShop. I was trying to AOT one of the services and several warnings showed up. Do you want me to verify the breakage or performance (I think this scenario causes breakage)?
Are you going to file a docs issue? |
@davidfowl The perf impact of this fix if you could please. I opened #1745 earlier which repros the bug you found - if I do some rebasing later and incorporate it into this PR, it should pass to resolve the AoT issue. |
(I also need to re-run our own benchmarks) |
a60ec9b
to
106ca8a
Compare
Based on these benchmarks there is a clear regression for both execution time and allocations when static is removed from lambda. Additionally, if we went ahead with this, we no longer can claim to have zero-allocation API. I would say, let's keep both branches and just have an extra allocation with AOT. In the future we can try to remove it too. Don't know how though 🤨. |
Excluding a flash of inspiration or a significant refactor, the options so far appear to be:
Personally I'm erring to option 3. I'm going to revert the code back to dd2fd54 locally and see what the benchmarks give me. |
I prefer the option 3 too. While we introduce extra branch for AOT this change is limited to a very small % of code-base. Note, you can do that check even when building the pipeline and create an instance and chain |
I'm about to push up some changes I've made to re-introduce the branch (just waiting for the mutation tests), after that I'll try the approach with having two different delegating strategy types, as that should be a lot simpler 👍. |
106ca8a
to
671b450
Compare
I've tried to split the code by type (rather than with a method) in three different approaches, but it seems to still hit the original issue as the AoT compiler tries to include the version with the generic recursion and then fails to compile. |
Can't we just say "we know better" and suppress the warning then? |
Unsure - it looked like an error rather than a warning coming from a single line, and to be honest all three of the things I tried looked more complicated than this or had more duplication. I would suggest we leave it like this for now if this is the approach we want (branch, with allocation regression for AoT) and we can try and refactor it more later after shipping a patch release that fixes things for AoT. |
|
If this API is documented as zero-allocation, it should be called out that it will not deliver on such promise with AOT.
For this, I'll just go back to what I said about generic virtual methods earlier. If ExecuteCore wasn't a generic virtual method, but simply took an object and boxed/unboxed things as needed, the benchmark results would look different. The generic virtual method approach avoids the boxing/unboxing, but then burns all the saved time trying to figure out where to dispatch the method calls. What this PR does to the AOT version is "worst of both worlds" - we still do a generic virtual method call, but also box. Here's how "best of both worlds" looks like (either no boxing, but generic virtual, or boxing and no generic virtual):
The version that boxes comes ahead by 13% solely because we didn't burn CPU time dispatching generic virtual methods. (Notice the standard deviation is higher for the Object version, likely due to counting the time in GCs, but the boxing version still comes ahead with plenty of buffer.) This might not be applicable to this problem, or there are other constraints, but wanted to call this out because sometimes people pay too much attention to small short lived allocations and miss the whole picture. BDN doesn't do a good job of showing where the costs shifted (with generic virtual methods, the costs shift to dispatch time and extra JIT time (we need to JIT a new native method for each unique T) - this translates to more working set, and worse startup time). Small short-lived allocations are cheap to clean up. The whole benchmark backing the table, in case I made a mistake somewhere (click to expand). But the gist is that a generic virtual method call has a comparable CPU cost to an object allocation on averageusing BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<TestPipelines>();
[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net80)]
public class TestPipelines
{
PipelineComponent p = new DelegatingComponent(new IdentityComponent(), new IdentityComponent());
[Benchmark]
public int Generic() => p.ExecuteCore<int, int>(state => state + 1, 0);
[Benchmark]
public int Object() => (int)p.ExecuteCore(state => (object)((int)state + 1), (object)0);
}
abstract class PipelineComponent
{
public abstract TResult ExecuteCore<TResult, TState>(Func<TState, TResult> callback, TState state);
public abstract object ExecuteCore(Func<object, object> callback, object state);
}
class IdentityComponent : PipelineComponent
{
public override TResult ExecuteCore<TResult, TState>(Func<TState, TResult> callback, TState state)
=> callback(state);
public override object ExecuteCore(Func<object, object> callback, object state)
=> callback(state);
}
class DelegatingComponent : PipelineComponent
{
PipelineComponent _component;
PipelineComponent _next;
public DelegatingComponent(PipelineComponent component, PipelineComponent next)
{
_component = component;
_next = next;
}
public override TResult ExecuteCore<TResult, TState>(Func<TState, TResult> callback, TState state)
{
return _component.ExecuteCore(static (state) =>
{
return state._next.ExecuteCore(state.callback, state.state);
},
(_next, callback, state));
}
public override object ExecuteCore(Func<object, object> callback, object state)
=> _component.ExecuteCore(static (object state) =>
{
(PipelineComponent next, Func<object, object> callback, object state_) =
((PipelineComponent, Func<object, object>, object))state;
return next.ExecuteCore(callback, state_);
},
(_next, callback, state));
} |
The options I can see here so far are:
I think it's better to unblock the scenario for AoT (which is effectively a bug because we always wanted to support it: #1110) for .NET 8, which is option 2. It's also worth noting that the performance of Polly v8 with the new API surface in general is much better than Polly v7, so any users moving to .NET 8 and the new resilience features from dotnet/extensions are going to see improvements with or without AoT, even if AoT performance for Polly specifically is slightly worse than without. In other words, a user adopting the new Polly v8 APIs and .NET 8 won't miss something they didn't yet have. It's not ideal, but coming up with a perfect solution (if possible) isn't going to be quick, so if we want to enable Polly with AoT for .NET 8 it seems like we're going to have to make some sort of compromise.
Searching the code for "zero" and "allocation", all I can find is references to use trying to avoid/minimise/reduce them. I think the only explicit "documentation" of zero are the benchmark results checked into the repo that happen to be zero. I don't think we ever had a cast-iron guarantee that the new Polly v8 API was allocation free in 100% of scenarios. |
While supporting AOT is important we should also realize that majority of .NET users simply won't be using it in near feature. I am all for supporting AOT even if it comes with some extra allocation. We can do this while still supporting non-AOT scenarios with no perf or allocation degradation. Regarding virtual generic methods, at this point IMHO I would stick with current design. If I read the numbers correctly it would give us around ~3 ns improvements per invocation. This improvement will be dwarfed by the overhead and ceremony required by each resilience strategy (see the perf of standard pipeline of 5 strategies which is around ~2000ns per execution). On top of that Polly will be used in the context of async IO operations and we can multiple all the numbers by 10x (at minimum). So my vote is to merge this PR, resolve AOT issues for 8.0.0, and keep an eye for future improvements. |
Anyway, thanks @MichalStrehovsky for all the context. I didn't realize that generic virtual call have such overhead. |
1b93ce2
to
5e3c076
Compare
A potential fix for App-vNext#1732 using `RuntimeFeature.IsDynamicCodeSupported` to guard against the allocation regression that avoiding the infinite generic recursion causes through boxing.
Simplify the proposed fix by removing the AoT and non-AoT branching and just using the AoT fix for all cases.
Add a console project that validates whether the new-in-v8 Polly assemblies are AoT compatible. Relates to App-vNext#1732.
- Re-introduce branching for handling AoT and non-AoT to remove the allocation impact for non-AoT scenarios. - Refactor the code to allow for unit testing both branches and increasing code re-use.
Mark the new Polly v8 assemblies as AoT compatible.
Turns out that the issue is detected without the additional hint of explicitly exercising the problematic code.
Move `DelegatingComponent` out from being a nested type.
Resolves peer review feedback, and also seems to make it faster.
Add a micro-benchmark for `DelegatingComponent` code path for non-AoT and AoT.
Update CHANGELOG to reference this PR.
b85d45c
to
4a39d1f
Compare
A potential fix for #1732 using
RuntimeFeature.IsDynamicCodeSupported
to guard against an allocation regression while still avoiding the infinite generic recursion the current code in Polly 8.0.0 causes.This is just a starting point for discussion on what the best fix and/or compromises are to support using Polly with AoT compiled applications.
I'm very much out of my comfort zone when it comes to AoT, so this was just something I came up with that resolves the issue at hand. It does trade-off against either increased code size by using the guard, or without the guard it introduces an allocation regression of 24 bytes per invocation (original thread).
💯% open to suggestions for a better solution.
/cc @davidfowl @MichalStrehovsky @eerhardt @martintmk