-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PlaintextMVC benchmark is slow on arm64 #60166
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Do we know that is actually slower as opposed to maybe something else is causing that lambda to be invoked many more times? |
AFAIR the Docker image defined in TE's repo for Actix is not working for ARM64. Will try again. |
It's not necessary to run Actix but anything fast and other than .NET on both x64 and arm64 just to feel the difference |
Could you please take a look at the GC info tab in perf-view that shows the GC stats, and share it here for x64 vs. arm64? |
@jkotas while I'm re-opening the traces (it takes quite some time on my PC) here are the perf counters ( |
NB: these are markdown formatted and also correctly spaced so you can either paste them as-is or wrap them in triple back ticks. Stating the obvious: Also can you use |
I think the GC is clearly part of the problem. Gen0 GC are firing 4x more frequently per second and throughput is 8x lower. It means that the GC rate is effectively 30x higher. @dotnet/gc should take a look. |
Notice that the counters show Gen0 size of |
would be good to figure out the computed Gen0 budget for this arm64 machine. |
Plaintext is not allocating and never triggers a GC so you don't get any stats. |
It explains why it's so fast then 🙂 |
let me know if you want me to check it (but I need some pointers what to look at), I can modify gc and send it via crank there |
is it possible to capture a GCCollectOnly trace? it's described here. |
@EgorBo to get the trace as requested by @Maoni0 you will need to switch to a managed trace since it's running on Linux. Use Also you might need to reduce the length of the benchmark with |
I sent these traces via zip to CLR GC Core just in case. |
with the data from @EgorBo and @sebastienros I got to the culprit. on arm64 we are reading the cache size this way
and on this particular arm64 machine there's no entry for the L3 cache (it only has index0/1 which are for data/instruction L1 cache size and index2 which is for L2 cache). and since we take the largest which is the L2 cache size, it's 256k, we return 2x that which is 768k and gen0 min budget is calculated as 5/8 of this which is 480k which is of course tiny. if folks know of a way to get the L3 cache size programmatically on linux in this case I'm all ears. |
|
@EgorBo based on how many benchmarks you must have run, I believe I should ask for some extra fans to be installed in the machine |
I did a run today for Json mvc benchmark . For arm64, I collected using The max working set, GC Heap size and size of committed memory by GC is almost 2.5X on arm64. At the same time, Gen2 and LOH sizes are almost 3x less on arm64. If I reading the gc-collect trace, for arm64, we just did 2 rounds of GC in arm64 vs. 17 rounds for x64. Here is the x64 GC info: I also noticed the thread pool queue length and items/seconds are 2X smaller on arm64. @mangod9 or @Maoni0 - any idea why and what should be the next steps to investigate? |
Are there differences in available memory on the two machines? We will need to compare profiles across the two to investigate further. |
@sebastienros - could you confirm?
@sebastienros - could you please confirm?
x64 is 28 cores vs. arm64 is 32 cores. With that, what can we conclude about thread pool queue length and items/second differences? |
ARM:
INTEL:
Also a reminder that this ARM machine doesn't report L3 the way dotnet is reading it. |
The higher memory for ARM probably explains the low GC count. Assume the container running the test can be restricted to 32g for a better comparison?
Yeah we are aware of that issue, hence Kunal is testing with |
Yes, that will give us better apples to apples comparison. |
@sebastienros suggested to pass arguments to limit cpu/memory, so I used |
You may want to run some basic GC and threadpool perf tests to isolate the problem, there may be more than one. For example - what is the allocation rate (B/sec) for a test like this:
|
If I am reading the traces, it seems that GC pause time on arm64 is 5X than x64. Both scenarios were ran using flags x64: arm64: (in addition to above flags, also added @Maoni0 - can you suggest any further steps to track this down? |
does it also collect more? what's the survived bytes for these GCs? you could also share the traces. |
I will email them to you. |
@AntonLapounov @janvorli in case they have seen this before. Could you please share traces with us? |
Done over the email. |
Update: On @Maoni0 suggestion, we came up with a concentrated repro without the needing of threadpool/thread.sleep. With that we can see the slowness in allocation rate and regression in max pause time. using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApp15
{
public class GC_Test
{
static int depth = 25;
public class Node
{
public Node left;
public Node right;
}
static void Populate(int iDepth, Node thisNode)
{
if (iDepth <= 0)
return;
else
{
iDepth--;
thisNode.left = new Node();
thisNode.right = new Node();
Populate(iDepth, thisNode.left);
Populate(iDepth, thisNode.right);
}
}
public static void Main(string[] args)
{
Console.WriteLine("Started");
var stopwatch = Stopwatch.StartNew();
Node root = new Node();
Populate(depth, root);
for (int i = 0; i < 100; i++)
{
GC.Collect();
}
Console.WriteLine($"Finished in {stopwatch.ElapsedMilliseconds}ms");
}
}
}
Below, the screenshots are for GCStats and call stack for both x64 vs. arm64. Some observations:
x64: arm64: |
One of the factors that influences the GC allocation rate is memory bandwidth. It may be a good idea to measure the memory bandwidth on the two machines: Write a simple micro-benchmark that clears 100MB buffer (no GC involved).
I do not think that this explanation makes sense. The benchmark you have shared allocates same amount on both arm64 and x64. I think that the different number of GCs between arm64 and x64 is caused by GC setting the allocation budgets differently. The allocation budgets are computed from the cache size, so the difference is most likely caused by the different (or incorrectly computed) cache sizes between arm64 and x64. There was a discussion about the buggy cache size computation earlier in this thread. Are you using any env. variables to compensate for it? |
Yes. I am using the environment variable on arm64 to set the cache size. I am also setting the memory and CPU same for x64/arm64.
|
@kunalspathak How is it possible that |
Honestly I am not sure how those numbers are collected and reported. Currently I am working with @Maoni0 to gather traces for https://github.com/dotnet/performance/tree/main/src/benchmarks/gc/src/exec/GCPerfSim. |
@AntonLapounov that's just the generation size after a GC, if there's no pinning, it could easily be 24 bytes (which is just a min object size) |
Closing as completed |
Plaintext-PlaintextMVC benchmark should benefit a lot from PGO (namely, from guarded devirtualizations and inlining) and it does benefit from it on all x64 platforms (Linux, Windows, Intel, Amd, etc) - up to +40% more RPS. Unfortunately, it's not the case for arm64 where there is no difference between DynamicPGO and Default. Moreover, the benchmark is 7-8x times slower on arm64 in comparison with x64-dynamicpgo (while I'd expect it to be 1.5-2x slower only).
It looks to me that on arm64 it's bound to JIT_New:
while on x64 it looks like this:
Namely, this call-site ("drilled") (https://github.com/dotnet/aspnetcore/blob/a0b950fc51c43289ab3c6dbea15926d32f3556cc/src/Mvc/Mvc.Core/src/Routing/ControllerRequestDelegateFactory.cs#L68-L101):
Arm64:
same call-site on x64:
Flamegraph for arm64 (two JIT_New are highlighted) for the first th:
x64:
Does it ring a bell to anyone (e.g. JIT_NewS_MP_FastPortable is not used, some gc feature is not implemented for arm64, some allocation tracker/profiler is enabled, etc.)?
/cc @dotnet/jit-contrib @Maoni0 @jkotas @davidwrighton
I can re-run the benchmark with any modifications in jit/vm/gc you suggest.
Steps to get the native traces:
Arm64-Linux:
x64-Linux:
Powerbi link: https://aka.ms/aspnet/benchmarks (open "PGO" page there)
The text was updated successfully, but these errors were encountered: