PlaintextMVC benchmark is slow on arm64 #60166

EgorBo · 2021-10-08T11:05:48Z

Plaintext-PlaintextMVC benchmark should benefit a lot from PGO (namely, from guarded devirtualizations and inlining) and it does benefit from it on all x64 platforms (Linux, Windows, Intel, Amd, etc) - up to +40% more RPS. Unfortunately, it's not the case for arm64 where there is no difference between DynamicPGO and Default. Moreover, the benchmark is 7-8x times slower on arm64 in comparison with x64-dynamicpgo (while I'd expect it to be 1.5-2x slower only).

It looks to me that on arm64 it's bound to JIT_New:

while on x64 it looks like this:

Namely, this call-site ("drilled") (https://github.com/dotnet/aspnetcore/blob/a0b950fc51c43289ab3c6dbea15926d32f3556cc/src/Mvc/Mvc.Core/src/Routing/ControllerRequestDelegateFactory.cs#L68-L101):
Arm64:

same call-site on x64:

Flamegraph for arm64 (two JIT_New are highlighted) for the first th:

x64:

Does it ring a bell to anyone (e.g. JIT_NewS_MP_FastPortable is not used, some gc feature is not implemented for arm64, some allocation tracker/profiler is enabled, etc.)?
/cc @dotnet/jit-contrib @Maoni0 @jkotas @davidwrighton
I can re-run the benchmark with any modifications in jit/vm/gc you suggest.

Steps to get the native traces:

Arm64-Linux:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --scenario mvc --profile aspnet-citrine-arm --application.collect true --application.collectStartup false --application.options.collectCounters false   --application.dotnetTrace false --application.framework net6.0 --application.environmentVariables DOTNET_TieredPGO=1  --application.environmentVariables DOTNET_TC_QuickJitForLoops=1  --application.environmentVariables DOTNET_ReadyToRun=0

x64-Linux:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --scenario mvc --profile aspnet-citrine-lin --application.collect true --application.collectStartup false --application.options.collectCounters false   --application.dotnetTrace false --application.framework net6.0 --application.environmentVariables DOTNET_TieredPGO=1  --application.environmentVariables DOTNET_TC_QuickJitForLoops=1  --application.environmentVariables DOTNET_ReadyToRun=0

Powerbi link: https://aka.ms/aspnet/benchmarks (open "PGO" page there)

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-10-08T11:05:51Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

EgorBo · 2021-10-08T11:11:12Z

A similar benchmark "Plaintex-Plaintext" looks much better:

stephentoub · 2021-10-08T11:16:37Z

Does it ring a bell to anyone (e.g. slow jit helper is used, some gc feature is not implemented for arm64, etc)?

Do we know that is actually slower as opposed to maybe something else is causing that lambda to be invoked many more times?

EgorBo · 2021-10-08T11:30:11Z

Does it ring a bell to anyone (e.g. slow jit helper is used, some gc feature is not implemented for arm64, etc)?

Do we know that is actually slower as opposed to maybe something else is causing that lambda to be invoked many more times?

Not sure, but I'd expect codepaths to be exactly the same, btw, here is a comparison for all benchmarks we run PGO for:

Exactly the same anomalies in JsonMVC, Foruntes and FortunesEF where arm64 is 5-7x times slower.

EgorBo · 2021-10-08T11:46:06Z

^ Most of the benchmarks get a nice boost from DynamicPGO (to be more precise - from FullPGO where we disable R2R and enable QuickJitForLoops) on x64, but I don't see the same boost on arm64 for Fortunes, FortunesEF, JsonMVC and PlaintextMVC - and I suspect that it's because of "bound to JIT_New".

EgorBo · 2021-10-08T12:16:07Z

I don't know the exact arm64 hardware we have, but NodeJS benchmarks show around 50-100% difference between arm64 and x64 so I guess we can expect the same ratio on .NET.

JsonMvc and Plaintext benchmarks on NodeJS:

It'd be nice if we could run Actix (Rust) benchmarks on arm64 too - it should show a more precise ratio due to the fact it uses LLVM which is heavily optimized for both x64 and arm64.

Currently, we only run Actix for x64.

cc @sebastienros

sebastienros · 2021-10-08T14:35:17Z

AFAIR the Docker image defined in TE's repo for Actix is not working for ARM64. Will try again.

EgorBo · 2021-10-08T14:37:10Z

AFAIR the Docker image defined in TE's repo for Actix is not working for ARM64. Will try again.

It's not necessary to run Actix but anything fast and other than .NET on both x64 and arm64 just to feel the difference

jkotas · 2021-10-08T14:57:54Z

Could you please take a look at the GC info tab in perf-view that shows the GC stats, and share it here for x64 vs. arm64?

EgorBo · 2021-10-08T15:07:40Z

@jkotas while I'm re-opening the traces (it takes quite some time on my PC) here are the perf counters (--application.options.collectCounters true):

sebastienros · 2021-10-08T15:17:23Z

NB: these are markdown formatted and also correctly spaced so you can either paste them as-is or wrap them in triple back ticks.

Stating the obvious:
Some x64 numbers are multiples of arm64 because RPS is just higher, e.g. Allocations, TP, so nothing weird here.
But the GC is abnormally high on arm64: 66% cpu and 59 gen 0 collections per second and the Gen 0 size significantly higher too.

Also can you use --chart, this will show if the max GC (66%) is just a small period of time, of if it's for the whole run.

jkotas · 2021-10-08T15:18:18Z

I think the GC is clearly part of the problem. Gen0 GC are firing 4x more frequently per second and throughput is 8x lower. It means that the GC rate is effectively 30x higher. @dotnet/gc should take a look.

jkotas · 2021-10-08T15:21:19Z

Notice that the counters show Gen0 size of 438,360 vs. 672. That is off a lot...

mangod9 · 2021-10-08T15:31:10Z

would be good to figure out the computed Gen0 budget for this arm64 machine.

EgorBo · 2021-10-08T15:33:00Z

Here a comparison between Plaintext-MVC and Plaintext-Plaintext benchmarks on arm64 (with --graph):

not sure why gc stats are empty for the Plaintext-plaintext one (it's zero alloc?)

sebastienros · 2021-10-08T15:34:02Z

Plaintext is not allocating and never triggers a GC so you don't get any stats.

EgorBo · 2021-10-08T15:35:41Z

It explains why it's so fast then 🙂

EgorBo · 2021-10-08T15:37:37Z

would be good to figure out the computed Gen0 budget for this arm64 machine.

let me know if you want me to check it (but I need some pointers what to look at), I can modify gc and send it via crank there

Maoni0 · 2021-10-11T17:44:35Z

is it possible to capture a GCCollectOnly trace? it's described here.

sebastienros · 2021-10-11T17:59:38Z

@EgorBo to get the trace as requested by @Maoni0 you will need to switch to a managed trace since it's running on Linux.

Use --application.dotnetTrace true --application.dotnetTraceProviders gc-collect instead of --application.collect.

Also you might need to reduce the length of the benchmark with --variable warmup=2 --variable duration=5 for instance. Just in case the trace is too big.

EgorBo · 2021-10-11T18:34:32Z

I sent these traces via zip to CLR GC Core just in case.

Maoni0 · 2021-10-11T20:34:48Z

with the data from @EgorBo and @sebastienros I got to the culprit. on arm64 we are reading the cache size this way

        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index0/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index1/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index2/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index3/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index4/size", &size))
            cacheSize = std::max(cacheSize, size);

and on this particular arm64 machine there's no entry for the L3 cache (it only has index0/1 which are for data/instruction L1 cache size and index2 which is for L2 cache). and since we take the largest which is the L2 cache size, it's 256k, we return 2x that which is 768k and gen0 min budget is calculated as 5/8 of this which is 480k which is of course tiny.

if folks know of a way to get the L3 cache size programmatically on linux in this case I'm all ears.

EgorBo · 2021-10-11T20:51:05Z

Duplicating here:
When I hard-coded GetLogicalProcessorCacheSizeFromOS to always return 32*1024*1024 and copied my locally built clrgc/coreclr via crank - I got 4X better RPS for this specific benchmark.

(32*1024*1024 is on the right)

BruceForstall · 2021-10-11T20:58:21Z

if folks know of a way to get the L3 cache size programmatically on linux in this case I'm all ears.

@janvorli ?

EgorBo · 2021-10-13T08:55:43Z

Full report (default vs COMPlus_GCgen0size=1E00000):

It compares RPS and P90 latency (ms) for both Default JIT mode and "Full PGO"

sebastienros · 2021-10-13T15:53:47Z

@EgorBo based on how many benchmarks you must have run, I believe I should ask for some extra fans to be installed in the machine

kunalspathak · 2021-12-17T18:54:09Z

I did a run today for Json mvc benchmark . For arm64, I collected using DOTNET_GCgen0size=1E00000 . Below is the comparison of x64 vs. arm64:

The max working set, GC Heap size and size of committed memory by GC is almost 2.5X on arm64. At the same time, Gen2 and LOH sizes are almost 3x less on arm64.

If I reading the gc-collect trace, for arm64, we just did 2 rounds of GC in arm64 vs. 17 rounds for x64.
Here is the arm64 GC info:

Here is the x64 GC info:

I also noticed the thread pool queue length and items/seconds are 2X smaller on arm64.

@mangod9 or @Maoni0 - any idea why and what should be the next steps to investigate?

mangod9 · 2021-12-17T19:37:12Z

Are there differences in available memory on the two machines? We will need to compare profiles across the two to investigate further.

kunalspathak · 2021-12-17T19:44:19Z

@sebastienros - could you confirm?

Are there differences in available memory on the two machines? We will need to compare profiles across the two to investigate further.

@sebastienros - could you please confirm?

I also noticed the thread pool queue length and items/seconds are 2X smaller on arm64.

x64 is 28 cores vs. arm64 is 32 cores. With that, what can we conclude about thread pool queue length and items/second differences?

sebastienros · 2021-12-17T19:52:51Z

ARM:

ThinkSystem HR330A
1x 32-Core/3.0GHz eMAG CPU
64GB DDR4 (8x8GB)

INTEL:

Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 2195 Mhz
14 Core(s), 28 Logical Processor(s)
32GB

Also a reminder that this ARM machine doesn't report L3 the way dotnet is reading it.

mangod9 · 2021-12-17T20:10:29Z

The higher memory for ARM probably explains the low GC count. Assume the container running the test can be restricted to 32g for a better comparison?

Also a reminder that this ARM machine doesn't report L3 the way dotnet is reading it.

Yeah we are aware of that issue, hence Kunal is testing with DOTNET_GCgen0size=1E00000

kunalspathak · 2021-12-17T20:41:12Z

The higher memory for ARM probably explains the low GC count. Assume the container running the test can be restricted to 32g for a better comparison?

Yes, that will give us better apples to apples comparison.

kunalspathak · 2021-12-17T21:39:41Z

@sebastienros suggested to pass arguments to limit cpu/memory, so I used --application.memoryLimitInBytes 17179869184 --application.cpuSet 0-7 (Memory ~16GB and 8 cores), for x64 and arm64, but that doesn't change things. I still see mismatch in GC/threadpool.

jkotas · 2021-12-17T22:04:45Z

You may want to run some basic GC and threadpool perf tests to isolate the problem, there may be more than one. For example - what is the allocation rate (B/sec) for a test like this:

using System.Threading;

int cpus = Environment.ProcessorCount;
for (int i = 0; i < cpus; i++)
   new Thread(() => { for(;;) GC.KeepAlive(new object()); });

Thread.Sleep(20_000);
Environment.Exit(0);

kunalspathak · 2022-01-03T19:22:07Z

If I am reading the traces, it seems that GC pause time on arm64 is 5X than x64.

Both scenarios were ran using flags
--application.dotnetTrace true --application.dotnetTraceProviders gc-collect --application.memoryLimitInBytes 17179869184 --application.cpuSet 0-7.

x64:

arm64: (in addition to above flags, also added -application.environmentVariables DOTNET_GCgen0size=1E00000)

@Maoni0 - can you suggest any further steps to track this down?

Maoni0 · 2022-01-03T20:41:45Z

does it also collect more? what's the survived bytes for these GCs? you could also share the traces.

kunalspathak · 2022-01-03T21:11:18Z

does it also collect more? what's the survived bytes for these GCs? you could also share the traces.

I will email them to you.

kunalspathak · 2022-01-03T23:49:26Z

Looking at the traces with @Maoni0, there are some unusual callstack present in StackWalker code that does more work in arm64 compared to x64:

arm64:

x64:

@mangod9 - Do you know what would make arm64 do more work than x64 in stack walking code?

mangod9 · 2022-01-03T23:58:55Z

@AntonLapounov @janvorli in case they have seen this before. Could you please share traces with us?

kunalspathak · 2022-01-05T19:21:39Z

Could you please share traces with us?

Done over the email.

kunalspathak · 2022-01-14T20:29:22Z

Update:

On @Maoni0 suggestion, we came up with a concentrated repro without the needing of threadpool/thread.sleep. With that we can see the slowness in allocation rate and regression in max pause time.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApp15
{
    public class GC_Test
    {
        static int depth = 25;
        public class Node
        {
            public Node left;
            public Node right;
        }
        static void Populate(int iDepth, Node thisNode)
        {
            if (iDepth <= 0)
                return;
            else
            {
                iDepth--;
                thisNode.left = new Node();
                thisNode.right = new Node();
                Populate(iDepth, thisNode.left);
                Populate(iDepth, thisNode.right);
            }
        }

        public static void Main(string[] args)
        {
            Console.WriteLine("Started");
            var stopwatch = Stopwatch.StartNew();
            Node root = new Node();
            Populate(depth, root);
            for (int i = 0; i < 100; i++)
            {
                GC.Collect();
            }
            Console.WriteLine($"Finished in {stopwatch.ElapsedMilliseconds}ms");
        }
    }
}

application	x64_gc_collectCounters	arm64_gc_collectCounters	Diff	Comments
Execution Time	10334msec	28873msec	+178%	<--
CPU Usage (%)	4	4	0.00%
Cores usage (%)	112	143	+27.68%
Working Set (MB)	449	558	+24.28%
Private Memory (MB)	1,318	1,413	+7.21%
Build Time (ms)	1,913	5,885	+207.63%
Start Time (ms)	44	98	+122.73%
Published Size (KB)	70,441	77,501	+10.02%
.NET Core SDK Version	6.0.101	6.0.101
Max CPU Usage (%)	14	17	+21.43%
Max Working Set (MB)	520	585	+12.50%
Max GC Heap Size (MB)	453	501	+10.60%
Size of committed memory by the GC (MB)	454	503	+10.79%
Max Number of Gen 0 GCs / sec	26.00	5.00	-80.77%
Max Number of Gen 1 GCs / sec	14.00	3.00	-78.57%
Max Number of Gen 2 GCs / sec	2.00	1.00	-50.00%
Max Time in GC (%)	86.00	99.00	+15.12%
Max Gen 0 Size (B)	24	24	0.00%
Max Gen 1 Size (B)	20,185,184	62,914,656	+211.69%
Max Gen 2 Size (B)	433,981,480	471,853,312	+8.73%
Max LOH Size (B)	24	24	0.00%
Max Allocation Rate (B/sec)	245,607,168	115,520,480	-52.97%	<--
Max GC Heap Fragmentation	0	0	+25.45%
# of Assemblies Loaded	6	6	0.00%
Max Exceptions (#/s)	0	0
Max Lock Contention (#/s)	0	0
Max ThreadPool Threads Count	0	0
Max ThreadPool Queue Length	0	0
Max ThreadPool Items (#/s)	0	0
Max Active Timers	0	0
IL Jitted (B)	11,984	21,655	+80.70%	<-- why different?
Methods Jitted	69	155	+124.64%	<-- why different?

Below, the screenshots are for GCStats and call stack for both x64 vs. arm64. Some observations:

The execution time and GC allocation rate is 2X slower on Arm64
There are 34 more GCs that happened in x64 compared to arm64 in addition to the 100 GCs that were induced using GC.Collect(). I am assuming this is because of overall speed difference. On x64, code is executing faster leading to more allocation of objects and hence more frequently GC is triggered. On arm64, the slowness lead to lower allocation and hence lower no. of GCs.
Alike previous data, even in this short repro, we can see the "Max Pause Time" is 4X more on Arm64 than on x64.
Looking into the callstack screenshots, we can see that the plan_phase and mark_phase are the biggest contributor of the diff. In plan_phase, even the simple introsort_loop takes longer to execute. My initial assumption was the difference in the code produced by LLVM (which we use to build linux x64/arm64). I tried to extract the introsort_loop (and surrounding code) and checked the diff in godbolt, but I don't see anything that would stand out.
In mark_phase, I can see a clear difference in mark_object_simple1, but I don't see why it would be the case. May be there is huge difference in the count of data both platforms operate on?
Lastly, as seen in above table, I see the no. of methods/IL Jitted is different. I will add more traces to see why that is the case.

x64:

arm64:

jkotas · 2022-01-15T18:20:25Z

The execution time and GC allocation rate is 2X slower on Arm64

One of the factors that influences the GC allocation rate is memory bandwidth. It may be a good idea to measure the memory bandwidth on the two machines: Write a simple micro-benchmark that clears 100MB buffer (no GC involved).

There are 34 more GCs that happened in x64 compared to arm64 in addition to the 100 GCs that were induced using GC.Collect(). I am assuming this is because of overall speed difference. On x64, code is executing faster leading to more allocation of objects and hence more frequently GC is triggered.

I do not think that this explanation makes sense. The benchmark you have shared allocates same amount on both arm64 and x64. I think that the different number of GCs between arm64 and x64 is caused by GC setting the allocation budgets differently. The allocation budgets are computed from the cache size, so the difference is most likely caused by the different (or incorrectly computed) cache sizes between arm64 and x64. There was a discussion about the buggy cache size computation earlier in this thread. Are you using any env. variables to compensate for it?

kunalspathak · 2022-01-17T16:21:50Z

There was a discussion about the buggy cache size computation earlier in this thread. Are you using any env. variables to compensate for it?

Yes. I am using the environment variable on arm64 to set the cache size. I am also setting the memory and CPU same for x64/arm64.

--application.environmentVariables DOTNET_GCgen0size=1E00000 --application.memoryLimitInBytes 17179869184 --application.cpuSet 0-7.

AntonLapounov · 2022-02-02T00:16:34Z

@kunalspathak How is it possible that Max Gen 0 Size is only 24 bytes for your small repro and only 192 bytes for the benchmark on ARM64?

kunalspathak · 2022-02-02T00:23:07Z

@kunalspathak How is it possible that Max Gen 0 Size is only 24 bytes for your small repro and only 192 bytes for the benchmark on ARM64?

Honestly I am not sure how those numbers are collected and reported. Currently I am working with @Maoni0 to gather traces for https://github.com/dotnet/performance/tree/main/src/benchmarks/gc/src/exec/GCPerfSim.

Maoni0 · 2022-02-02T01:03:51Z

@AntonLapounov that's just the generation size after a GC, if there's no pinning, it could easily be 24 bytes (which is just a min object size)

EgorBo · 2022-07-06T22:02:50Z

Closing as completed

EgorBo added the tenet-performance Performance related issue label Oct 8, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 8, 2021

jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 8, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 8, 2021

JulieLeeMSFT added this to the 7.0.0 milestone Oct 8, 2021

JulieLeeMSFT assigned EgorBo Oct 8, 2021

EgorBo mentioned this issue Oct 19, 2021

[arm64] These benchmarks are faster on Rosetta-x64 than on native arm64 #60616

Closed

kunalspathak mentioned this issue Jan 24, 2022

GC's VXSort question #64164

Open

This was referenced Jan 31, 2022

Fix LLC cache issue on Apple M1 #64576

Merged

Better default for L3 cache size on win-arm64 and lin-arm64 #64645

Closed

kunalspathak mentioned this issue Apr 8, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

EgorBo closed this as completed Jul 6, 2022

ghost locked as resolved and limited conversation to collaborators Aug 6, 2022

PlaintextMVC benchmark is slow on arm64 #60166

PlaintextMVC benchmark is slow on arm64 #60166

Comments

EgorBo commented Oct 8, 2021 • edited Loading

Steps to get the native traces:

dotnet-issue-labeler bot commented Oct 8, 2021

EgorBo commented Oct 8, 2021

stephentoub commented Oct 8, 2021 • edited Loading

EgorBo commented Oct 8, 2021 • edited Loading

EgorBo commented Oct 8, 2021 • edited Loading

EgorBo commented Oct 8, 2021 • edited Loading

sebastienros commented Oct 8, 2021

EgorBo commented Oct 8, 2021

jkotas commented Oct 8, 2021

EgorBo commented Oct 8, 2021

sebastienros commented Oct 8, 2021

jkotas commented Oct 8, 2021

jkotas commented Oct 8, 2021

mangod9 commented Oct 8, 2021

EgorBo commented Oct 8, 2021

sebastienros commented Oct 8, 2021

EgorBo commented Oct 8, 2021

EgorBo commented Oct 8, 2021

Maoni0 commented Oct 11, 2021

sebastienros commented Oct 11, 2021

EgorBo commented Oct 11, 2021

Maoni0 commented Oct 11, 2021

EgorBo commented Oct 11, 2021 • edited Loading

BruceForstall commented Oct 11, 2021

EgorBo commented Oct 13, 2021 • edited Loading

sebastienros commented Oct 13, 2021

kunalspathak commented Dec 17, 2021

mangod9 commented Dec 17, 2021

kunalspathak commented Dec 17, 2021

sebastienros commented Dec 17, 2021

mangod9 commented Dec 17, 2021

kunalspathak commented Dec 17, 2021

kunalspathak commented Dec 17, 2021 • edited Loading

jkotas commented Dec 17, 2021

kunalspathak commented Jan 3, 2022

Maoni0 commented Jan 3, 2022

kunalspathak commented Jan 3, 2022

kunalspathak commented Jan 3, 2022

mangod9 commented Jan 3, 2022

kunalspathak commented Jan 5, 2022

kunalspathak commented Jan 14, 2022

jkotas commented Jan 15, 2022

kunalspathak commented Jan 17, 2022

AntonLapounov commented Feb 2, 2022

kunalspathak commented Feb 2, 2022

Maoni0 commented Feb 2, 2022

EgorBo commented Jul 6, 2022

EgorBo commented Oct 8, 2021 •

edited

Loading

stephentoub commented Oct 8, 2021 •

edited

Loading

EgorBo commented Oct 8, 2021 •

edited

Loading

EgorBo commented Oct 8, 2021 •

edited

Loading

EgorBo commented Oct 8, 2021 •

edited

Loading

EgorBo commented Oct 11, 2021 •

edited

Loading

EgorBo commented Oct 13, 2021 •

edited

Loading

kunalspathak commented Dec 17, 2021 •

edited

Loading