JIT: consider optimizing small fixed-sized stackalloc #7113

AndyAyersMS · 2016-12-08T23:13:32Z

In some cases small fixed-sized stackallocs could be turned into regular local vars, which (if this transformation could be anticipated) would unblock inlining. See related discussion in #7109.

@benaadams feel free to add pointers to examples you've found.

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2016-12-09T23:00:44Z

I did a quick prototype of this, converting any fixed sized stackalloc of 32 bytes or less, and the only hits in our normal set of code size assemblies (which includes corefx and roslyn) were in some interop stubs in the core library. The diffs look good and the jit changes are small and likely low risk but I'd like to have a bit more motivation before pushing further.

@benaadams if there's an easy way to drop the latest jit into an aspnet run I can try it there and see if it hits more often or in more interesting places. Last time we looked at doing this it was nontrivial though. I could also backport to 1.1 pretty easily I think if that helps.

benaadams · 2016-12-10T00:36:51Z

For motivation, I imagine with Span working as an "array" over stackalloced items it will become more widespread. Pipelines, Formatting and Encoding use it in corefxlab; however some of them are variable and others are large.

FormatInt64 is a pretty widely used function in AspNet and stackallocs const 20; but doesn't look very inlinable anyway though 😞

GenerateConnectionId (and the identical GenerateRequestId and GenerateGuidString) is stackalloc'd 13 but again doesn't look greatly inlinable; and I think they are lazy-init'ed

To drop latest Jit into aspnet; I compile a project as

{
  "version": "1.1.0-*",
  "dependencies": {
    "Microsoft.AspNetCore.Server.Kestrel": "1.2.0-*"
  },
  "buildOptions": {
    "emitEntryPoint": true
  },
  "frameworks": {
    "netcoreapp1.1": {
      "dependencies": {
        "Microsoft.NETCore.App": {
          "version": "1.1.0-*"
        }
      }
    }
  },
  "publishOptions": {
    "include": [
      "hosting.json"
    ]
  },
  "runtimes": {
    "win7-x64": {}
  }
}

dotnet publish -c Release; then copy bin\Product\Windows_NT.x64.Release into the bin\Release\netcoreapp1.1\win7-x64\publish folder and clrjit from Checked

At this point it fails with System.IO.TextInfo can't find method error so also copy files from corefx\bin\AnyOS.AnyCPU.Release\System.IO into the publish folder.

Unfortunately that only covers start-up; I think there are some other missing methods; as it won't send a response back - but am looking into it...

benaadams · 2016-12-10T00:45:03Z

Two examples I'm definitely aware of where stackalloc has been avoided due to performance are

https://github.com/dotnet/corefxlab/blob/master/src/System.IO.Pipelines/ReadableBuffer.cs#L95-L103

and

https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Char.cs#L932-L936

benaadams · 2016-12-10T13:11:10Z

Simple app for plaintext testing is

public class Startup
{
	private static readonly byte[] _helloWorldPayload = Encoding.UTF8.GetBytes("Hello, World!");

	public void Configure(IApplicationBuilder app, ILoggerFactory loggerFactory)
	{
		//loggerFactory.AddConsole(LogLevel.Information);

		app.Run((context) =>
		{
			var response = context.Response;
			response.StatusCode = 200;
			response.ContentType = "text/plain";
			response.Headers["Content-Length"] = "13";
			return response.Body.WriteAsync(_helloWorldPayload, 0, _helloWorldPayload.Length);
		});
	}

	public static void Main(string[] args)
	{
		var host = new WebHostBuilder()
			.UseKestrel()
			.UseUrls("http://localhost:5001/")
			.UseContentRoot(Directory.GetCurrentDirectory())
			.UseStartup<Startup>()
			.Build();

		host.Run();
	}
}

Though I don't think it will hit any stackallocs

benaadams · 2016-12-10T13:22:24Z

Also need from corefx
Windows_NT.AnyCPU.Release/System.Private.Uri

AndyAyersMS · 2016-12-13T19:24:44Z

Haven't tried this on asp.net yet, but I did run the change through the desktop SPMI collection (database of roughly 2M methods we use to validate aspects of desktop codegen). No hits for this outside of IL stubs and a handful of stackalloc-specific test cases.

However the size savings from stubs alone might make this worth doing....

AndyAyersMS · 2016-12-14T17:48:35Z

Jit source changes here for future reference.

Drawaes · 2016-12-22T03:41:20Z

Try an example of asp.net with sslstream?

https://github.com/dotnet/corefx/blob/ec2a6190efa743ab600317f44d757433e44e859b/src/System.Net.Security/src/System/Net/Security/SslStreamPal.Windows.cs

Drawaes · 2016-12-22T03:51:06Z

Forget that the stack allocations will be too big for 32 bytes

benaadams · 2017-01-09T16:16:53Z

I think the stackallocs in SslStream are 48 bytes x86 and 64 bytes x64 (4 x sizeof(SecBuffer)) - though I don't think Encrypt and Decrypt are inlinable.

AndyAyersMS · 2017-10-19T16:46:25Z

Revived here: master...AndyAyersMS:ReviveOptLocalloc

Some thoughts:

If the inline call site is in a loop, the inline could lead to unbounded growth in caller stack, and potential stack overflows. There currently is no "unstackalloc" we can pend to the end of the inlined code.
It might be interesting to see if an "unstackalloc" could be created. Assuming we had such a thing, we'd have to watch for the possibility of locally caught exceptions, sort of like what we do for unpinning, to ensure that the unstackalloc could not be bypassed. More likely we would disallow inlines of stackalloc methods requiring unstackallocs inside try regions.
If the inline call site is conditional and somewhat rare, adding a stackallock to a method via inlining could bloat the prolog/epilog of the method, potentially making it slower overall.
The inliner's stack model (used in the IL prescan) is quite crude, and while it can in some cases detect constant sized stackallocs, it doesn't know the value of the constants.
This argues for allowing methods with stackallocs as inline candidates without any up-front screening, and then rejecting them later if they turn out to be over the size threshold, or are called within loops, or maybe screening out in-loop candidates in the prescan, and then variable and too large candidates during importation. There is always some TP risk in creating new candidates that might be rejected during importation, as the jit does a fair amount of speculative work and then tosses it out. But see the next point.
The current optimization still only fires in IL stubs in corelib, these currently never get inlined. I may port these changes to desktop where I have a wider variety of methods to experiment on, and see if that is still true. But it could be one of those chicken and egg things, where if small fixed sized cases were efficient and would inline, they'd get used more often.
Allowing large fixed stackallocs to convert to local buffers may force the jit to use large offsets to access other locals and so cause code size bloat and CQ issues. If we eventually want to enable this for larger allocation sizes, we will also need to look at reordering the stack to put the buffers farther from the frame or stack pointer, since buffer access tends to be buffer-base-relative, and local access frame-pointer (or stack pointer relative).

benaadams · 2017-10-19T16:54:44Z

There's also the new C# 7.2 safe span stackalloc; which might make it more widespread

e.g. in IPAddressExtensions

Span<byte> addressBytes = stackalloc byte[IPAddressParserStatics.IPv6AddressBytes];

AndyAyersMS · 2017-10-19T23:53:51Z

Realized that there's often a cast between the fixed size and the localloc. Fixing that we see things firing in more cases:

Total bytes of diff: -3822 (-0.02 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
       -3456 : System.Private.CoreLib.dasm (-0.10 % of base)
        -102 : System.Private.DataContractSerialization.dasm (-0.01 % of base)
         -89 : System.Net.Primitives.dasm (-0.16 % of base)
         -83 : System.Private.Uri.dasm (-0.11 % of base)
         -31 : System.Console.dasm (-0.07 % of base)
8 total files with size differences (8 improved, 0 regressed), 122 unchanged.
Top method improvements by size (bytes):
       -1110 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long,long):int:this (47 methods)
        -382 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
        -339 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
        -256 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(int,long,long):int:this (17 methods)
        -140 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoCOM(int,ref,long):int:this (5 methods)
53 total methods with size differences (53 improved, 0 regressed), 142974 unchanged.

(this is with 32 bytes).

Updated my fork to support inlining of these methods. EG

using System;

class L
{
    unsafe static int Use4()
    {
        byte* i = stackalloc byte[4];
        i[2] = 50;
        return i[2] * 2;
    }

    unsafe static int Use(int x)
    {
        byte* i = stackalloc byte[x];
        i[1] = 50;
        return i[1] * 2;
    }

    public static int Main()
    {
        int v0 = Use4();
        int v1 = Use(10);
        int v2 = Use(100);
        int v3 = Use(v0);
        int v4 = 0;
        int v5 = 0;
        int v6 = 0;

        for (int i = 0; i < 7; i++)
        {
            v5 += Use4();
            v5 += Use(4);
            v6 += Use(v0);
        }

        return v0 + v1 + v2 + v3 + v4 + v5 + v6 - 2400;
    }
}

Inlines into 06000003 L:Main():int
  [1 IL=0000 TR=000001 06000001] [below ALWAYS_INLINE size] L:Use4():int
  [2 IL=0008 TR=000008 06000002] [below ALWAYS_INLINE size] L:Use(int):int
  [0 IL=0016 TR=000016 06000002] [FAILED: localloc size too large] L:Use(int):int
  [0 IL=0023 TR=000024 06000002] [FAILED: localloc size unknown] L:Use(int):int
  [3 IL=0045 TR=000055 06000001] [below ALWAYS_INLINE size] L:Use4():int
  [4 IL=0056 TR=000064 06000002] [below ALWAYS_INLINE size] L:Use(int):int
  [0 IL=0067 TR=000074 06000002] [FAILED: localloc size unknown] L:Use(int):int

AndyAyersMS · 2017-10-20T15:57:06Z

Tried setting the threshold to 128 and 256. The latter gets a few more cases but we also start to see code size regressions because of larger stack offsets.

This optimization often allows RSP based addressing for locals in methods that were using RBP based addressing before. The potential downside is that if the stackalloc local ends up closer to RSP than the other locals then their offsets grow, and we only get 128 bytes of cheap local offsets.

Accesses to the stackalloc local tend to be relative to its base so its position on the frame is less critical.

I'm going to see if there is any way to put the stackalloc local closer to the root of the frame; if that's possible it should avoid code size increases.

AndyAyersMS · 2017-10-20T16:44:57Z

Looks like over in lvaAssignVirtualFrameOffsetsToLocals we could indeed influence frame layout, as the locals introduced by stackalloc conversion are marked as "unsafe buffers" (since they're accessed via pointer math), however there are lots of other considerations that feed into layout. Given the smallish number of > 32 byte fixed offsets perhaps it's best to just keep with 32 for now.

Optimize fixed sized locallocs of 32 bytes or less to use local buffers. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not conver to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this if we are willing to reorder locals and see other cases where it might apply. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.

Optimize fixed sized locallocs of 32 bytes or less to use local buffers. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not convert to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this threshold if we are willing to reorder locals and see fixed sized cases larger than 32 bytes. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.

Optimize fixed sized locallocs of 32 bytes or less to use local buffers, if the localloc is not in a loop. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not convert to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this threshold if we are willing to reorder locals and see fixed sized cases larger than 32 bytes. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.

AndyAyersMS closed this as completed in dotnet/coreclr#14623 Nov 1, 2017

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 2.1.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: consider optimizing small fixed-sized stackalloc #7113

JIT: consider optimizing small fixed-sized stackalloc #7113

AndyAyersMS commented Dec 8, 2016

AndyAyersMS commented Dec 9, 2016

benaadams commented Dec 10, 2016

benaadams commented Dec 10, 2016 •

edited

Loading

benaadams commented Dec 10, 2016

benaadams commented Dec 10, 2016

AndyAyersMS commented Dec 13, 2016

AndyAyersMS commented Dec 14, 2016

Drawaes commented Dec 22, 2016

Drawaes commented Dec 22, 2016

benaadams commented Jan 9, 2017

AndyAyersMS commented Oct 19, 2017 •

edited

Loading

benaadams commented Oct 19, 2017

AndyAyersMS commented Oct 19, 2017

AndyAyersMS commented Oct 20, 2017

AndyAyersMS commented Oct 20, 2017 •

edited

Loading

JIT: consider optimizing small fixed-sized stackalloc #7113

JIT: consider optimizing small fixed-sized stackalloc #7113

Comments

AndyAyersMS commented Dec 8, 2016

AndyAyersMS commented Dec 9, 2016

benaadams commented Dec 10, 2016

benaadams commented Dec 10, 2016 • edited Loading

benaadams commented Dec 10, 2016

benaadams commented Dec 10, 2016

AndyAyersMS commented Dec 13, 2016

AndyAyersMS commented Dec 14, 2016

Drawaes commented Dec 22, 2016

Drawaes commented Dec 22, 2016

benaadams commented Jan 9, 2017

AndyAyersMS commented Oct 19, 2017 • edited Loading

benaadams commented Oct 19, 2017

AndyAyersMS commented Oct 19, 2017

AndyAyersMS commented Oct 20, 2017

AndyAyersMS commented Oct 20, 2017 • edited Loading

benaadams commented Dec 10, 2016 •

edited

Loading

AndyAyersMS commented Oct 19, 2017 •

edited

Loading

AndyAyersMS commented Oct 20, 2017 •

edited

Loading