Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: consider optimizing small fixed-sized stackalloc #7113

Closed
AndyAyersMS opened this issue Dec 8, 2016 · 15 comments · Fixed by dotnet/coreclr#14623
Closed

JIT: consider optimizing small fixed-sized stackalloc #7113

AndyAyersMS opened this issue Dec 8, 2016 · 15 comments · Fixed by dotnet/coreclr#14623
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@AndyAyersMS
Copy link
Member

In some cases small fixed-sized stackallocs could be turned into regular local vars, which (if this transformation could be anticipated) would unblock inlining. See related discussion in #7109.

@benaadams feel free to add pointers to examples you've found.

@AndyAyersMS
Copy link
Member Author

I did a quick prototype of this, converting any fixed sized stackalloc of 32 bytes or less, and the only hits in our normal set of code size assemblies (which includes corefx and roslyn) were in some interop stubs in the core library. The diffs look good and the jit changes are small and likely low risk but I'd like to have a bit more motivation before pushing further.

@benaadams if there's an easy way to drop the latest jit into an aspnet run I can try it there and see if it hits more often or in more interesting places. Last time we looked at doing this it was nontrivial though. I could also backport to 1.1 pretty easily I think if that helps.

@benaadams
Copy link
Member

For motivation, I imagine with Span working as an "array" over stackalloced items it will become more widespread. Pipelines, Formatting and Encoding use it in corefxlab; however some of them are variable and others are large.

FormatInt64 is a pretty widely used function in AspNet and stackallocs const 20; but doesn't look very inlinable anyway though 😞

GenerateConnectionId (and the identical GenerateRequestId and GenerateGuidString) is stackalloc'd 13 but again doesn't look greatly inlinable; and I think they are lazy-init'ed

To drop latest Jit into aspnet; I compile a project as

{
  "version": "1.1.0-*",
  "dependencies": {
    "Microsoft.AspNetCore.Server.Kestrel": "1.2.0-*"
  },
  "buildOptions": {
    "emitEntryPoint": true
  },
  "frameworks": {
    "netcoreapp1.1": {
      "dependencies": {
        "Microsoft.NETCore.App": {
          "version": "1.1.0-*"
        }
      }
    }
  },
  "publishOptions": {
    "include": [
      "hosting.json"
    ]
  },
  "runtimes": {
    "win7-x64": {}
  }
}

dotnet publish -c Release; then copy bin\Product\Windows_NT.x64.Release into the bin\Release\netcoreapp1.1\win7-x64\publish folder and clrjit from Checked

At this point it fails with System.IO.TextInfo can't find method error so also copy files from corefx\bin\AnyOS.AnyCPU.Release\System.IO into the publish folder.

Unfortunately that only covers start-up; I think there are some other missing methods; as it won't send a response back - but am looking into it...

@benaadams
Copy link
Member

benaadams commented Dec 10, 2016

@benaadams
Copy link
Member

Simple app for plaintext testing is

public class Startup
{
	private static readonly byte[] _helloWorldPayload = Encoding.UTF8.GetBytes("Hello, World!");

	public void Configure(IApplicationBuilder app, ILoggerFactory loggerFactory)
	{
		//loggerFactory.AddConsole(LogLevel.Information);

		app.Run((context) =>
		{
			var response = context.Response;
			response.StatusCode = 200;
			response.ContentType = "text/plain";
			response.Headers["Content-Length"] = "13";
			return response.Body.WriteAsync(_helloWorldPayload, 0, _helloWorldPayload.Length);
		});
	}

	public static void Main(string[] args)
	{
		var host = new WebHostBuilder()
			.UseKestrel()
			.UseUrls("http://localhost:5001/")
			.UseContentRoot(Directory.GetCurrentDirectory())
			.UseStartup<Startup>()
			.Build();

		host.Run();
	}
}

Though I don't think it will hit any stackallocs

@benaadams
Copy link
Member

Also need from corefx
Windows_NT.AnyCPU.Release/System.Private.Uri

@AndyAyersMS
Copy link
Member Author

Haven't tried this on asp.net yet, but I did run the change through the desktop SPMI collection (database of roughly 2M methods we use to validate aspects of desktop codegen). No hits for this outside of IL stubs and a handful of stackalloc-specific test cases.

However the size savings from stubs alone might make this worth doing....

@AndyAyersMS
Copy link
Member Author

Jit source changes here for future reference.

@Drawaes
Copy link
Contributor

Drawaes commented Dec 22, 2016

@Drawaes
Copy link
Contributor

Drawaes commented Dec 22, 2016

Forget that the stack allocations will be too big for 32 bytes

@benaadams
Copy link
Member

I think the stackallocs in SslStream are 48 bytes x86 and 64 bytes x64 (4 x sizeof(SecBuffer)) - though I don't think Encrypt and Decrypt are inlinable.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 19, 2017

Revived here: master...AndyAyersMS:ReviveOptLocalloc

Some thoughts:

  • If the inline call site is in a loop, the inline could lead to unbounded growth in caller stack, and potential stack overflows. There currently is no "unstackalloc" we can pend to the end of the inlined code.
  • It might be interesting to see if an "unstackalloc" could be created. Assuming we had such a thing, we'd have to watch for the possibility of locally caught exceptions, sort of like what we do for unpinning, to ensure that the unstackalloc could not be bypassed. More likely we would disallow inlines of stackalloc methods requiring unstackallocs inside try regions.
  • If the inline call site is conditional and somewhat rare, adding a stackallock to a method via inlining could bloat the prolog/epilog of the method, potentially making it slower overall.
  • The inliner's stack model (used in the IL prescan) is quite crude, and while it can in some cases detect constant sized stackallocs, it doesn't know the value of the constants.
  • This argues for allowing methods with stackallocs as inline candidates without any up-front screening, and then rejecting them later if they turn out to be over the size threshold, or are called within loops, or maybe screening out in-loop candidates in the prescan, and then variable and too large candidates during importation. There is always some TP risk in creating new candidates that might be rejected during importation, as the jit does a fair amount of speculative work and then tosses it out. But see the next point.
  • The current optimization still only fires in IL stubs in corelib, these currently never get inlined. I may port these changes to desktop where I have a wider variety of methods to experiment on, and see if that is still true. But it could be one of those chicken and egg things, where if small fixed sized cases were efficient and would inline, they'd get used more often.
  • Allowing large fixed stackallocs to convert to local buffers may force the jit to use large offsets to access other locals and so cause code size bloat and CQ issues. If we eventually want to enable this for larger allocation sizes, we will also need to look at reordering the stack to put the buffers farther from the frame or stack pointer, since buffer access tends to be buffer-base-relative, and local access frame-pointer (or stack pointer relative).

@benaadams
Copy link
Member

There's also the new C# 7.2 safe span stackalloc; which might make it more widespread

e.g. in IPAddressExtensions

Span<byte> addressBytes = stackalloc byte[IPAddressParserStatics.IPv6AddressBytes];

@AndyAyersMS
Copy link
Member Author

Realized that there's often a cast between the fixed size and the localloc. Fixing that we see things firing in more cases:

Total bytes of diff: -3822 (-0.02 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
       -3456 : System.Private.CoreLib.dasm (-0.10 % of base)
        -102 : System.Private.DataContractSerialization.dasm (-0.01 % of base)
         -89 : System.Net.Primitives.dasm (-0.16 % of base)
         -83 : System.Private.Uri.dasm (-0.11 % of base)
         -31 : System.Console.dasm (-0.07 % of base)
8 total files with size differences (8 improved, 0 regressed), 122 unchanged.
Top method improvements by size (bytes):
       -1110 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long,long):int:this (47 methods)
        -382 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
        -339 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
        -256 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(int,long,long):int:this (17 methods)
        -140 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoCOM(int,ref,long):int:this (5 methods)
53 total methods with size differences (53 improved, 0 regressed), 142974 unchanged.

(this is with 32 bytes).

Updated my fork to support inlining of these methods. EG

using System;

class L
{
    unsafe static int Use4()
    {
        byte* i = stackalloc byte[4];
        i[2] = 50;
        return i[2] * 2;
    }

    unsafe static int Use(int x)
    {
        byte* i = stackalloc byte[x];
        i[1] = 50;
        return i[1] * 2;
    }

    public static int Main()
    {
        int v0 = Use4();
        int v1 = Use(10);
        int v2 = Use(100);
        int v3 = Use(v0);
        int v4 = 0;
        int v5 = 0;
        int v6 = 0;

        for (int i = 0; i < 7; i++)
        {
            v5 += Use4();
            v5 += Use(4);
            v6 += Use(v0);
        }

        return v0 + v1 + v2 + v3 + v4 + v5 + v6 - 2400;
    }
}
Inlines into 06000003 L:Main():int
  [1 IL=0000 TR=000001 06000001] [below ALWAYS_INLINE size] L:Use4():int
  [2 IL=0008 TR=000008 06000002] [below ALWAYS_INLINE size] L:Use(int):int
  [0 IL=0016 TR=000016 06000002] [FAILED: localloc size too large] L:Use(int):int
  [0 IL=0023 TR=000024 06000002] [FAILED: localloc size unknown] L:Use(int):int
  [3 IL=0045 TR=000055 06000001] [below ALWAYS_INLINE size] L:Use4():int
  [4 IL=0056 TR=000064 06000002] [below ALWAYS_INLINE size] L:Use(int):int
  [0 IL=0067 TR=000074 06000002] [FAILED: localloc size unknown] L:Use(int):int

@AndyAyersMS
Copy link
Member Author

Tried setting the threshold to 128 and 256. The latter gets a few more cases but we also start to see code size regressions because of larger stack offsets.

This optimization often allows RSP based addressing for locals in methods that were using RBP based addressing before. The potential downside is that if the stackalloc local ends up closer to RSP than the other locals then their offsets grow, and we only get 128 bytes of cheap local offsets.

Accesses to the stackalloc local tend to be relative to its base so its position on the frame is less critical.

I'm going to see if there is any way to put the stackalloc local closer to the root of the frame; if that's possible it should avoid code size increases.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 20, 2017

Looks like over in lvaAssignVirtualFrameOffsetsToLocals we could indeed influence frame layout, as the locals introduced by stackalloc conversion are marked as "unsafe buffers" (since they're accessed via pointer math), however there are lots of other considerations that feed into layout. Given the smallish number of > 32 byte fixed offsets perhaps it's best to just keep with 32 for now.

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Oct 20, 2017
Optimize fixed sized locallocs of 32 bytes or less to use local buffers.
Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not conver to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this if we are willing to reorder locals and see other
cases where it might apply.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Oct 20, 2017
Optimize fixed sized locallocs of 32 bytes or less to use local buffers.
Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not convert to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this threshold if we are willing to reorder locals and see
fixed sized cases larger than 32 bytes.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Oct 31, 2017
Optimize fixed sized locallocs of 32 bytes or less to use local buffers,
if the localloc is not in a loop.

Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not convert to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this threshold if we are willing to reorder locals and see
fixed sized cases larger than 32 bytes.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Nov 1, 2017
Optimize fixed sized locallocs of 32 bytes or less to use local buffers,
if the localloc is not in a loop.

Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not convert to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this threshold if we are willing to reorder locals and see
fixed sized cases larger than 32 bytes.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.
AndyAyersMS referenced this issue in dotnet/coreclr Nov 1, 2017
Optimize fixed sized locallocs of 32 bytes or less to use local buffers,
if the localloc is not in a loop.

Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not convert to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this threshold if we are willing to reorder locals and see
fixed sized cases larger than 32 bytes.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants