Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJit: Functions with stackalloc won't inline #7109

Open
benaadams opened this issue Dec 8, 2016 · 22 comments
Open

RyuJit: Functions with stackalloc won't inline #7109

benaadams opened this issue Dec 8, 2016 · 22 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

Functions using stackalloc will fail to inline with reason unknown opcode

Discovered this as part of this PR davidfowl/Channels#145 (should have probably raised it then)

Example function that will fail to inline even with aggressive inlinining

public unsafe bool TrySliceTo(byte b1, byte b2, out ReadableBuffer slice, out ReadCursor cursor)
{
   byte* twoBytes = stackalloc byte[2];
   twoBytes[0] = b1;
   twoBytes[1] = b2;
   return TrySliceTo(new Span<byte>(twoBytes, 2), out slice, out cursor);
}

Changed function that will inline successfully without aggressive inlinining

public unsafe bool TrySliceTo(byte b1, byte b2, out ReadableBuffer slice, out ReadCursor cursor)
{
   // use address of ushort rather than stackalloc as the inliner won't inline functions with stackalloc
   ushort twoBytes;
   byte* byteArray = (byte*)&twoBytes;
   byteArray[0] = b1;
   byteArray[1] = b2;
   return TrySliceTo(new Span<byte>(byteArray, 2), out slice, out cursor);
}

/cc @AndyAyersMS @JosephTremoulet
category:cq
theme:inlining
skill-level:expert
cost:medium

@benaadams
Copy link
Member Author

benaadams commented Dec 8, 2016

@BruceForstall was this fixed by "RyuJIT: improve CQ for amd64 localloc " https://github.com/dotnet/coreclr/issues/6261 (PR "Clean up localloc implementation" dotnet/coreclr#6276) ?

Though that might be entirely an non-inlining effecting change?

@benaadams
Copy link
Member Author

benaadams commented Dec 8, 2016

Also may have been factor in changing the code path for dotnet/coreclr#6141 from stackalloc to uint for similar reasons as discovered in benchmarking by @jamesqo (additionally may have been additional localloc issues referenced above)

@AndyAyersMS
Copy link
Member

There are challenges inlining methods with stackalloc, primarily because there is no unstackalloc. Inlining such methods can convert cases that do not overflow the stack into cases that do.

C/C++compilers have similar limitations inlining methods with alloca.

We could consider supporting this for aggressive inline methods, perhaps.

@benaadams
Copy link
Member Author

benaadams commented Dec 8, 2016

What about for fixed sized/const stackallocs vs variable sized ones? Maybe with a threashold?

Are quite a lot of fixed sized allocs (also variable ones) coreclr; corefx

@benaadams
Copy link
Member Author

primarily because there is no unstackalloc.

Ahh... so the SP will advance, but there is no way to bring it back? So scoping with SP can only be controlled via function call; rather than say in an if/while block etc

So it could get quite messy if a stackalloc function was inlined into the top of a while loop for example?

@AndyAyersMS
Copy link
Member

Yes, the problematic cases are ones where the call site is in a loop. Even if the stackalloc is fixed size we won't know the number of loop iterations during inlining and so would not be able to bound the potential stack growth.

The jit could potentially convert small fixed-sized stackallocs into new locals, that might enable inlining the small fixed sized cases.

@benaadams
Copy link
Member Author

benaadams commented Dec 8, 2016

Could loop + stackalloc block inline; but without loop its fine?

Though I suppose eventually it will hit a loop when all the inlines collapse/fold-in

e.g. caller of caller of caller has a loop

Trickier than it first appears 😦

@AndyAyersMS
Copy link
Member

Maybe for small allocations, yes.

The interesting cases for perf likely involve amortizing the cost of small fixed-sized stackallocs in loops, like the example you gave above: davidfowl/Channels#145. If the inlinee does a variable or large stackalloc it presumably will touch most of the storage and so the stackalloc overhead is probably not as significant.

I'd like to see the jit convert small fixed stackallocs into regular local vars first. Then we could consider anticipating this in the inliner and allowing such methods to be inlines.

@benaadams
Copy link
Member Author

benaadams commented Dec 8, 2016

@AndyAyersMS could you raise an issue for that? I have no idea how to phrase it/get the concept across 😄

@AndyAyersMS
Copy link
Member

Sure, dotnet/coreclr#8542.

@AndyAyersMS
Copy link
Member

Not sure why I didn't auto link to this when I fixed (at least some cases) of this: dotnet/coreclr#14623.

Fixed sized stackallocs less than or equal to 32 bytes that are not in loops will no longer block inlining.

@benaadams
Copy link
Member Author

@AndyAyersMS since no initlocals is a thing now (applied to coreclr and some of corefx) and coming as a method SkipLocalsInitAttribute in C#7.3 dotnet/roslyn#24723; is it worth considering expanding the fixed sized stackalloc threshold when no zeroing will occur? (e.g. no .locals init)

@AndyAyersMS
Copy link
Member

No init locals helps normal stackalloc roughly equally as well as a stackalloc converted to a fixed buffer.

My reluctance to push the value higher (see notes in dotnet/coreclr#14623) was twofold -- not many cases in the 32...512 range, and for larger sizes the jit starts using the larger encoding for most on-frame accesses.

There are things the jit could do to address the need for large offsets. It could rearrange the frame and try to put the most frequently accessed locals in places where they can be reached via small offsets, bias the frame pointer so that addressing can take advantage of the full -128...127 range, or introduce a fixed or temporary second frame pointer that is located near the frequently used locals. But the jit does not know these tricks yet.

With the recent increase in using span over stackallock we should probably remeasure the frequency stats and maybe that would justify increasing the conversion limit. But we'd need to keep an eye on code size.

@tannergooding
Copy link
Member

@AndyAyersMS, it would seem like it might be possible to change stackalloc for inlined methods to also be able to unalloc

That is, instead of inlining the stack allocations into the parent method's stack, you would move the stack pointer such that the code generated would be:

  • stackalloc
  • inlined code
  • unalloc

I could think of a few places where this might be undesirable, but there are also a number of places where opting into this behavior would be beneficial.

@benaadams
Copy link
Member Author

With the recent increase in using span over stackallock we should probably remeasure the frequency stats and maybe that would justify increasing the conversion limit.

I'm seeing a jump in what's considered a good cut off for stackalloc due the reduced cost from the use of illink init locals removal: e.g. https://github.com/dotnet/corefx/pull/27204/files

Though that method probably isn't a good example of one to inline 😄

@benaadams
Copy link
Member Author

you would move the stack pointer such that the code generated would be:

For non init locals I assume? Else you pull in a bunch of zeroing code as well

@tannergooding
Copy link
Member

For non init locals I assume? Else you pull in a bunch of zeroing code as well

Yeah, that would probably be better 😄

@AndyAyersMS
Copy link
Member

Undoing stackalloc has always been something of a pipe dream -- I don't know of any native compilers that do it either.

Aside from getting the mechanical aspects right, the undo point creates a barrier to code motion that can be hard for the compiler to model and reason about (similar to what happens with pinning). And there is an EH aspect to consider too... we'd likely need to induce a try/finally to ensure that the undo always happened.

@benaadams
Copy link
Member Author

And there is an EH aspect to consider too... we'd likely need to induce a try/finally to ensure that the undo always happened.

At which point its probably getting ugly for an inline candidate?

@AndyAyersMS
Copy link
Member

Yes. When EH gets involved the jit becomes very conservative (as you've seen firsthand with the async cases).

@AndyAyersMS
Copy link
Member

Looks like

private const int CharStackBufferSize = 32;

is somewhat popular in corefx formatting, so maybe 64 is now interesting.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@AndyAyersMS
Copy link
Member

Simple cases of this are now handled (small, fixed sized, not in loops).

More general handling is not yet a priority, so will leave this as future.

@AndyAyersMS AndyAyersMS removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants