Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: convert fixed-sized locallocs to locals, enable inlining #14623

Merged
merged 1 commit into from
Nov 1, 2017

Conversation

AndyAyersMS
Copy link
Member

Optimize fixed sized locallocs of 32 bytes or less to use local buffers.
Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not convert to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this threshold if we are willing to reorder locals and see
fixed sized cases larger than 32 bytes.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.

@AndyAyersMS
Copy link
Member Author

@briansull PTAL
cc @dotnet/jit-contrib

Jit-diffs stats:

Total bytes of diff: -3822 (-0.02 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
       -3456 : System.Private.CoreLib.dasm (-0.10 % of base)
        -102 : System.Private.DataContractSerialization.dasm (-0.01 % of base)
         -89 : System.Net.Primitives.dasm (-0.16 % of base)
         -83 : System.Private.Uri.dasm (-0.11 % of base)
         -31 : System.Console.dasm (-0.07 % of base)
8 total files with size differences (8 improved, 0 regressed), 122 unchanged.
Top method improvements by size (bytes):
       -1110 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long,long):int:this (47 methods)
        -382 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
        -339 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
        -256 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(int,long,long):int:this (17 methods)
        -140 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoCOM(int,ref,long):int:this (5 methods)
53 total methods with size differences (53 improved, 0 regressed), 142974 unchanged.

@briansull
Copy link

It seems like the tests aren't happy with this change.

@AndyAyersMS
Copy link
Member Author

Ah, the new test case needs /unsafe. Let me fix it...

{
// Get the size threshold for local conversion
ssize_t maxSize = DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE;
INDEBUG(maxSize = JitConfig.JitStackAllocToLocalSize(););

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this syntax here.

Could you expand it to #ifdef DEBUG form

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@AndyAyersMS
Copy link
Member Author

Hmm, still some test issues to sort out.

@AndyAyersMS
Copy link
Member Author

Grr, switches with fallthroughs. Not the first time I've been burned. Hopefully that takes care of the test issues.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@AndyAyersMS
Copy link
Member Author

Still a few issues out there, building x86 locally now to repro.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 20, 2017

In one of the failing cases (jit\jit64\opt\localloc\call04_small_il) there's a method with a localloc and then a explicit tail call, and the localloc address is passed to the callee. The expectation is that the localloc memory will survive the tail call.

My first gloss through Ecma-335 says this is an invalid program, in particular III.2.4 says:

The tail. prefix shall immediately precede a call, calli, or callvirt instruction. It indicates that the current method’s stack frame is no longer required and thus can be removed before the call instruction is executed.

I am going to look at the other failures first before I decide what to do about this one.

@AndyAyersMS
Copy link
Member Author

The other x86 failures are jit asserts about the GS cookie offset. So likely my change is breaking some jit internal consistency issue by just setting the locals as unsafe buffers; probably something else has to change too. Will debug. May hit this on arm32 too.

The call04_small case doesn't fail on x64 because for that target the jit bails on explicit tail calls in methods that need GS protection.

@AndyAyersMS
Copy link
Member Author

SInce the jit bails on explicit tail calls when localloc is used, seems reasonable to do the same if one is used but then optimized.

For the assert cases, it looks like we also need to set compGSReorderStackLayout when we mark the buffer as lvIsUnsafeBuffer for x86, but this causes bad code size diffs on x64, so need to think about it more.

@jkotas
Copy link
Member

jkotas commented Oct 21, 2017

Do you need to also worry about local alloc inside a loop? They should not be converted this way.

@AndyAyersMS
Copy link
Member Author

Inlining is restricted to methods with convertible fixed-sized locallocs. If the call site is in a loop, the local buffer introduced into the inlinee by localloc conversion will get reused each iteration (and zeroed first if the inlinee has initlocals). So stack will not grow.

We don't allow inlining of large fixed or variable sized locallocs in this change, in part because of worries about stack growth. I left a bunch of notes on the topic over in the linked issue #8542.

@AndyAyersMS
Copy link
Member Author

Still have a few x86 issues to sort out, related to whether or not the converted buffer is considered an unsafe buffer.

@jkotas
Copy link
Member

jkotas commented Oct 22, 2017

This is what I meant:

Expected result: 45
Actual result w/ your change: infinite loop

using System;

unsafe class Program
{
    struct Element
    {
        public Element* Next;
        public int Value;
    }

    static int foo(int n)
    {
        Element* root = null;
        for (int i = 0; i < n; i++)
        {
            byte* pb = stackalloc byte[16];
            Element* p = (Element*)pb;
            p->Value = i;
            p->Next = root;
            root = p;
        }

        int sum = 0;
        while (root != null)
        {
            sum += root->Value;
            root = root->Next;
        }
        return sum;
    }

    static void Main(string[] args)
    {
        Console.WriteLine(foo(10));
    }
}

@AndyAyersMS
Copy link
Member Author

Thanks. Updated to avoid conversion in such cases.

@AndyAyersMS
Copy link
Member Author

Most recent test failures seem unrelated. Still expecting some x86 failures though.

@dotnet-bot retest this please

@AndyAyersMS
Copy link
Member Author

Ah, should read the failures more closely.. the new test needed a tweak.

@AndyAyersMS
Copy link
Member Author

Now marking the buffer as unsafe; we still get some size wins overall.

Total bytes of diff: -2085 (-0.01 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
       -1693 : System.Private.CoreLib.dasm (-0.05 % of base)
        -119 : System.Private.DataContractSerialization.dasm (-0.02 % of base)
         -80 : System.Private.Uri.dasm (-0.10 % of base)
         -71 : System.Net.Primitives.dasm (-0.12 % of base)
         -45 : System.Private.Xml.dasm (0.00 % of base)
8 total files with size differences (8 improved, 0 regressed), 122 unchanged.
Top method regessions by size (bytes):
         200 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoCOM(int,ref,long):int:this (5 methods)
         184 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
          50 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):this (6 methods)
          45 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(int,ref):int:this (3 methods)
          44 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoCOM(int,ref,int,byref):this (1 methods)
Top method improvements by size (bytes):
       -1110 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long,long):int:this (47 methods)
        -267 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
        -190 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(int,long,long):int:this (17 methods)
        -140 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_COMtoCLR(int,long,long):int:this (6 methods)
        -102 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(int,int,long,long):int:this (3 methods)
53 total methods with size differences (42 improved, 11 regressed), 142787 unchanged.

@AndyAyersMS
Copy link
Member Author

@briansull PTAL one more time if you don't mind.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@@ -22958,6 +22975,7 @@ void Compiler::fgInsertInlineeBlocks(InlineInfo* pInlineInfo)
compLongUsed |= InlineeCompiler->compLongUsed;
compFloatingPointUsed |= InlineeCompiler->compFloatingPointUsed;
compLocallocUsed |= InlineeCompiler->compLocallocUsed;
compLocallocOptimized |= InlineeCompiler->compLocallocOptimized;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future refactoring opportunity:
We should place all of these values into a single struct so that we can perform a single struct assignment
from InlineCompiler to the this pointer to update all of them at once.

Optimize fixed sized locallocs of 32 bytes or less to use local buffers,
if the localloc is not in a loop.

Also "optimize" the degenerate 0 byte case.

Allow inline candidates containing localloc, but fail inlining if any
of a candidate's locallocs do not convert to local buffers.

The 32 byte size threshold was arrived at empirically; larger values did
not enable many more cases and started seeinge size bloat because of
larger stack offsets.

We can revise this threshold if we are willing to reorder locals and see
fixed sized cases larger than 32 bytes.

Closes #8542.

Also add missing handler for the callsite is in try region, this was
an oversight.
@AndyAyersMS
Copy link
Member Author

Fixed formatting issue and rebased.

@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib formatter build seems to be busted, perhaps from some recent change. Any ideas?

diff.cs(22,31): error CS8137: Cannot define a class or member that utilizes tuples because the compiler required type 'System.Runtime.CompilerServices.TupleElementNamesAttribute' cannot be found. Are you missing a reference? [D:\j\workspace\x64_windows_n---616f30f2\jitutils\src\jit-diff\jit-diff.csproj

@AndyAyersMS
Copy link
Member Author

Looks like tests\scripts\format.py is loading up a 1.0 vintage CLI to build the jitutils tools for the formatting leg. Seems unlikely to work now that we've updated the tools to netcoreapp2.0.

@AndyAyersMS
Copy link
Member Author

Hah, it worked. Will PR those parts separately.

@AndyAyersMS
Copy link
Member Author

Hmm, some error trying to clean up after the windows formatter ran and didn't find issues:

18:52:58 Deleting D:\j\workspace\x64_windows_n---616f30f2\dotnetcli-jitutils
18:52:58 Traceback (most recent call last):
18:52:58   File "tests\scripts\format.py", line 248, in <module>
18:52:58     return_code = main(sys.argv[1:])
18:52:58   File "tests\scripts\format.py", line 231, in main
18:52:58     shutil.rmtree(dotnetcliPath, onerror=del_rw)
18:52:58   File "C:\Python27\lib\shutil.py", line 247, in rmtree
18:52:58     rmtree(fullname, ignore_errors, onerror)
18:52:58   File "C:\Python27\lib\shutil.py", line 247, in rmtree
18:52:58     rmtree(fullname, ignore_errors, onerror)
18:52:58   File "C:\Python27\lib\shutil.py", line 247, in rmtree
18:52:58     rmtree(fullname, ignore_errors, onerror)
18:52:58   File "C:\Python27\lib\shutil.py", line 247, in rmtree
18:52:58     rmtree(fullname, ignore_errors, onerror)
18:52:58   File "C:\Python27\lib\shutil.py", line 252, in rmtree
18:52:58     onerror(os.remove, fullname, sys.exc_info())
18:52:58   File "tests\scripts\format.py", line 29, in del_rw
18:52:58     os.chmod(name, 0651)
18:52:58 WindowsError: [Error 3] The system cannot find the path specified: 'D:\\j\\workspace\\x64_windows_n---616f30f2\\dotnetcli-jitutils\\sdk\\NuGetFallbackFolder\\runtime.opensuse.13.2-x64.runtime.native.system.security.cryptography.openssl\\4.3.0\\runtime.opensuse.13.2-x64.runtime.native.system.security.cryptography.openssl.4.3.0.nupkg.sha51

Not sure if this is stale bits on the test machine or something else.

@AndyAyersMS
Copy link
Member Author

Let me see if it happens again. Link to previous failure.

@dotnet-bot retest Windows_NT x64 Innerloop Formatting

@AndyAyersMS
Copy link
Member Author

Ah, the file name is > 260 characters.

@AndyAyersMS
Copy link
Member Author

Reverted the experimental commits now that the formatting is fixed.

@AndyAyersMS
Copy link
Member Author

Tizen failure likely unrelated. Seems like a lot of these tests are failing intermittently now.

@RussKeldorph eyeballing looks like recently about 1 in 3 test runs fail. Haven't drilled in but would guess these are mostly random failures. Can we up the priority of the retry logic that was discussed in #6298?

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: consider optimizing small fixed-sized stackalloc
4 participants