Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue updating previous generic dictionaries after expansion #40355

Merged
merged 9 commits into from
Aug 7, 2020

Conversation

trylek
Copy link
Member

@trylek trylek commented Aug 5, 2020

Initial simplistic implementation based on my current understanding
of the code. I may be easily missing some extra places in the code
that need fixing. For now I've kept the instrumentation in
genericdict.cpp.

Fixes: #40298

Thanks

Tomas

@trylek trylek added this to the 5.0.0 milestone Aug 5, 2020
@trylek trylek requested a review from jkotas August 5, 2020 01:22

// Allocate from the high frequence heap of the correct domain
S_SIZE_T allocSize = safe_cbMT;
allocSize += cbOptional;
allocSize += cbIMap;
allocSize += cbPerInst;
allocSize += cbInstAndDict;
allocSize += cbInstAndDictPlusBackPointer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The starting dictionary does not really need to have the back pointer. It is always going to be NULL. You can tell that it is the starting dictionary by comparing actual dictionary size with GetNumInitialSlots

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, methodtablebuilder seems to have its own calculation of the initial dictionary size around

DWORD estNumTypeSlots = (numMethodsAdjusted * nTypeFactorBy2 + 2) / 3;

In my local debugging, type dictionaries usually had initial size equal to 0x200; can you please clarify what the check should look like in this case?

I was originally trying to mark the initial directory with zero bit set in the size but for now I rolled it back as an optimization that can be addressed later once the basic algorithmic change is shown to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I finally understand what you meant - using the initial slot count in the DictionaryLayout for the check. Fixed in 3rd commit, thanks for the suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I must be still missing something here, apparently it still doesn't work. The code at

infoSize = DictionaryLayout::GetDictionarySizeFromLayout(methodInst.GetNumArgs(), pDL);

apparently allocates an "initial dictionary" with size equal to "current method dictionary layout size" which is different from the "initial slot count" on the dictionary layout. With the back pointers everything seems to be working so I hope this is the last hurdle I need help with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be solved by moving the accounting for the back-pointer into GetDictionarySizeFromLayout:

...
    if (pDictLayout != NULL)
    {
        WORD slots = VolatileLoadWithoutBarrier(&pDictLayout->m_numSlots);

        bytes += sizeof(TADDR);                  // Slot for dictionary size
        bytes += slots * sizeof(TADDR);       // Slots for dictionary slots based on a dictionary layout

        if (pDictLayout->m_numInitialSlots < slots)
        {
            bytes += sizeof(PTR_Dictionary);  // Dictionary back-pointer for expanded dictionaries
        }
    }

And adding an extra condition for exiting the patching loop on null back-pointers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great, will do, thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, it's got an interesting wrinkle - we still need two different methods otherwise we'd change the existing semantics of the "size slot"; I have basically kept your idea but I had to introduce a pair of functions, GetDictionarySlotSizeFromLayout vs. GetDictionaryAllocSizeFromLayout where the former is the value of the "size slot" and the latter is the "total number of bytes" including the back pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5th commit (using the two methods). It's somewhat uglier than what you originally proposed but it's still relatively well encapsulated I believe.

@jkotas
Copy link
Member

jkotas commented Aug 5, 2020

This will also need a fix in the JIT to reduce how much CSE the JIT does for dictionary entries. The problem is that the Dictionary* is CSEed in the JITed code, and the JITed code never sees the updated Dictionary*. The code for that is in Compiler::impRuntimeLookupToTree. We need to stop setting GTF_IND_INVARIANT on the last indirection when sizeOffset != CORINFO_NO_SIZE_CHECK.

@trylek
Copy link
Member Author

trylek commented Aug 5, 2020

Adding JIT folks for reviewing the JIT change, please advise me if there are any special tests I should run and / or whether I should produce some JIT diffs from this change. Locally it seems to be passing and fixing the regression. I'm also not clear on what to do with the "regression test" - I can probably hardcode some sufficiently large timeout margin into it so that it fails upon exceeding it; alternatively we can somehow incorporate it in our perf test system.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT changes seem ok modulo some formatting.

This will lead to asm diffs; it would be good to characterize these.

@@ -2117,11 +2117,17 @@ GenTree* Compiler::impRuntimeLookupToTree(CORINFO_RESOLVED_TOKEN* pResolvedToken
nullptr DEBUGARG("impRuntimeLookup indirectOffset"));
}

// The last indirection could be subject to a size check (dynamic dictionary expansion)
bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually fully parenthesize:

Suggested change
bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK);
bool isLastIndirectionWithSizeCheck = ((i == pRuntimeLookup->indirections - 1) && (pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK));

Also, will likely require clang-formatting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the formatting. I'll look into clang formatting and providing the assembly diffs as the next step.

@AndyAyersMS
Copy link
Member

Testing is tricky as this isn't something BDN will normally capture.

I wonder if you can engineer a test as follows:

  • a baseline version that is a generic method over some ref T that's called once and loops, and within the loop always refers to various types dependent on T, eg if (t is List<T>) and so on. IIRC having 4 or 5 of these guarantees a slow lookup for at least one.
  • an alternate version that outlines the body of the loop into a non-inlined method with the same sequence of type references.

Without dictionary updates the baseline version should always have slow iterations as it will always refer to the initial dictionary. The alternate version should eventually get fast ones as it will refer to the expanded dictionary. So you can compare the times for the last N iterations of each; they should be similar if the bug is fixed.

//
//static
DWORD
DictionaryLayout::GetDictionarySizeFromLayout(
DictionaryLayout::GetDictionarySlotSizeFromLayout(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it one method with DWORD* ppSlotSize argument to make sure that you always get consistent snapshot for slot size and alloc size.

This method is sometimes called without locks where one thread is reading the layout and other thread is expanding it. It is why I had Volatile... in my code snippet. Having one method to return both alloc and slot sizes will avoid bugs with somebody accidentally changes the order in which GetDictionarySlotSizeFromLayout and GetDictionaryAllocSizeFromLayout are called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, sorry about not realizing this, fixed in 6th commit.

@trylek trylek force-pushed the DictionaryExpansion branch from 02d93a9 to eb9a243 Compare August 6, 2020 00:56
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Once it builds...

@jkotas
Copy link
Member

jkotas commented Aug 6, 2020

probably too long source line

It is best to download the .patch file from the CI artifacts and apply it.

@trylek
Copy link
Member Author

trylek commented Aug 6, 2020

I saw in the patch that it was just this one line split so I applied it manually. I absolutely agree that for larger diffs it would be much easier to apply the patch mechanically.

@AndyAyersMS
Copy link
Member

FYI, if you have jitutils tools available, there is a jit-format command that will do formatting for you; you can then do this locally rather than waiting for the CI to generate the patch. It should also be possible to reformat from visual studio or vscode as they should pick up the jit's .clang-format, though I haven't tried these.

@trylek
Copy link
Member Author

trylek commented Aug 6, 2020

OK, I have finally managed to produce the jit-diff for the ExpansionPerf test. As expected, the only difference is in the Test.TestDict method.

Baseline: https://gist.github.com/trylek/644d79473dcc802a7a9729580c5e7a4e
Diff: https://gist.github.com/trylek/e1e65fc9777efa6a416498d23697b9c4

JIT-diff console output:

D:\git\runtime5>jit-diff diff --core_root D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --assembly D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll --base D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --diff D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --arch x64 --build Checked --pmi
Using --output D:\git\runtime5\artifacts\diffs
Beginning PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll
\ Finished 1/1 Base 1/1 Diff [3.6 sec]
Completed PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll in 3.57s
Diffs (if any) can be viewed by comparing: D:\git\runtime5\artifacts\diffs\dasmset_6\base D:\git\runtime5\artifacts\diffs\dasmset_6\diff
Analyzing CodeSize diffs...
Found 1 files with textual diffs.
PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -7 (-0.20% of base)
    diff is an improvement.
Top file improvements (bytes):
          -7 : ExpansionPerf.dasm (-0.20% of base)
1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements (bytes):
          -7 (-0.26% of base) : ExpansionPerf.dasm - Test:TestDict(IDictionary`2,Func`2) (7 methods)
Top method improvements (percentages):
          -7 (-0.26% of base) : ExpansionPerf.dasm - Test:TestDict(IDictionary`2,Func`2) (7 methods)
1 total methods with Code Size differences (1 improved, 0 regressed), 20 unchanged.
Completed analysis in 0.30s

At the first glance, the minute improvement is caused by the fact that the suppressed common subexpression elimination stops storing the Dictionary * in a register and instead repeatedly loads it using an indirect MOV, thus freeing the one register for other optimizations. For example:

Baseline:

G_M27032_IG30:
       movsxd   rcx, r12d
       mov      r13, gword ptr [rbx+8*rcx+16]
       mov      rcx, rbp
       cmp      qword ptr [rcx+8], 56
       jle      SHORT G_M27032_IG32

Diff:

G_M27032_IG30:
       movsxd   rcx, r15d
       mov      r12, gword ptr [rbx+8*rcx+16]
       mov      rcx, qword ptr [rsi+16]
       cmp      qword ptr [rcx+8], 56
       jle      SHORT G_M27032_IG32

@AndyAyersMS - please let me know if you find this summary info sufficient or suggest what else should I do, you know I'm not familiar with this process.

@trylek trylek changed the title WIP: Continue updating previous generic dictionaries after expansion Continue updating previous generic dictionaries after expansion Aug 6, 2020
@AndyAyersMS
Copy link
Member

I was looking for a broader assessment over all the framework assemblies (pass -f instead of --assembly <...>)

@trylek
Copy link
Member Author

trylek commented Aug 6, 2020

@AndyAyersMS - thanks for clarifying. This is the summary; please let me what other artifacts or data you're interested in.

D:\git\runtime5>jit-diff diff --core_root D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root -f --base D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --diff D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --arch x64 --build Checked --pmi
Using --output D:\git\runtime5\artifacts\diffs
Beginning PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies
\ Finished 136/268 Base 0/268 Diff [140.2 sec]Error running D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe on D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll
1 errors compiling set.
Dasm command "jit-dasm-pmi --platform D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --corerun D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe --jit D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked\clrjit.dll --nocopy --output D:\git\runtime5\artifacts\diffs\dasmset_7\base D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll" returned with 1 failures
/ Finished 268/268 Base 261/268 Diff [398.8 sec]Error running D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe on D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll
1 errors compiling set.
Dasm command "jit-dasm-pmi --platform D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --corerun D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe --jit D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked\clrjit.dll --nocopy --output D:\git\runtime5\artifacts\diffs\dasmset_7\diff D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll" returned with 1 failures
| Finished 268/268 Base 268/268 Diff [446.5 sec]
Dasm commands returned 1 base failures, 1 diff failures.
Completed PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies in 446.68s
Diffs (if any) can be viewed by comparing: D:\git\runtime5\artifacts\diffs\dasmset_7\base D:\git\runtime5\artifacts\diffs\dasmset_7\diff
Analyzing CodeSize diffs...
Found 36 files with textual diffs.
PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: 1304 (0.00% of base)
    diff is a regression.
Top file regressions (bytes):
         487 : FSharp.Core.dasm (0.03% of base)
         193 : System.Linq.dasm (0.02% of base)
         188 : Microsoft.CodeAnalysis.dasm (0.01% of base)
         138 : System.Private.CoreLib.dasm (0.00% of base)
         112 : CommandLine.dasm (0.02% of base)
          93 : System.Threading.Tasks.Dataflow.dasm (0.01% of base)
          35 : System.Linq.Expressions.dasm (0.00% of base)
          35 : System.Threading.Channels.dasm (0.02% of base)
          28 : System.Collections.Immutable.dasm (0.00% of base)
          27 : System.Composition.Hosting.dasm (0.03% of base)
          18 : Microsoft.Extensions.Options.dasm (0.08% of base)
          18 : System.Memory.dasm (0.01% of base)
          14 : Microsoft.Extensions.Caching.Abstractions.dasm (0.06% of base)
          12 : Newtonsoft.Json.dasm (0.00% of base)
          12 : xunit.core.dasm (0.02% of base)
          11 : System.Private.DataContractSerialization.dasm (0.00% of base)
          10 : System.Private.Xml.dasm (0.00% of base)
           9 : System.Collections.Concurrent.dasm (0.00% of base)
           9 : xunit.assert.dasm (0.01% of base)
           9 : xunit.execution.dotnet.dasm (0.00% of base)
Top file improvements (bytes):
        -163 : System.Linq.Parallel.dasm (-0.01% of base)
         -10 : System.Private.Xml.Linq.dasm (-0.01% of base)
          -4 : System.Reflection.Metadata.dasm (-0.00% of base)
          -4 : System.Threading.Tasks.Parallel.dasm (-0.00% of base)
          -3 : System.Diagnostics.Process.dasm (-0.00% of base)
          -3 : System.Net.Http.dasm (-0.00% of base)
          -3 : System.Net.Http.Json.dasm (-0.01% of base)
          -2 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base)
36 total files with Code Size differences (8 improved, 28 regressed), 232 unchanged.
Top method regressions (bytes):
         137 ( 0.32% of base) : Microsoft.CodeAnalysis.dasm - ArrayBuilder`1:ToDictionary(Func`2,IEqualityComparer`1):Dictionary`2:this (49 methods)
          66 ( 0.39% of base) : CommandLine.dasm - <>c__DisplayClass0_0`1:b__7():ParserResult`1:this (7 methods)
          52 ( 0.21% of base) : System.Linq.Parallel.dasm - DistinctQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
          49 ( 0.21% of base) : FSharp.Core.dasm - FSharpMailboxProcessor`1:PostAndTryAsyncReply(FSharpFunc`2,FSharpOption`1):FSharpAsync`1:this (49 methods)
          43 ( 0.93% of base) : System.Private.CoreLib.dasm - ReferenceTypeHelper`1:GetPropertyGetter(PropertyInfo):Func`2:this
          37 ( 4.12% of base) : System.Threading.Tasks.Dataflow.dasm - ObserversState:ProcessItemAsync(__Canon):Task:this
          35 ( 0.16% of base) : FSharp.Core.dasm - FSharpMailboxProcessor`1:PostAndAsyncReply(FSharpFunc`2,FSharpOption`1):FSharpAsync`1:this (49 methods)
          35 ( 0.13% of base) : Microsoft.CodeAnalysis.dasm - AsyncQueue`1:WithCancellation(Task`1,CancellationToken):Task`1 (49 methods)
          33 ( 0.11% of base) : System.Linq.Parallel.dasm - TakeOrSkipWhileQueryOperator`1:WrapHelper(PartitionedStream`2,IPartitionedStreamRecipient`1,QuerySettings):this (49 methods)
          31 ( 0.96% of base) : FSharp.Core.dasm - Parallel@1240-3:Invoke(int,FSharpAsync`1):Unit:this (7 methods)
          30 ( 2.52% of base) : System.Linq.dasm - AppendPrepend1Iterator`1:GetCount(bool):int:this (7 methods)
          28 ( 0.25% of base) : FSharp.Core.dasm - FSharpSet`1:Map(FSharpFunc`2):FSharpSet`1:this (49 methods)
          26 ( 0.60% of base) : FSharp.Core.dasm - StartChild@1649-1:Invoke(CancellationToken):FSharpAsync`1:this (7 methods)
          26 ( 0.59% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:Aggregate(byref,bool):bool:this (7 methods)
          23 ( 0.26% of base) : System.Private.CoreLib.dasm - TlsOverPerCoreLockedStacksArrayPool`1:Trim():bool:this (9 methods)
          22 ( 0.41% of base) : System.Linq.dasm - Enumerable:Select(IEnumerable`1,Func`2):IEnumerable`1 (7 methods)
          21 ( 0.07% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods)
          20 ( 0.72% of base) : FSharp.Core.dasm - TryScan@233:Invoke(Unit):FSharpAsync`1:this (7 methods)
          20 ( 0.26% of base) : System.Linq.dasm - Enumerable:SequenceEqual(IEnumerable`1,IEnumerable`1,IEqualityComparer`1):bool (7 methods)
          20 ( 0.45% of base) : System.Linq.dasm - Enumerable:ToDictionary(IEnumerable`1,Func`2,IEqualityComparer`1):Dictionary`2 (7 methods)
Top method improvements (bytes):
        -154 (-0.58% of base) : System.Linq.Parallel.dasm - IndexedWhereQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods)
         -94 (-0.26% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
         -94 (-0.23% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
         -35 (-0.76% of base) : FSharp.Core.dasm - Choice@1291-1:Invoke(AsyncActivation`1):AsyncReturn:this (7 methods)
         -28 (-0.12% of base) : System.Linq.Parallel.dasm - MergeExecutor`1:Execute(PartitionedStream`2,bool,int,TaskScheduler,bool,CancellationState,int):MergeExecutor`1 (49 methods)
         -14 (-0.21% of base) : CommandLine.dasm - EnumerableExtensions:FoldImpl(IEnumerable`1,int,Func`2,Func`3,Func`4,Func`5):long (7 methods)
         -14 (-0.17% of base) : System.Linq.Parallel.dasm - ParallelEnumerable:SequenceEqual(ParallelQuery`1,ParallelQuery`1,IEqualityComparer`1):bool (7 methods)
         -13 (-1.52% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,Func`2,byref):__Canon
         -13 (-0.03% of base) : System.Linq.Parallel.dasm - UnionQueryOperator`1:WrapPartitionedStreamFixedBothTypes(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,int,CancellationToken):this (49 methods)
         -10 (-0.21% of base) : System.Collections.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.21% of base) : System.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.63% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,Range):ref (7 methods)
         -10 (-0.21% of base) : System.Private.Xml.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
          -9 (-0.06% of base) : FSharp.Core.dasm - SetTreeModule:compareStacks(IComparer`1,FSharpList`1,FSharpList`1):int (7 methods)
          -9 (-0.24% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):ArgumentTypeInfo:this (7 methods)
          -7 (-0.18% of base) : Microsoft.CodeAnalysis.dasm - EnumerableExtensions:Do(IEnumerable`1,Action`1):IEnumerable`1 (7 methods)
          -7 (-0.62% of base) : Microsoft.CodeAnalysis.dasm - d__3`1:<>m__Finally1():this (7 methods)
          -6 (-0.16% of base) : System.Linq.Parallel.dasm - DefaultMergeHelper`2:.ctor(PartitionedStream`2,bool,int,TaskScheduler,CancellationState,int):this (7 methods)
          -6 (-0.29% of base) : System.Linq.Parallel.dasm - PartitionedDataSource`1:MakePartitions(IEnumerator`1,int):ref (7 methods)
          -4 (-0.12% of base) : FSharp.Core.dasm - ExtraTopLevelOperators:dictRefType(IEnumerable`1):DictImpl`3 (7 methods)
Top method regressions (percentages):
           6 ( 5.56% of base) : System.Collections.dasm - <>c__DisplayClass34_0:b__0(Node):bool:this (7 methods)
           8 ( 4.12% of base) : System.Private.CoreLib.dasm - Table:Insert(__Canon,__Canon):this
          37 ( 4.12% of base) : System.Threading.Tasks.Dataflow.dasm - ObserversState:ProcessItemAsync(__Canon):Task:this
          30 ( 2.52% of base) : System.Linq.dasm - AppendPrepend1Iterator`1:GetCount(bool):int:this (7 methods)
           4 ( 2.07% of base) : FSharp.Core.dasm - TryScan@243-1:Invoke(__Canon):FSharpAsync`1:this
           4 ( 2.07% of base) : FSharp.Core.dasm - scan@215-2:Invoke(__Canon):FSharpAsync`1:this
           4 ( 2.07% of base) : FSharp.Core.dasm - scanNoTimeout@228-2:Invoke(__Canon):FSharpAsync`1:this
           4 ( 2.07% of base) : FSharp.Core.dasm - PostAndTryAsyncReply@398-6:Invoke(__Canon):FSharpAsync`1:this
          15 ( 1.84% of base) : System.Threading.Channels.dasm - ChannelReader`1:g__ReadAsyncCore|8_0(CancellationToken):ValueTask`1:this (7 methods)
           3 ( 1.68% of base) : System.Reflection.Metadata.dasm - PEHeaders:GetContainingSectionIndex(int):int:this
          10 ( 1.64% of base) : System.Linq.dasm - Enumerable:TryGetFirst(IEnumerable`1,byref):__Canon
           8 ( 1.56% of base) : System.Linq.dasm - Enumerable:SingleOrDefault(IEnumerable`1):__Canon
           9 ( 1.39% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,byref):__Canon
           4 ( 1.37% of base) : Microsoft.Extensions.Options.ConfigurationExtensions.dasm - NamedConfigureFromConfigurationOptions`1:.ctor(String,IConfiguration,Action`1):this
          14 ( 1.33% of base) : System.Linq.Expressions.dasm - UpdateDelegates:UpdateAndExecuteVoid2(CallSite,__Canon,long)
          19 ( 1.32% of base) : Microsoft.Extensions.Options.dasm - OptionsFactory`1:Create(String):__Canon:this
           2 ( 1.28% of base) : System.Private.CoreLib.dasm - Vector256`1:Equals(Vector256`1):bool:this (6 methods)
          12 ( 1.27% of base) : FSharp.Core.dasm - TryReceive@278:Invoke(Unit):FSharpAsync`1:this (7 methods)
          12 ( 1.27% of base) : System.Linq.dasm - Buffer`1:.ctor(IEnumerable`1):this (7 methods)
          10 ( 1.23% of base) : FSharp.Core.dasm - StartChildAsTask@1166-1:Invoke(CancellationToken):FSharpAsync`1:this (7 methods)
Top method improvements (percentages):
         -13 (-1.52% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,Func`2,byref):__Canon
         -35 (-0.76% of base) : FSharp.Core.dasm - Choice@1291-1:Invoke(AsyncActivation`1):AsyncReturn:this (7 methods)
         -10 (-0.63% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,Range):ref (7 methods)
          -7 (-0.62% of base) : Microsoft.CodeAnalysis.dasm - d__3`1:<>m__Finally1():this (7 methods)
        -154 (-0.58% of base) : System.Linq.Parallel.dasm - IndexedWhereQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods)
          -3 (-0.40% of base) : Microsoft.Extensions.Options.dasm - OptionsMonitor`1:.ctor(IOptionsFactory`1,IEnumerable`1,IOptionsMonitorCache`1):this
          -6 (-0.29% of base) : System.Linq.Parallel.dasm - PartitionedDataSource`1:MakePartitions(IEnumerator`1,int):ref (7 methods)
         -94 (-0.26% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
          -9 (-0.24% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):ArgumentTypeInfo:this (7 methods)
         -94 (-0.23% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
          -2 (-0.22% of base) : FSharp.Core.dasm - AsyncPrimitives:RunSynchronouslyInAnotherThread(CancellationToken,FSharpAsync`1,FSharpOption`1):__Canon
         -14 (-0.21% of base) : CommandLine.dasm - EnumerableExtensions:FoldImpl(IEnumerable`1,int,Func`2,Func`3,Func`4,Func`5):long (7 methods)
         -10 (-0.21% of base) : System.Collections.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.21% of base) : System.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.21% of base) : System.Private.Xml.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
          -7 (-0.18% of base) : Microsoft.CodeAnalysis.dasm - EnumerableExtensions:Do(IEnumerable`1,Action`1):IEnumerable`1 (7 methods)
         -14 (-0.17% of base) : System.Linq.Parallel.dasm - ParallelEnumerable:SequenceEqual(ParallelQuery`1,ParallelQuery`1,IEqualityComparer`1):bool (7 methods)
          -3 (-0.17% of base) : FSharp.Core.dasm - AsyncPrimitives:StartAsTask(CancellationToken,FSharpAsync`1,FSharpOption`1):Task`1 (7 methods)
          -6 (-0.16% of base) : System.Linq.Parallel.dasm - DefaultMergeHelper`2:.ctor(PartitionedStream`2,bool,int,TaskScheduler,CancellationState,int):this (7 methods)
          -3 (-0.13% of base) : FSharp.Core.dasm - FSharpAsync:StartImmediateAsTask(FSharpAsync`1,FSharpOption`1):Task`1 (7 methods)
221 total methods with Code Size differences (37 improved, 184 regressed), 253375 unchanged.
Completed analysis in 35.03s

@AndyAyersMS
Copy link
Member

Thanks, this is good. Wanted to confirm that the effects weren't widespread, and that there weren't any real outsized diffs.

@trylek
Copy link
Member Author

trylek commented Aug 7, 2020

@AndyAyersMS - I've spent some time trying to modify the ExpansionPerf test as you suggested. While I clearly see that without the fix the test is about 10 times slower than with it, I'm unable to measure any major difference between the "inlined" and "non-inlined" version you proposed above. It seems to me that, with the CSE fix, even the "inlined" version repeatedly fetches the dictionary pointer during the loop so that it silently picks up the dictionary expansion and switches over to the fast path. This is how one of the "is List<T>"-like constructs looks like in the jit-diff disassembly:

       mov      r15, qword ptr [rsi+16]
       mov      rcx, r15
       mov      r12, qword ptr [rcx+8]
       cmp      r12, 48
       jle      SHORT G_M59274_IG25
						;; bbWeight=4    PerfScore 22.00
G_M59274_IG24:
       mov      rcx, qword ptr [rcx+48]
       test     rcx, rcx
       je       SHORT G_M59274_IG25
       jmp      SHORT G_M59274_IG26
						;; bbWeight=1.60 PerfScore 8.40
G_M59274_IG25:
       mov      rcx, rsi
       mov      rdx, 0xD1FFAB1E
       call     CORINFO_HELP_RUNTIMEHANDLE_METHOD
       mov      rcx, rax
						;; bbWeight=0.40 PerfScore 0.70
G_M59274_IG26:
       mov      rdx, r14
       call     CORINFO_HELP_ISINSTANCEOFCLASS
       test     rax, rax
       je       SHORT G_M59274_IG28
						;; bbWeight=4    PerfScore 10.00
G_M59274_IG27:
       inc      ebx
						;; bbWeight=2    PerfScore 0.50
G_M59274_IG28:

For now I plan to remove the ExpansionPerf test from the PR so that I can retest and merge in the functional change to give it some bake time; I'll follow up on the test as a subsequent step because the original repro does seem to have the characteristics we need - i.e. running the entire first pass on the "slow resolution path" and only then switching over to the new dictionary - but it will likely take me a bit to figure out the exact formulation of the test so that it retains these characteristics.

trylek added 8 commits August 7, 2020 19:24
Initial simplistic implementation based on my current understanding
of the code. I may be easily missing some extra places in the code
that need fixing. For now I've kept the instrumentation in
genericdict.cpp.

Thanks

Tomas
Also put back the original number of iterations (1M); I reduced it
to 100K in my local testing to let me run the test on checked runtime.

Thanks

Tomas
@trylek trylek force-pushed the DictionaryExpansion branch from dfab03e to ce0c608 Compare August 7, 2020 17:25
@trylek trylek merged commit d1c7946 into dotnet:master Aug 7, 2020
@trylek trylek deleted the DictionaryExpansion branch August 7, 2020 23:17
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
…et#40355)

Fix perf regression after introduction of expandable generic
dictionaries. When we expand a generic dictionary, we should back
propagate the change to previously allocated shorter versions of
the same dictionary, otherwise already running JITted code may
continue referring to the "old version" of the dictionary implying
repeated slow path generic lookups.

Thanks

Tomas
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5x performance regression from 3.1 to 5.0 due to generic dictionary CSE
3 participants