Continue updating previous generic dictionaries after expansion #40355

trylek · 2020-08-05T01:22:20Z

Initial simplistic implementation based on my current understanding
of the code. I may be easily missing some extra places in the code
that need fixing. For now I've kept the instrumentation in
genericdict.cpp.

Fixes: #40298

Thanks

Tomas

jkotas · 2020-08-05T02:01:57Z

src/coreclr/src/vm/generics.cpp


    // Allocate from the high frequence heap of the correct domain
    S_SIZE_T allocSize = safe_cbMT;
    allocSize += cbOptional;
    allocSize += cbIMap;
    allocSize += cbPerInst;
-    allocSize += cbInstAndDict;
+    allocSize += cbInstAndDictPlusBackPointer;


Nit: The starting dictionary does not really need to have the back pointer. It is always going to be NULL. You can tell that it is the starting dictionary by comparing actual dictionary size with GetNumInitialSlots

Hmm, methodtablebuilder seems to have its own calculation of the initial dictionary size around

runtime/src/coreclr/src/vm/methodtablebuilder.cpp

Line 7081 in 5fd9958

DWORD estNumTypeSlots = (numMethodsAdjusted * nTypeFactorBy2 + 2) / 3;

In my local debugging, type dictionaries usually had initial size equal to 0x200; can you please clarify what the check should look like in this case?

I was originally trying to mark the initial directory with zero bit set in the size but for now I rolled it back as an optimization that can be addressed later once the basic algorithmic change is shown to work.

OK, I think I finally understand what you meant - using the initial slot count in the DictionaryLayout for the check. Fixed in 3rd commit, thanks for the suggestion.

Hmm, I must be still missing something here, apparently it still doesn't work. The code at

runtime/src/coreclr/src/vm/genmeth.cpp

Line 396 in 06963ea

infoSize = DictionaryLayout::GetDictionarySizeFromLayout(methodInst.GetNumArgs(), pDL);

apparently allocates an "initial dictionary" with size equal to "current method dictionary layout size" which is different from the "initial slot count" on the dictionary layout. With the back pointers everything seems to be working so I hope this is the last hurdle I need help with.

I think this can be solved by moving the accounting for the back-pointer into GetDictionarySizeFromLayout:

... if (pDictLayout != NULL) { WORD slots = VolatileLoadWithoutBarrier(&pDictLayout->m_numSlots); bytes += sizeof(TADDR); // Slot for dictionary size bytes += slots * sizeof(TADDR); // Slots for dictionary slots based on a dictionary layout if (pDictLayout->m_numInitialSlots < slots) { bytes += sizeof(PTR_Dictionary); // Dictionary back-pointer for expanded dictionaries } }

And adding an extra condition for exiting the patching loop on null back-pointers.

Sounds great, will do, thank you!

Hehe, it's got an interesting wrinkle - we still need two different methods otherwise we'd change the existing semantics of the "size slot"; I have basically kept your idea but I had to introduce a pair of functions, GetDictionarySlotSizeFromLayout vs. GetDictionaryAllocSizeFromLayout where the former is the value of the "size slot" and the latter is the "total number of bytes" including the back pointer.

Fixed in 5th commit (using the two methods). It's somewhat uglier than what you originally proposed but it's still relatively well encapsulated I believe.

jkotas · 2020-08-05T02:50:36Z

This will also need a fix in the JIT to reduce how much CSE the JIT does for dictionary entries. The problem is that the Dictionary* is CSEed in the JITed code, and the JITed code never sees the updated Dictionary*. The code for that is in Compiler::impRuntimeLookupToTree. We need to stop setting GTF_IND_INVARIANT on the last indirection when sizeOffset != CORINFO_NO_SIZE_CHECK.

trylek · 2020-08-05T10:39:04Z

Adding JIT folks for reviewing the JIT change, please advise me if there are any special tests I should run and / or whether I should produce some JIT diffs from this change. Locally it seems to be passing and fixing the regression. I'm also not clear on what to do with the "regression test" - I can probably hardcode some sufficiently large timeout margin into it so that it fails upon exceeding it; alternatively we can somehow incorporate it in our perf test system.

AndyAyersMS

JIT changes seem ok modulo some formatting.

This will lead to asm diffs; it would be good to characterize these.

AndyAyersMS · 2020-08-05T22:01:44Z

src/coreclr/src/jit/importer.cpp

@@ -2117,11 +2117,17 @@ GenTree* Compiler::impRuntimeLookupToTree(CORINFO_RESOLVED_TOKEN* pResolvedToken
                                      nullptr DEBUGARG("impRuntimeLookup indirectOffset"));
        }

+        // The last indirection could be subject to a size check (dynamic dictionary expansion)
+        bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK);


We usually fully parenthesize:

Suggested change

bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK);

bool isLastIndirectionWithSizeCheck = ((i == pRuntimeLookup->indirections - 1) && (pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK));

Also, will likely require clang-formatting.

I have fixed the formatting. I'll look into clang formatting and providing the assembly diffs as the next step.

AndyAyersMS · 2020-08-05T22:22:11Z

Testing is tricky as this isn't something BDN will normally capture.

I wonder if you can engineer a test as follows:

a baseline version that is a generic method over some ref T that's called once and loops, and within the loop always refers to various types dependent on T, eg if (t is List<T>) and so on. IIRC having 4 or 5 of these guarantees a slow lookup for at least one.
an alternate version that outlines the body of the loop into a non-inlined method with the same sequence of type references.

Without dictionary updates the baseline version should always have slow iterations as it will always refer to the initial dictionary. The alternate version should eventually get fast ones as it will refer to the expanded dictionary. So you can compare the times for the last N iterations of each; they should be similar if the bug is fixed.

jkotas · 2020-08-05T23:04:59Z

src/coreclr/src/vm/genericdict.cpp

 // 
 //static
 DWORD 
-DictionaryLayout::GetDictionarySizeFromLayout(
+DictionaryLayout::GetDictionarySlotSizeFromLayout(


I would make it one method with DWORD* ppSlotSize argument to make sure that you always get consistent snapshot for slot size and alloc size.

This method is sometimes called without locks where one thread is reading the layout and other thread is expanding it. It is why I had Volatile... in my code snippet. Having one method to return both alloc and slot sizes will avoid bugs with somebody accidentally changes the order in which GetDictionarySlotSizeFromLayout and GetDictionaryAllocSizeFromLayout are called.

Good point, sorry about not realizing this, fixed in 6th commit.

jkotas

LGTM. Once it builds...

jkotas · 2020-08-06T17:02:29Z

probably too long source line

It is best to download the .patch file from the CI artifacts and apply it.

trylek · 2020-08-06T17:20:05Z

I saw in the patch that it was just this one line split so I applied it manually. I absolutely agree that for larger diffs it would be much easier to apply the patch mechanically.

AndyAyersMS · 2020-08-06T17:59:27Z

FYI, if you have jitutils tools available, there is a jit-format command that will do formatting for you; you can then do this locally rather than waiting for the CI to generate the patch. It should also be possible to reformat from visual studio or vscode as they should pick up the jit's .clang-format, though I haven't tried these.

trylek · 2020-08-06T20:25:08Z

OK, I have finally managed to produce the jit-diff for the ExpansionPerf test. As expected, the only difference is in the Test.TestDict method.

Baseline: https://gist.github.com/trylek/644d79473dcc802a7a9729580c5e7a4e
Diff: https://gist.github.com/trylek/e1e65fc9777efa6a416498d23697b9c4

JIT-diff console output:

D:\git\runtime5>jit-diff diff --core_root D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --assembly D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll --base D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --diff D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --arch x64 --build Checked --pmi
Using --output D:\git\runtime5\artifacts\diffs
Beginning PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll
\ Finished 1/1 Base 1/1 Diff [3.6 sec]
Completed PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll in 3.57s
Diffs (if any) can be viewed by comparing: D:\git\runtime5\artifacts\diffs\dasmset_6\base D:\git\runtime5\artifacts\diffs\dasmset_6\diff
Analyzing CodeSize diffs...
Found 1 files with textual diffs.
PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -7 (-0.20% of base)
    diff is an improvement.
Top file improvements (bytes):
          -7 : ExpansionPerf.dasm (-0.20% of base)
1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements (bytes):
          -7 (-0.26% of base) : ExpansionPerf.dasm - Test:TestDict(IDictionary`2,Func`2) (7 methods)
Top method improvements (percentages):
          -7 (-0.26% of base) : ExpansionPerf.dasm - Test:TestDict(IDictionary`2,Func`2) (7 methods)
1 total methods with Code Size differences (1 improved, 0 regressed), 20 unchanged.
Completed analysis in 0.30s

At the first glance, the minute improvement is caused by the fact that the suppressed common subexpression elimination stops storing the Dictionary * in a register and instead repeatedly loads it using an indirect MOV, thus freeing the one register for other optimizations. For example:

Baseline:

G_M27032_IG30:
       movsxd   rcx, r12d
       mov      r13, gword ptr [rbx+8*rcx+16]
       mov      rcx, rbp
       cmp      qword ptr [rcx+8], 56
       jle      SHORT G_M27032_IG32

Diff:

G_M27032_IG30:
       movsxd   rcx, r15d
       mov      r12, gword ptr [rbx+8*rcx+16]
       mov      rcx, qword ptr [rsi+16]
       cmp      qword ptr [rcx+8], 56
       jle      SHORT G_M27032_IG32

@AndyAyersMS - please let me know if you find this summary info sufficient or suggest what else should I do, you know I'm not familiar with this process.

AndyAyersMS · 2020-08-06T21:20:45Z

I was looking for a broader assessment over all the framework assemblies (pass -f instead of --assembly <...>)

trylek · 2020-08-06T21:59:09Z

@AndyAyersMS - thanks for clarifying. This is the summary; please let me what other artifacts or data you're interested in.

D:\git\runtime5>jit-diff diff --core_root D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root -f --base D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --diff D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --arch x64 --build Checked --pmi
Using --output D:\git\runtime5\artifacts\diffs
Beginning PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies
\ Finished 136/268 Base 0/268 Diff [140.2 sec]Error running D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe on D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll
1 errors compiling set.
Dasm command "jit-dasm-pmi --platform D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --corerun D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe --jit D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked\clrjit.dll --nocopy --output D:\git\runtime5\artifacts\diffs\dasmset_7\base D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll" returned with 1 failures
/ Finished 268/268 Base 261/268 Diff [398.8 sec]Error running D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe on D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll
1 errors compiling set.
Dasm command "jit-dasm-pmi --platform D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --corerun D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe --jit D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked\clrjit.dll --nocopy --output D:\git\runtime5\artifacts\diffs\dasmset_7\diff D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll" returned with 1 failures
| Finished 268/268 Base 268/268 Diff [446.5 sec]
Dasm commands returned 1 base failures, 1 diff failures.
Completed PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies in 446.68s
Diffs (if any) can be viewed by comparing: D:\git\runtime5\artifacts\diffs\dasmset_7\base D:\git\runtime5\artifacts\diffs\dasmset_7\diff
Analyzing CodeSize diffs...
Found 36 files with textual diffs.
PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: 1304 (0.00% of base)
    diff is a regression.
Top file regressions (bytes):
         487 : FSharp.Core.dasm (0.03% of base)
         193 : System.Linq.dasm (0.02% of base)
         188 : Microsoft.CodeAnalysis.dasm (0.01% of base)
         138 : System.Private.CoreLib.dasm (0.00% of base)
         112 : CommandLine.dasm (0.02% of base)
          93 : System.Threading.Tasks.Dataflow.dasm (0.01% of base)
          35 : System.Linq.Expressions.dasm (0.00% of base)
          35 : System.Threading.Channels.dasm (0.02% of base)
          28 : System.Collections.Immutable.dasm (0.00% of base)
          27 : System.Composition.Hosting.dasm (0.03% of base)
          18 : Microsoft.Extensions.Options.dasm (0.08% of base)
          18 : System.Memory.dasm (0.01% of base)
          14 : Microsoft.Extensions.Caching.Abstractions.dasm (0.06% of base)
          12 : Newtonsoft.Json.dasm (0.00% of base)
          12 : xunit.core.dasm (0.02% of base)
          11 : System.Private.DataContractSerialization.dasm (0.00% of base)
          10 : System.Private.Xml.dasm (0.00% of base)
           9 : System.Collections.Concurrent.dasm (0.00% of base)
           9 : xunit.assert.dasm (0.01% of base)
           9 : xunit.execution.dotnet.dasm (0.00% of base)
Top file improvements (bytes):
        -163 : System.Linq.Parallel.dasm (-0.01% of base)
         -10 : System.Private.Xml.Linq.dasm (-0.01% of base)
          -4 : System.Reflection.Metadata.dasm (-0.00% of base)
          -4 : System.Threading.Tasks.Parallel.dasm (-0.00% of base)
          -3 : System.Diagnostics.Process.dasm (-0.00% of base)
          -3 : System.Net.Http.dasm (-0.00% of base)
          -3 : System.Net.Http.Json.dasm (-0.01% of base)
          -2 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base)
36 total files with Code Size differences (8 improved, 28 regressed), 232 unchanged.
Top method regressions (bytes):
         137 ( 0.32% of base) : Microsoft.CodeAnalysis.dasm - ArrayBuilder`1:ToDictionary(Func`2,IEqualityComparer`1):Dictionary`2:this (49 methods)
          66 ( 0.39% of base) : CommandLine.dasm - <>c__DisplayClass0_0`1:b__7():ParserResult`1:this (7 methods)
          52 ( 0.21% of base) : System.Linq.Parallel.dasm - DistinctQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
          49 ( 0.21% of base) : FSharp.Core.dasm - FSharpMailboxProcessor`1:PostAndTryAsyncReply(FSharpFunc`2,FSharpOption`1):FSharpAsync`1:this (49 methods)
          43 ( 0.93% of base) : System.Private.CoreLib.dasm - ReferenceTypeHelper`1:GetPropertyGetter(PropertyInfo):Func`2:this
          37 ( 4.12% of base) : System.Threading.Tasks.Dataflow.dasm - ObserversState:ProcessItemAsync(__Canon):Task:this
          35 ( 0.16% of base) : FSharp.Core.dasm - FSharpMailboxProcessor`1:PostAndAsyncReply(FSharpFunc`2,FSharpOption`1):FSharpAsync`1:this (49 methods)
          35 ( 0.13% of base) : Microsoft.CodeAnalysis.dasm - AsyncQueue`1:WithCancellation(Task`1,CancellationToken):Task`1 (49 methods)
          33 ( 0.11% of base) : System.Linq.Parallel.dasm - TakeOrSkipWhileQueryOperator`1:WrapHelper(PartitionedStream`2,IPartitionedStreamRecipient`1,QuerySettings):this (49 methods)
          31 ( 0.96% of base) : FSharp.Core.dasm - Parallel@1240-3:Invoke(int,FSharpAsync`1):Unit:this (7 methods)
          30 ( 2.52% of base) : System.Linq.dasm - AppendPrepend1Iterator`1:GetCount(bool):int:this (7 methods)
          28 ( 0.25% of base) : FSharp.Core.dasm - FSharpSet`1:Map(FSharpFunc`2):FSharpSet`1:this (49 methods)
          26 ( 0.60% of base) : FSharp.Core.dasm - StartChild@1649-1:Invoke(CancellationToken):FSharpAsync`1:this (7 methods)
          26 ( 0.59% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:Aggregate(byref,bool):bool:this (7 methods)
          23 ( 0.26% of base) : System.Private.CoreLib.dasm - TlsOverPerCoreLockedStacksArrayPool`1:Trim():bool:this (9 methods)
          22 ( 0.41% of base) : System.Linq.dasm - Enumerable:Select(IEnumerable`1,Func`2):IEnumerable`1 (7 methods)
          21 ( 0.07% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods)
          20 ( 0.72% of base) : FSharp.Core.dasm - TryScan@233:Invoke(Unit):FSharpAsync`1:this (7 methods)
          20 ( 0.26% of base) : System.Linq.dasm - Enumerable:SequenceEqual(IEnumerable`1,IEnumerable`1,IEqualityComparer`1):bool (7 methods)
          20 ( 0.45% of base) : System.Linq.dasm - Enumerable:ToDictionary(IEnumerable`1,Func`2,IEqualityComparer`1):Dictionary`2 (7 methods)
Top method improvements (bytes):
        -154 (-0.58% of base) : System.Linq.Parallel.dasm - IndexedWhereQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods)
         -94 (-0.26% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
         -94 (-0.23% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
         -35 (-0.76% of base) : FSharp.Core.dasm - Choice@1291-1:Invoke(AsyncActivation`1):AsyncReturn:this (7 methods)
         -28 (-0.12% of base) : System.Linq.Parallel.dasm - MergeExecutor`1:Execute(PartitionedStream`2,bool,int,TaskScheduler,bool,CancellationState,int):MergeExecutor`1 (49 methods)
         -14 (-0.21% of base) : CommandLine.dasm - EnumerableExtensions:FoldImpl(IEnumerable`1,int,Func`2,Func`3,Func`4,Func`5):long (7 methods)
         -14 (-0.17% of base) : System.Linq.Parallel.dasm - ParallelEnumerable:SequenceEqual(ParallelQuery`1,ParallelQuery`1,IEqualityComparer`1):bool (7 methods)
         -13 (-1.52% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,Func`2,byref):__Canon
         -13 (-0.03% of base) : System.Linq.Parallel.dasm - UnionQueryOperator`1:WrapPartitionedStreamFixedBothTypes(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,int,CancellationToken):this (49 methods)
         -10 (-0.21% of base) : System.Collections.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.21% of base) : System.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.63% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,Range):ref (7 methods)
         -10 (-0.21% of base) : System.Private.Xml.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
          -9 (-0.06% of base) : FSharp.Core.dasm - SetTreeModule:compareStacks(IComparer`1,FSharpList`1,FSharpList`1):int (7 methods)
          -9 (-0.24% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):ArgumentTypeInfo:this (7 methods)
          -7 (-0.18% of base) : Microsoft.CodeAnalysis.dasm - EnumerableExtensions:Do(IEnumerable`1,Action`1):IEnumerable`1 (7 methods)
          -7 (-0.62% of base) : Microsoft.CodeAnalysis.dasm - d__3`1:<>m__Finally1():this (7 methods)
          -6 (-0.16% of base) : System.Linq.Parallel.dasm - DefaultMergeHelper`2:.ctor(PartitionedStream`2,bool,int,TaskScheduler,CancellationState,int):this (7 methods)
          -6 (-0.29% of base) : System.Linq.Parallel.dasm - PartitionedDataSource`1:MakePartitions(IEnumerator`1,int):ref (7 methods)
          -4 (-0.12% of base) : FSharp.Core.dasm - ExtraTopLevelOperators:dictRefType(IEnumerable`1):DictImpl`3 (7 methods)
Top method regressions (percentages):
           6 ( 5.56% of base) : System.Collections.dasm - <>c__DisplayClass34_0:b__0(Node):bool:this (7 methods)
           8 ( 4.12% of base) : System.Private.CoreLib.dasm - Table:Insert(__Canon,__Canon):this
          37 ( 4.12% of base) : System.Threading.Tasks.Dataflow.dasm - ObserversState:ProcessItemAsync(__Canon):Task:this
          30 ( 2.52% of base) : System.Linq.dasm - AppendPrepend1Iterator`1:GetCount(bool):int:this (7 methods)
           4 ( 2.07% of base) : FSharp.Core.dasm - TryScan@243-1:Invoke(__Canon):FSharpAsync`1:this
           4 ( 2.07% of base) : FSharp.Core.dasm - scan@215-2:Invoke(__Canon):FSharpAsync`1:this
           4 ( 2.07% of base) : FSharp.Core.dasm - scanNoTimeout@228-2:Invoke(__Canon):FSharpAsync`1:this
           4 ( 2.07% of base) : FSharp.Core.dasm - PostAndTryAsyncReply@398-6:Invoke(__Canon):FSharpAsync`1:this
          15 ( 1.84% of base) : System.Threading.Channels.dasm - ChannelReader`1:g__ReadAsyncCore|8_0(CancellationToken):ValueTask`1:this (7 methods)
           3 ( 1.68% of base) : System.Reflection.Metadata.dasm - PEHeaders:GetContainingSectionIndex(int):int:this
          10 ( 1.64% of base) : System.Linq.dasm - Enumerable:TryGetFirst(IEnumerable`1,byref):__Canon
           8 ( 1.56% of base) : System.Linq.dasm - Enumerable:SingleOrDefault(IEnumerable`1):__Canon
           9 ( 1.39% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,byref):__Canon
           4 ( 1.37% of base) : Microsoft.Extensions.Options.ConfigurationExtensions.dasm - NamedConfigureFromConfigurationOptions`1:.ctor(String,IConfiguration,Action`1):this
          14 ( 1.33% of base) : System.Linq.Expressions.dasm - UpdateDelegates:UpdateAndExecuteVoid2(CallSite,__Canon,long)
          19 ( 1.32% of base) : Microsoft.Extensions.Options.dasm - OptionsFactory`1:Create(String):__Canon:this
           2 ( 1.28% of base) : System.Private.CoreLib.dasm - Vector256`1:Equals(Vector256`1):bool:this (6 methods)
          12 ( 1.27% of base) : FSharp.Core.dasm - TryReceive@278:Invoke(Unit):FSharpAsync`1:this (7 methods)
          12 ( 1.27% of base) : System.Linq.dasm - Buffer`1:.ctor(IEnumerable`1):this (7 methods)
          10 ( 1.23% of base) : FSharp.Core.dasm - StartChildAsTask@1166-1:Invoke(CancellationToken):FSharpAsync`1:this (7 methods)
Top method improvements (percentages):
         -13 (-1.52% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,Func`2,byref):__Canon
         -35 (-0.76% of base) : FSharp.Core.dasm - Choice@1291-1:Invoke(AsyncActivation`1):AsyncReturn:this (7 methods)
         -10 (-0.63% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,Range):ref (7 methods)
          -7 (-0.62% of base) : Microsoft.CodeAnalysis.dasm - d__3`1:<>m__Finally1():this (7 methods)
        -154 (-0.58% of base) : System.Linq.Parallel.dasm - IndexedWhereQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods)
          -3 (-0.40% of base) : Microsoft.Extensions.Options.dasm - OptionsMonitor`1:.ctor(IOptionsFactory`1,IEnumerable`1,IOptionsMonitorCache`1):this
          -6 (-0.29% of base) : System.Linq.Parallel.dasm - PartitionedDataSource`1:MakePartitions(IEnumerator`1,int):ref (7 methods)
         -94 (-0.26% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
          -9 (-0.24% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):ArgumentTypeInfo:this (7 methods)
         -94 (-0.23% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods)
          -2 (-0.22% of base) : FSharp.Core.dasm - AsyncPrimitives:RunSynchronouslyInAnotherThread(CancellationToken,FSharpAsync`1,FSharpOption`1):__Canon
         -14 (-0.21% of base) : CommandLine.dasm - EnumerableExtensions:FoldImpl(IEnumerable`1,int,Func`2,Func`3,Func`4,Func`5):long (7 methods)
         -10 (-0.21% of base) : System.Collections.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.21% of base) : System.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
         -10 (-0.21% of base) : System.Private.Xml.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods)
          -7 (-0.18% of base) : Microsoft.CodeAnalysis.dasm - EnumerableExtensions:Do(IEnumerable`1,Action`1):IEnumerable`1 (7 methods)
         -14 (-0.17% of base) : System.Linq.Parallel.dasm - ParallelEnumerable:SequenceEqual(ParallelQuery`1,ParallelQuery`1,IEqualityComparer`1):bool (7 methods)
          -3 (-0.17% of base) : FSharp.Core.dasm - AsyncPrimitives:StartAsTask(CancellationToken,FSharpAsync`1,FSharpOption`1):Task`1 (7 methods)
          -6 (-0.16% of base) : System.Linq.Parallel.dasm - DefaultMergeHelper`2:.ctor(PartitionedStream`2,bool,int,TaskScheduler,CancellationState,int):this (7 methods)
          -3 (-0.13% of base) : FSharp.Core.dasm - FSharpAsync:StartImmediateAsTask(FSharpAsync`1,FSharpOption`1):Task`1 (7 methods)
221 total methods with Code Size differences (37 improved, 184 regressed), 253375 unchanged.
Completed analysis in 35.03s

AndyAyersMS · 2020-08-06T22:34:57Z

Thanks, this is good. Wanted to confirm that the effects weren't widespread, and that there weren't any real outsized diffs.

trylek · 2020-08-07T17:22:06Z

@AndyAyersMS - I've spent some time trying to modify the ExpansionPerf test as you suggested. While I clearly see that without the fix the test is about 10 times slower than with it, I'm unable to measure any major difference between the "inlined" and "non-inlined" version you proposed above. It seems to me that, with the CSE fix, even the "inlined" version repeatedly fetches the dictionary pointer during the loop so that it silently picks up the dictionary expansion and switches over to the fast path. This is how one of the "is List<T>"-like constructs looks like in the jit-diff disassembly:

       mov      r15, qword ptr [rsi+16]
       mov      rcx, r15
       mov      r12, qword ptr [rcx+8]
       cmp      r12, 48
       jle      SHORT G_M59274_IG25
						;; bbWeight=4    PerfScore 22.00
G_M59274_IG24:
       mov      rcx, qword ptr [rcx+48]
       test     rcx, rcx
       je       SHORT G_M59274_IG25
       jmp      SHORT G_M59274_IG26
						;; bbWeight=1.60 PerfScore 8.40
G_M59274_IG25:
       mov      rcx, rsi
       mov      rdx, 0xD1FFAB1E
       call     CORINFO_HELP_RUNTIMEHANDLE_METHOD
       mov      rcx, rax
						;; bbWeight=0.40 PerfScore 0.70
G_M59274_IG26:
       mov      rdx, r14
       call     CORINFO_HELP_ISINSTANCEOFCLASS
       test     rax, rax
       je       SHORT G_M59274_IG28
						;; bbWeight=4    PerfScore 10.00
G_M59274_IG27:
       inc      ebx
						;; bbWeight=2    PerfScore 0.50
G_M59274_IG28:

For now I plan to remove the ExpansionPerf test from the PR so that I can retest and merge in the functional change to give it some bake time; I'll follow up on the test as a subsequent step because the original repro does seem to have the characteristics we need - i.e. running the entire first pass on the "slow resolution path" and only then switching over to the new dictionary - but it will likely take me a bit to figure out the exact formulation of the test so that it retains these characteristics.

Initial simplistic implementation based on my current understanding of the code. I may be easily missing some extra places in the code that need fixing. For now I've kept the instrumentation in genericdict.cpp. Thanks Tomas

Also put back the original number of iterations (1M); I reduced it to 100K in my local testing to let me run the test on checked runtime. Thanks Tomas

…et#40355) Fix perf regression after introduction of expandable generic dictionaries. When we expand a generic dictionary, we should back propagate the change to previously allocated shorter versions of the same dictionary, otherwise already running JITted code may continue referring to the "old version" of the dictionary implying repeated slow path generic lookups. Thanks Tomas

trylek added the area-VM-coreclr label Aug 5, 2020

trylek added this to the 5.0.0 milestone Aug 5, 2020

trylek requested a review from jkotas August 5, 2020 01:22

jkotas reviewed Aug 5, 2020

View reviewed changes

trylek requested review from sandreenko and AndyAyersMS August 5, 2020 10:39

AndyAyersMS reviewed Aug 5, 2020

View reviewed changes

jkotas reviewed Aug 5, 2020

View reviewed changes

trylek force-pushed the DictionaryExpansion branch from 02d93a9 to eb9a243 Compare August 6, 2020 00:56

jkotas approved these changes Aug 6, 2020

View reviewed changes

trylek changed the title ~~WIP: Continue updating previous generic dictionaries after expansion~~ Continue updating previous generic dictionaries after expansion Aug 6, 2020

trylek added 8 commits August 7, 2020 19:24

Fix setting of GTF_IND_INVARIANT flag per JanK's suggestion

7275832

Remove instrumentation; fix initial slot count per JanK's suggestion

37e74c9

Also put back the original number of iterations (1M); I reduced it to 100K in my local testing to let me run the test on checked runtime. Thanks Tomas

Revert superfluous back pointer on initial dictionaries

3516ecf

Address JanK's and Andy's feedback

8d348a6

Fix race condition per JanK's advice (just one DictionarySize method)

6d65a3c

Fix typo in assertion check

133abec

Fix formatting issue (probably too long source line)

b2700e3

Remove the incomplete ExpansionPerf test from the change

ce0c608

trylek force-pushed the DictionaryExpansion branch from dfab03e to ce0c608 Compare August 7, 2020 17:25

trylek merged commit d1c7946 into dotnet:master Aug 7, 2020

trylek deleted the DictionaryExpansion branch August 7, 2020 23:17

danmoseley mentioned this pull request Aug 14, 2020

[Perf -23%] System.Collections.IndexerSet<String>.Span #40856

Closed

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue updating previous generic dictionaries after expansion #40355

Continue updating previous generic dictionaries after expansion #40355

trylek commented Aug 5, 2020 •

edited

Loading

jkotas Aug 5, 2020

trylek Aug 5, 2020

trylek Aug 5, 2020

trylek Aug 5, 2020

jkotas Aug 5, 2020

trylek Aug 5, 2020

trylek Aug 5, 2020

trylek Aug 5, 2020

jkotas commented Aug 5, 2020

trylek commented Aug 5, 2020

AndyAyersMS left a comment

AndyAyersMS Aug 5, 2020

trylek Aug 5, 2020

AndyAyersMS commented Aug 5, 2020

jkotas Aug 5, 2020

trylek Aug 6, 2020

jkotas left a comment

jkotas commented Aug 6, 2020

trylek commented Aug 6, 2020

AndyAyersMS commented Aug 6, 2020

trylek commented Aug 6, 2020

AndyAyersMS commented Aug 6, 2020

trylek commented Aug 6, 2020

AndyAyersMS commented Aug 6, 2020

trylek commented Aug 7, 2020 •

edited

Loading

	bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK);
	bool isLastIndirectionWithSizeCheck = ((i == pRuntimeLookup->indirections - 1) && (pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK));

Continue updating previous generic dictionaries after expansion #40355

Continue updating previous generic dictionaries after expansion #40355

Conversation

trylek commented Aug 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Aug 5, 2020

trylek commented Aug 5, 2020

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS commented Aug 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas left a comment

Choose a reason for hiding this comment

jkotas commented Aug 6, 2020

trylek commented Aug 6, 2020

AndyAyersMS commented Aug 6, 2020

trylek commented Aug 6, 2020

AndyAyersMS commented Aug 6, 2020

trylek commented Aug 6, 2020

AndyAyersMS commented Aug 6, 2020

trylek commented Aug 7, 2020 • edited Loading

trylek commented Aug 5, 2020 •

edited

Loading

trylek commented Aug 7, 2020 •

edited

Loading