-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue updating previous generic dictionaries after expansion #40355
Conversation
src/coreclr/src/vm/generics.cpp
Outdated
|
||
// Allocate from the high frequence heap of the correct domain | ||
S_SIZE_T allocSize = safe_cbMT; | ||
allocSize += cbOptional; | ||
allocSize += cbIMap; | ||
allocSize += cbPerInst; | ||
allocSize += cbInstAndDict; | ||
allocSize += cbInstAndDictPlusBackPointer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The starting dictionary does not really need to have the back pointer. It is always going to be NULL. You can tell that it is the starting dictionary by comparing actual dictionary size with GetNumInitialSlots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, methodtablebuilder seems to have its own calculation of the initial dictionary size around
runtime/src/coreclr/src/vm/methodtablebuilder.cpp
Line 7081 in 5fd9958
DWORD estNumTypeSlots = (numMethodsAdjusted * nTypeFactorBy2 + 2) / 3; |
In my local debugging, type dictionaries usually had initial size equal to 0x200; can you please clarify what the check should look like in this case?
I was originally trying to mark the initial directory with zero bit set in the size but for now I rolled it back as an optimization that can be addressed later once the basic algorithmic change is shown to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I finally understand what you meant - using the initial slot count in the DictionaryLayout for the check. Fixed in 3rd commit, thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I must be still missing something here, apparently it still doesn't work. The code at
runtime/src/coreclr/src/vm/genmeth.cpp
Line 396 in 06963ea
infoSize = DictionaryLayout::GetDictionarySizeFromLayout(methodInst.GetNumArgs(), pDL); |
apparently allocates an "initial dictionary" with size equal to "current method dictionary layout size" which is different from the "initial slot count" on the dictionary layout. With the back pointers everything seems to be working so I hope this is the last hurdle I need help with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be solved by moving the accounting for the back-pointer into GetDictionarySizeFromLayout
:
...
if (pDictLayout != NULL)
{
WORD slots = VolatileLoadWithoutBarrier(&pDictLayout->m_numSlots);
bytes += sizeof(TADDR); // Slot for dictionary size
bytes += slots * sizeof(TADDR); // Slots for dictionary slots based on a dictionary layout
if (pDictLayout->m_numInitialSlots < slots)
{
bytes += sizeof(PTR_Dictionary); // Dictionary back-pointer for expanded dictionaries
}
}
And adding an extra condition for exiting the patching loop on null back-pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, will do, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, it's got an interesting wrinkle - we still need two different methods otherwise we'd change the existing semantics of the "size slot"; I have basically kept your idea but I had to introduce a pair of functions, GetDictionarySlotSizeFromLayout
vs. GetDictionaryAllocSizeFromLayout
where the former is the value of the "size slot" and the latter is the "total number of bytes" including the back pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5th commit (using the two methods). It's somewhat uglier than what you originally proposed but it's still relatively well encapsulated I believe.
This will also need a fix in the JIT to reduce how much CSE the JIT does for dictionary entries. The problem is that the |
Adding JIT folks for reviewing the JIT change, please advise me if there are any special tests I should run and / or whether I should produce some JIT diffs from this change. Locally it seems to be passing and fixing the regression. I'm also not clear on what to do with the "regression test" - I can probably hardcode some sufficiently large timeout margin into it so that it fails upon exceeding it; alternatively we can somehow incorporate it in our perf test system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT changes seem ok modulo some formatting.
This will lead to asm diffs; it would be good to characterize these.
src/coreclr/src/jit/importer.cpp
Outdated
@@ -2117,11 +2117,17 @@ GenTree* Compiler::impRuntimeLookupToTree(CORINFO_RESOLVED_TOKEN* pResolvedToken | |||
nullptr DEBUGARG("impRuntimeLookup indirectOffset")); | |||
} | |||
|
|||
// The last indirection could be subject to a size check (dynamic dictionary expansion) | |||
bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually fully parenthesize:
bool isLastIndirectionWithSizeCheck = (i == pRuntimeLookup->indirections - 1 && pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK); | |
bool isLastIndirectionWithSizeCheck = ((i == pRuntimeLookup->indirections - 1) && (pRuntimeLookup->sizeOffset != CORINFO_NO_SIZE_CHECK)); |
Also, will likely require clang-formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed the formatting. I'll look into clang formatting and providing the assembly diffs as the next step.
Testing is tricky as this isn't something BDN will normally capture. I wonder if you can engineer a test as follows:
Without dictionary updates the baseline version should always have slow iterations as it will always refer to the initial dictionary. The alternate version should eventually get fast ones as it will refer to the expanded dictionary. So you can compare the times for the last N iterations of each; they should be similar if the bug is fixed. |
src/coreclr/src/vm/genericdict.cpp
Outdated
// | ||
//static | ||
DWORD | ||
DictionaryLayout::GetDictionarySizeFromLayout( | ||
DictionaryLayout::GetDictionarySlotSizeFromLayout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it one method with DWORD* ppSlotSize
argument to make sure that you always get consistent snapshot for slot size and alloc size.
This method is sometimes called without locks where one thread is reading the layout and other thread is expanding it. It is why I had Volatile...
in my code snippet. Having one method to return both alloc and slot sizes will avoid bugs with somebody accidentally changes the order in which GetDictionarySlotSizeFromLayout
and GetDictionaryAllocSizeFromLayout
are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, sorry about not realizing this, fixed in 6th commit.
02d93a9
to
eb9a243
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Once it builds...
It is best to download the .patch file from the CI artifacts and apply it. |
I saw in the patch that it was just this one line split so I applied it manually. I absolutely agree that for larger diffs it would be much easier to apply the patch mechanically. |
FYI, if you have jitutils tools available, there is a |
OK, I have finally managed to produce the jit-diff for the ExpansionPerf test. As expected, the only difference is in the Test.TestDict method. Baseline: https://gist.github.com/trylek/644d79473dcc802a7a9729580c5e7a4e JIT-diff console output: D:\git\runtime5>jit-diff diff --core_root D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --assembly D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll --base D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --diff D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --arch x64 --build Checked --pmi Using --output D:\git\runtime5\artifacts\diffs Beginning PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll \ Finished 1/1 Base 1/1 Diff [3.6 sec] Completed PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll in 3.57s Diffs (if any) can be viewed by comparing: D:\git\runtime5\artifacts\diffs\dasmset_6\base D:\git\runtime5\artifacts\diffs\dasmset_6\diff Analyzing CodeSize diffs... Found 1 files with textual diffs. PMI CodeSize Diffs for D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Loader\classloader\DictionaryExpansion\ExpansionPerf\ExpansionPerf.dll for x64 default jit Summary of Code Size diffs: (Lower is better) Total bytes of diff: -7 (-0.20% of base) diff is an improvement. Top file improvements (bytes): -7 : ExpansionPerf.dasm (-0.20% of base) 1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged. Top method improvements (bytes): -7 (-0.26% of base) : ExpansionPerf.dasm - Test:TestDict(IDictionary`2,Func`2) (7 methods) Top method improvements (percentages): -7 (-0.26% of base) : ExpansionPerf.dasm - Test:TestDict(IDictionary`2,Func`2) (7 methods) 1 total methods with Code Size differences (1 improved, 0 regressed), 20 unchanged. Completed analysis in 0.30s At the first glance, the minute improvement is caused by the fact that the suppressed common subexpression elimination stops storing the Dictionary * in a register and instead repeatedly loads it using an indirect MOV, thus freeing the one register for other optimizations. For example: Baseline: G_M27032_IG30: movsxd rcx, r12d mov r13, gword ptr [rbx+8*rcx+16] mov rcx, rbp cmp qword ptr [rcx+8], 56 jle SHORT G_M27032_IG32 Diff: G_M27032_IG30: movsxd rcx, r15d mov r12, gword ptr [rbx+8*rcx+16] mov rcx, qword ptr [rsi+16] cmp qword ptr [rcx+8], 56 jle SHORT G_M27032_IG32 @AndyAyersMS - please let me know if you find this summary info sufficient or suggest what else should I do, you know I'm not familiar with this process. |
I was looking for a broader assessment over all the framework assemblies (pass - |
@AndyAyersMS - thanks for clarifying. This is the summary; please let me what other artifacts or data you're interested in. D:\git\runtime5>jit-diff diff --core_root D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root -f --base D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --diff D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked --arch x64 --build Checked --pmi Using --output D:\git\runtime5\artifacts\diffs Beginning PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies \ Finished 136/268 Base 0/268 Diff [140.2 sec]Error running D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe on D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll 1 errors compiling set. Dasm command "jit-dasm-pmi --platform D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --corerun D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe --jit D:\git\runtime3\artifacts\bin\CoreCLR\Windows_NT.x64.Checked\clrjit.dll --nocopy --output D:\git\runtime5\artifacts\diffs\dasmset_7\base D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll" returned with 1 failures / Finished 268/268 Base 261/268 Diff [398.8 sec]Error running D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe on D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll 1 errors compiling set. Dasm command "jit-dasm-pmi --platform D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root --corerun D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\corerun.exe --jit D:\git\runtime5\artifacts\bin\CoreCLR\Windows_NT.x64.Checked\clrjit.dll --nocopy --output D:\git\runtime5\artifacts\diffs\dasmset_7\diff D:\git\runtime5\artifacts\tests\CoreCLR\Windows_NT.x64.Release\Tests\Core_Root\FSharp.Core.dll" returned with 1 failures | Finished 268/268 Base 268/268 Diff [446.5 sec] Dasm commands returned 1 base failures, 1 diff failures. Completed PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies in 446.68s Diffs (if any) can be viewed by comparing: D:\git\runtime5\artifacts\diffs\dasmset_7\base D:\git\runtime5\artifacts\diffs\dasmset_7\diff Analyzing CodeSize diffs... Found 36 files with textual diffs. PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit Summary of Code Size diffs: (Lower is better) Total bytes of diff: 1304 (0.00% of base) diff is a regression. Top file regressions (bytes): 487 : FSharp.Core.dasm (0.03% of base) 193 : System.Linq.dasm (0.02% of base) 188 : Microsoft.CodeAnalysis.dasm (0.01% of base) 138 : System.Private.CoreLib.dasm (0.00% of base) 112 : CommandLine.dasm (0.02% of base) 93 : System.Threading.Tasks.Dataflow.dasm (0.01% of base) 35 : System.Linq.Expressions.dasm (0.00% of base) 35 : System.Threading.Channels.dasm (0.02% of base) 28 : System.Collections.Immutable.dasm (0.00% of base) 27 : System.Composition.Hosting.dasm (0.03% of base) 18 : Microsoft.Extensions.Options.dasm (0.08% of base) 18 : System.Memory.dasm (0.01% of base) 14 : Microsoft.Extensions.Caching.Abstractions.dasm (0.06% of base) 12 : Newtonsoft.Json.dasm (0.00% of base) 12 : xunit.core.dasm (0.02% of base) 11 : System.Private.DataContractSerialization.dasm (0.00% of base) 10 : System.Private.Xml.dasm (0.00% of base) 9 : System.Collections.Concurrent.dasm (0.00% of base) 9 : xunit.assert.dasm (0.01% of base) 9 : xunit.execution.dotnet.dasm (0.00% of base) Top file improvements (bytes): -163 : System.Linq.Parallel.dasm (-0.01% of base) -10 : System.Private.Xml.Linq.dasm (-0.01% of base) -4 : System.Reflection.Metadata.dasm (-0.00% of base) -4 : System.Threading.Tasks.Parallel.dasm (-0.00% of base) -3 : System.Diagnostics.Process.dasm (-0.00% of base) -3 : System.Net.Http.dasm (-0.00% of base) -3 : System.Net.Http.Json.dasm (-0.01% of base) -2 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base) 36 total files with Code Size differences (8 improved, 28 regressed), 232 unchanged. Top method regressions (bytes): 137 ( 0.32% of base) : Microsoft.CodeAnalysis.dasm - ArrayBuilder`1:ToDictionary(Func`2,IEqualityComparer`1):Dictionary`2:this (49 methods) 66 ( 0.39% of base) : CommandLine.dasm - <>c__DisplayClass0_0`1:b__7():ParserResult`1:this (7 methods) 52 ( 0.21% of base) : System.Linq.Parallel.dasm - DistinctQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods) 49 ( 0.21% of base) : FSharp.Core.dasm - FSharpMailboxProcessor`1:PostAndTryAsyncReply(FSharpFunc`2,FSharpOption`1):FSharpAsync`1:this (49 methods) 43 ( 0.93% of base) : System.Private.CoreLib.dasm - ReferenceTypeHelper`1:GetPropertyGetter(PropertyInfo):Func`2:this 37 ( 4.12% of base) : System.Threading.Tasks.Dataflow.dasm - ObserversState:ProcessItemAsync(__Canon):Task:this 35 ( 0.16% of base) : FSharp.Core.dasm - FSharpMailboxProcessor`1:PostAndAsyncReply(FSharpFunc`2,FSharpOption`1):FSharpAsync`1:this (49 methods) 35 ( 0.13% of base) : Microsoft.CodeAnalysis.dasm - AsyncQueue`1:WithCancellation(Task`1,CancellationToken):Task`1 (49 methods) 33 ( 0.11% of base) : System.Linq.Parallel.dasm - TakeOrSkipWhileQueryOperator`1:WrapHelper(PartitionedStream`2,IPartitionedStreamRecipient`1,QuerySettings):this (49 methods) 31 ( 0.96% of base) : FSharp.Core.dasm - Parallel@1240-3:Invoke(int,FSharpAsync`1):Unit:this (7 methods) 30 ( 2.52% of base) : System.Linq.dasm - AppendPrepend1Iterator`1:GetCount(bool):int:this (7 methods) 28 ( 0.25% of base) : FSharp.Core.dasm - FSharpSet`1:Map(FSharpFunc`2):FSharpSet`1:this (49 methods) 26 ( 0.60% of base) : FSharp.Core.dasm - StartChild@1649-1:Invoke(CancellationToken):FSharpAsync`1:this (7 methods) 26 ( 0.59% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:Aggregate(byref,bool):bool:this (7 methods) 23 ( 0.26% of base) : System.Private.CoreLib.dasm - TlsOverPerCoreLockedStacksArrayPool`1:Trim():bool:this (9 methods) 22 ( 0.41% of base) : System.Linq.dasm - Enumerable:Select(IEnumerable`1,Func`2):IEnumerable`1 (7 methods) 21 ( 0.07% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods) 20 ( 0.72% of base) : FSharp.Core.dasm - TryScan@233:Invoke(Unit):FSharpAsync`1:this (7 methods) 20 ( 0.26% of base) : System.Linq.dasm - Enumerable:SequenceEqual(IEnumerable`1,IEnumerable`1,IEqualityComparer`1):bool (7 methods) 20 ( 0.45% of base) : System.Linq.dasm - Enumerable:ToDictionary(IEnumerable`1,Func`2,IEqualityComparer`1):Dictionary`2 (7 methods) Top method improvements (bytes): -154 (-0.58% of base) : System.Linq.Parallel.dasm - IndexedWhereQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods) -94 (-0.26% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods) -94 (-0.23% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods) -35 (-0.76% of base) : FSharp.Core.dasm - Choice@1291-1:Invoke(AsyncActivation`1):AsyncReturn:this (7 methods) -28 (-0.12% of base) : System.Linq.Parallel.dasm - MergeExecutor`1:Execute(PartitionedStream`2,bool,int,TaskScheduler,bool,CancellationState,int):MergeExecutor`1 (49 methods) -14 (-0.21% of base) : CommandLine.dasm - EnumerableExtensions:FoldImpl(IEnumerable`1,int,Func`2,Func`3,Func`4,Func`5):long (7 methods) -14 (-0.17% of base) : System.Linq.Parallel.dasm - ParallelEnumerable:SequenceEqual(ParallelQuery`1,ParallelQuery`1,IEqualityComparer`1):bool (7 methods) -13 (-1.52% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,Func`2,byref):__Canon -13 (-0.03% of base) : System.Linq.Parallel.dasm - UnionQueryOperator`1:WrapPartitionedStreamFixedBothTypes(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,int,CancellationToken):this (49 methods) -10 (-0.21% of base) : System.Collections.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods) -10 (-0.21% of base) : System.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods) -10 (-0.63% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,Range):ref (7 methods) -10 (-0.21% of base) : System.Private.Xml.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods) -9 (-0.06% of base) : FSharp.Core.dasm - SetTreeModule:compareStacks(IComparer`1,FSharpList`1,FSharpList`1):int (7 methods) -9 (-0.24% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):ArgumentTypeInfo:this (7 methods) -7 (-0.18% of base) : Microsoft.CodeAnalysis.dasm - EnumerableExtensions:Do(IEnumerable`1,Action`1):IEnumerable`1 (7 methods) -7 (-0.62% of base) : Microsoft.CodeAnalysis.dasm - d__3`1:<>m__Finally1():this (7 methods) -6 (-0.16% of base) : System.Linq.Parallel.dasm - DefaultMergeHelper`2:.ctor(PartitionedStream`2,bool,int,TaskScheduler,CancellationState,int):this (7 methods) -6 (-0.29% of base) : System.Linq.Parallel.dasm - PartitionedDataSource`1:MakePartitions(IEnumerator`1,int):ref (7 methods) -4 (-0.12% of base) : FSharp.Core.dasm - ExtraTopLevelOperators:dictRefType(IEnumerable`1):DictImpl`3 (7 methods) Top method regressions (percentages): 6 ( 5.56% of base) : System.Collections.dasm - <>c__DisplayClass34_0:b__0(Node):bool:this (7 methods) 8 ( 4.12% of base) : System.Private.CoreLib.dasm - Table:Insert(__Canon,__Canon):this 37 ( 4.12% of base) : System.Threading.Tasks.Dataflow.dasm - ObserversState:ProcessItemAsync(__Canon):Task:this 30 ( 2.52% of base) : System.Linq.dasm - AppendPrepend1Iterator`1:GetCount(bool):int:this (7 methods) 4 ( 2.07% of base) : FSharp.Core.dasm - TryScan@243-1:Invoke(__Canon):FSharpAsync`1:this 4 ( 2.07% of base) : FSharp.Core.dasm - scan@215-2:Invoke(__Canon):FSharpAsync`1:this 4 ( 2.07% of base) : FSharp.Core.dasm - scanNoTimeout@228-2:Invoke(__Canon):FSharpAsync`1:this 4 ( 2.07% of base) : FSharp.Core.dasm - PostAndTryAsyncReply@398-6:Invoke(__Canon):FSharpAsync`1:this 15 ( 1.84% of base) : System.Threading.Channels.dasm - ChannelReader`1:g__ReadAsyncCore|8_0(CancellationToken):ValueTask`1:this (7 methods) 3 ( 1.68% of base) : System.Reflection.Metadata.dasm - PEHeaders:GetContainingSectionIndex(int):int:this 10 ( 1.64% of base) : System.Linq.dasm - Enumerable:TryGetFirst(IEnumerable`1,byref):__Canon 8 ( 1.56% of base) : System.Linq.dasm - Enumerable:SingleOrDefault(IEnumerable`1):__Canon 9 ( 1.39% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,byref):__Canon 4 ( 1.37% of base) : Microsoft.Extensions.Options.ConfigurationExtensions.dasm - NamedConfigureFromConfigurationOptions`1:.ctor(String,IConfiguration,Action`1):this 14 ( 1.33% of base) : System.Linq.Expressions.dasm - UpdateDelegates:UpdateAndExecuteVoid2(CallSite,__Canon,long) 19 ( 1.32% of base) : Microsoft.Extensions.Options.dasm - OptionsFactory`1:Create(String):__Canon:this 2 ( 1.28% of base) : System.Private.CoreLib.dasm - Vector256`1:Equals(Vector256`1):bool:this (6 methods) 12 ( 1.27% of base) : FSharp.Core.dasm - TryReceive@278:Invoke(Unit):FSharpAsync`1:this (7 methods) 12 ( 1.27% of base) : System.Linq.dasm - Buffer`1:.ctor(IEnumerable`1):this (7 methods) 10 ( 1.23% of base) : FSharp.Core.dasm - StartChildAsTask@1166-1:Invoke(CancellationToken):FSharpAsync`1:this (7 methods) Top method improvements (percentages): -13 (-1.52% of base) : System.Linq.dasm - Enumerable:TryGetLast(IEnumerable`1,Func`2,byref):__Canon -35 (-0.76% of base) : FSharp.Core.dasm - Choice@1291-1:Invoke(AsyncActivation`1):AsyncReturn:this (7 methods) -10 (-0.63% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,Range):ref (7 methods) -7 (-0.62% of base) : Microsoft.CodeAnalysis.dasm - d__3`1:<>m__Finally1():this (7 methods) -154 (-0.58% of base) : System.Linq.Parallel.dasm - IndexedWhereQueryOperator`1:WrapPartitionedStream(PartitionedStream`2,IPartitionedStreamRecipient`1,bool,QuerySettings):this (49 methods) -3 (-0.40% of base) : Microsoft.Extensions.Options.dasm - OptionsMonitor`1:.ctor(IOptionsFactory`1,IEnumerable`1,IOptionsMonitorCache`1):this -6 (-0.29% of base) : System.Linq.Parallel.dasm - PartitionedDataSource`1:MakePartitions(IEnumerator`1,int):ref (7 methods) -94 (-0.26% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods) -9 (-0.24% of base) : System.Reflection.Metadata.dasm - CustomAttributeDecoder`1:DecodeNamedArgumentType(byref,bool):ArgumentTypeInfo:this (7 methods) -94 (-0.23% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:WrapPartitionedStreamHelper(PartitionedStream`2,PartitionedStream`2,IPartitionedStreamRecipient`1,CancellationToken):this (49 methods) -2 (-0.22% of base) : FSharp.Core.dasm - AsyncPrimitives:RunSynchronouslyInAnotherThread(CancellationToken,FSharpAsync`1,FSharpOption`1):__Canon -14 (-0.21% of base) : CommandLine.dasm - EnumerableExtensions:FoldImpl(IEnumerable`1,int,Func`2,Func`3,Func`4,Func`5):long (7 methods) -10 (-0.21% of base) : System.Collections.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods) -10 (-0.21% of base) : System.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods) -10 (-0.21% of base) : System.Private.Xml.Linq.dasm - EnumerableHelpers:ToArray(IEnumerable`1,byref):ref (7 methods) -7 (-0.18% of base) : Microsoft.CodeAnalysis.dasm - EnumerableExtensions:Do(IEnumerable`1,Action`1):IEnumerable`1 (7 methods) -14 (-0.17% of base) : System.Linq.Parallel.dasm - ParallelEnumerable:SequenceEqual(ParallelQuery`1,ParallelQuery`1,IEqualityComparer`1):bool (7 methods) -3 (-0.17% of base) : FSharp.Core.dasm - AsyncPrimitives:StartAsTask(CancellationToken,FSharpAsync`1,FSharpOption`1):Task`1 (7 methods) -6 (-0.16% of base) : System.Linq.Parallel.dasm - DefaultMergeHelper`2:.ctor(PartitionedStream`2,bool,int,TaskScheduler,CancellationState,int):this (7 methods) -3 (-0.13% of base) : FSharp.Core.dasm - FSharpAsync:StartImmediateAsTask(FSharpAsync`1,FSharpOption`1):Task`1 (7 methods) 221 total methods with Code Size differences (37 improved, 184 regressed), 253375 unchanged. Completed analysis in 35.03s |
Thanks, this is good. Wanted to confirm that the effects weren't widespread, and that there weren't any real outsized diffs. |
@AndyAyersMS - I've spent some time trying to modify the ExpansionPerf test as you suggested. While I clearly see that without the fix the test is about 10 times slower than with it, I'm unable to measure any major difference between the "inlined" and "non-inlined" version you proposed above. It seems to me that, with the CSE fix, even the "inlined" version repeatedly fetches the dictionary pointer during the loop so that it silently picks up the dictionary expansion and switches over to the fast path. This is how one of the "is List<T>"-like constructs looks like in the jit-diff disassembly: mov r15, qword ptr [rsi+16] mov rcx, r15 mov r12, qword ptr [rcx+8] cmp r12, 48 jle SHORT G_M59274_IG25 ;; bbWeight=4 PerfScore 22.00 G_M59274_IG24: mov rcx, qword ptr [rcx+48] test rcx, rcx je SHORT G_M59274_IG25 jmp SHORT G_M59274_IG26 ;; bbWeight=1.60 PerfScore 8.40 G_M59274_IG25: mov rcx, rsi mov rdx, 0xD1FFAB1E call CORINFO_HELP_RUNTIMEHANDLE_METHOD mov rcx, rax ;; bbWeight=0.40 PerfScore 0.70 G_M59274_IG26: mov rdx, r14 call CORINFO_HELP_ISINSTANCEOFCLASS test rax, rax je SHORT G_M59274_IG28 ;; bbWeight=4 PerfScore 10.00 G_M59274_IG27: inc ebx ;; bbWeight=2 PerfScore 0.50 G_M59274_IG28: For now I plan to remove the ExpansionPerf test from the PR so that I can retest and merge in the functional change to give it some bake time; I'll follow up on the test as a subsequent step because the original repro does seem to have the characteristics we need - i.e. running the entire first pass on the "slow resolution path" and only then switching over to the new dictionary - but it will likely take me a bit to figure out the exact formulation of the test so that it retains these characteristics. |
Initial simplistic implementation based on my current understanding of the code. I may be easily missing some extra places in the code that need fixing. For now I've kept the instrumentation in genericdict.cpp. Thanks Tomas
Also put back the original number of iterations (1M); I reduced it to 100K in my local testing to let me run the test on checked runtime. Thanks Tomas
dfab03e
to
ce0c608
Compare
…et#40355) Fix perf regression after introduction of expandable generic dictionaries. When we expand a generic dictionary, we should back propagate the change to previously allocated shorter versions of the same dictionary, otherwise already running JITted code may continue referring to the "old version" of the dictionary implying repeated slow path generic lookups. Thanks Tomas
Initial simplistic implementation based on my current understanding
of the code. I may be easily missing some extra places in the code
that need fixing. For now I've kept the instrumentation in
genericdict.cpp.
Fixes: #40298
Thanks
Tomas