-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version of class profiling for PGO #45133
Conversation
Add support to the jit and runtime so that PGO can determine the distribution of classes at virtual and indirect call sites. Use this information when jitting to enable guarded devirtualization, if there is a suitably likely class to guess for. Enable by setting: ``` COMPlus_TieredCompilation=1 COMPlus_TieredPGO=1 COMPlus_JitClassProfiling=1 COMPlus_JitEnableGuardedDevirtualization=1 ``` impact can be enhanced by also setting ``` COMPlus_TC_QuickJitForLoops=1 ``` to allow more methods to pass through Tier0.
cc @dotnet/jit-contrib @davidwrighton Jit and jit interface changes should be fairly close to their final form. Runtime side will eventually be replaced as we build out more of the PGO story over there. But this is a start. Still missing: crossgen2 "support", and some way to invalidate class profile entries for unloadable classes. |
src/coreclr/src/ToolBox/superpmi/superpmi-shared/methodcontext.h
Outdated
Show resolved
Hide resolved
Not surprisingly, crossgen2 is unhappy; I need to stub out |
src/coreclr/src/vm/jithelpers.cpp
Outdated
VALIDATEOBJECTREF(objRef); | ||
|
||
ICorJitInfo::ClassProfile* const classProfile = (ICorJitInfo::ClassProfile*) tableAddress; | ||
const int count = classProfile->Count++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to use volatile and/or interlocked operations?
In !_DEBUG
, it is a valid C/C++ optimization to get rid of the count
local and re-fetch it below, ie transform the line below to classProfile->ClassTable[classProfile->Count]
. I would potentially lead to buffer overruns or skipped slots in ClassProfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be volatile so it's just read once. Likewise for s_rng
.
It is OK to drop updates because of races, and we're trying to keep overhead to a minimum, so no need to interlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions
@@ -623,6 +623,7 @@ enum CorInfoHelpFunc | |||
CORINFO_HELP_STACK_PROBE, // Probes each page of the allocated stack frame | |||
|
|||
CORINFO_HELP_PATCHPOINT, // Notify runtime that code has reached a patchpoint | |||
CORINFO_HELP_CLASSPROFILE, // Update class profile for a call site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to change the JIT-EE GUID
src/coreclr/src/inc/corjit.h
Outdated
@@ -251,6 +251,14 @@ class ICorJitInfo : public ICorDynamicInfo | |||
UINT32 ExecutionCount; | |||
}; | |||
|
|||
struct ClassProfile | |||
{ | |||
enum { SIZE = 8, SAMPLE_INTERVAL = 32 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to comment this structure and its fields. E.g., what is SAMPLE_INTERVAL
? What is ClassProfile
used for?
Is UINT32 big enough for Count
(and ExecutionCount
above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some description.
For count overflow I currently don't have great answers; it should be rare (we'd need methods that spent a lot of cycles in Tier0) but obviously it's not impossible. Right now overflow won't cause any crashes but it might lead to nonsensical profile data.
src/coreclr/src/inc/corjit.h
Outdated
CORINFO_METHOD_HANDLE ftnHnd, | ||
CORINFO_CLASS_HANDLE baseHnd, | ||
UINT32 ilOffset, | ||
UINT32 * pLikelihood, // estimated likelihood of the class (0...100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify this is a decimal percentage? i.e. 100 = 100% certainty.
src/coreclr/src/inc/corjit.h
Outdated
CORINFO_METHOD_HANDLE ftnHnd, | ||
CORINFO_CLASS_HANDLE baseHnd, | ||
UINT32 ilOffset, | ||
UINT32 * pLikelihood, // estimated likelihood of the class (0...100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify these are OUT params
@@ -993,6 +993,15 @@ HRESULT getMethodBlockCounts(CORINFO_METHOD_HANDLE ftnHnd, | |||
BlockCounts** pBlockCounts, | |||
UINT32 * pNumRuns); | |||
|
|||
// Get the likely implementing class for a virtual call or interface call made by ftnHnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be "almost diffable" with icorjitinfo.h, so add ClassProfile
struct as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out, like BlockCounts is ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out, like BlockCounts is ?
If necessary. I can't remember why BlockCounts is commented out; if it's required or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if #45044 is merged, it will be unnecessary.
{ | ||
enum { | ||
SIZE = 8, | ||
SAMPLE_INTERVAL = 32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these values need to be powers of two? For example could SIZE be 7 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be any positive value provided N >= S.
const unsigned count = *pCount++; | ||
const unsigned S = ICorJitInfo::ClassProfile::SIZE; | ||
const unsigned N = ICorJitInfo::ClassProfile::SAMPLE_INTERVAL; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(N >= S); // SAMPLE_INTERVAL must be >= SIZE.
and possibly assert that N is a power of two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N doesn't need to a power of 2, though it probably will be. in practice.
// | ||
if ((x % N) < S) | ||
{ | ||
unsigned i = x % S; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a divide by an unsigned constant the compiler would use a reciprocal multiply. So this could be written using modulus and we wouldn't require S to be a power of two. I think that the replacement of values in the ClassTable[] would also be more random when N is not divisible by S.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already written using a modulus...?
In the most general form we should actually be checking (x % count) < S
so the divisor would be variable. I'm experimenting with using a fixed window size here to bias the sample somewhat towards more recent observations.
Using a prime S
(say) is an interesting idea, but S
needs to be very small, so perhaps 7 or 11 might be worth a try.
@@ -5233,6 +5234,74 @@ void JIT_Patchpoint(int* counter, int ilOffset) | |||
|
|||
#endif // FEATURE_ON_STACK_REPLACEMENT | |||
|
|||
HCIMPL2(void, JIT_ClassProfile, Object *obj, void* tableAddress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an method header comment staying that this is a profiling/instrumentation method and needs to be optimized to minimize the cost of runtime profiling.
We might want to explicitly tell the compiler to optimize this method (in retail builds) as we might not normally gather any C++ PGO data for this case if this feature it isn't enabled by default.
// intentionally simple so we can have multithreaded | ||
// access w/o tearing state. | ||
// | ||
static volatile unsigned s_rng = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If C++ supports thread local statics this would be a good use case.
Alternatively the runtime has support for per thread data, I believe.
Otherwise each thread has to grab the cache line for writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class profile data is shared so I'm not sure there's much to be gained by having per-thread RNG state.
struct HistogramEntry | ||
{ | ||
CORINFO_CLASS_HANDLE m_mt; | ||
unsigned m_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the meanings of these fields:
m_count is the number of occurrences of m_mt in ClassTable[]
// Yep, found data. See if there is a suitable class profile. | ||
// | ||
// This bit is currently somewhat hacky ... we scan the records, the count records come | ||
// first and are in increasing IL offset order. Class profiles have inverted IL offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is an "inverted" IL offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set some flags bits in the high part of the IL offset so the runtime side can distinguish class profiles from regular block counts.
This is a temporary measure needed until the runtime side evolves to understand there are different kinds of profile data.
// Need to make sure this is even divisor | ||
// | ||
j += sizeof(ICorJitInfo::ClassProfile) / sizeof(ICorJitInfo::BlockCounts); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need to make sure this is an even divisor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the underlying data is allocated in BlockCount increments, we need to make sure ClassProfile is a multiple, so that this update to j
gets us to the next ClassProfile record.
@@ -342,6 +479,166 @@ HRESULT PgoManager::getMethodBlockCounts(MethodDesc* pMD, unsigned ilSize, UINT3 | |||
return E_NOTIMPL; | |||
} | |||
|
|||
CORINFO_CLASS_HANDLE PgoManager::getLikelyClass(MethodDesc* pMD, unsigned ilSize, unsigned ilOffset, UINT32* pLikelihood, UINT32* pNumberOfClasses) | |||
{ | |||
*pLikelihood = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pLikelyhood couls also be a 32-bit float.
As it currently stands is is an integer percentage, where 100 means 100% and 33 means 33%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need that sort of precision; our histograms are approximate.
CORINFO_CLASS_HANDLE baseHnd, | ||
UINT32 ilOffset, | ||
UINT32 * pLikelihood, | ||
UINT32 * pNumberOfClasses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pLikelyhood couls also be a 32-bit float.
As it currently stands it is an integer percentage, where 100 means 100% and 33 means 33%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes here, histogram data is approximate.
What mostly matters is that for monomorphic types we get it right and predict high likelihood. For other cases the payoff is smaller.
… type to work with shared classes (like __Canon). Also note we can reach this stage for both virtual and interface calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
I was looking at #4489 and realized it was a good test case for GDV via TieredPGO. Drilling in, I realized the logic I had for determining when to guess for a class wasn't as general as it could be and was missing out on guessing for calls dispatched through some shared classes (notably, So I've updated that part of
resulting assembly code is: ; Assembly listing for method Runtime4489:UseInterfaceCalls():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 15837975
; invoked as altjit
; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> rcx this class-hnd
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T02] ( 2, 4 ) ref -> rcx ld-addr-op class-hnd "Inlining Arg"
; V03 tmp2 [V03,T01] ( 3, 4 ) ref -> rcx "guarded devirt this temp"
;* V04 tmp3 [V04 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "guarded devirt this exact temp"
; V05 tmp4 [V05,T03] ( 2, 4 ) ref -> r11 class-hnd "Inlining Arg"
;
; Lcl frame size = 40
G_M18455_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M18455_IG02: ;; offset=0004H
4C8B5928 mov r11, gword ptr [rcx+40]
488B4910 mov rcx, gword ptr [rcx+16]
45391B cmp dword ptr [r11], r11d
49BBF0684EC0F87F0000 mov r11, 0x7FF8C04E68F0
4C3919 cmp qword ptr [rcx], r11 // GDV check
7517 jne SHORT G_M18455_IG04
48B918B54DC0F87F0000 mov rcx, 0x7FF8C04DB518 // correct guess
4533DB xor r11d, r11d
448919 mov dword ptr [rcx], r11d
FF01 inc dword ptr [rcx]
;; bbWeight=1 PerfScore 15.75
G_M18455_IG03: ;; offset=0030H
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M18455_IG04: ;; offset=0035H
49BB580521C0F87F0000 mov r11, 0x7FF8C0210558 // incorrect guess
48B8580521C0F87F0000 mov rax, 0x7FF8C0210558
FF10 call qword ptr [rax]ICalls:Execute():this
EBE3 jmp SHORT G_M18455_IG03
;; bbWeight=0 PerfScore 0.00
I also verified I can drop a checked clrjit.dll next to the release one (calling it say clrjit.chk.dll) and invoke it selectively via the altjit mechanism; very handy for experimenting without having to swap jits around (and, drive all this via BDN). |
@DrewScoggins , @adamsitnik - As we discussed this in the past, Andy confirmed that it works, so something we can try to get the disassembly more reliably? |
Failure is known #45168. Mono build is a timeout (unrelated). |
Add support to the jit and runtime so that PGO can determine the distribution of
classes at virtual and indirect call sites.
Use this information when jitting to enable guarded devirtualization, if there
is a suitably likely class to guess for.
Enable by setting:
impact can be enhanced by also setting
to allow more methods to pass through Tier0.