-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding a cache of well known tokens to the JIT #50426
Comments
CC. @AndyAyersMS, @EgorBo I had brought this idea up a couple times on Discord and decided to log a formal issue. I think it will largely handle the issue around the most critical scenarios in a mostly pay for play manner |
On a meta point, there has also been previous discussion around concepts like "jit hints" and this would allow such work to be prototyped since the helper method would be well-known during importation. |
Example of such an heuristic: EgorBo@3810c21
It was originally rejected due to "resolvetoken" call but can be re-considered once we implement that cache of tokens ^. |
It would be better, I suspect, to implement a cache like this on the runtime side of the jit interface, as the runtime is more aware of various complications (collectible assemblies and such), has means for managing caches that might need periodic trimming or cleaning, and is able to use locks and other synchronization mechanisms that wouldn't be appropriate in the jit. If that's the case, then what we're talking about seems like a "lightweight" version of resolveToken that has limited resolution abilities, and keeps a cache with
Given the above, it would be interesting to sketch what the jit interface API might look like. Note if the aim of this is to improve inlining, we'll often be unable to make strong immediate inferences with the information we get back. Suppose during the IL scan this new API lets the jit know a particular call returns true. Connecting that with potential "savings" from being able to partially import the method is not easy as we don't know how the return value influences future computation and don't understand the control flow even if we can tell that value influences a branch. So instead what we'd need to do is simply note that the IL contains "known return value" calls that we suspect are conveying valuable bits of information, and then use those notes to boost the viability of the method as an inline candidate. Then we'd revisit the method at the next stage of the inlinee evaluation pipeline, where we'd import the method for real and build IR and optimize based on that and verify that good things indeed happened (or more likely, just assume based on the hints that good things would happen). |
@AndyAyersMS, that makes sense to me. I believe the most important information for the purpose of inlining is I think, therefore, that the surface area for the JIT/EE interface might be relatively simplistic such as:
We would then flow the resolved token down into Say, for example, that there is a Similar heuristics can happen for Having the cache store and return the |
Should the first argument be just |
We may want to look at this as bag of cached key/value pairs attached to
The JIT can define as many keys as it needs and the JIT owns interpretation of the value for given key, e.g. it can be the intrinsic IDs or cached observations about inlinees. The VM side would be free to flush the cache at will, e.g. when it becomes too big or when unloading occurs. |
There are merits to both token-based lookup and handle based cache lookup. Tanner's proposal is aimed at providing a cheap form of token resolution for use by the very early stages of inline candidate evaluation, where we're just scanning the IL stream, so that we can recognize some calls as being of special interest (like the A more general handle keyed cache could be use to amortize costs of more detailed later stages of analysis, eg how much of an inlinee actually ends up getting imported, given that some of the calls it makes resolve to known constants at jit time. Longer term, it also could be used by an AOT jit querying into interprocedural analysis data, where the shape of the data will likely be dictated by an earlier analysis performed by the jit. |
It certainly could be and that would be more extensible overall. That being said, I was hoping to attack a more specific issue first and something which I think may be better suited to be an independent API from a "you can cache any data you want" feature. Namely, metadata tokens are unique per module and are somewhat "expensive" to resolve so we don't want to do so when doing an "is it profitable to inline this" check. This is often fine because most calls aren't "interesting" and won't have significant impact on inlining. If tracked on the VM side, this would effectively require a lazily allocated Unlike a general-purpose "cache any information you want API", this API is known to be small and bounded and so there wouldn't need to be a huge concern around flushing it, since the overhead will never be more than a handful of entries per module and most modules will likely be empty or only contain a couple entries. |
The expensive part of If you would like to have a raw API that just can attach information for token, the API should explicitly take just
If you do not find the token in cache, how are you going to tell whether you have not seen this token yet vs. you have seen the token and it is not a well-known method? |
The intent was that In tier 0, it is likely there would likely be no entries in this cache and such entries wouldn't be used due to lack of inlining or other optimizations happening. However, methods that are rejitted to tier 1 would have had these self-same tokens already cached (from the tier 0 pass) if they were intrinsic IDs. The inliner has to do a basic pass over the opcodes in a method it is determining inline profitability for and so it has the current The input does not strictly need to be a |
I am sorry but I do not see how this can work. I think you are saying that you can get from BTW: We have done experiments on catching |
My understanding is that metadata tokens are effectively unique 32-bit unsigned integers per "module" (it is scoped to a specific "metadata binary"). This means that in the context of a single This means that once something in the module has already resolved a call token to a well known class Module
{
private Dictionary<uint, NamedIntrinsic> _cachedIntrinsics;
public static void RegisterWellKnownIntrinsic(uint token, NamedIntrinsic intrinsic) {
if (_cachedIntrinsics is null) {
_cachedIntrinsics = new Dictionary<uint, NamedIntrinsic>();
}
_cachedIntrinsics[token] = intrinsic;
}
public static bool TryGetWellKnownIntrinsic(uint token, out NamedIntrinsic intrinsic)
{
intrinsic = NI_Illegal;
return (_cachedIntrinsic is not null) && _cachedIntrinsic.TryGetValue(token, out intrinsic);
}
} So, during tier 0 when an "interesting" named intrinsic is found, During tier 1 when inlining is happening, the inliner would use a sibling JIT/EE interface method to say "is this token ID in this module a well known named intrinsic", that method effectively returns |
Right, once something in the module resolves the token, it is also possible to get raw
I do not think we would want to be spending time doing this during tier 0.
It is not guaranteed that the cache for given token has been populated. The token can be on path that has not been executed yet; or the cache could have been flushed in the meantime. So this check would be returning false negatives. I do not think we would be ok with unpredictable performance that this would introduce. |
I wasn't aware we had this. I think the downside is that it requires an additional resolution on the handle to get the corresponding
We are already resolving these tokens in tier 0, particularly for HWIntrinsics as it is a strict requirement for them to be handled by the JIT. There are some that aren't resolved, but I think those are likely less interesting.
My hope is that that populating the cache is largely handled by tiered jitting. Unless a user is utilizing this code in a method marked For modules that are using generic specialization (e.g.
The same is true for any cache like setup we have that allows that cache to be flushed. I think it comes down to us saying saying we do or don't care that there are cases with My proposal is to do this in a limited scenario that allows us to keep small, pay for play caches of well known tokens. |
To find out whether the method is an Intrinsic, you have to get its
I disagree. A real cache that handles cache misses properly would be fully determistic. |
Seems like we could experiment here easily enough. We can have the inlinee IL scan resolve all call tokens when we're at Tier1, and see how much JIT throughput that costs us. The hypothesis here is that these resolutions will be quick as all the necessary loading and instantiation will have happened earlier, so the additional costs won't be that high. A fair number of methods bypass Tier0 today, so it may be something more nuanced is needed... |
Right, just the first time. After you have resolved that once for a module, then simply the token is needed to map back to a
My understanding is that we couldn't use proper cache here. The issue being that token resolution can be expensive and we don't want to be doing that, even with a cache, for every token when determining inlining profitability for a given method. So I think I'm not connecting something here and so I'm failing to see how some "real cache" that handles cache misses by fully resolving a token will save cost here. |
Right, resolving the token for the first time is expensive. The solutions you have in mind seem to have non-predictable characteristics. We would see number of cases where the inlining behavior depends on what methods were JITed before and which tokens happened to be left in the cache as side-effect. I do not consider such solution to be a viable option. |
Right, I was just trying to propose something to workaround needing to resolve the tokens in the inline profitability phase, because I have been told and seen many times that doing this in the inliner was effectively a non-starter due to how it impacts JIT throughput. If we believe we can do this without it negatively impacting throughput or that the negative impact is justified for Tier 1 methods, then that is all the better and would also allow broader heuristics to occur (user defined properties that simply return a However, as Andy mentioned there are a fair number of methods that bypass Tier0 today. It would be nice if these could also be handled, either by ensuring we do the same resolutions under |
I did a very minimal prototype here: https://github.com/tannergooding/runtime/tree/resolve-tokens-inline. This prototype effectively just updates I then profiled (using AMD uProf) In the scheme of things, I think the most "costly" aspect of resolving tokens for improving inlining heuristics is likely going to be downstream effects if it causes something to inline that wouldn't have been inlined without the change. |
Andy pointed out on Discord I'll need to not run SPMI like originally suggested but a live benchmark instead. Will report those results when done. |
After discussing with @AndyAyersMS something that would work correctly (since SPMI caches tokens), I ended up testing This results in the following flame graph: Most of the time is still spent in the backend, but the numbers are actually more measurable for the impact now:
|
So the data seems to agree that with TC enabled there is minimal extra cost to trying to resolve inlinee tokens early during Tier1, as most tokens we run across at Tier1 have already been resolved and re-resolving is fairly cheap. Still remains to be seen how to best realize a benefit from this extra information. I suspect recognizing the constant valued methods will give a small improvement but more will be needed. I don't think it is feasible to build a flow graph model during the IL scan, so I'd still tempted to just add enough observational hints to encourage the jit to proceed with inlining and then possibly implement a backout scheme if it turns out the cost of these inlinees that we expect to simplify is too high. While it's tempting to say that any use of HW intrinsics merits much more aggressive inlining, we still have to have some kinds of limits in place. |
Definitely. I think the "simplest" heuristics are likely based on some basic counts such as:
There are definitely cases where a user could write checks in a fashion to make us think its a profitable method with many dead code paths when it actually isn't, but I expect those will be rare, especially in methods marked
|
@AndyAyersMS, do you think a PR with the minimal prototype I gave would be appropriate? That would allow iterative changes like the one @EgorBo tried with It could be placed behind a |
Yes. |
#50675 is merged. This allows us to resolve tokens for tier1 and prejit inline observations. This provides minimal improved support for |
Today, the JIT uses a set of heuristics to determine whether or not a method is profitable to inline. However, due to tokens being unique per module and due to the cost associated with resolving tokens the JIT (when inlining) can't easily see when certain code patterns will result in constants or otherwise dead code elimination. This can hinder the ability for the JIT to correctly inline methods with things like
Isa.IsSupported
,if (typeof(T) == typeof(...))
checks, or evenstring.Length
.As such, I propose we create a cache (effectively a
Dictionary<TokenId, MethodInfo>
or similar, per module) of well known methods (i.e. certainNamedIntrinsic
s) so the inliner can correctly determine if the inlining is profitable.This table could be built up during Tier 0 and would be available during Tier 1 so that the inliner can correctly see when a call is constant and do the appropriate dead code elimination and size adjustments.
The text was updated successfully, but these errors were encountered: