-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: consider optimizing small fixed-sized stackalloc #7113
Comments
I did a quick prototype of this, converting any fixed sized stackalloc of 32 bytes or less, and the only hits in our normal set of code size assemblies (which includes corefx and roslyn) were in some interop stubs in the core library. The diffs look good and the jit changes are small and likely low risk but I'd like to have a bit more motivation before pushing further. @benaadams if there's an easy way to drop the latest jit into an aspnet run I can try it there and see if it hits more often or in more interesting places. Last time we looked at doing this it was nontrivial though. I could also backport to 1.1 pretty easily I think if that helps. |
For motivation, I imagine with Span working as an "array" over FormatInt64 is a pretty widely used function in AspNet and GenerateConnectionId (and the identical GenerateRequestId and GenerateGuidString) is stackalloc'd 13 but again doesn't look greatly inlinable; and I think they are lazy-init'ed To drop latest Jit into aspnet; I compile a project as
At this point it fails with System.IO.TextInfo can't find method error so also copy files from Unfortunately that only covers start-up; I think there are some other missing methods; as it won't send a response back - but am looking into it... |
Two examples I'm definitely aware of where stackalloc has been avoided due to performance are and https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Char.cs#L932-L936 |
Simple app for plaintext testing is public class Startup
{
private static readonly byte[] _helloWorldPayload = Encoding.UTF8.GetBytes("Hello, World!");
public void Configure(IApplicationBuilder app, ILoggerFactory loggerFactory)
{
//loggerFactory.AddConsole(LogLevel.Information);
app.Run((context) =>
{
var response = context.Response;
response.StatusCode = 200;
response.ContentType = "text/plain";
response.Headers["Content-Length"] = "13";
return response.Body.WriteAsync(_helloWorldPayload, 0, _helloWorldPayload.Length);
});
}
public static void Main(string[] args)
{
var host = new WebHostBuilder()
.UseKestrel()
.UseUrls("http://localhost:5001/")
.UseContentRoot(Directory.GetCurrentDirectory())
.UseStartup<Startup>()
.Build();
host.Run();
}
} Though I don't think it will hit any stackallocs |
Also need from corefx |
Haven't tried this on asp.net yet, but I did run the change through the desktop SPMI collection (database of roughly 2M methods we use to validate aspects of desktop codegen). No hits for this outside of IL stubs and a handful of stackalloc-specific test cases. However the size savings from stubs alone might make this worth doing.... |
Jit source changes here for future reference. |
Try an example of asp.net with sslstream? |
Forget that the stack allocations will be too big for 32 bytes |
I think the stackallocs in SslStream are 48 bytes x86 and 64 bytes x64 (4 x |
Revived here: master...AndyAyersMS:ReviveOptLocalloc Some thoughts:
|
There's also the new C# 7.2 safe span stackalloc; which might make it more widespread e.g. in IPAddressExtensions Span<byte> addressBytes = stackalloc byte[IPAddressParserStatics.IPv6AddressBytes]; |
Realized that there's often a cast between the fixed size and the localloc. Fixing that we see things firing in more cases:
(this is with 32 bytes). Updated my fork to support inlining of these methods. EG using System;
class L
{
unsafe static int Use4()
{
byte* i = stackalloc byte[4];
i[2] = 50;
return i[2] * 2;
}
unsafe static int Use(int x)
{
byte* i = stackalloc byte[x];
i[1] = 50;
return i[1] * 2;
}
public static int Main()
{
int v0 = Use4();
int v1 = Use(10);
int v2 = Use(100);
int v3 = Use(v0);
int v4 = 0;
int v5 = 0;
int v6 = 0;
for (int i = 0; i < 7; i++)
{
v5 += Use4();
v5 += Use(4);
v6 += Use(v0);
}
return v0 + v1 + v2 + v3 + v4 + v5 + v6 - 2400;
}
}
|
Tried setting the threshold to 128 and 256. The latter gets a few more cases but we also start to see code size regressions because of larger stack offsets. This optimization often allows RSP based addressing for locals in methods that were using RBP based addressing before. The potential downside is that if the stackalloc local ends up closer to RSP than the other locals then their offsets grow, and we only get 128 bytes of cheap local offsets. Accesses to the stackalloc local tend to be relative to its base so its position on the frame is less critical. I'm going to see if there is any way to put the stackalloc local closer to the root of the frame; if that's possible it should avoid code size increases. |
Looks like over in |
Optimize fixed sized locallocs of 32 bytes or less to use local buffers. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not conver to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this if we are willing to reorder locals and see other cases where it might apply. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.
Optimize fixed sized locallocs of 32 bytes or less to use local buffers. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not convert to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this threshold if we are willing to reorder locals and see fixed sized cases larger than 32 bytes. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.
Optimize fixed sized locallocs of 32 bytes or less to use local buffers, if the localloc is not in a loop. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not convert to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this threshold if we are willing to reorder locals and see fixed sized cases larger than 32 bytes. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.
Optimize fixed sized locallocs of 32 bytes or less to use local buffers, if the localloc is not in a loop. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not convert to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this threshold if we are willing to reorder locals and see fixed sized cases larger than 32 bytes. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.
Optimize fixed sized locallocs of 32 bytes or less to use local buffers, if the localloc is not in a loop. Also "optimize" the degenerate 0 byte case. Allow inline candidates containing localloc, but fail inlining if any of a candidate's locallocs do not convert to local buffers. The 32 byte size threshold was arrived at empirically; larger values did not enable many more cases and started seeinge size bloat because of larger stack offsets. We can revise this threshold if we are willing to reorder locals and see fixed sized cases larger than 32 bytes. Closes #8542. Also add missing handler for the callsite is in try region, this was an oversight.
In some cases small fixed-sized
stackalloc
s could be turned into regular local vars, which (if this transformation could be anticipated) would unblock inlining. See related discussion in #7109.@benaadams feel free to add pointers to examples you've found.
The text was updated successfully, but these errors were encountered: