-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsAdded "double-Vector128" path for ARM64 to IndexOf(byte), a quick benchmark: public IEnumerable<byte[]> TestData()
{
yield return File.ReadAllBytes(@"/Users/egorbo/prj/runtime/THIRD-PARTY-NOTICES.TXT");
}
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public int IndexOf_NotFound(byte[] data) => data.AsSpan().IndexOf((byte)222);
It's twice faster for large input, I tested small inputs with various cases and didn't notice any noticeable penalty from this change. e.g.: https://gist.github.com/EgorBo/69ac4df69a2c1cd8def04e0614d86f76 I'm leaving the opportunity to port the whole thing to cross-plat vectors for someone else, my quick attempt had worse numbers with
|
The numbers look good, but I'm still concerned that we don't have a way to test and show the impact of changes like this on efficiency/little cores (which may not have pipelining support) and we don't have a way to reliably show the power consumption cost and its impact for mobile scenarios (iOS/Android).
Do you have some numbers/data showing what the perf difference was here and where the codegen inefficiency was? My guess is that its the "missing opt" around specially handling |
here is my change EgorBo@ef13f72 the benchmark for the PR's description performs at ~
Happy to test it once we get an API/instructions how to do it, I spent some time trying to hack my M1 but with no luck (tried ProcessorAffinity) 🙁 |
Is it possible to JIT Vector256 usage down to such usage of Vector128, basically emulating the former on the latter? I'd like to avoid lots of manual such emulation if at all possible... it's yet one more thing developers will need to think about in what's already a complex web of decisions, another code path to maintain, etc. |
Yes, but its more work to enable and isn't a thing that most devs should do by default. There is a lot of nuance around unrolling vs interleaving (this PR is interleaving two iterations) and the latter generally depends on hardware pipelining to actually be more efficient/better. It's why I called out the concerns around perf on other hardware and specifically around "little" or "efficient" scenarios. Just getting better numbers on some hardware doesn't necessarily mean a change is good/worthwhile. We notably don't have the setup to do more extensive testing or to automatically validate, over time, the scenarios I called out. But they are still important scenarios to consider and may vary based on whether we are considering "Server" (including Cloud) vs "Desktop" vs "Mobile" |
I agree it looks bad, but I noticed that IndexOf(byte) and two IndexOfAny overloads are quite frequent guests in TE traces, mostly due to Http Headers parsing and As for the suggestion to expand Vector256, yeah, I agree with Tanner it's going to be complicated to properly pipeline stuff, we don't have an instruction selection phase in jit yet. |
Are enough headers long enough to benefit from this that it moves the needle?
The promise of Vector128/256 was developers wouldn't need to worry about cross-platform differences; this suggests that's not entirely the case. And now devs need to choose between |
And that's still generally the case and a place where you and I disagreed on what is good or not. I still hold the opinion that simply targeting Vector128, even if it potentially leaves perf on the table, is the best default choice for devs wanting something that is simple, extensible, and generally "just works". It has the least overall number of considerations. Devs who want additional perf can then consider also using |
The perf here isn't bad, its just that this is a "perf critical" scenario for an area that the BCL itself cares about in relation to TechEmpower, therefore it qualifies under the second consideration of "want/need additional perf, so loop unrolling or similar is worth considering" |
I think SpanHelpers is definitely not something anybody should ever use as an example, it contains a lot of hacks 😄 e.g. relying on the fact that aligned data won't cross page boundary so we can load more than needed, the actual manual data alignment,
Browsers typically send ~1000 bytes of headers, in TE benchmarks those are ~200 bytes. |
If the search being done is looking for a newline, total header length doesn't really matter, but rather the length of individual header lines. Does this PR show benefits for TE or other ASP.NET benchmarks?
Not from my vantage point. And this isn't about default guidance. We created Vector128 and Vector256 preportedly so that folks could optimize both cases without necessarily being able to test on all platforms. This PR shows that someone who cares about the extra 2x, which if the guidance is to start with a 128 code path and add a 256 code path where beneficial is pretty much anyone who writes that 256 path, is now not going to be satisfied with ARM64 perf, largely the reason these helpers were added. I get that it's complex, and based on previous interactions I'm expecting a large response citing efficiency cores and downclocking and unknown future hardware and so on... that doesn't change the fact that, from where I stand, this isn't living up to the promise if devs need to do what's being done in this PR. |
This is a scenario specific optimization, not the default for most code. And even the code in this PR is "suboptimal" for various platforms and we could likely get another 2-4x increase on top of this for a 200-1000 byte payload. Doing that probably isn't worth the investment/effort and long term maintenance required, however. There is always going to be more perf you can eek out and so its a question of what is a good enough default for users who just want to get good perf without thinking too much into what is the best perf. The xplat helper APIs available from Vector128/256 are for that scenario and allow you to much more easily transition into getting the better perf when and if that becomes relevant for your scenario. The BCL itself is special in that a lot of scenarios are relevant for a lot of our customers and so we have more places that have these additional micro-optimization paths. There are likewise cases that aren't as important and which don't get the same level of care and micro-optimizing. |
There's a method for which it was important enough to have both 128-bit and 256-bit code paths (it's implemented today using intrinsics, but I expect the plan is to switch it over to using Such methods, where a dev wants to get the extra (up to) 2x by supporting 256-bit SIMD, are not niche. And yet we're on a path where the
What would that investment and maintenance be? 2-4x is huge if IndexOf of all lengths is something we care about.
The core libraries are special because they're used in many different scenarios with many different constraints and requirements. But while it's special in that regard, that doesn't mean it's unique. There are plenty of libraries that fit those same criteria, that don't know how much data will be fed through them and that care about performance. I'm not sure why that's up for debate. |
Again, this is a scenario specific optimization and one that when it was put up I immediately called out as having questionable impact on other ARM64 hardware.
There is a distinct difference between doing
This remains true. You check
The code is setup in the JIT in a way that allows most of the code that supports VectorXXX to be shared. So supporting just Vector128 vs both Vector128 and Vector256 or extending that to support Vector512, 1024, 2048, or almost any other size in the future is negligible overall. These types and performance oriented paths are applicable to multiple platforms. Not every size will be applicable to "all" platforms except for Vector128, where if SIMD acceleration exists, it can essentially be guaranteed you will end up having Vector128 support (there are a lot of complex and historical reasons for this that aren't worth getting into here). Even within a single platform/architecture, there can be multiple ISAs and instructions that are used to support a given operation. There can be multiple options available for doing something and rationalizing/keeping track of the "best way" is difficult. This allows most of that logic to be codified in the JIT, so not everyone needs to spend 3-6months (or even several years) learning all the SIMD secrets. Users, for a majority case, don't have to consider things like "I need to do a comparison so I should do The JIT is able to take care of most of this for you by specializing the helper methods to do the right thing and will do so over time giving users "free perf" in a majority of cases/scenarios; even if you are targeting Vector256 and Vector256 happens to only be support on x86/x64 this is very much still a simplification and vast improvement over what was typically needed to be maintained. There are then still cases where you have operations that have no good cross-platform representation and cases where you need or want the extra tuning/perf/control and in those cases you have the ability to use entirely the "raw" intrinsics or to use a mix of helpers of and selectively lighting up with the "raw" intrinsics for the cases where you can specialize even more for a given platform/scenario.
Code that is several times the length and which does various unsafe tricks/hacks to deal with alignment, to handle loading/storing memory that is just going to repeatedly trash the cache, vectorizing the tail end of the loops rather than falling back to scalar, etc. 2-4x can be big, yes. It can also be relatively nothing and not worthwhile. Going from The code would be overall harder to understand, less readable, and less maintainable all for something which likely doesn't provide benefit to most users/use-cases. For users which are working with "big data" and where it is a bottleneck, they then have a few options available including finding an existing SIMD library designed to work with "big data" or rolling their own. For them it would be worthwhile to introduce the additional complexity and maintenance costs for their specialized scenario. It does not, IMO, make sense for the entire BCL to go and make all SIMD paths that much more complex for a scenario that isn't the 90% use case.
Yes and other special libraries may likewise want to do something a bit more complex. In .NET 6 we basically had a cliff, you either had "simple scalar logic" or "manually SIMD everything" for all platforms you care about. This was a great place to be in for devs who were familiar with SIMD, you wanted to take advantage and light up support for their scenarios, and put .NET as a thing that could actively compete with C/C++, Rust, and other lowlevel languages that gave you this kind of control. However, being unfamiliar with a platform, not wanting to keep up to date with changes and newer instructions over time, and other factors all played into making this difficult for people who didn't have the right tools/experience to accelerate their own scenarios. In .NET 7, with the introduction of the cross-platform helpers we now have a nice and accessible ramp. SIMD acceleration is now something that almost anyone can reasonably add to their codebase and they can pick and choose what level is right for them and their code. Level 1 is a very good place to be in and will itself provide approx 2-16x improvements (depending on the underlying element type) for a lot of code. It will apply to the broadest set of platforms and while it may not give the best performance everywhere, it gives you good performance with relatively little corresponding effort. It will be supported effectively everywhere because 128-bit SIMD support is the standard baseline. The only places it really won't light up is places that don't have any SIMD support (WASM, which is getting SIMD support in the near future) or where the JIT hasn't enabled SIMD support yet (Arm32 and x86 Unix). There is a reasonable expectation here that if you get increased performance on one platform/architecture a similar increase will be visible on all platforms/architectures. Likewise, you can reasonably expect perf to "automagically" increase over time for newer hardware based on the JIT adding support and optimizing for it. Level 2 is a step up. It starts getting into cases where you need to have more considerations around data sizes, alignments, and that it may not be supported everywhere. Using Level 2 cont. There is then a different variation of what might be considered "level 2" where you also start considering things like pipelining (what this PR is doing). This can likewise provide an approx 2-3x improvement over level 1 but it can also provide no improvement or worse performance. What you get depends on the hardware you are running against and what it supports. For an Alder Lake Power core, it supports "dual dispatch" for most Vector128 and Vector256 instructions (based on publicly available instruction timing info). For an Alder Lake Efficiency core, it supports "triple dispatch" for most Vector128 instructions and "single dispatch" for most Vector256 instructions (based on publicly available instruction timing info). While the efficiency cores support 3x throughput, it will still not typically outperform the 2x throughput of the power cores because it is clocked lower and may have longer instruction latency costs. In the cases where only "single dispatch" exists for a given instruction/port, it may still be pipelinable with a subsequent instruction that executes on a different port and which is not dependent on the previous instruction. For example Level 3 is then again a step up where you start getting the best performance by utilizing the "raw intrinsics" and codifying support for individual platforms. In such a scenario you become responsible for everything the JIT was managing for you before. You have to deal with ISA differences, to add support for new hardware as it comes online, to validate that the behavior is correct in the various platform specific edge cases, etc. Most users and most code won't want to be here. It generally requires extensive knowledge and the ability/desire to maintain such code over time. It is still important to have because it provides the baseline on which the other layers are built. It is still important to have because there are users and libraries where having this kind of control is desirable/necessary, but it is ultimately a rare scenario. There are then some other in betweens involving mixing of helpers and "raw intrinsics" or that get into the complexity of dealing with non-temporal data, dealing with partial SIMD operations, etc. We are, all in all, in a much better place and I expect that the BCL will use a mix of Level 1 and Level 2 depending on scenario. We have many functions today that are only |
I think this boils it quite down (at least the current state), and from what I see (in the community) this information should be shared, so that lots of confusion can be eliminated beforehands and make .NET not too "confusing". |
There is a plan for me to writeup some guidance documentation. Just not an agreement on what that guidance should be. |
Thanks for the write ups and discussions, my 50 cents:
|
Thank you. |
I'm going to close this PR and file an issue with some of my findings/reasearch |
This PR adds "double-Vector128" path for ARM64 to IndexOf(byte) - a sort of an alternative of AVX2 path on x86. A quick benchmark:
It's twice faster for large input, I tested small inputs with various cases and didn't notice any noticeable penalty from this change. e.g.: https://gist.github.com/EgorBo/69ac4df69a2c1cd8def04e0614d86f76
I'm leaving the opportunity to port the whole thing to cross-plat vectors for someone else, my quick attempt had worse numbers with
Vector128.ExtractMostSignificantBits
vsTryFindFirstMatchedLane
which is used now to calculate positions in the comparison result for arm64.