-
Notifications
You must be signed in to change notification settings - Fork 43
Finalizing the instruction set #343
Comments
I think this is a good idea, but as part of this I think we should also come up with and make public an overall strategy for how incremental extensions will be made to the (fixed-width) instruction set. Having such a strategy will make it easier to draw the line for the MVP. |
@lars-t-hansen, that's a good idea. My assumption was that we would lean heavily on the normal proposal process for follow-on extensions, but the more we sketch out a framework for how we will evaluate and group post-MVP instructions now, the more streamlined that those follow-on proposals will be. Did you have specific thoughts about what our strategy should be? I think that one of the things that has worked out really well in this proposal has been having clear technical guidelines (e.g. portability, determinism) to focus development and prevent feature creep. Follow-on extensions to this proposal (i.e. not radically different proposals like scalable vectors or machine-specific instructions) will be easier to scope if they each focus on doing one thing well, either by focusing on a particular use case (e.g. neural networks) or a single instruction family (e.g. fma). |
@tlively, I think grouping instructions thematically is going to be very helpful and I support that. We need to be careful about how we define the groups so that we don't get into another (bulk memory, reference types, threads) quagmire, but maybe that's not too hard with these low-level instructions. Part of our SIMD strategy will likely have to do with dealing with machine specific corner cases, though. IIRC both arm64 and x64 have support for 8-bit operations but differ in some cases. Also IIRC, the QFMA class of instructions had similar issues. These are desirable instruction (groups) and we should probably be talking about principles for handling cross-architecture incompatibilities (beyond NaN cases) so that we can ship this type of functionality. |
Right, I think we need to start talking again about conditional compilation for these sorts of features -- either as a generic feature (https://github.com/webassembly/conditional-sections) or as something specific to SIMD. |
I think we need to set a date for finalizing the instruction set, but this date should be in the future, not imminent. Right now the SIMD specification is missing whole classes of instructions (e.g. #348), and finalizing this early would make it unusable for a set of use-cases. IMO a good criterion for finalizing the ISA would be to check that all instructions from ARM NEON and x86 SSE4.1 either:
|
Adding equivalents of "missing" hardware instructions does not guarantee users would be able to leverage them. In particular, I'd like to avoid the situation when we are going to add some amount of instructions which all map to existing hardware ops, but overall the speedup vs scalar is around 1x. Also, there are other desired properties aside from mapping of hardware instructions - code size is probably the most important criteria. If we are serious about evaluating whether or not we can add any more instructions we need to have a set of actual, not micro, benchmarks linked from this repo, which we can test for both speed and size of generated binaries. If we can't come up with something like this, I'd rather freeze the set against adding new operations (fixing already added ops is OK). |
Getting full coverage of portable native instructions is an explicit non-goal of this proposal. That's why we've always had criteria demanding demonstrated speedups in real-world use cases. @Maratyszcza, how far in the future would you want the cutoff date to be? |
IMO, we need to hold until at least end of the year, to make sure SIMD specification works for multiple classes of applications. I'm concerned that SIMD specification might be too tied to the small number of applications ported so far, and missing important instructions needed for other applications. Checking instructions present in SSE4.1 and NEON is a good exercise in looking for missing instructions, as the designers of native SIMD instructions sets already had to make similar decisions on which instructions to include. |
Are there specific teams that we're waiting for? I'm curious why an additional 3 months is necessary now, since SIMD has been available experimentally for a year+ now.
Agreed, but there's no requirement that these instructions go into the SIMD MVP. We already know that there are instructions that will never be included in the SIMD MVP (e.g. anything with non-determinism), so I think it's reasonable to hold off on other instructions too, knowing that they can be bucketed into the next round of instructions. |
Origin Trial is active since July 14, 2020 release of Chrome 84, just over two months. AFAICT, many teams were reluctant to port code to WebAssembly SIMD before OT as there wasn't any path to production. Some types of applications we haven't seen yet are:
Besides, I have several instruction proposals in review (#124, #127, #237, #247, #290, #350, #352), and I'd like to see them accepted/rejected based on performance evaluation. I think three months would be enough to finish it.
WebAssembly doesn't have any mechanism for runtime feature detection, and thus software must threat different WAsm extensions as separate architectures (e.g. TensorFlow.js has to recompile 3 times for WAsm MVP, WAsm SIMD, WAsm SIMD+Threads). Supporting both WAsm SIMD and WAsm SIMD2 is not like supporting SSE2 and SSE4, it is like supporting both x86 and ARM. Because of the resulting deployment pains, I want WAsm SIMD MVP to be sufficiently feature-complete that most applications could target it. |
Thanks, this is helpful. I'm curious if we know whether anyone is actively working on porting for these applications, however. I wouldn't want to wait 3 months and be in the same place we are now.
Some of these are older proposals, and I agree they should be resolved. Some of them are much newer, however. I'm a little concerned that we will keep finding another few instructions to add and not be able to move forward with the MVP.
Agreed, this is an issue @lars-t-hansen brought up above. I think we should seriously consider adding an explicit runtime SIMD feature detection instruction for this purpose, so we don't have to increase the size of the support matrix you mention. |
See #356! |
Good that Marat mentions crypto - AES would probably be helpful, it's very widely used. PPC and x86 have straightforward instructions but unfortunately ARM splits the operation differently: instead of ShiftRows, SubBytes, MixColumns, XOR (x86 AESENC), they have ShiftRows+SubBytes+XOR(AESE) and MixColumns (AESMC), and the AESE and AESMC should be consecutive to enable fusing. Anyone interested in defining a way to bridge that gap? |
Cryptographic primitives are already exposed through Web Cryptography API, so I see little value in having AES instructions in the spec: they would be nice for parallel RNG, but given that multithreaded WAsm applications are uncommon, it wouldn’t have much usage. I was more thinking about carryless multiplication, which has use-cases in hashing beyond cryptography. |
CLMUL is indeed interesting, +1 for that. AES is useful beyond standard crypto, though - RNG as you say (I wrote one), and some other hashes also use it (Meow hash). |
Hi there, I would love to add a set of sample applications or test suites that would demonstrate whether or not we have all of the needed instructions -- mvp or not. For the last few weeks I've been diligently testing the simd suite with a very basic, but important computation. Conversion of Rgb2yuv. In this process I've noticed that we're missing key features that exist in every modern chipset. For instance, a swizzle that can zero lanes in a single op or a horizontal add operation (#20). If these don't make it into the mvp, we definitely need a way to detect them, add them, and possibly polyfill in the future. In addition, there's the weird gotchas. Like V8 deciding to treat everything as unaligned accesses. It's super sad when we're putting in the work to make sure we get efficient performance improvements to our JavaScript applications. Where do these fit in the standardization pipeline, and how do we encourage everyone to keep implementing new features? |
Thanks for testing! We can reconsider the horizontal add operation in light of other recent additions, since I don't think we've discussed it recently. Could you add a comment to #20 about your use case?
We are actively looking at designing a mechnism to make this easy. See #356.
My guess is that changing this would be a significant amount of effort for V8, but @ngzhian would have more details. It would be very helpful to understand the performance improvements we might get by putting in this effort, especially since I've heard that alignment doesn't matter so much on recent architectures. Would you be able to do an experiment to figure out what the difference is?
Concrete performance data showing the benefit of implementing new features or the benefit of improving implementations is the best way to motivate this work. |
It is a bit more subtle. On HSW+, if code is limited by L1 read ports (2 per cycle), unaligned halves the speed (can only do one per cycle). Otherwise, there is indeed no impact on throughput nor latency. |
I'm surprised about that, but uops.info seems to agree. |
|
That's news to me since the documentation says if the index is out of bounds, it selects the 0th lane -- which is definitely what happens when working with shuffle. |
The documentation says "For indices outside of the range the resulting lane is 0." Isn't this what you want? Is the wording confusing? |
Yeah. The wording is confusing. That's definitely what I want. It should say Lane's value will become zero. |
Just to summarize what I'm doing here -- and to respond to @tlively 's comments. I'm using WebAssembly SIMD to augment existing javascript libraries ssim and jest-image-snapshot for which I'm an active author and contributor to. In a nutshell, jest-image-snapshot does automated testing of webpages by comparing screenshots, and ssim.js is a library for doing image comparison. If I can do runtime detection of WebAssembly, WebAssembly SIMD, and WebAssembly Threads, I can offer a humongous performance improvement to the libraries as a seamless drop-in replacement. Currently, these libraries are performing all of the operations in JS. With respect to the other comments, I'm mostly short a few key ops -- direct conversion from 8bit integers to 32bit integers and vice versa. If swizzle can fill that gap, that's a help. I'm at a loss as to why there's no regular 'load' op -- generally speaking -- and someone correct me if I'm wrong, using replace_lane (pinsr) or extract_lane (pextr) should be the exception not the norm, when I'm writing assembly code. More often that not, I just want to be able to load and store the low bits of any vector -- especially when I'm doing something like a prefix sum. The last two comments on this are with respect to getting a zero'd vector and shuffling zeros in and out from either side. V8 seems to have trouble guessing that I just want a pxor, and I'd really like to have equivalents of pslldq or psrldq. @tlively I'll update #20 though if pmaddubsw turns out to be portable across arm and intel, it'll definitely be the way to go. |
@Maratyszcza I adjusted the code for the 32bits to use swizzle, and now V8 is doing something bizarre.
|
@omnisip Can you verify that the instructions in the .wasm are what you expect so we can rule out the toolchain as the source of weirdness? |
Yes. They look fine. It's definitely V8. https://www.felixcloutier.com/x86/pshufb It's doing something to compensate for the fact that pshufb only respects the 4 least significant bits. 0x70 is 0b01110000. It's totally ridiculous. It has no material effect on the swizzle, but what's worse is that it does it in each iteration of the loop instead of keeping the same shuffle mask. V8's assembly output makes me want to cry. Why can't it just create a single mask at compilation time? Why does it torture us with inline assembly mask creation??? |
cc @ngzhian. Also, @omnisip, can you file a v8 bug for this? It should have the "WebAssembly" component and the "Proj-SIMD" label: https://bugs.chromium.org/p/v8/issues/entry . It would be good to move the discussion of particular issues like this off of this thread :) |
Yeah, sorry. And I can't seem to create the issue or assign the tags you're suggesting. I'll leave this for @ngzhian . |
I doubt this is V8 bug, zeroing of the lanes requires extra manipulations. See #93. |
@tlively LLVM is re-generating the mask on each iteration of the loop instead of keeping it as a constant. |
@omnisip Can you open a new issue on this repo with an example that demonstrates that problem in C/C++? |
@omnisip V8 doesn't support constant loads from memory, so we regenerate the mask every time (tracking bug https://crbug.com/v8/10980). As explained, Can you share a simple benchmark for this, and do you think that this poor codegen will be a bottleneck? (note that if you are hand writing intrinsics, there are ways around this, like storing the mask into a v128 local, you will eliminate the ceremony around moving values into reg, moving into xmm, then shuffling just to create mask.) |
@ngzhian Thanks for the informed response. The insult to injury is that it's recreating the mask in the loop -- which is being caused outside of V8 by the LLVM code generation. After getting a better understanding of vpaddusb, I can actually appreciate how this works with pshufb. Does this [mask regeneration] happen the same way with the shuffle operation or just with swizzle? I'd like to add a zshuffle such that zshuffle v128_a, v128_b, Imm[] becomes completely interchangeable with swizzle when the first two parameters are the same. The logic for adding zeroing looks pretty straightforward on both ARM, and X86. I'd also like to add a runtime variant of zshuffle that supports swizzling across two vectors. With both, we can have a functionally complete version of shuffle and swizzle that covers masked (runtime) and unmasked (compile time) cases. With respect to the benchmark, I'm happy to do so. The challenge with this particular situation is that I have to keep adding operations to solve my unsigned 8bit to 32bit conversion, and I'm constantly shuffling (or swizzling) already because there's no horizontal add or horizontal multiply and add. [This particular use case gets a lot of benefit from pmaddubsw but I haven't finished the research yet to see how portable this functionality is]. So anything I can do to cut the number of ops is helpful. One thing I'm interested in, but don't know how practical it is, is to update the standard for our shuffle to match the Intel pshufb implementation rather than worrying about adding and saturating. The reason I suggest this is because it's actually a really flexible implementation. Once you get to understand what they were doing by looking at the bottom four bits, you can easily scale the same logic to look at the bottom 5 bits for 32byte/256bit vectors, bottom 6 bits for 64byte/512bit vectors, and bottom 7bits for 128byte/1024bit vectors. That's a decent amount of room to grow going forward. If the ARM chipset doesn't require any extra emulation instructions to see that a value is out of range or it works in a way compatible with pshufb, we'll have a very flexible solution going forward. The only thing that would need to change is how the user is informed of what happens if the value doesn't have the top bit set, instead of correcting it for them, we just explain it takes the bottom bits and mask it out. We could put together a best practice of telling them to put 0xff in that field instead of 33 or 17 or whatever it may be. Edits: Minor clarifications and antecedents. |
This is super intriguing. What do you mean by this? In my particular case, I'm actually pre-generating the shuffle masks once, but LLVM seems to be inlining them (and not consistently). If I could trick LLVM or V8 into thinking the thing was a string or such that I could load off of the stack, I'd be open to that too, but it seems like that's something related to constant generation. |
I think filing a bug to @tlively would be useful to track this. Maybe somewhere in Emscripten? (this repository has more to do with the specification rather than the implementat/toolchain).
This mask regeneration you're seeing is specific to swizzle, and is specifically used to set up a mask (0x7070707070707070) to do saturating addition to deal with pshufb behavior. (We can probably reduce this down to 2 instructions rather than 3). For shuffle, in general, we have no such mask setup, instead we copy the shuffle immediates into a register, then call pshufb. (https://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/x64/code-generator-x64.cc;l=3706;drc=1d85b5f7bd02e59acf890afdf6255fed92b2a701). We have a special check for
I think there is interest in such an instruction, raising a separate issue for this to gauge interest will be good. (I think this was brought up before, but did not get standardized, perhaps I'm misremembering.) |
Maybe I don't understand your example correctly, say you are using the mask in a hot loop, you can do something like
which should reduce this generated code to just once, rather than every loop:
|
Interesting idea. We use |
Yeah. That's already been filed, and when it's resolved, it would be equivalent to your v128.local solution you proposed in your second comment -- making this a bit more moot.
Really, this is what I want. If I could use your shuffle with the same vector and get it to zero the lanes, I would be happy. For grins, is it possible for me to get 0xff all the way through to V8 through WASM, or will it get filtered out and check before then?
Yes. I'm following the other ticket #20 and will put together a PR for this when I have a spare moment. |
No promises here unfortunately, we try to optimize, but get it wrong sometimes, since LLVM does some matching, and V8 does some matching. There have been discussions at length (e.g. #196) about such problems. I am not sure what the current state of that is. Possible thing you can do is:
|
You're welcome. If this is viable and acceptable, it would be great. |
Since we made an initial decision on this, I will close this issue we can discuss further steps on #389. |
It's been a while since we introduced the criteria in #203 for evaluating new proposed instructions. Since then, we have made significant progress on the spec tests, text, and interpreter, to the point where finalizing the instruction set will soon be the main phase 4 blocker.
We should continue evaluating currently proposed instructions using the previous criteria and we should continue to be open to bug fixes and improvements to our existing instructions, but I propose that effective immediately (upon stakeholder agreement) we start labeling all new proposed instructions as post-mvp. Does that sound reasonable to everyone?
The text was updated successfully, but these errors were encountered: