Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Finalizing the instruction set #343

Closed
tlively opened this issue Sep 10, 2020 · 42 comments
Closed

Finalizing the instruction set #343

tlively opened this issue Sep 10, 2020 · 42 comments

Comments

@tlively
Copy link
Member

tlively commented Sep 10, 2020

It's been a while since we introduced the criteria in #203 for evaluating new proposed instructions. Since then, we have made significant progress on the spec tests, text, and interpreter, to the point where finalizing the instruction set will soon be the main phase 4 blocker.

We should continue evaluating currently proposed instructions using the previous criteria and we should continue to be open to bug fixes and improvements to our existing instructions, but I propose that effective immediately (upon stakeholder agreement) we start labeling all new proposed instructions as post-mvp. Does that sound reasonable to everyone?

@lars-t-hansen
Copy link
Contributor

I think this is a good idea, but as part of this I think we should also come up with and make public an overall strategy for how incremental extensions will be made to the (fixed-width) instruction set. Having such a strategy will make it easier to draw the line for the MVP.

@tlively
Copy link
Member Author

tlively commented Sep 11, 2020

@lars-t-hansen, that's a good idea. My assumption was that we would lean heavily on the normal proposal process for follow-on extensions, but the more we sketch out a framework for how we will evaluate and group post-MVP instructions now, the more streamlined that those follow-on proposals will be. Did you have specific thoughts about what our strategy should be?

I think that one of the things that has worked out really well in this proposal has been having clear technical guidelines (e.g. portability, determinism) to focus development and prevent feature creep. Follow-on extensions to this proposal (i.e. not radically different proposals like scalable vectors or machine-specific instructions) will be easier to scope if they each focus on doing one thing well, either by focusing on a particular use case (e.g. neural networks) or a single instruction family (e.g. fma).

@lars-t-hansen
Copy link
Contributor

@tlively, I think grouping instructions thematically is going to be very helpful and I support that. We need to be careful about how we define the groups so that we don't get into another (bulk memory, reference types, threads) quagmire, but maybe that's not too hard with these low-level instructions.

Part of our SIMD strategy will likely have to do with dealing with machine specific corner cases, though. IIRC both arm64 and x64 have support for 8-bit operations but differ in some cases. Also IIRC, the QFMA class of instructions had similar issues. These are desirable instruction (groups) and we should probably be talking about principles for handling cross-architecture incompatibilities (beyond NaN cases) so that we can ship this type of functionality.

@binji
Copy link
Member

binji commented Sep 14, 2020

These are desirable instruction (groups) and we should probably be talking about principles for handling cross-architecture incompatibilities (beyond NaN cases) so that we can ship this type of functionality.

Right, I think we need to start talking again about conditional compilation for these sorts of features -- either as a generic feature (https://github.com/webassembly/conditional-sections) or as something specific to SIMD.

@Maratyszcza
Copy link
Contributor

I think we need to set a date for finalizing the instruction set, but this date should be in the future, not imminent. Right now the SIMD specification is missing whole classes of instructions (e.g. #348), and finalizing this early would make it unusable for a set of use-cases.

IMO a good criterion for finalizing the ISA would be to check that all instructions from ARM NEON and x86 SSE4.1 either:

  1. Have a close equivalent in WAsm SIMD, or
  2. Can't be specified in a portable and efficient way across architectures

@penzn
Copy link
Contributor

penzn commented Sep 18, 2020

IMO a good criterion for finalizing the ISA would be to check that all instructions from ARM NEON and x86 SSE4.1

Adding equivalents of "missing" hardware instructions does not guarantee users would be able to leverage them. In particular, I'd like to avoid the situation when we are going to add some amount of instructions which all map to existing hardware ops, but overall the speedup vs scalar is around 1x. Also, there are other desired properties aside from mapping of hardware instructions - code size is probably the most important criteria.

If we are serious about evaluating whether or not we can add any more instructions we need to have a set of actual, not micro, benchmarks linked from this repo, which we can test for both speed and size of generated binaries. If we can't come up with something like this, I'd rather freeze the set against adding new operations (fixing already added ops is OK).

@tlively
Copy link
Member Author

tlively commented Sep 19, 2020

Getting full coverage of portable native instructions is an explicit non-goal of this proposal. That's why we've always had criteria demanding demonstrated speedups in real-world use cases. @Maratyszcza, how far in the future would you want the cutoff date to be?

@Maratyszcza
Copy link
Contributor

IMO, we need to hold until at least end of the year, to make sure SIMD specification works for multiple classes of applications.

I'm concerned that SIMD specification might be too tied to the small number of applications ported so far, and missing important instructions needed for other applications. Checking instructions present in SSE4.1 and NEON is a good exercise in looking for missing instructions, as the designers of native SIMD instructions sets already had to make similar decisions on which instructions to include.

@binji
Copy link
Member

binji commented Sep 21, 2020

IMO, we need to hold until at least end of the year, to make sure SIMD specification works for multiple classes of applications.

Are there specific teams that we're waiting for? I'm curious why an additional 3 months is necessary now, since SIMD has been available experimentally for a year+ now.

Checking instructions present in SSE4.1 and NEON is a good exercise in looking for missing instructions, as the designers of native SIMD instructions sets already had to make similar decisions on which instructions to include.

Agreed, but there's no requirement that these instructions go into the SIMD MVP. We already know that there are instructions that will never be included in the SIMD MVP (e.g. anything with non-determinism), so I think it's reasonable to hold off on other instructions too, knowing that they can be bucketed into the next round of instructions.

@Maratyszcza
Copy link
Contributor

Are there specific teams that we're waiting for? I'm curious why an additional 3 months is necessary now, since SIMD has been available experimentally for a year+ now.

Origin Trial is active since July 14, 2020 release of Chrome 84, just over two months. AFAICT, many teams were reluctant to port code to WebAssembly SIMD before OT as there wasn't any path to production. Some types of applications we haven't seen yet are:

  • Scientific computing applications, in particular using double-precision and mixed-precision (missing whole sets of instructions, see Double-precision SIMD conversions #348)
  • Hashing and cryptography (e.g. missing carry-less multiplication, unclear how well WAsm SIMD works for bitslicing algorithms)
  • Algorithms using Hamming weight (unclear if existing dynamic shuffle instruction is sufficient, or we need dedicated Population Count instructions)

Besides, I have several instruction proposals in review (#124, #127, #237, #247, #290, #350, #352), and I'd like to see them accepted/rejected based on performance evaluation. I think three months would be enough to finish it.

Agreed, but there's no requirement that these instructions go into the SIMD MVP. We already know that there are instructions that will never be included in the SIMD MVP (e.g. anything with non-determinism), so I think it's reasonable to hold off on other instructions too, knowing that they can be bucketed into the next round of instructions.

WebAssembly doesn't have any mechanism for runtime feature detection, and thus software must threat different WAsm extensions as separate architectures (e.g. TensorFlow.js has to recompile 3 times for WAsm MVP, WAsm SIMD, WAsm SIMD+Threads). Supporting both WAsm SIMD and WAsm SIMD2 is not like supporting SSE2 and SSE4, it is like supporting both x86 and ARM. Because of the resulting deployment pains, I want WAsm SIMD MVP to be sufficiently feature-complete that most applications could target it.

@binji
Copy link
Member

binji commented Sep 21, 2020

Some types of applications we haven't seen yet are ...

Thanks, this is helpful. I'm curious if we know whether anyone is actively working on porting for these applications, however. I wouldn't want to wait 3 months and be in the same place we are now.

Besides, I have several instruction proposals in review (#124, #127, #237, #247, #290, #350, #352), and I'd like to see them accepted/rejected based on performance evaluation. I think three months would be enough to finish it.

Some of these are older proposals, and I agree they should be resolved. Some of them are much newer, however. I'm a little concerned that we will keep finding another few instructions to add and not be able to move forward with the MVP.

WebAssembly doesn't have any mechanism for runtime feature detection, and thus software must threat different WAsm extensions as separate architectures

Agreed, this is an issue @lars-t-hansen brought up above. I think we should seriously consider adding an explicit runtime SIMD feature detection instruction for this purpose, so we don't have to increase the size of the support matrix you mention.

@tlively
Copy link
Member Author

tlively commented Sep 22, 2020

I think we should seriously consider adding an explicit runtime SIMD feature detection instruction for this purpose, so we don't have to increase the size of the support matrix you mention.

See #356!

@jan-wassenberg
Copy link

Good that Marat mentions crypto - AES would probably be helpful, it's very widely used.

PPC and x86 have straightforward instructions but unfortunately ARM splits the operation differently: instead of ShiftRows, SubBytes, MixColumns, XOR (x86 AESENC), they have ShiftRows+SubBytes+XOR(AESE) and MixColumns (AESMC), and the AESE and AESMC should be consecutive to enable fusing. Anyone interested in defining a way to bridge that gap?

@Maratyszcza
Copy link
Contributor

Cryptographic primitives are already exposed through Web Cryptography API, so I see little value in having AES instructions in the spec: they would be nice for parallel RNG, but given that multithreaded WAsm applications are uncommon, it wouldn’t have much usage. I was more thinking about carryless multiplication, which has use-cases in hashing beyond cryptography.

@jan-wassenberg
Copy link

CLMUL is indeed interesting, +1 for that. AES is useful beyond standard crypto, though - RNG as you say (I wrote one), and some other hashes also use it (Meow hash).

@omnisip
Copy link

omnisip commented Oct 4, 2020

Hi there,

I would love to add a set of sample applications or test suites that would demonstrate whether or not we have all of the needed instructions -- mvp or not.

For the last few weeks I've been diligently testing the simd suite with a very basic, but important computation. Conversion of Rgb2yuv.

In this process I've noticed that we're missing key features that exist in every modern chipset. For instance, a swizzle that can zero lanes in a single op or a horizontal add operation (#20).

If these don't make it into the mvp, we definitely need a way to detect them, add them, and possibly polyfill in the future. In addition, there's the weird gotchas. Like V8 deciding to treat everything as unaligned accesses. It's super sad when we're putting in the work to make sure we get efficient performance improvements to our JavaScript applications. Where do these fit in the standardization pipeline, and how do we encourage everyone to keep implementing new features?

@tlively
Copy link
Member Author

tlively commented Oct 5, 2020

Thanks for testing! We can reconsider the horizontal add operation in light of other recent additions, since I don't think we've discussed it recently. Could you add a comment to #20 about your use case?

If these don't make it into the mvp, we definitely need a way to detect them, add them, and possibly polyfill in the future.

We are actively looking at designing a mechnism to make this easy. See #356.

V8 deciding to treat everything as unaligned accesses.

My guess is that changing this would be a significant amount of effort for V8, but @ngzhian would have more details. It would be very helpful to understand the performance improvements we might get by putting in this effort, especially since I've heard that alignment doesn't matter so much on recent architectures. Would you be able to do an experiment to figure out what the difference is?

how do we encourage everyone to keep implementing new features?

Concrete performance data showing the benefit of implementing new features or the benefit of improving implementations is the best way to motivate this work.

@jan-wassenberg
Copy link

especially since I've heard that alignment doesn't matter so much on recent architectures

It is a bit more subtle. On HSW+, if code is limited by L1 read ports (2 per cycle), unaligned halves the speed (can only do one per cycle). Otherwise, there is indeed no impact on throughput nor latency.

@omnisip
Copy link

omnisip commented Oct 5, 2020

especially since I've heard that alignment doesn't matter so much on recent architectures

It is a bit more subtle. On HSW+, if code is limited by L1 read ports (2 per cycle), unaligned halves the speed (can only do one per cycle). Otherwise, there is indeed no impact on throughput nor latency.

I'm surprised about that, but uops.info seems to agree.

@Maratyszcza
Copy link
Contributor

a swizzle that can zero lanes

wasm_v8x16_swizzle zeroes lanes if index is out-of-bounds (>=16)

@omnisip
Copy link

omnisip commented Oct 5, 2020

a swizzle that can zero lanes

wasm_v8x16_swizzle zeroes lanes if index is out-of-bounds (>=16)

That's news to me since the documentation says if the index is out of bounds, it selects the 0th lane -- which is definitely what happens when working with shuffle.

@Maratyszcza
Copy link
Contributor

That's news to me since the documentation says if the index is out of bounds, it selects the 0th lane -- which is definitely what happens when working with shuffle.

The documentation says "For indices outside of the range the resulting lane is 0." Isn't this what you want? Is the wording confusing?

@omnisip
Copy link

omnisip commented Oct 5, 2020

That's news to me since the documentation says if the index is out of bounds, it selects the 0th lane -- which is definitely what happens when working with shuffle.

The documentation says "For indices outside of the range the resulting lane is 0." Isn't this what you want? Is the wording confusing?

Yeah. The wording is confusing. That's definitely what I want. It should say Lane's value will become zero.

@omnisip
Copy link

omnisip commented Oct 5, 2020

Just to summarize what I'm doing here -- and to respond to @tlively 's comments.

I'm using WebAssembly SIMD to augment existing javascript libraries ssim and jest-image-snapshot for which I'm an active author and contributor to. In a nutshell, jest-image-snapshot does automated testing of webpages by comparing screenshots, and ssim.js is a library for doing image comparison. If I can do runtime detection of WebAssembly, WebAssembly SIMD, and WebAssembly Threads, I can offer a humongous performance improvement to the libraries as a seamless drop-in replacement. Currently, these libraries are performing all of the operations in JS.

With respect to the other comments, I'm mostly short a few key ops -- direct conversion from 8bit integers to 32bit integers and vice versa. If swizzle can fill that gap, that's a help. I'm at a loss as to why there's no regular 'load' op -- generally speaking -- and someone correct me if I'm wrong, using replace_lane (pinsr) or extract_lane (pextr) should be the exception not the norm, when I'm writing assembly code. More often that not, I just want to be able to load and store the low bits of any vector -- especially when I'm doing something like a prefix sum. The last two comments on this are with respect to getting a zero'd vector and shuffling zeros in and out from either side. V8 seems to have trouble guessing that I just want a pxor, and I'd really like to have equivalents of pslldq or psrldq.

@tlively I'll update #20 though if pmaddubsw turns out to be portable across arm and intel, it'll definitely be the way to go.

@omnisip
Copy link

omnisip commented Oct 5, 2020

@Maratyszcza I adjusted the code for the 32bits to use swizzle, and now V8 is doing something bizarre.

0x94a9e81d4e0    c0  48b800ffffff01ffffff REX.W movq rax,0xffffff01ffffff00
0x94a9e81d4ea    ca  c4e1f96ec8     vmovq xmm1,rax
0x94a9e81d4ef    cf  c5fb12c9       vmovddup xmm1,xmm1
0x94a9e81d4f3    d3  48b802ffffff03ffffff REX.W movq rax,0xffffff03ffffff02
0x94a9e81d4fd    dd  c4e3f122c801   vpinsrq xmm1,xmm1,rax,0x1
// as convoluted as this is, should be the proper shuffle mask for me to 32bit integers out of the first 4 bytes.
...
// Now down here... where did this add operation come from?
// Why isn't it just using the mask it created to extract the values?
0x94a9e81d653   233  41ba70707070   movl r10,0x70707070
0x94a9e81d659   239  c441796ee2     vmovd xmm12,r10
0x94a9e81d65e   23e  c4417970e400   vpshufd xmm12,xmm12,0x0
0x94a9e81d664   244  c519dce1       vpaddusb xmm12,xmm12,xmm1
0x94a9e81d668   248  c4422100dc     vpshufb xmm11,xmm11,xmm12

@tlively
Copy link
Member Author

tlively commented Oct 5, 2020

@omnisip Can you verify that the instructions in the .wasm are what you expect so we can rule out the toolchain as the source of weirdness?

@omnisip
Copy link

omnisip commented Oct 5, 2020

Yes. They look fine. It's definitely V8. https://www.felixcloutier.com/x86/pshufb

It's doing something to compensate for the fact that pshufb only respects the 4 least significant bits. 0x70 is 0b01110000. It's totally ridiculous.

It has no material effect on the swizzle, but what's worse is that it does it in each iteration of the loop instead of keeping the same shuffle mask. V8's assembly output makes me want to cry. Why can't it just create a single mask at compilation time? Why does it torture us with inline assembly mask creation???

@tlively
Copy link
Member Author

tlively commented Oct 5, 2020

cc @ngzhian. Also, @omnisip, can you file a v8 bug for this? It should have the "WebAssembly" component and the "Proj-SIMD" label: https://bugs.chromium.org/p/v8/issues/entry . It would be good to move the discussion of particular issues like this off of this thread :)

@omnisip
Copy link

omnisip commented Oct 5, 2020

Yeah, sorry. And I can't seem to create the issue or assign the tags you're suggesting. I'll leave this for @ngzhian .

@penzn
Copy link
Contributor

penzn commented Oct 5, 2020

I doubt this is V8 bug, zeroing of the lanes requires extra manipulations. See #93.

@omnisip
Copy link

omnisip commented Oct 5, 2020

@tlively LLVM is re-generating the mask on each iteration of the loop instead of keeping it as a constant.

@tlively
Copy link
Member Author

tlively commented Oct 5, 2020

@omnisip Can you open a new issue on this repo with an example that demonstrates that problem in C/C++?

@ngzhian
Copy link
Member

ngzhian commented Oct 6, 2020

@omnisip V8 doesn't support constant loads from memory, so we regenerate the mask every time (tracking bug https://crbug.com/v8/10980).

As explained, vpaddusb is necessary due to how pshufb works, we need to set top bit of the shuffle mask to zero the destination lane. In general we have to do this all the time, but I think it makes you cry because you clearly used a constant mask, and yet V8 doesn't seem to know that. I think there can be some wins we get from pattern matching on swizzle(v128, v128.const) to get better codegen. (Filed https://crbug.com/v8/10992)

Can you share a simple benchmark for this, and do you think that this poor codegen will be a bottleneck? (note that if you are hand writing intrinsics, there are ways around this, like storing the mask into a v128 local, you will eliminate the ceremony around moving values into reg, moving into xmm, then shuffling just to create mask.)

@omnisip
Copy link

omnisip commented Oct 6, 2020

@ngzhian Thanks for the informed response. The insult to injury is that it's recreating the mask in the loop -- which is being caused outside of V8 by the LLVM code generation. After getting a better understanding of vpaddusb, I can actually appreciate how this works with pshufb.

Does this [mask regeneration] happen the same way with the shuffle operation or just with swizzle? I'd like to add a zshuffle such that zshuffle v128_a, v128_b, Imm[] becomes completely interchangeable with swizzle when the first two parameters are the same. The logic for adding zeroing looks pretty straightforward on both ARM, and X86. I'd also like to add a runtime variant of zshuffle that supports swizzling across two vectors. With both, we can have a functionally complete version of shuffle and swizzle that covers masked (runtime) and unmasked (compile time) cases.

With respect to the benchmark, I'm happy to do so. The challenge with this particular situation is that I have to keep adding operations to solve my unsigned 8bit to 32bit conversion, and I'm constantly shuffling (or swizzling) already because there's no horizontal add or horizontal multiply and add. [This particular use case gets a lot of benefit from pmaddubsw but I haven't finished the research yet to see how portable this functionality is]. So anything I can do to cut the number of ops is helpful.

One thing I'm interested in, but don't know how practical it is, is to update the standard for our shuffle to match the Intel pshufb implementation rather than worrying about adding and saturating. The reason I suggest this is because it's actually a really flexible implementation. Once you get to understand what they were doing by looking at the bottom four bits, you can easily scale the same logic to look at the bottom 5 bits for 32byte/256bit vectors, bottom 6 bits for 64byte/512bit vectors, and bottom 7bits for 128byte/1024bit vectors. That's a decent amount of room to grow going forward. If the ARM chipset doesn't require any extra emulation instructions to see that a value is out of range or it works in a way compatible with pshufb, we'll have a very flexible solution going forward. The only thing that would need to change is how the user is informed of what happens if the value doesn't have the top bit set, instead of correcting it for them, we just explain it takes the bottom bits and mask it out. We could put together a best practice of telling them to put 0xff in that field instead of 33 or 17 or whatever it may be.

Edits: Minor clarifications and antecedents.

@omnisip
Copy link

omnisip commented Oct 6, 2020

(note that if you are hand writing intrinsics, there are ways around this, like storing the mask into a v128 local, you will eliminate the ceremony around moving values into reg, moving into xmm, then shuffling just to create mask.)

This is super intriguing. What do you mean by this? In my particular case, I'm actually pre-generating the shuffle masks once, but LLVM seems to be inlining them (and not consistently). If I could trick LLVM or V8 into thinking the thing was a string or such that I could load off of the stack, I'd be open to that too, but it seems like that's something related to constant generation.

@ngzhian
Copy link
Member

ngzhian commented Oct 6, 2020

which is being caused outside of V8 by the LLVM code generation.

I think filing a bug to @tlively would be useful to track this. Maybe somewhere in Emscripten? (this repository has more to do with the specification rather than the implementat/toolchain).

Does this [mask regeneration] happen the same way with the shuffle operation or just with swizzle?

This mask regeneration you're seeing is specific to swizzle, and is specifically used to set up a mask (0x7070707070707070) to do saturating addition to deal with pshufb behavior. (We can probably reduce this down to 2 instructions rather than 3).

For shuffle, in general, we have no such mask setup, instead we copy the shuffle immediates into a register, then call pshufb. (https://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/x64/code-generator-x64.cc;l=3706;drc=1d85b5f7bd02e59acf890afdf6255fed92b2a701).

We have a special check for shuffle(x, x, imms), where the inputs are the same (https://source.chromium.org/chromium/chromium/src/+/master:v8/src/wasm/simd-shuffle.h;l=22?q=f:simd-shuffle&ss=chromium%2Fchromium%2Fsrc:v8%2F), and can make use of this to generate better code.

because there's no horizontal add or horizontal multiply and add

I think there is interest in such an instruction, raising a separate issue for this to gauge interest will be good. (I think this was brought up before, but did not get standardized, perhaps I'm misremembering.)

@ngzhian
Copy link
Member

ngzhian commented Oct 6, 2020

This is super intriguing. What do you mean by this? In my particular case, I'm actually pre-generating the shuffle masks once

Maybe I don't understand your example correctly, say you are using the mask in a hot loop, you can do something like

v128.const 0x00 0xff 0xff 0xff 0x01 0xff 0xff 0xff 0x02 0xff 0xff 0xff 0x03 0xff 0xff 0xff
local.set $mask
loop:
  local.get $input
  local.get $mask
  i8x16.swizzle $input $mask
  do other stuff

which should reduce this generated code to just once, rather than every loop:

0x94a9e81d4e0    c0  48b800ffffff01ffffff REX.W movq rax,0xffffff01ffffff00
0x94a9e81d4ea    ca  c4e1f96ec8     vmovq xmm1,rax
0x94a9e81d4ef    cf  c5fb12c9       vmovddup xmm1,xmm1
0x94a9e81d4f3    d3  48b802ffffff03ffffff REX.W movq rax,0xffffff03ffffff02
0x94a9e81d4fd    dd  c4e3f122c801   vpinsrq xmm1,xmm1,rax,0x1

@ngzhian
Copy link
Member

ngzhian commented Oct 6, 2020

One thing I'm interested in, but don't know how practical it is, is to update the standard for our shuffle to match the Intel pshufb implementation rather than worrying about adding and saturating.

Interesting idea. We use Tbl for ARM implementation, the logic there is that any index out of range would result in 0. If we opt for pshufb style of top bit set to zero a lane, ARM implementation wouldn't have to change, since top bit set = out of range for an unsigned index. Note that this would be for swizzle only. For shuffle, the lane indices are immediates, which are validated to be in range, so the implementation shouldn't need to do any unnecessary check. (Which is something maybe V8 is doing, I will have to double check this, so thanks for bringing this up.)

@omnisip
Copy link

omnisip commented Oct 6, 2020

which is being caused outside of V8 by the LLVM code generation.

I think filing a bug to @tlively would be useful to track this. Maybe somewhere in Emscripten? (this repository has more to do with the specification rather than the implementat/toolchain).

Yeah. That's already been filed, and when it's resolved, it would be equivalent to your v128.local solution you proposed in your second comment -- making this a bit more moot.

Does this [mask regeneration] happen the same way with the shuffle operation or just with swizzle?

This mask regeneration you're seeing is specific to swizzle, and is specifically used to set up a mask (0x7070707070707070) to do saturating addition to deal with pshufb behavior. (We can probably reduce this down to 2 instructions rather than 3).

For shuffle, in general, we have no such mask setup, instead we copy the shuffle immediates into a register, then call pshufb. (https://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/x64/code-generator-x64.cc;l=3706;drc=1d85b5f7bd02e59acf890afdf6255fed92b2a701).

We have a special check for shuffle(x, x, imms), where the inputs are the same (https://source.chromium.org/chromium/chromium/src/+/master:v8/src/wasm/simd-shuffle.h;l=22?q=f:simd-shuffle&ss=chromium%2Fchromium%2Fsrc:v8%2F), and can make use of this to generate better code.

Really, this is what I want. If I could use your shuffle with the same vector and get it to zero the lanes, I would be happy. For grins, is it possible for me to get 0xff all the way through to V8 through WASM, or will it get filtered out and check before then?

because there's no horizontal add or horizontal multiply and add

I think there is interest in such an instruction, raising a separate issue for this to gauge interest will be good. (I think this was brought up before, but did not get standardized, perhaps I'm misremembering.)

Yes. I'm following the other ticket #20 and will put together a PR for this when I have a spare moment.

@ngzhian
Copy link
Member

ngzhian commented Oct 6, 2020

Really, this is what I want. If I could use your shuffle with the same vector and get it to zero the lanes, I would be happy.

No promises here unfortunately, we try to optimize, but get it wrong sometimes, since LLVM does some matching, and V8 does some matching. There have been discussions at length (e.g. #196) about such problems. I am not sure what the current state of that is. Possible thing you can do is:

  1. have a repro
  2. see where the masks get messed up...
    a. if the the masks in .wasm is different from the source mask you use, file a bug on Emscripten
    b. if the mask you use in the source stays the same as the masks in .wasm, and things run slower than you expected, file a bug on v8.

@omnisip
Copy link

omnisip commented Oct 6, 2020

One thing I'm interested in, but don't know how practical it is, is to update the standard for our shuffle to match the Intel pshufb implementation rather than worrying about adding and saturating.

Interesting idea. We use Tbl for ARM implementation, the logic there is that any index out of range would result in 0. If we opt for pshufb style of top bit set to zero a lane, ARM implementation wouldn't have to change, since top bit set = out of range for an unsigned index. Note that this would be for swizzle only. For shuffle, the lane indices are immediates, which are validated to be in range, so the implementation shouldn't need to do any unnecessary check. (Which is something maybe V8 is doing, I will have to double check this, so thanks for bringing this up.)

You're welcome. If this is viable and acceptable, it would be great.

@tlively
Copy link
Member Author

tlively commented Oct 30, 2020

Since we made an initial decision on this, I will close this issue we can discuss further steps on #389.

@tlively tlively closed this as completed Oct 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants