-
Notifications
You must be signed in to change notification settings - Fork 43
Byte shuffle / table lookup operations #24
Comments
I would agree with the capability to do shuffles. The pshufb is very useful in ARGB and YUV conversions as well as matrix manipulation. If I understand the proposal the table data and index is your mask? Could the first two suggestions be made into 1 API? I guess there isn't a p residence for memory or register operations. Perms are nice but can be expensive. Nice for reverse lookups so agree good again to use. I my case I believe i used this for a hamming code but I think I need to go back and look but that is a use case. I probably wouldn't call it a lookup in the API because there are other uses but agree I like the instruction selection. |
Hi,
|
FWIW something along the lines of i8x16.byteshuffle is a feature I'd love to see in wasm. As mentioned in the OP, it is supported by SSSE3 on x86 side via pshufb, and by NEON on ARM side via vtbl. My library, https://github.com/zeux/meshoptimizer, provides a fast decoder for index & vertex data. It's currently ~5x slower with wasm (when compiled with Emscripten) compared to the native C++ version, and most of this difference seems to be due to the lack of SIMD, see zeux/meshoptimizer@e9120f0. This decoder requires variable byte permute - it's not really feasible to emulate this since it scalarizes most of the work required to decode. In general, pshufb is critical for any sort of accelerated decompression. |
In the linked PR, @tlively and @dtig requested more precise benchmark numbers that are more specific to byte shuffles. To that end, I've implemented an SSE2 fallback for the decoder (see zeux/meshoptimizer@bbeafeb) - the only instruction the decoder used that wasn't in SSE2 was pshufb. I'd expect this version to map reasonably well to wasm-without-dynamic shuffle. The performance numbers are as follows - they are all measured on gcc 7.3, x64, i7-8700K:
You can reproduce these numbers by checking out https://github.com/zeux/meshoptimizer/tree/wasm-simd and running something like
(buddha.obj is a standard Buddha model that you can download from, for example, https://casual-effects.com/data/) The difference between "vertex" and "vertex oct" is that "vertex" compresses an array of 16-byte vertex structures whereas "vertex oct" compresses an array of 12-byte vertex structures. It's the same code, but the data is different and it affects the use of pshufb/emulated pshufb so I'm including both for completeness. You can see that the scalar version is ~3x slower than the AVX version; the AVX version is a tiny bit faster than SSSE3 version - I believe this is just due to better register allocation because of a use of 3-operand instructions, I'm including this because I'd expect browsers to use 3-operand instructions - SSSE3 is the baseline version with pshufb, and SSE2 replaces The emulated Note that the algorithm only uses Happy to answer any questions about the workload or to do more testing. |
@zeux It looks like your use could be mapped to something like the AVX-512 Once you do a shuffle with a mask read from memory, it's infeasible for WASM backends to pattern match that dynamic shuffle to more specialized instructions on different targets. I do think WASM needs a dynamic shuffle, but it seems like it would be worthwhile having specialized instructions for things like compress/expand. (there's also some past discussion of this in #8 that I didn't see referenced in the more recent issues) |
FWIW LLVM has portable support for these: https://llvm.org/docs/LangRef.html#masked-vector-expanding-load-and-compressing-store-intrinsics , and they are in the pipeline for Rust's packed SIMD module. |
@AndrewScheidecker Yeah this is correct I think. AVX-512 isn't a realistic SIMD target for broad use at this point, and I don't think this is available in other instructions sets - so it feels that here the correct thing to do is to expose dynamic shuffle since that has wide-spread support (and the uses are more broad than expanding loads). I'm not sure how well the compiler would be able to emulate expanding load without relying on a similar look up table which I haven't seen compilers synthesize before. |
And the byte-wise compress/expand are in AVX-512 VBMI2, which isn't even in a shipping CPU yet.
I think compress/expand and other specialized shuffles are no substitute for a dynamic shuffle instruction, but there are advantages to adding instructions that directly express patterns that would otherwise be a table+shuffle, even if it just turns into a table+shuffle on most architectures: it allows applications to transparently support future hardware extensions, which turn accelerates development and adoption of those extensions. The risk is that if WASM adds instructions that don't map well to future hardware, then they become redundant when new instructions that do are added.
It's not too hard for a WASM backend to translate a compress or expand instruction to a table+shuffle, but going the other direction isn't feasible. That said, LLVM does not yet correctly lower these portable compress/expand intrinsics on anything other than |
Sure - I agree with that; I was writing this from the perspective of LLVM that I've never seen generate large lookup tables for emulating a single instruction. It feels like the pragmatic thing to do now is to expose dynamic shuffles; if and when expand/compress becomes available and widespread, compress/expand instruction could be considered separately. |
Is there an equivalent of the compress/expand shuffles on ARM as well? |
In ARM SVE you would execute |
This change adds a variable shuffle instruction to SIMD proposal. When indices are out of range, the result is specified as 0 for each lane. This matches hardware behavior on ARM and RISCV architectures. On x86_64 and MIPS, the hardware provides instructions that can select 0 when the high bit is set to 1 (x86_64) or any of the two high bits are set to 1 (MIPS). On these architectures, the backend is expected to emit a pair of instructions, saturating add (saturate(x + (128 - 16)) for x86_64) and permute, to emulate the proposed behavior. To distinguish variable shuffles with immediate shuffles, existing v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be explicit about the fact that it shuffles two vectors with an immediate argument. This naming scheme allows for adding variants like v8x16.shuffle2 and v8x16.shuffle1_imm in the future. Fixes #68. Contributes to #24. Fixes #11.
All modern SIMD instruction sets provide an instruction to shuffle bytes of one of more SIMD registers using indices from another SIMD register. Such instructions can be used to do parallel table lookup in a small table that fits into few SIMD registers. The instructions are:
PSHUFB xmm, xmm/mem
on x86 with SSSE3 (according to Steam hardware survey supported on 97.32% of machines). Parallel table lookup in a 16-entry table. Due to special handling of negative indices, it is easy to extend this operation to larger tables.VPPERM xmm, xmm, xmm, xmm/mem
on x86 with AMD XOP (supported on AMD Bulldozer-Piledriver-Streamroller-Excavator, but not on AMD Zen). Due to 2 source registers, does parallel table lookup in a 32-entry table.VTBL.8 Dd, {reg-list}, Dm
on ARM NEON, wherereg-list
is a list of 1, 2, 3, or 4 D (64-bit) registers. Depending on the number of registers in the list, this instruction does table lookup in a table of 8, 16, 24, or 32 entries. Complementary instructionVTBX
makes it easy to extend table lookup to more than 32 entries.TBL Vd.<T>, {reg-list}, Vm.<T>
on ARM64, wherereg-list
is a list of 1, 2, 3, or 4 SIMD (128-bit) registers. This instruction enables parallel table lookup in a table of 16, 32, 48, or 64 entries, and complementary instructionTBX
extends the operation to larger tables.VPERM vrT, vrA, vrB, vrC
on POWER/PowerPC with Altivec. Performs parallel table lookup in a 32-entry table.I suggest WASM SIMD should expose the following 5 instructions:
i8x16.byteshuffle(idx: v128, table: v128) -> v128
- shuffle a byte vector using elements of another byte vector as indicesi8x16.byteshuffle(idx: v128, tablelo: v128, tablehi: v128) -> v128
- shuffle two concatenated byte vectors using elements of a byte vector as indicesi8x16.lookup(idx: v128, entry0: ImmByte, ..., entry15: ImmByte) -> v128
- table lookup in a hard-coded table of 16 byte elements.i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry31: ImmByte) -> v128
- table lookup in a hard-coded table of 32 byte elements.i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry63: ImmByte) -> v128
- table lookup in a hard-coded table of 64 byte elements.Separate versions with hard-coded table elements are provided for efficiency: depending on architecture, modifying table entries may enable more efficient implementation. E.g. implementation of a 32-byte shift via
PSHUFB
instruction would find beneficial to XOR the second 16 elements with the first 16 elements.The text was updated successfully, but these errors were encountered: