Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Byte shuffle / table lookup operations #24

Open
Maratyszcza opened this issue Nov 5, 2017 · 11 comments
Open

Byte shuffle / table lookup operations #24

Maratyszcza opened this issue Nov 5, 2017 · 11 comments

Comments

@Maratyszcza
Copy link
Contributor

All modern SIMD instruction sets provide an instruction to shuffle bytes of one of more SIMD registers using indices from another SIMD register. Such instructions can be used to do parallel table lookup in a small table that fits into few SIMD registers. The instructions are:

  • PSHUFB xmm, xmm/mem on x86 with SSSE3 (according to Steam hardware survey supported on 97.32% of machines). Parallel table lookup in a 16-entry table. Due to special handling of negative indices, it is easy to extend this operation to larger tables.
  • VPPERM xmm, xmm, xmm, xmm/mem on x86 with AMD XOP (supported on AMD Bulldozer-Piledriver-Streamroller-Excavator, but not on AMD Zen). Due to 2 source registers, does parallel table lookup in a 32-entry table.
  • VTBL.8 Dd, {reg-list}, Dm on ARM NEON, where reg-list is a list of 1, 2, 3, or 4 D (64-bit) registers. Depending on the number of registers in the list, this instruction does table lookup in a table of 8, 16, 24, or 32 entries. Complementary instruction VTBX makes it easy to extend table lookup to more than 32 entries.
  • TBL Vd.<T>, {reg-list}, Vm.<T> on ARM64, where reg-list is a list of 1, 2, 3, or 4 SIMD (128-bit) registers. This instruction enables parallel table lookup in a table of 16, 32, 48, or 64 entries, and complementary instruction TBX extends the operation to larger tables.
  • VPERM vrT, vrA, vrB, vrC on POWER/PowerPC with Altivec. Performs parallel table lookup in a 32-entry table.

I suggest WASM SIMD should expose the following 5 instructions:

  • i8x16.byteshuffle(idx: v128, table: v128) -> v128 - shuffle a byte vector using elements of another byte vector as indices
  • i8x16.byteshuffle(idx: v128, tablelo: v128, tablehi: v128) -> v128 - shuffle two concatenated byte vectors using elements of a byte vector as indices
  • i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry15: ImmByte) -> v128 - table lookup in a hard-coded table of 16 byte elements.
  • i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry31: ImmByte) -> v128 - table lookup in a hard-coded table of 32 byte elements.
  • i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry63: ImmByte) -> v128 - table lookup in a hard-coded table of 64 byte elements.

Separate versions with hard-coded table elements are provided for efficiency: depending on architecture, modifying table entries may enable more efficient implementation. E.g. implementation of a 32-byte shift via PSHUFB instruction would find beneficial to XOR the second 16 elements with the first 16 elements.

@rrwinterton
Copy link

I would agree with the capability to do shuffles. The pshufb is very useful in ARGB and YUV conversions as well as matrix manipulation. If I understand the proposal the table data and index is your mask? Could the first two suggestions be made into 1 API? I guess there isn't a p residence for memory or register operations. Perms are nice but can be expensive. Nice for reverse lookups so agree good again to use. I my case I believe i used this for a hamming code but I think I need to go back and look but that is a use case. I probably wouldn't call it a lookup in the API because there are other uses but agree I like the instruction selection.

@lemaitre
Copy link

lemaitre commented Apr 19, 2018

Hi,
I just created a merge request on shuffling and permutations, and I think there is some overlap: #30
Here are the overlaps (might need some argument reordering):

  • i8x16.byteshuffle(idx: v128, table: v128) -> v128 == v8x16.permuteVar(a: v128, s: v128) -> v128
  • i8x16.byteshuffle(idx: v128, tablelo: v128, tablehi: v128) -> v128 == v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128

@zeux
Copy link
Contributor

zeux commented Mar 2, 2019

FWIW something along the lines of i8x16.byteshuffle is a feature I'd love to see in wasm.

As mentioned in the OP, it is supported by SSSE3 on x86 side via pshufb, and by NEON on ARM side via vtbl.

My library, https://github.com/zeux/meshoptimizer, provides a fast decoder for index & vertex data. It's currently ~5x slower with wasm (when compiled with Emscripten) compared to the native C++ version, and most of this difference seems to be due to the lack of SIMD, see zeux/meshoptimizer@e9120f0. This decoder requires variable byte permute - it's not really feasible to emulate this since it scalarizes most of the work required to decode. In general, pshufb is critical for any sort of accelerated decompression.

@zeux
Copy link
Contributor

zeux commented Mar 7, 2019

In the linked PR, @tlively and @dtig requested more precise benchmark numbers that are more specific to byte shuffles. To that end, I've implemented an SSE2 fallback for the decoder (see zeux/meshoptimizer@bbeafeb) - the only instruction the decoder used that wasn't in SSE2 was pshufb. I'd expect this version to map reasonably well to wasm-without-dynamic shuffle.

The performance numbers are as follows - they are all measured on gcc 7.3, x64, i7-8700K:

config=release-avx: 2.67 GB/s vertex, 2.57 GB/s vertex oct
config=release-ssse3: 2.62 GB/s vertex, 2.53 GB/s vertex oct
config=release-sse2: 1.58 GB/s vertex, 1.42 GB/s vertex oct
config=release: 0.88 GB/s vertex, 0.84 GB/s vertex oct

You can reproduce these numbers by checking out https://github.com/zeux/meshoptimizer/tree/wasm-simd and running something like

make config=release-ssse3 dev files=data/buddha.obj

(buddha.obj is a standard Buddha model that you can download from, for example, https://casual-effects.com/data/)

The difference between "vertex" and "vertex oct" is that "vertex" compresses an array of 16-byte vertex structures whereas "vertex oct" compresses an array of 12-byte vertex structures. It's the same code, but the data is different and it affects the use of pshufb/emulated pshufb so I'm including both for completeness.

You can see that the scalar version is ~3x slower than the AVX version; the AVX version is a tiny bit faster than SSSE3 version - I believe this is just due to better register allocation because of a use of 3-operand instructions, I'm including this because I'd expect browsers to use 3-operand instructions - SSSE3 is the baseline version with pshufb, and SSE2 replaces _mm_shuffle_epi8 with a partial scalar emulation path that only handles the table lookup.

The emulated pshufb path results in SSE2 version being 1.65x and 1.78x slower than SSSE3 on the two different data sets.

Note that the algorithm only uses pshufb in two places in the midst of many other SSE instructions, so I'd expect this to be far from the worst practical case, but even then the performance loss is substantial. I'm not sure how WASM will compare - baseline scalar implementation is currently substantially slower in WASM vs C; if memory accesses are ranged-checked against WASM heap, for example, it's possible that the emulation will be comparatively slower vs a true vector instruction.

Happy to answer any questions about the workload or to do more testing.

@AndrewScheidecker
Copy link
Contributor

@zeux It looks like your use could be mapped to something like the AVX-512 vpcompressb instruction. Is that correct?

Once you do a shuffle with a mask read from memory, it's infeasible for WASM backends to pattern match that dynamic shuffle to more specialized instructions on different targets. I do think WASM needs a dynamic shuffle, but it seems like it would be worthwhile having specialized instructions for things like compress/expand.

(there's also some past discussion of this in #8 that I didn't see referenced in the more recent issues)

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

but it seems like it would be worthwhile having specialized instructions for things like compress/expand.

FWIW LLVM has portable support for these: https://llvm.org/docs/LangRef.html#masked-vector-expanding-load-and-compressing-store-intrinsics , and they are in the pipeline for Rust's packed SIMD module.

@zeux
Copy link
Contributor

zeux commented Mar 7, 2019

@AndrewScheidecker Yeah this is correct I think. AVX-512 isn't a realistic SIMD target for broad use at this point, and I don't think this is available in other instructions sets - so it feels that here the correct thing to do is to expose dynamic shuffle since that has wide-spread support (and the uses are more broad than expanding loads). I'm not sure how well the compiler would be able to emulate expanding load without relying on a similar look up table which I haven't seen compilers synthesize before.

@AndrewScheidecker
Copy link
Contributor

AVX-512 isn't a realistic SIMD target for broad use at this point, and I don't think this is available in other instructions sets

And the byte-wise compress/expand are in AVX-512 VBMI2, which isn't even in a shipping CPU yet.

so it feels that here the correct thing to do is to expose dynamic shuffle since that has wide-spread support (and the uses are more broad than expanding loads)

I think compress/expand and other specialized shuffles are no substitute for a dynamic shuffle instruction, but there are advantages to adding instructions that directly express patterns that would otherwise be a table+shuffle, even if it just turns into a table+shuffle on most architectures: it allows applications to transparently support future hardware extensions, which turn accelerates development and adoption of those extensions. The risk is that if WASM adds instructions that don't map well to future hardware, then they become redundant when new instructions that do are added.

I'm not sure how well the compiler would be able to emulate expanding load without relying on a similar look up table which I haven't seen compilers synthesize before.

It's not too hard for a WASM backend to translate a compress or expand instruction to a table+shuffle, but going the other direction isn't feasible. That said, LLVM does not yet correctly lower these portable compress/expand intrinsics on anything other than -mcpu=icelake-client or -mcpu=icelake-server: https://godbolt.org/z/mp7Z3V vs https://godbolt.org/z/eUq2Qn.

@zeux
Copy link
Contributor

zeux commented Mar 7, 2019

It's not too hard for a WASM backend to translate a compress or expand instruction to a table+shuffle, but going the other direction isn't feasible.

Sure - I agree with that; I was writing this from the perspective of LLVM that I've never seen generate large lookup tables for emulating a single instruction. It feels like the pragmatic thing to do now is to expose dynamic shuffles; if and when expand/compress becomes available and widespread, compress/expand instruction could be considered separately.

@dtig
Copy link
Member

dtig commented Mar 7, 2019

Is there an equivalent of the compress/expand shuffles on ARM as well?

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2019

Is there an equivalent of the compress/expand shuffles on ARM as well?

In ARM SVE you would execute COMPACT on a vector with a mask to move the selected elements to the front (or the back?), and then just write the front to memory. In ARM NEON I don't know. If you have a run-time mask you have to go from the mask to a vector of shuffle indices, and then do a dynamic shuffle. If the mask is small (e.g. for a 4 element vector), you might be able to just do a table look up to get it, but I never implemented this myself (it makes more sense to just work with shuffle indices directly). On Aarch64 you can have really big tables for the lookups, but I don't recall how big they are.

dtig pushed a commit that referenced this issue Mar 27, 2019
This change adds a variable shuffle instruction to SIMD proposal.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

To distinguish variable shuffles with immediate shuffles, existing
v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be
explicit about the fact that it shuffles two vectors with an immediate
argument.

This naming scheme allows for adding variants like v8x16.shuffle2 and
v8x16.shuffle1_imm in the future.

Fixes #68.
Contributes to #24.
Fixes #11.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants