Improve SIMD support in vertex codec #72

zeux · 2019-11-02T02:08:06Z

This change adds AVX512 and WASM SIMD support for vertex codec.

AVX512 version gives a minor speedup compared to SSSE3 version - it's ~10% faster to decode; however it also is ~2x smaller which is nice-to-have. Because the speedup isn't very significant we only activate this version if AVX512 instructions are declared to be supported - this happens in gcc/clang when -march=icelake-client is specified. Note that the implementation requires AVX512VL and AVX512VBMI2.

WASM SIMD version is around 3 times faster than the scalar version, which is about the same speedup that SIMD gives for native versions. This currently only works on the bleeding edge of the entire ecosystem - requires several fixes in LLVM/binaryen and Chrome Canary to support swizzle and some other instructions.

Note that WASM SIMD version has a few TODO comments where we use suboptimal or slightly unsafe instructions. This is because Chrome Canary doesn't support load_splat variants. When this support is implemented, we can switch to using better and hopefully faster intrinsics, although I'd expect the performance gains to be minimal.

Performance as measured on i5-1035G4 on buddha.obj using the position stream (4.4 MB uncompressed, 8 byte position data):

C++ scalar: 0.72 GB/s
C++ SSSE3: 2.15 GB/s
C++ AVX512: 2.42 GB/s
wasm scalar: 0.44 GB/s
wasm SIMD: 1.23 GB/s

Move transpose8/unzigzag8 into separate ifdef sections that are independent of byte decoding; this allows us to substitute byte decoding with different versions separately.

This is a test implementation of decoding using AVX512-VBMI2 and related instruction sets, available on Ice Lake. The result is ~10% faster than the baseline SSSE3 version, and much simpler.

Instead of using gcc intrinsic __builtin_popcount we now use _mm_popcnt_u32 which is available in MSVC as well.

Use jobs: instead of matrix/include

For now we aren't using SIMD decoding for 2/4 bit groups, but everything else is using SIMD. Using buddha.glb as a benchmark and repeating decodeVertexBuffer 100 times, this reduces the time from 880 ms to 530 ms on Chrome 78.

Right now this function is a no-op; however in SIMD implementation it computes a few tables so we will need to call it sooner or later.

Note: wasmMoveMask is currently very suboptimal; this will be fixed later. Unfortunately this code is hitting codegen issues - so it's not clear if it even works.

Some instructions are predicated on instruction sets like VBMI and VL; while there's no hardware that supports VBMI2 and doesn't support VBMI or VL, when manually specifying instruction sets in command line this becomes important to get right.

Chrome doesn't support load_splat and andnot; we work around the absence of these instructions to make the bytecode load (it loads only in Canary version but that's better than nothing!)

This change uses relatively efficient wide integer math to build the final mask as quickly as possible. It still requires a few operations so we also use a fast path for when the entire mask is 0.

We no longer need the baseline implementation - we always use swizzle et al now. Also run the code through clang-format.

This is so that we can remove this workaround when we don't need it anymore.

zeux added 14 commits November 1, 2019 13:29

vertexcodec: Refactor SIMD implementation a bit

eed7346

Move transpose8/unzigzag8 into separate ifdef sections that are independent of byte decoding; this allows us to substitute byte decoding with different versions separately.

vertexcodec: Implement AVX512 decoding

6e5c9c8

This is a test implementation of decoding using AVX512-VBMI2 and related instruction sets, available on Ice Lake. The result is ~10% faster than the baseline SSSE3 version, and much simpler.

vertexcodec: Fix MSVC AVX512 build

4bcad5c

Instead of using gcc intrinsic __builtin_popcount we now use _mm_popcnt_u32 which is available in MSVC as well.

demo: Encode index/vertex data in dev mode

8cb84b1

Minor Travis CI cleanup

9848001

Use jobs: instead of matrix/include

vertexcodec: Implement basic WASM SIMD support

a2a4737

For now we aren't using SIMD decoding for 2/4 bit groups, but everything else is using SIMD. Using buddha.glb as a benchmark and repeating decodeVertexBuffer 100 times, this reduces the time from 880 ms to 530 ms on Chrome 78.

js: Call __wasm_call_ctors after instantiation

1fc1c01

Right now this function is a no-op; however in SIMD implementation it computes a few tables so we will need to call it sooner or later.

vertexcodec: Convert the rest of SIMD code to WASM

5be8455

Note: wasmMoveMask is currently very suboptimal; this will be fixed later. Unfortunately this code is hitting codegen issues - so it's not clear if it even works.

vertexcodec: Make WASM SIMD version validate

b058471

Chrome doesn't support load_splat and andnot; we work around the absence of these instructions to make the bytecode load (it loads only in Canary version but that's better than nothing!)

vertexcodec: Full WASM SIMD implementation works now

8202560

vertexcodec: Optimize wasmMoveMask

43a36af

This change uses relatively efficient wide integer math to build the final mask as quickly as possible. It still requires a few operations so we also use a fast path for when the entire mask is 0.

vertexcodec: Cleanup the remaining code

6e35c48

We no longer need the baseline implementation - we always use swizzle et al now. Also run the code through clang-format.

vertexcodec: Reference emscripten bug report

98b03fe

This is so that we can remove this workaround when we don't need it anymore.

This was referenced Nov 2, 2019

Using wasm_v32x4_load_splat triggers assertion in visitSIMDLoad emscripten-core/emscripten#9725

Closed

Inefficient x64 codegen for swizzle WebAssembly/simd#93

Closed

zeux merged commit 6198511 into master Nov 2, 2019

zeux deleted the vertexcodec-simd branch November 2, 2019 22:41

zeux mentioned this pull request Nov 3, 2019

Enable WASM SIMD support #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SIMD support in vertex codec #72

Improve SIMD support in vertex codec #72

zeux commented Nov 2, 2019

Improve SIMD support in vertex codec #72

Improve SIMD support in vertex codec #72

Conversation

zeux commented Nov 2, 2019