Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SIMD support in vertex codec #72

Merged
merged 14 commits into from
Nov 2, 2019
Merged

Improve SIMD support in vertex codec #72

merged 14 commits into from
Nov 2, 2019

Conversation

zeux
Copy link
Owner

@zeux zeux commented Nov 2, 2019

This change adds AVX512 and WASM SIMD support for vertex codec.

AVX512 version gives a minor speedup compared to SSSE3 version - it's ~10% faster to decode; however it also is ~2x smaller which is nice-to-have. Because the speedup isn't very significant we only activate this version if AVX512 instructions are declared to be supported - this happens in gcc/clang when -march=icelake-client is specified. Note that the implementation requires AVX512VL and AVX512VBMI2.

WASM SIMD version is around 3 times faster than the scalar version, which is about the same speedup that SIMD gives for native versions. This currently only works on the bleeding edge of the entire ecosystem - requires several fixes in LLVM/binaryen and Chrome Canary to support swizzle and some other instructions.

Note that WASM SIMD version has a few TODO comments where we use suboptimal or slightly unsafe instructions. This is because Chrome Canary doesn't support load_splat variants. When this support is implemented, we can switch to using better and hopefully faster intrinsics, although I'd expect the performance gains to be minimal.

Performance as measured on i5-1035G4 on buddha.obj using the position stream (4.4 MB uncompressed, 8 byte position data):

C++ scalar: 0.72 GB/s
C++ SSSE3: 2.15 GB/s
C++ AVX512: 2.42 GB/s
wasm scalar: 0.44 GB/s
wasm SIMD: 1.23 GB/s

zeux added 14 commits November 1, 2019 13:29
Move transpose8/unzigzag8 into separate ifdef sections that are
independent of byte decoding; this allows us to substitute byte decoding
with different versions separately.
This is a test implementation of decoding using AVX512-VBMI2 and related
instruction sets, available on Ice Lake. The result is ~10% faster than
the baseline SSSE3 version, and much simpler.
Instead of using gcc intrinsic __builtin_popcount we now use
_mm_popcnt_u32 which is available in MSVC as well.
Use jobs: instead of matrix/include
For now we aren't using SIMD decoding for 2/4 bit groups, but everything
else is using SIMD.

Using buddha.glb as a benchmark and repeating decodeVertexBuffer 100
times, this reduces the time from 880 ms to 530 ms on Chrome 78.
Right now this function is a no-op; however in SIMD implementation it
computes a few tables so we will need to call it sooner or later.
Note: wasmMoveMask is currently very suboptimal; this will be fixed
later.

Unfortunately this code is hitting codegen issues - so it's not clear if
it even works.
Some instructions are predicated on instruction sets like VBMI and VL;
while there's no hardware that supports VBMI2 and doesn't support VBMI
or VL, when manually specifying instruction sets in command line this
becomes important to get right.
Chrome doesn't support load_splat and andnot; we work around the absence
of these instructions to make the bytecode load (it loads only in Canary
version but that's better than nothing!)
This change uses relatively efficient wide integer math to build the
final mask as quickly as possible. It still requires a few operations so
we also use a fast path for when the entire mask is 0.
We no longer need the baseline implementation - we always use swizzle et
al now. Also run the code through clang-format.
This is so that we can remove this workaround when we don't need it
anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant