-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SIMD support in vertex codec #72
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Move transpose8/unzigzag8 into separate ifdef sections that are independent of byte decoding; this allows us to substitute byte decoding with different versions separately.
This is a test implementation of decoding using AVX512-VBMI2 and related instruction sets, available on Ice Lake. The result is ~10% faster than the baseline SSSE3 version, and much simpler.
Instead of using gcc intrinsic __builtin_popcount we now use _mm_popcnt_u32 which is available in MSVC as well.
Use jobs: instead of matrix/include
For now we aren't using SIMD decoding for 2/4 bit groups, but everything else is using SIMD. Using buddha.glb as a benchmark and repeating decodeVertexBuffer 100 times, this reduces the time from 880 ms to 530 ms on Chrome 78.
Right now this function is a no-op; however in SIMD implementation it computes a few tables so we will need to call it sooner or later.
Note: wasmMoveMask is currently very suboptimal; this will be fixed later. Unfortunately this code is hitting codegen issues - so it's not clear if it even works.
Some instructions are predicated on instruction sets like VBMI and VL; while there's no hardware that supports VBMI2 and doesn't support VBMI or VL, when manually specifying instruction sets in command line this becomes important to get right.
Chrome doesn't support load_splat and andnot; we work around the absence of these instructions to make the bytecode load (it loads only in Canary version but that's better than nothing!)
This change uses relatively efficient wide integer math to build the final mask as quickly as possible. It still requires a few operations so we also use a fast path for when the entire mask is 0.
We no longer need the baseline implementation - we always use swizzle et al now. Also run the code through clang-format.
This is so that we can remove this workaround when we don't need it anymore.
This was referenced Nov 2, 2019
Closed
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change adds AVX512 and WASM SIMD support for vertex codec.
AVX512 version gives a minor speedup compared to SSSE3 version - it's ~10% faster to decode; however it also is ~2x smaller which is nice-to-have. Because the speedup isn't very significant we only activate this version if AVX512 instructions are declared to be supported - this happens in gcc/clang when -march=icelake-client is specified. Note that the implementation requires AVX512VL and AVX512VBMI2.
WASM SIMD version is around 3 times faster than the scalar version, which is about the same speedup that SIMD gives for native versions. This currently only works on the bleeding edge of the entire ecosystem - requires several fixes in LLVM/binaryen and Chrome Canary to support swizzle and some other instructions.
Note that WASM SIMD version has a few TODO comments where we use suboptimal or slightly unsafe instructions. This is because Chrome Canary doesn't support load_splat variants. When this support is implemented, we can switch to using better and hopefully faster intrinsics, although I'd expect the performance gains to be minimal.
Performance as measured on i5-1035G4 on buddha.obj using the position stream (4.4 MB uncompressed, 8 byte position data):
C++ scalar: 0.72 GB/s
C++ SSSE3: 2.15 GB/s
C++ AVX512: 2.42 GB/s
wasm scalar: 0.44 GB/s
wasm SIMD: 1.23 GB/s