Simplify and optimize i8/i16 interleaving SIMD ops for x64 #567

robertknight · 2025-01-31T21:40:21Z

Simplify zip_hi_i8 and zip_hi_i16 for AVX2. The new code generates the same instructions as before, because LLVM was smart about optimizing the old code, but it is now simpler.
Split tests for zip_lo_i8 and zip_hi_i8 and the same for the i16-variants. This makes debugging easier.
Simplify and optimize the zip_* methods for AVX-512.

As a side note, these changes were made with the help of OpenAI o3-mini-high model. This seems like a notable step up from o1-mini.

Replace several instructions for combining second 128-bit lane from `lo` and `hi` with a single permute. The generated code is the same because LLVM was already doing this transformation itself, but this makes it obvious in the source that the operation can be done with one instruction.

This results in some code duplication but it makes debugging easier.

Instead of interleaving the low and high 256 halves separately and stiching them together, use the AVX-512 versions of the unpacklo / unpackhi instructions to interleave from the low/high halves of all 128-bit lanes in one go, then use `_mm512_permutex2var_epi32` to combine interleaved values from the different 128b lanes.

robertknight added 3 commits January 31, 2025 20:34

Split tests for zip_lo_* and zip_hi_*

5c902fa

This results in some code duplication but it makes debugging easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify and optimize i8/i16 interleaving SIMD ops for x64 #567

Simplify and optimize i8/i16 interleaving SIMD ops for x64 #567

robertknight commented Jan 31, 2025

Simplify and optimize i8/i16 interleaving SIMD ops for x64 #567

Are you sure you want to change the base?

Simplify and optimize i8/i16 interleaving SIMD ops for x64 #567

Conversation

robertknight commented Jan 31, 2025