Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and optimize i8/i16 interleaving SIMD ops for x64 #567

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

robertknight
Copy link
Owner

  • Simplify zip_hi_i8 and zip_hi_i16 for AVX2. The new code generates the same instructions as before, because LLVM was smart about optimizing the old code, but it is now simpler.
  • Split tests for zip_lo_i8 and zip_hi_i8 and the same for the i16-variants. This makes debugging easier.
  • Simplify and optimize the zip_* methods for AVX-512.

As a side note, these changes were made with the help of OpenAI o3-mini-high model. This seems like a notable step up from o1-mini.

Replace several instructions for combining second 128-bit lane from `lo`
and `hi` with a single permute.

The generated code is the same because LLVM was already doing this
transformation itself, but this makes it obvious in the source that the
operation can be done with one instruction.
This results in some code duplication but it makes debugging easier.
Instead of interleaving the low and high 256 halves separately and stiching them
together, use the AVX-512 versions of the unpacklo / unpackhi instructions to
interleave from the low/high halves of all 128-bit lanes in one go, then use
`_mm512_permutex2var_epi32` to combine interleaved values from the different
128b lanes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant