-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240
Comments
oops, we created bug reports simultaneously...description from #241: Also, I don't know if there are portable LLVM intrinsics for them, but supporting register-register compress/expand (without the memory access) would be nice too...those are supported on at least AVX512, SVE, RISC-V V, and SimpleV. |
The in-register variant actually seems more useful to me. What ends up in the remaining lanes? Zeros? |
x86 supports leaving either zeros or whatever was there before in the destination register in the unused lanes |
Wouldn't that get fixed up by LLVM's mem2reg pass? |
compressed store / expand load would probably not be cleaned by llvm, because iirc it doesn't have arch-independent intrinsics for compress/expand reg->reg. |
one major use case is speeding up i haven't looked, but i'd assume it likely can be converted to rust using portable-simd if we added compress-store |
I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand? I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch. But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is I am using an approach like this myself, in application code, but I'm also exploiting domain-specific behavior to know which compress masks are possible in my use case and have a two-level LUT that's three orders of magnitude smaller. |
yes, unfortunately.
that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.
if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.
because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM. |
yes, unfortunately (unless you use target-specific intrinsics, which we're trying to avoid as part of being portable).
that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code. actually, now that i'm looking at it, 128-bit vpcompressb takes only 2 clock cycles on Zen 4, which is really fast (should be faster than a load, dynamic swizzle, and store sequence), so idk what performance cliff you're referring to... timings from Agner Fog's nice PDF
if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.
because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM. |
I totally agree, I just mentioned it in passing. I opened up an issue with LLVM about this but have not heard anything since.
The cliff is on compressstore specifically. Register compress, masked store and even expandload are all "fast" - in line with what you'd expect. Agner doesn't list vpcompressb with an m512 operand, but it's shown here: https://uops.info/table.html
I'm proposing to only use the look-up table if there's no native intrinsic supported. AVX512 and SVE (for 32b lanes at least) seem to have instructions for this. And only use a LUT for <=16 lanes, falling back to a scalar impl for bigger vectors, just like swizzle_dyn. I think that's sensible.
Is this new swizzle op going to be parameterized for different element types, or fixed to support only bytes? IOW, would we implement |
One of the most important operations in analytics is the SQL-equivalent of a WHERE clause.
For SIMD, this is an operation that can be summarized as (pseudo rust code)
roughy translated to:
As pointed out by @programmerjake on zulip, there are instructions for this.
imo this is sufficiently relevant for portable-simd to support.
The text was updated successfully, but these errors were encountered: