Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

Open
jorgecarleitao opened this issue Feb 6, 2022 · 10 comments
Labels
A-LLVM Area: LLVM C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@jorgecarleitao
Copy link
Contributor

jorgecarleitao commented Feb 6, 2022

One of the most important operations in analytics is the SQL-equivalent of a WHERE clause.

For SIMD, this is an operation that can be summarized as (pseudo rust code)

fn where<LANES: const usize>(array: [T; LANES], mask: Mask<LANES>, receiver: &mut [T; LANES], selected: &mut usize)

roughy translated to:

fn where<LANES: const usize>(array: [T; LANES], mask: Mask<LANES>, receiver: &mut [T; LANES], selected: &mut usize) {
    *selected = 0
    for (value: T, is_selected: bool) in array.iter().zip(mask.iter()) {
        if is_selected {
            *receiver[*selected] = value;
            *selected += 1;
        }
    }
}

As pointed out by @programmerjake on zulip, there are instructions for this.

imo this is sufficiently relevant for portable-simd to support.

@jorgecarleitao jorgecarleitao added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Feb 6, 2022
@jorgecarleitao jorgecarleitao changed the title Add "llvm.masked.expandload.*" Intrinsics Add "llvm.masked.compressstore.*" Intrinsics Feb 6, 2022
@programmerjake
Copy link
Member

oops, we created bug reports simultaneously...description from #241:
Add support for llvm.masked.compressstore.* and llvm.masked.expandload.* intrinsics.

Also, I don't know if there are portable LLVM intrinsics for them, but supporting register-register compress/expand (without the memory access) would be nice too...those are supported on at least AVX512, SVE, RISC-V V, and SimpleV.

Zulip Discussion

@programmerjake programmerjake changed the title Add "llvm.masked.compressstore.*" Intrinsics Add compress store / expand load intrinsics & maybe register -> register variant too #241 Feb 6, 2022
@calebzulawski
Copy link
Member

The in-register variant actually seems more useful to me. What ends up in the remaining lanes? Zeros?

@programmerjake
Copy link
Member

x86 supports leaving either zeros or whatever was there before in the destination register in the unused lanes

@workingjubilee
Copy link
Member

Wouldn't that get fixed up by LLVM's mem2reg pass?

@programmerjake
Copy link
Member

Wouldn't that get fixed up by LLVM's mem2reg pass?

compressed store / expand load would probably not be cleaned by llvm, because iirc it doesn't have arch-independent intrinsics for compress/expand reg->reg.

@programmerjake
Copy link
Member

one major use case is speeding up [T]::sort -- used in the avx512 fast sort mentioned here: https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/Fast.20sorting/near/328108693
source code (c++): https://github.com/intel/x86-simd-sort

i haven't looked, but i'd assume it likely can be converted to rust using portable-simd if we added compress-store

@workingjubilee workingjubilee added the A-LLVM Area: LLVM label Mar 4, 2023
@farnoy
Copy link
Contributor

farnoy commented Jan 4, 2024

I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand?

I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch.

But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is 2^16 * 16B = 1MiB. Using a const expression to build it takes long to evaluate and the table pollutes the binary, and it's only one of the multiple variants we'd need. It would be more practical for larger elements, since u32x8 would only need a LUT that is 2^8 * 32B = 8KiB. One blocker for this is that swizzle_dyn is only implemented for Simd<u8, N> today, I'm not sure why.

I am using an approach like this myself, in application code, but I'm also exploiting domain-specific behavior to know which compress masks are possible in my use case and have a two-level LUT that's three orders of magnitude smaller.

@programmerjake
Copy link
Member

I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand?

yes, unfortunately.

I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch.

that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.

But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is 2^16 * 16B = 1MiB. Using a const expression to build it takes long to evaluate and the table pollutes the binary, and it's only one of the multiple variants we'd need. It would be more practical for larger elements, since u32x8 would only need a LUT that is 2^8 * 32B = 8KiB.

if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.

One blocker for this is that swizzle_dyn is only implemented for Simd<u8, N> today, I'm not sure why.

because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM.

@programmerjake
Copy link
Member

programmerjake commented Jan 4, 2024

I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand?

yes, unfortunately (unless you use target-specific intrinsics, which we're trying to avoid as part of being portable).

I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch.

that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.

actually, now that i'm looking at it, 128-bit vpcompressb takes only 2 clock cycles on Zen 4, which is really fast (should be faster than a load, dynamic swizzle, and store sequence), so idk what performance cliff you're referring to... timings from Agner Fog's nice PDF

But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is 2^16 * 16B = 1MiB. Using a const expression to build it takes long to evaluate and the table pollutes the binary, and it's only one of the multiple variants we'd need. It would be more practical for larger elements, since u32x8 would only need a LUT that is 2^8 * 32B = 8KiB.

if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.

One blocker for this is that swizzle_dyn is only implemented for Simd<u8, N> today, I'm not sure why.

because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM.

@farnoy
Copy link
Contributor

farnoy commented Jan 4, 2024

that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.

I totally agree, I just mentioned it in passing. I opened up an issue with LLVM about this but have not heard anything since.

so idk what performance cliff you're referring to

The cliff is on compressstore specifically. Register compress, masked store and even expandload are all "fast" - in line with what you'd expect. Agner doesn't list vpcompressb with an m512 operand, but it's shown here: https://uops.info/table.html

if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.

I'm proposing to only use the look-up table if there's no native intrinsic supported. AVX512 and SVE (for 32b lanes at least) seem to have instructions for this. And only use a LUT for <=16 lanes, falling back to a scalar impl for bigger vectors, just like swizzle_dyn. I think that's sensible.

because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM.

Is this new swizzle op going to be parameterized for different element types, or fixed to support only bytes? IOW, would we implement Simd::<u16, N>::swizzle_dyn with an endian-aware wrapper around Simd::<u8, N>::swizzle_dyn or directly with a different monomorphization of a platform-intrinsic leading to a slightly different LLVM IR op? If it's the first option, then we should be able to add that today, by using the existing implementation we have for bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: LLVM C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

No branches or pull requests

5 participants