-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378
Comments
I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures. The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT |
I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4 |
Some LLVM IR surgery to get good codegen, even being VLEN-agnostic: https://godbolt.org/z/jh8YK394T → https://godbolt.org/z/ePrGdzEKx I'd imagine wouldn't be hard for a compiler to recognize & convert Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't. |
I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of |
The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to |
The core problem you're trying to resolve is using vector intrinsics to implement some fixed-length vector functionality. Honestly, this isn't the focus of the RVV intrinsic spec (at least in version 1.0). A straightforward approach is to use GNU vectors (e.g., This reflects a limitation in the current compiler implementation, as it doesn't handle memory analysis for scalable vector types very effectively. Returning to the main problem we want to solve: creating an easier programming model for SIMD-style programs while also improving code generation. One idea I have is to provide an alternative set of intrinsics to improve both user experience and code generation quality. Here’s a concrete example: int32x4_t a, b, c;
a = __riscv_vle32(int32x4_t, ptr_a, 4);
// We could also provide an overloaded version for VLMAX, e.g., __riscv_vle32(int32x4_t, ptr_a);
// Or simply use: a = *(int32x4_t *)ptr_a;
// ----
b = __riscv_vle32(int32x4_t, ptr_b, 4);
c = __riscv_vadd(int32x4_t, a, b, 4);
// Alternative syntax: c = a + b;
// or c = __riscv_vadd(int32x4_t, a, b);
// ----
__riscv_vse32(int32x4_t, ptr_c, c);
// Or: *(int32x4_t *)ptr_c = c; This approach was discussed in the early stages of the RVV intrinsics, but it wasn’t prioritized, so it didn’t come to fruition. Another idea I have is to try converting scalable vector types to fixed-length vectors, which might improve code generation quality. However, this would require significant engineering effort, so upstream toolchain compilers may not consider it unless there's strong motivation. In conclusion, I would say that introducing intrinsics for whole-register vector load/store doesn’t truly solve the issue...the real problem lies in the compiler implementation. |
FYI: This ABI proposal is trying to resolve this issue: |
What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b? |
That's what I noted as an option here before:
I don't think there necessarily needs to be a restriction on the relative size. Without the restriction, you could have a 32B or 64B buffer and do generic code over VLEN=128/256/512 (VLEN≥1024 would still work, but only use the low 512 bits), allowing good VLEN=256/512 perf while still being compatible with VLEN=128). Also noted as an idea here. |
It is possible to abuse C++'s type system so it generates the code we want https://godbolt.org/z/rWj3oGx7e (I presume in this case your classes will be parameterised by That said, those two conversions might be a bit questionable. I'm not a language lawyer. EDIT: do not mind the |
Unfortunately I think the type punning there is not allowed (breaks strict aliasing). |
Thanks Jan! |
Unfortunately this doesn't nest properly: https://godbolt.org/z/jTP1aeT9E Edit: I do think it's a important blocker on some projects getting RVV support. But, if there isn't a relatively easy change to adress this usecase, it may be preferable to get the 1.0 release done and solve it the next version. |
Looks like we're thwarted by the default copy constructor that introduces a memcpy. Telling it to copy the vector instead seems to do something reasonable: https://godbolt.org/z/9cfh1c975 That said given @jan-wassenberg comment above, I would be a bit wary of using this approach. |
That looks like a generic missed optimization around scalable vs fixed-width types; that is, if not fixed, such behaviors would remain even if there were whole-register load/store intrinsics (and, as far as LLVM goes, the But I don't think whole-register loads/stores are the thing to expand on/encourage for use-cases like this; fixed- |
@rofirrim @jan-wassenberg Can't we just use Here is a quick proof of concept that does 8/16/32/64 bit unsigned add/sub/mul. The type is designed to work with VLEN<=512. https://godbolt.org/z/oeGE65z4n This is what I imagined for the dedicated while register load/store, it seems to work on gcc and clang and produce optimal codegen. I'd prefer the explicit whole register load/store instructions, assuming the semantic -> optimization passes, behave the same, as that would be a lot cleaner. The only thing that could be improved is having all operations, including load/store set |
I can't speak much about the Another thing I've seen is that you may be paying a bit too much due to the copy constructors as you're passing something that must copy its internal buffer (so you store to local memory just to be loaded afterwards). Those copies would be the ones simplified if a whole vector load/store intrinsic were mapped to a plain LLVM/GCC IR load/store. Given that you have only one field, your struct is in practice equivalent to the storage for a vector register. Your mileage may vary of course, but if you're able to use (possibly const) references code generation seems slightly better: https://godbolt.org/z/PafnY9E1P Hope this is useful. |
I wonder if we could somehow special case |
Given the changes may be substantial in the specification, I suggest we postpone this after 1.0. Thoughts? |
We may add few more intrinsics to improve this, but I would prefer postpone this after 1.0. |
I suggest we revisit this after 1.0. |
The absence of whole register load/store instructions was already discussed in the past with the following conclusion:
I'd like to posit what I think is a "compelling use-case".
A lot of libraries use a fixed size SIMD abstraction type that allows code sharing between existing SIMD ISAs (sse,avx,neon,...): simdjson, vectorscan, ...
This requires the ability to store the vector register state in data structures, which is currently only properly possible via the
riscv_rvv_vector_bits
attribute extension supported by both gcc and clang, since it requires a fixed size, known at compile time.This attribute isn't standardized, and can only be used for a single VLEN without potentially majorly rearchitecting code structure and build system, as it depends on the
-mrvv-vector-bits
command line argument.An alternative approach for implementing such abstract SIMD types is to have all operations load/store from a fixed width buffer in the structure, and rely on the compiler eliminating redundant load/stores. This avoids having to store the variable length vector register directly in the SIMD class, and allows multiple implementations of this SIMD type for different VLEN. The generated code using these types would have to be runtime dispatched based on the actual VLEN.
It could also be used to e.g. create a SIMD type that can be stored in data structures, and works for VLEN 128, 256 and 512, by making the buffer always 512-bit wide, and just not using the extra bits for VLEN<512. This isn't possible using the
riscv_rvv_vector_bits
attribute either, because it assumes a single fixed VLEN.This approach is however unusable, since no current compiler is capable of eliminating redundant predicated load/stores: https://godbolt.org/z/TdajMTMKT
The
actual
function in the link above represents what the codegen for such an implementation currently looks like, andexpected
simulations what the codegen should be with redundant load/store elimination.As you can see, even when always using vl=VLMAX no redundant load/stores are removed.
Since the RVV compiler backends aren't as mature, I also compared how the compilers handle predicated (masked) vs unpredicated AVX512 load/stores.
There you can observe, that predicated redundant AVX512 load/stores also can't be eliminated, but unpredicated ones can.
Hence, I suggest adding unpredicated RVV load/store intrinsics, aka the whole register load/stores, to help facilitate the compilers load/store elimination in this use-case.
The text was updated successfully, but these errors were encountered: