Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C operator for scalable vector types #13

Open
Hsiangkai opened this issue Apr 22, 2020 · 15 comments
Open

C operator for scalable vector types #13

Hsiangkai opened this issue Apr 22, 2020 · 15 comments
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release

Comments

@Hsiangkai
Copy link
Collaborator

Hsiangkai commented Apr 22, 2020

What kinds of C operator should we support for scalable vector types? What is the semantic of C operator on scalable vector types? Should it operate on VLMAX or vl or something else?

What is the behavior and limitation of scalable vector types?

@rdolbeau
Copy link
Collaborator

I'm not a huge fan of mixing apples and oranges (except in sponge cake ;-) ).

For fixed-width vector, it's already not obvious - AVX for instance doesn't type the register so the 'SEW' isn't known, so no meaning. For NEON/SVE/V with the more specific (a.k.a. 'better ;-) ') types it can be reasonably meaningful, but then there's the relationship with VL that is an issue in V:

vint32m1_t a, b, c, d;
unsigned long gvl = vsetvl_i32m1(16);
a = (some vector stuff)
b = (more vector stuff);
unsigned long gvl2 = vsetvl_i32m1(8);
c = (and some other vector stuff);
(extra stuff that may or may not be vector)
d = a + b;

So, how wide is 'd' now? To a casual reader, it's less than obvious. Both input should be 16-wide (assuming VLMAX>=16 for SEW=32/LMUL=1 ...).
It can be 8 if we obey the implicit, invisible VL that we set for 'c'.
It can be 16 if we assume the user want to use all its data in 'a' and 'b'.
It can be VLMAX (which may or may not be larger than 16) if we assume C operators are full-vector as they are not intrinsics and therefore might not obey any specific VL.

... or the compiler can just spew an error message and force the user to write d = vadd_vv_i32m1(a, b); or (my favorite) the even more explicit d = vadd_vv_i32_m1_vl(a, b, gvl);

The problem with implicit 'VL' is that it semantically works on instructions, but in many developer's mind, the 'VL' is going to be associated to the data itself (i.e. as an analogy 'N' is the size of the array, and therefore is used as the bound of the loop; not the other way around...).

So for this specific issue - I would say, none.

@kito-cheng
Copy link
Collaborator

kito-cheng commented Apr 22, 2020

Here is several operator we might need to discuss, this table is organized from wiki page: Operators in C and C++

Arithmetic, comparison, relational, logical, bit-wise and compound assignment operators are controversial, because those are relative to the #8, how to pass the VL.

So I would like to discuss other part, operators not list above:

Assume a and b both are vint32m1_t, and n is int :

Operator Synatx
Assignment a = b, a = n
Subscript a[n]
Address-of &a
Ternary conditional n ? a : b, a > b ? a : b
sizeof sizeof (a)
Alignof alignof (a)
typeid typeid (a)
Conversion, static_cast type (a), (type)a, static_cast(a)
new new a
delete delete a

Pointer type, assume p are vint32m1_t *:

Operator Synatx
Subscript p[n]
Deference *p
Increment/Decrement p++, p--, ++p, --p
Pointer Arithmetic p + 1, p + n
reinterpret_cast<> reinterpret_cast<vint64m1_t>(p)

The point we need discuss is, should we support those operator with scalable vector type? if so what's the semantics?

@kito-cheng
Copy link
Collaborator

kito-cheng commented Apr 22, 2020

SiFive's implementation:

Assume a and b both are vint32m1_t, and n is int :

Operator Synatx Supported in SiFive's implementation Semantics
Assignment a = b, a = n Y Same as vcopy, VL-aware
Subscript a[n] N
Address-of &a Y
Ternary conditional n ? a : b, a > b ? a : b Y *(partial support)
sizeof sizeof (a) N
Alignof alignof (a) Y
typeid typeid (a) Y
Conversion, static_cast type (a), (type)a, static_cast(a) N
new new a N Due to sizeof not supported
delete delete a N Due to sizeof not supported

Pointer type, assume p are vint32m1_t *:

Operator Synatx Supported in SiFive's implementation Semantics
Subscript p[n] N
Deference *p Y Same as unit stride load/store, VL-aware
Increment/Decrement p++, p--, ++p, --p N
Pointer Arithmetic p + 1, p + n N
reinterpret_cast<> reinterpret_cast<vint64m1_t>(p) Y

@rofirrim
Copy link
Collaborator

rofirrim commented Apr 22, 2020

I believe that if we choose to implement the GCC extension for vector types for these types, perhaps the more reasonable thing to do here is to give them VLMAX semantics. Just to agree with existing practice of using vectors like "big scalars".

However I see how this introduces confusion in the context of implicit vector length because one could argue that the VL is also implicit in C builtin operations. Also I think lack of control over VLMAX by the user makes these extensions not very useful in general (how much are we going to load/store?).

So I would be inclined, that for now, C builtin operations not be extended to rvv vectors, to avoid introducing legacy.

Assignment is a bit of a special. Seems too fundamental to disallow (otherwise nothing will work), so I'd expect

va = vb;

to copy the whole vector (aligned with my expectation that these "builtin" operations use VLMAX).

This is important if we want to preserve the as-if behaviour that an assignment allows the user to "name" a value (and this is why we can replace usages of va with vb in the compiler). If we only copy up to VL I think we might be breaking this assumption.

Does this make sense?

@nick-knight
Copy link
Collaborator

nick-knight commented Apr 22, 2020

@rofirrim I agree we should support operator= and that it should copy the entire underlying register group (up to VLMAX). This implies uniform semantics in cases of copy-elision like return-value optimization.

vmv.vv is still available if the user wants length-VL and masked copies.

We should all be aware of the possibility of implementation-defined behavior for inactive and tail elements:
https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
(This is still under discussion by the V-ext task group.)

@ebahapo
Copy link

ebahapo commented Apr 23, 2020

Besides the assignment operator, it would make sense to me to also implement pointer indirection, the arithmetic and comparison operators on VL length. Methinks that users would be more comfortable porting the core of their algorithms if they could retain at least some of their original algebraic syntax.

@kito-cheng
Copy link
Collaborator

@ebahapo :
It's little counter-intuitive if assignment and arithmetic has different behavior for VL, that mean a = b + c; d = b + c and a = b + c; d = a is different.

@kito-cheng
Copy link
Collaborator

I was thought it might kill the performance if we define assignment/operator= always operate with VLMAX, but it should able to optimized by compiler.

Extend the example in my last comment:

a = vadd(b, c, avl);
d = vadd(b, c, avl); // It could optimized by CSE
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x
a = vadd(b, c, avl);
d = vcopy(a, avl);
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x
a = vadd(b, c, avl);
d = a;
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

For those 3 case vcopy, recompute and operator= should get same performance after optimization.

d = a should able to optimized out or optimized into copy avl elements only, since the reset part of d are unused.

@jan-wassenberg
Copy link

Joining a bit late :) I'm working on portable wrappers for intrinsics at Google and heard complaints about verbose code. Where possible, operators are very helpful for readability.

Somewhat related: our goal is to reduce the large cost of implementing and porting by having the same code compile for multiple platforms, including RVV. If the code requests VL-aware or even masked load/store unnecessarily, that would be expensive on other platforms.

The same code could be efficient everywhere if we have the main loop using VLMAX, and a second cleanup 'loop' using masks or avl.

Thus it would be nice to have VLMAX operators for the first loop, especially if the app does not need a cleanup because it is able to pad inputs/outputs to VLMAX. Does that make sense?

(BTW an ARM engineer seemed receptive to such operators for SVE ACLE.)

@nick-knight
Copy link
Collaborator

nick-knight commented Dec 9, 2020

Hi @jan-wassenberg, sorry, I don't completely follow your proposal. Please let me clarify where I am confused.

This thread concerned extending C operators to the new RVV types, which exposes a kind of impedance mismatch between the underlying assembly language and the C abstraction. There was consensus that the assignment operator would operate up to VLMAX, but it was unclear whether these semantics should also apply to the other operators.

For example, currently we can express N-by-N matrix multiply (C += A*B) in RVV intrinsics something like this:

for (int i = 0; i < N; ++i)
  for (int j = 0; vl = vsetvl_f32m1(N - j); j += vl)
    for (int k = 0; k < N; ++k) {
      vfloat32m1_t B_vec = vle32_v_f32m1(&B[k][j]);
      vfloat32m1_t C_vec = vle32_v_f32m1(&C[i][j]);
      C_vec = vfmacc_vf_f32m1(C_vec, A[i][k], B_vec);
      vse32_v_f32m1(&C[i][j], C_vec);
    }

It's tempting to extend the += and * operators to replace the vfmacc_vf_f32m1 intrinsic by:

C_vec += A[i][k] * B_vec;

However, like you mention, if these operators are defined to work on VLMAX elements, then we will need to add cleanup code to handle the fringe case, which will use the vfmacc_vf_f32m1 intrinsic anyway. So this alternative appears to me to be strictly more verbose, and also likely less performant because of increased (static and dynamic) code size.

EDIT: to be clear, I'm not opposed to the VLMAX semantics --- it doesn't remove any functionality --- I'm just concerned that it doesn't yield a net decrease on the intrinsics programmer's cognitive burden.

@jan-wassenberg
Copy link

Hi @knightsifive , thanks for looking into this. For c += a*b it makes sense to use FMA despite the increased verbosity, but c += a is enough to show the operator.

Let's imagine we take your code, which looks good for RVV, and replace each intrinsic with a wrapper function, then re-implement the wrappers using AVX2. Because the loop relies on VL, each iteration would have to check whether vl==vlmax, because AVX2 has neither VL nor masks and fairly expensive masked load/store. That is wasteful because only the last iteration actually needs it.

Now if we have a VLMAX first loop followed by cleanup, I agree with you that it is more verbose and also larger code. In the AVX2 case, we can still expect a performance benefit. (ICC also generates two such loops even for AVX3.)

In my experience with the JPEG XL image codec, we are often able to arrange for N to be a multiple of VLMAX, or at least make it safe to pretend it is by padding all inputs/outputs. Then we do not need a cleanup loop, and it would be nice if the first loop is able to use the shorter and more readable operators.

Why talk about AVX2 here? I imagine not all software is going to be rewritten specifically for RVV in the VL style. Projects such as OpenCV/JPEG XL already have such wrappers and would hope to write (performance-portable) code only once, not per platform. Is that something we would want to enable for faster adoption and porting?

I am actually not sure the above use case cares whether += uses VL or VLMAX, but I do hope that operators would be included/allowed for readability.

@jan-wassenberg
Copy link

@kito-cheng now that we are moving to explicit VL, it seems a good time to resume this discussion.
In operator+= etc functions, we do not have a user-provided AVL argument, so they would now use VLMAX, right?

Can the compiler define operator+= builtin functions? Unfortunately overloading them in normal C++ code is not possible because the arguments (vuint*) are built-in types, not user-defined.

@kito-cheng
Copy link
Collaborator

Apologize for very late reply, I would prefer block most operation at this first and then relax later if needed.

The list should be allowed in first version in my mind is:

  • Assignment operator, with VLMAX semantic.
  • typeid operator.
  • Address-of operator.

I think for those operators should be supported when size is known, and maybe we should only supported on VLS type (e.g. int32x8_t), but this part we don't have well discussion yet, although upstream LLVM has some initial support there.

@jan-wassenberg
Copy link

@kito-cheng
For user ergonomics, do you agree operators are much easier to read/understand at a glance? Here are some real-world examples:

const V y1 = (x2 / (sqrt_x2_plus_1 + kOne)) + abs_x;
const V y1 = Add(Div(x2, Add(sqrt_x2_plus_1, kOne)), abs_x);

const V z = (y + kTwo) / (y + kOne) * (y * kHalf);
const V z = Mul(Div(Add(y, kTwo), Add(y, kOne)), Mul(y, kHalf));

"relax later if needed" would have unfortunate consequences: users of a generic interface (e.g. Highway) would have to use Div etc now instead of operators, and once written I doubt code would be changed back to operators (some risk of introducing mistakes).

Is it infeasible to provide a builtin that behaves as if the following were allowed?
vfloat32m8_t operator/(vfloat32m8_t a, vfloat32m8_t b) { return vfdiv_vf_f32m1(a, b, MaxVL()); }

@sh1boot
Copy link

sh1boot commented Jan 3, 2024

FWIW, Clang (but not GCC) lets you pass its own built-in vector types to RVV intrinsics: https://godbolt.org/z/vTYsPWsMT

That is:

#define VECTOR_BITS 256  // use __riscv_v_min_vlen if you don't care (or __riscv_v_fixed_vlen if it is defined)
using fixed_vuint8m1_t = uint8_t __attribute__((vector_size(VECTOR_BITS / 8)));
fixed_vuint8m1_t add(fixed_vuint8m1_t a, fixed_vuint8m1_t b) {
    return __riscv_vadd(a, b, 32);
}

works perfectly well provided the type is no larger than an m8 register of the minimum vector length of the compilation target. It also works for other architectures (but still not with GCC).

That example isn't interesting because addition is already supported by the compiler, but it does mean you can switch freely to and from intrinsics where necessary. Consequently, you can introduce VL by switching to intrinsics.

So in a sense C operator support is already halfway there. The only missing part is the ability to create an unsized vector with __attribute__((vector_size())) which would always use VLMAX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release
Projects
None yet
Development

No branches or pull requests

9 participants