Lowering of `NTuple{4, VecElement{UInt8}}` #18445

vchuravy · 2016-09-11T14:03:11Z

I was looking into getting Julia to emit the LLVM IR for a truncation between vector types.

%1 = trunc <4 x i16> %0 to <4 x i8>

which should correspond to the Julia code:

import Core.Intrinsics: box, unbox, VecElement, trunc_int
typealias VUInt16{N} NTuple{N, VecElement{UInt16}}
typealias VUInt8{N} NTuple{N, VecElement{UInt8}}

trunc(x::VUInt16{4}) = box(VUInt8{4}, trunc_int(VUInt8{4}, unbox(VUInt16{N}, x)))

julia> trunc(VUInt16((1,2,3,4)))
------------------------------------------------------------------------------------------
ErrorException                                          Stacktrace (most recent call last)
[#1] — trunc(::NTuple{4,VecElement{UInt16}})
       ⌙ at REPL[4]:1

expected bits type as first argument

I traced it down to staticeval_bitstype

julia/src/intrinsics.cpp

Line 388 in fa4c02c

    
           static jl_value_t *staticeval_bitstype(jl_value_t *targ, const char *fname, jl_codectx_t *ctx)

and it seems that jl_is_bitstype is false for Tuples of VecElement (because nfields != 0).

So my question is should jl_is_bitstype be true for NTuple{N, VecElement}? Since they lower to the vector types in llvm that seems not unreasonable, or should we special case it for staticeval_bitstype?

The text was updated successfully, but these errors were encountered:

yuyichao · 2016-09-11T14:10:44Z

So my question is should jl_is_bitstype be true for NTuple{N, VecElement}?

bitstype is a julia concept unrelated to what LLVM code we emit. I think it's okay to add NTuple{N,VecElement} as a special case to places that is useful for hand written SIMD code.

eschnett · 2016-09-11T15:05:06Z

You can use the SIMD package to do this:

using SIMD
f(x) = x % Vec{4,Int8}
@code_llvm f(Vec{4,Int16}(1))

Unfortunately, SIMD currently scalarizes the code. Please open a bug report (or pull request...) if you are interested in improving this.

vchuravy · 2016-09-11T16:11:35Z

My goal is to give us a bit better support for SIMD in base Julia and to
reduce the reliance on llvmcall for SIMD.jl.

On Sun, 11 Sep 2016, 11:06 Erik Schnetter, [email protected] wrote:

You can use the SIMD https://github.com/eschnett/SIMD.jl package to do
this:

using SIMDf(x) = x % Vec{4,Int8}@code_llvm f(Vec{4,Int16}(1))

Unfortunately, SIMD currently scalarizes the code. Please open a bug
report (or pull request...) if you are interested in improving this.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#18445 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI3arTqBdnc24MBxxvBTa266yM2nzrTks5qpBijgaJpZM4J6A3B
.

vtjnash · 2016-09-11T20:55:22Z

How onerous would it be to make separate vector versions of all of the relevant Intrinsics? I'm not completely opposed to making the existing intrinsics accept both bitstype or NTuple{VecElement{bitstype}}, but it seems hard to update all of the code (esp. the C implementations we have for all of the intrinsics) to accept them.

eschnett · 2016-09-11T21:14:22Z

@vtjnash Isn't that exactly what the SIMD package already does? Before I implemented it, there was a discussion, and the conclusion was that new intrinsics were not necessary, and using llvmcall was easier, in particular since it didn't require changes to Base.

If there is a reason why intrinsics are to be preferred, e.g. if they are faster, then we should go that route. In this case, it would make sense to move the SIMD package into Base as well. Maybe we should move SIMD into Base first, and then work on intrinsics as performance improvement?

vchuravy · 2016-09-13T04:11:41Z

As far as I know the string from of llvmcall is something that is seen as a hack and should be removed in the long-term. One of its obvious limitations is that LLVM IR is not stable between LLVM versions. (also we can't really work with Julia struct...)

I personally think that having a great support story for SIMD types in Julia is important and it should have full language support. As you can see in #18470 the changes to base are not as bad as I thought (except the runtime-intrinsics) and Core.Intrinsics should give us enough basic support to implement most of SIMD.jl

eschnett · 2016-09-13T20:36:58Z

The big unknown issue with SIMD operations is how to represent vectors of booleans. Julia represents Bool essentially as UInt8 where values are restricted to 0 or 1. This is not efficient for SIMD operations. LLVM will often have to resort to scalarizing code, and thus all code using ifelse or similar constructs will run very slowly.

The solution adopted by OpenCL (and many other packages) is to have several boolean types, essentially one boolean type for each integer or floating point size: Bool8, Bool16, Bool32, Bool64, with respective conversion operations. Also, it is common to use the values 0 and -1 to represent false and true, and sometimes also to interpret all negative values as true when arbitrary integers are presented. This corresponds to hardware instructions on a wide range of architectures.

I'm bringing this up here since this is one of the big missing pieces in the SIMD package, and given that this is a large-ish change, it also deserves some discussion before going into Base Julia. However, given the state of Julia's and LLVM's optimizers, I don't see a way around introducing multiple boolean types if one wants to write efficient predicated SIMD code.

vchuravy · 2016-09-13T21:39:47Z

The select instruction in LLVM is <N x i1> http://llvm.org/docs/LangRef.html#select-instruction
How does that mesh with the hardware side? Supporting i1 for masked load and select is on my list of things to add.

eschnett · 2016-09-13T22:14:07Z

<N x i1> is only the representation for the LLVM instruction. E.g. Intel AVX2 expects <N x i64> for booleans if the other arguments to the select instruction are of type i64. (The rule of thumb for most architectures is that the number of bits in the boolean and the other arguments needs to be the same.)

If you store booleans in <N x i64>, and then truncate to <N x i1> right before the LLVM select instruction, LLVM will generate efficient code that skips the truncation. If, however, you store booleans in <N x i1> or <N x i8> when they are passed into a function, then LLVM will have to go to great lengths to convert to <N x i64> as expected by the instruction.

In SIMD.jl, I never store booleans in <N x i1>; instead, I only truncate to this type just before LLVM's select intrinsic.

vchuravy · 2016-09-16T14:51:30Z

Thanks for the explanation, that explains why I was seeing scalarized assembly for extracting a bitvector.

vchuravy mentioned this issue Sep 13, 2016

[RFC/WIP] Intrinsics for NTuple{N, VecElement} #18470

Closed

3 tasks

vchuravy self-assigned this Sep 16, 2016

mbauman added the compiler:simd instruction-level vectorization label Apr 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowering of `NTuple{4, VecElement{UInt8}}` #18445

Lowering of `NTuple{4, VecElement{UInt8}}` #18445

vchuravy commented Sep 11, 2016

yuyichao commented Sep 11, 2016 •

edited

Loading

eschnett commented Sep 11, 2016

vchuravy commented Sep 11, 2016

vtjnash commented Sep 11, 2016

eschnett commented Sep 11, 2016

vchuravy commented Sep 13, 2016

eschnett commented Sep 13, 2016

vchuravy commented Sep 13, 2016

eschnett commented Sep 13, 2016

vchuravy commented Sep 16, 2016

Lowering of NTuple{4, VecElement{UInt8}} #18445

Lowering of NTuple{4, VecElement{UInt8}} #18445

Comments

vchuravy commented Sep 11, 2016

yuyichao commented Sep 11, 2016 • edited Loading

eschnett commented Sep 11, 2016

vchuravy commented Sep 11, 2016

vtjnash commented Sep 11, 2016

eschnett commented Sep 11, 2016

vchuravy commented Sep 13, 2016

eschnett commented Sep 13, 2016

vchuravy commented Sep 13, 2016

eschnett commented Sep 13, 2016

vchuravy commented Sep 16, 2016

Lowering of `NTuple{4, VecElement{UInt8}}` #18445

Lowering of `NTuple{4, VecElement{UInt8}}` #18445

yuyichao commented Sep 11, 2016 •

edited

Loading