Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowering of NTuple{4, VecElement{UInt8}} #18445

Open
vchuravy opened this issue Sep 11, 2016 · 10 comments
Open

Lowering of NTuple{4, VecElement{UInt8}} #18445

vchuravy opened this issue Sep 11, 2016 · 10 comments
Assignees
Labels
compiler:simd instruction-level vectorization

Comments

@vchuravy
Copy link
Member

I was looking into getting Julia to emit the LLVM IR for a truncation between vector types.

%1 = trunc <4 x i16> %0 to <4 x i8>

which should correspond to the Julia code:

import Core.Intrinsics: box, unbox, VecElement, trunc_int
typealias VUInt16{N} NTuple{N, VecElement{UInt16}}
typealias VUInt8{N} NTuple{N, VecElement{UInt8}}

trunc(x::VUInt16{4}) = box(VUInt8{4}, trunc_int(VUInt8{4}, unbox(VUInt16{N}, x)))
julia> trunc(VUInt16((1,2,3,4)))
------------------------------------------------------------------------------------------
ErrorException                                          Stacktrace (most recent call last)
[#1] — trunc(::NTuple{4,VecElement{UInt16}})
       ⌙ at REPL[4]:1

expected bits type as first argument

I traced it down to staticeval_bitstype

static jl_value_t *staticeval_bitstype(jl_value_t *targ, const char *fname, jl_codectx_t *ctx)

and it seems that jl_is_bitstype is false for Tuples of VecElement (because nfields != 0).

So my question is should jl_is_bitstype be true for NTuple{N, VecElement}? Since they lower to the vector types in llvm that seems not unreasonable, or should we special case it for staticeval_bitstype?

@yuyichao
Copy link
Contributor

yuyichao commented Sep 11, 2016

So my question is should jl_is_bitstype be true for NTuple{N, VecElement}?

bitstype is a julia concept unrelated to what LLVM code we emit. I think it's okay to add NTuple{N,VecElement} as a special case to places that is useful for hand written SIMD code.

@eschnett
Copy link
Contributor

You can use the SIMD package to do this:

using SIMD
f(x) = x % Vec{4,Int8}
@code_llvm f(Vec{4,Int16}(1))

Unfortunately, SIMD currently scalarizes the code. Please open a bug report (or pull request...) if you are interested in improving this.

@vchuravy
Copy link
Member Author

My goal is to give us a bit better support for SIMD in base Julia and to
reduce the reliance on llvmcall for SIMD.jl.

On Sun, 11 Sep 2016, 11:06 Erik Schnetter, [email protected] wrote:

You can use the SIMD https://github.com/eschnett/SIMD.jl package to do
this:

using SIMDf(x) = x % Vec{4,Int8}@code_llvm f(Vec{4,Int16}(1))

Unfortunately, SIMD currently scalarizes the code. Please open a bug
report (or pull request...) if you are interested in improving this.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#18445 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAI3arTqBdnc24MBxxvBTa266yM2nzrTks5qpBijgaJpZM4J6A3B
.

@vtjnash
Copy link
Member

vtjnash commented Sep 11, 2016

How onerous would it be to make separate vector versions of all of the relevant Intrinsics? I'm not completely opposed to making the existing intrinsics accept both bitstype or NTuple{VecElement{bitstype}}, but it seems hard to update all of the code (esp. the C implementations we have for all of the intrinsics) to accept them.

@eschnett
Copy link
Contributor

@vtjnash Isn't that exactly what the SIMD package already does? Before I implemented it, there was a discussion, and the conclusion was that new intrinsics were not necessary, and using llvmcall was easier, in particular since it didn't require changes to Base.

If there is a reason why intrinsics are to be preferred, e.g. if they are faster, then we should go that route. In this case, it would make sense to move the SIMD package into Base as well. Maybe we should move SIMD into Base first, and then work on intrinsics as performance improvement?

@vchuravy
Copy link
Member Author

As far as I know the string from of llvmcall is something that is seen as a hack and should be removed in the long-term. One of its obvious limitations is that LLVM IR is not stable between LLVM versions. (also we can't really work with Julia struct...)

I personally think that having a great support story for SIMD types in Julia is important and it should have full language support. As you can see in #18470 the changes to base are not as bad as I thought (except the runtime-intrinsics) and Core.Intrinsics should give us enough basic support to implement most of SIMD.jl

@eschnett
Copy link
Contributor

The big unknown issue with SIMD operations is how to represent vectors of booleans. Julia represents Bool essentially as UInt8 where values are restricted to 0 or 1. This is not efficient for SIMD operations. LLVM will often have to resort to scalarizing code, and thus all code using ifelse or similar constructs will run very slowly.

The solution adopted by OpenCL (and many other packages) is to have several boolean types, essentially one boolean type for each integer or floating point size: Bool8, Bool16, Bool32, Bool64, with respective conversion operations. Also, it is common to use the values 0 and -1 to represent false and true, and sometimes also to interpret all negative values as true when arbitrary integers are presented. This corresponds to hardware instructions on a wide range of architectures.

I'm bringing this up here since this is one of the big missing pieces in the SIMD package, and given that this is a large-ish change, it also deserves some discussion before going into Base Julia. However, given the state of Julia's and LLVM's optimizers, I don't see a way around introducing multiple boolean types if one wants to write efficient predicated SIMD code.

@vchuravy
Copy link
Member Author

The select instruction in LLVM is <N x i1> http://llvm.org/docs/LangRef.html#select-instruction
How does that mesh with the hardware side? Supporting i1 for masked load and select is on my list of things to add.

@eschnett
Copy link
Contributor

<N x i1> is only the representation for the LLVM instruction. E.g. Intel AVX2 expects <N x i64> for booleans if the other arguments to the select instruction are of type i64. (The rule of thumb for most architectures is that the number of bits in the boolean and the other arguments needs to be the same.)

If you store booleans in <N x i64>, and then truncate to <N x i1> right before the LLVM select instruction, LLVM will generate efficient code that skips the truncation. If, however, you store booleans in <N x i1> or <N x i8> when they are passed into a function, then LLVM will have to go to great lengths to convert to <N x i64> as expected by the instruction.

In SIMD.jl, I never store booleans in <N x i1>; instead, I only truncate to this type just before LLVM's select intrinsic.

@vchuravy
Copy link
Member Author

Thanks for the explanation, that explains why I was seeing scalarized assembly for extracting a bitvector.

@vchuravy vchuravy self-assigned this Sep 16, 2016
@mbauman mbauman added the compiler:simd instruction-level vectorization label Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants