Code generation problems #12

GunnarFarneback · 2016-07-14T15:22:50Z

This code

using SIMD

function foo(x::Vector{Float32})
    N = length(x)
    y = Array(Vec{4, Float32}, N)
    for k = 1:N
        @inbounds y[k] = Vec{4, Float32}(x[k])
    end
    return y
end

x = rand(Float32, 100000);

foo(x);
@time foo(x);

run with a Julia 0.5 of today allocates way too much memory:

julia> @time foo(x);
  0.614021 seconds (200.13 k allocations: 7.637 MB, 12.62% gc time)

code_llvm produces a mess including a scary call to jl_apply_generic.

In a 48 days old Julia I had available this is much better, 6 allocations: 1.526 MB, unless @inbounds is removed, in which case it's back to 200k allocations. There code_llvm also produces a mess but at least without any jl_apply_generic.

I'm not sure if I'm doing something I shouldn't but at least it looks like something has regressed with recent Julia.

julia> versioninfo()
Julia Version 0.5.0-dev+5429
Commit 828f7ae* (2016-07-14 09:21 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

The text was updated successfully, but these errors were encountered:

eschnett · 2016-07-14T17:00:10Z

Methinks this might be a type instability. The array and tuple handling code in Julia is currently changing. I will have a look.

eschnett · 2016-07-14T18:21:54Z

Here is a shorter code to reproduce the problem:

using SIMD
f(a) = @inbounds a[1] = Vec{4,Float32}(1)
@code_llvm f(Array{Vec{4,Float32}}(1))

I see that the code that creates the SIMD vector is translated fine; it is inline, and as simple as it should be. I thus think that the array indexing is causing problems.

The problem disappears when I use NTuple instead of Vec.

@timholy Any ideas?

Here is the generated LLVM code:

define void @julia_f_67463(%jl_value_t*) #0 !dbg !6 {
top:
  %1 = call %jl_value_t*** @jl_get_ptls_states()
  %2 = alloca [6 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 3
  %4 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 2
  %5 = bitcast %jl_value_t** %3 to i8*
  call void @llvm.memset.p0i8.i32(i8* %5, i8 0, i32 24, i32 8, i1 false)
  %6 = bitcast [6 x %jl_value_t*]* %2 to i64*
  store i64 8, i64* %6, align 8
  %7 = bitcast %jl_value_t*** %1 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 1
  %10 = bitcast %jl_value_t** %9 to i64*
  store i64 %8, i64* %10, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %1, align 8
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %11 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 5
  %12 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 4
  store %jl_value_t* inttoptr (i64 4571598584 to %jl_value_t*), %jl_value_t** %3, align 8
  store %jl_value_t* inttoptr (i64 4647558576 to %jl_value_t*), %jl_value_t** %12, align 8
  %13 = bitcast %jl_value_t*** %1 to i8*
  %14 = getelementptr %jl_value_t**, %jl_value_t*** %1, i64 176
  %15 = bitcast %jl_value_t*** %14 to i8*
  %16 = call %jl_value_t* @jl_gc_pool_alloc(i8* %13, i8* %15, i32 32, i32 16328)
  %17 = getelementptr inbounds %jl_value_t, %jl_value_t* %16, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 4647152816 to %jl_value_t*), %jl_value_t** %17, align 8
  %18 = bitcast %jl_value_t* %16 to <4 x float>*
  store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, <4 x float>* %18, align 8
  store %jl_value_t* %16, %jl_value_t** %11, align 8
  %19 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %3, i32 3)
  store %jl_value_t* %19, %jl_value_t** %4, align 8
  %20 = bitcast %jl_value_t* %19 to float*
  %21 = load float, float* %20, align 16
  %22 = insertelement <4 x float> undef, float %21, i32 0
  %23 = bitcast %jl_value_t* %19 to i8*
  %24 = getelementptr i8, i8* %23, i64 4
  %25 = bitcast i8* %24 to float*
  %26 = load float, float* %25, align 4
  %27 = insertelement <4 x float> %22, float %26, i32 1
  %28 = getelementptr %jl_value_t, %jl_value_t* %19, i64 1
  %29 = bitcast %jl_value_t* %28 to float*
  %30 = load float, float* %29, align 8
  %31 = insertelement <4 x float> %27, float %30, i32 2
  %32 = getelementptr i8, i8* %23, i64 12
  %33 = bitcast i8* %32 to float*
  %34 = load float, float* %33, align 4
  %35 = insertelement <4 x float> %31, float %34, i32 3
  %36 = bitcast %jl_value_t* %0 to <4 x float>**
  %37 = load <4 x float>*, <4 x float>** %36, align 8
  store <4 x float> %35, <4 x float>* %37, align 8
  %38 = load i64, i64* %10, align 8
  store i64 %38, i64* %7, align 8
  ret void
}

Given this, I assume that the call to jl_apply_generic returns the address of the array element, and the following four insertelement instructions are the assignment of the Vec tuple to the array element.

timholy · 2016-07-14T22:43:43Z

This doesn't seem like it could possibly be affected by any of my recent array changes:

julia> using SIMD

julia> v = Vec{4,Float32}(1)
4-element SIMD.Vec{4,Float32}:
Float32⟨1.0,1.0,1.0,1.0⟩

julia> a = Array{Vec{4,Float32}}(1)
1-element Array{SIMD.Vec{4,Float32},1}:
 Float32⟨-0.00026230933,4.5644e-41,-0.02648136,4.5644e-41⟩

julia> @which a[1] = v
setindex!{T}(A::Array{T,N<:Any}, x, i1::Real) at array.jl:372

That line is here. This problem seems to be fixed (or at least, better) if you comment out this line. So I'd look into your convert method.

KristofferC · 2016-07-15T02:56:58Z

Ref #6

GunnarFarneback · 2016-07-15T12:27:15Z

This looks relevant:

julia> using SIMD

julia> code_llvm(Tuple, (Vec{4, Float32},))

define void @julia_Type_67726([4 x float]* noalias sret, %jl_value_t*, %Vec*) #0 {
  [...]
  %19 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %4, i32 3)
  [...]
}

In my older Julia (Commit bc56e32* (49 days old master)) this doesn't seem completely sane but might explain the observed regression:

julia> using SIMD

julia> code_llvm(Tuple, (Vec{4, Float32},))

define void @julia_Type_50122([4 x float]* sret, %jl_value_t*, %Vec*) #0 {
  [...]
  %18 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %5, i32 3)
  [...]
}

julia> Tuple(Vec{4, Float32}(0))
(0.0f0,0.0f0,0.0f0,0.0f0)

julia> code_llvm(Tuple, (Vec{4, Float32},))

define void @julia_Type_50122([4 x float]* sret, %jl_value_t*, %Vec*) #0 {
top:
  %3 = alloca [4 x float], align 4
  call void @julia_convert_50126([4 x float]* nonnull sret %3, %jl_value_t* inttoptr (i64 140529556360832 to %jl_value_t*), %Vec* %2) #0
  %4 = bitcast [4 x float]* %0 to i8*
  %5 = bitcast [4 x float]* %3 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %4, i8* %5, i64 16, i32 4, i1 false)
  ret void
}

In current Julia the simpler code can't be provoked by running the constructor once.

eschnett · 2016-07-15T15:06:02Z

@GunnarFarneback Thanks for pointing to #6; yes, this was the problem. Apologies for not understanding the main point of your pull request when you requested it three months ago.

timholy · 2016-07-15T16:23:00Z

(I think you meant @KristofferC.)

eschnett · 2016-07-15T17:39:22Z

@timholy @KristofferC Yes, sorry again. Not my day today.

KristofferC · 2016-07-15T23:42:42Z

:)

eschnett mentioned this issue Jul 14, 2016

Checklist of steps towards 0.5 release [candidates] JuliaLang/julia#17418

Closed

16 tasks

eschnett closed this as completed in 61c4a93 Jul 15, 2016

KristofferC mentioned this issue Jul 27, 2016

It's too easy to accidentally sidestep our convert(::Type{T}, ::T) no-op fallback JuliaLang/julia#17559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code generation problems #12

Code generation problems #12

GunnarFarneback commented Jul 14, 2016

eschnett commented Jul 14, 2016

eschnett commented Jul 14, 2016 •

edited

Loading

timholy commented Jul 14, 2016

KristofferC commented Jul 15, 2016

GunnarFarneback commented Jul 15, 2016

eschnett commented Jul 15, 2016

timholy commented Jul 15, 2016

eschnett commented Jul 15, 2016

KristofferC commented Jul 15, 2016

Code generation problems #12

Code generation problems #12

Comments

GunnarFarneback commented Jul 14, 2016

eschnett commented Jul 14, 2016

eschnett commented Jul 14, 2016 • edited Loading

timholy commented Jul 14, 2016

KristofferC commented Jul 15, 2016

GunnarFarneback commented Jul 15, 2016

eschnett commented Jul 15, 2016

timholy commented Jul 15, 2016

eschnett commented Jul 15, 2016

KristofferC commented Jul 15, 2016

eschnett commented Jul 14, 2016 •

edited

Loading