Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code generation problems #12

Closed
GunnarFarneback opened this issue Jul 14, 2016 · 9 comments
Closed

Code generation problems #12

GunnarFarneback opened this issue Jul 14, 2016 · 9 comments

Comments

@GunnarFarneback
Copy link
Contributor

This code

using SIMD

function foo(x::Vector{Float32})
    N = length(x)
    y = Array(Vec{4, Float32}, N)
    for k = 1:N
        @inbounds y[k] = Vec{4, Float32}(x[k])
    end
    return y
end

x = rand(Float32, 100000);

foo(x);
@time foo(x);

run with a Julia 0.5 of today allocates way too much memory:

julia> @time foo(x);
  0.614021 seconds (200.13 k allocations: 7.637 MB, 12.62% gc time)

code_llvm produces a mess including a scary call to jl_apply_generic.

In a 48 days old Julia I had available this is much better, 6 allocations: 1.526 MB, unless @inbounds is removed, in which case it's back to 200k allocations. There code_llvm also produces a mess but at least without any jl_apply_generic.

I'm not sure if I'm doing something I shouldn't but at least it looks like something has regressed with recent Julia.

julia> versioninfo()
Julia Version 0.5.0-dev+5429
Commit 828f7ae* (2016-07-14 09:21 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
@eschnett
Copy link
Owner

Methinks this might be a type instability. The array and tuple handling code in Julia is currently changing. I will have a look.

@eschnett
Copy link
Owner

eschnett commented Jul 14, 2016

Here is a shorter code to reproduce the problem:

using SIMD
f(a) = @inbounds a[1] = Vec{4,Float32}(1)
@code_llvm f(Array{Vec{4,Float32}}(1))

I see that the code that creates the SIMD vector is translated fine; it is inline, and as simple as it should be. I thus think that the array indexing is causing problems.

The problem disappears when I use NTuple instead of Vec.

@timholy Any ideas?

Here is the generated LLVM code:

define void @julia_f_67463(%jl_value_t*) #0 !dbg !6 {
top:
  %1 = call %jl_value_t*** @jl_get_ptls_states()
  %2 = alloca [6 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 3
  %4 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 2
  %5 = bitcast %jl_value_t** %3 to i8*
  call void @llvm.memset.p0i8.i32(i8* %5, i8 0, i32 24, i32 8, i1 false)
  %6 = bitcast [6 x %jl_value_t*]* %2 to i64*
  store i64 8, i64* %6, align 8
  %7 = bitcast %jl_value_t*** %1 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 1
  %10 = bitcast %jl_value_t** %9 to i64*
  store i64 %8, i64* %10, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %1, align 8
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %11 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 5
  %12 = getelementptr [6 x %jl_value_t*], [6 x %jl_value_t*]* %2, i64 0, i64 4
  store %jl_value_t* inttoptr (i64 4571598584 to %jl_value_t*), %jl_value_t** %3, align 8
  store %jl_value_t* inttoptr (i64 4647558576 to %jl_value_t*), %jl_value_t** %12, align 8
  %13 = bitcast %jl_value_t*** %1 to i8*
  %14 = getelementptr %jl_value_t**, %jl_value_t*** %1, i64 176
  %15 = bitcast %jl_value_t*** %14 to i8*
  %16 = call %jl_value_t* @jl_gc_pool_alloc(i8* %13, i8* %15, i32 32, i32 16328)
  %17 = getelementptr inbounds %jl_value_t, %jl_value_t* %16, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 4647152816 to %jl_value_t*), %jl_value_t** %17, align 8
  %18 = bitcast %jl_value_t* %16 to <4 x float>*
  store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, <4 x float>* %18, align 8
  store %jl_value_t* %16, %jl_value_t** %11, align 8
  %19 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %3, i32 3)
  store %jl_value_t* %19, %jl_value_t** %4, align 8
  %20 = bitcast %jl_value_t* %19 to float*
  %21 = load float, float* %20, align 16
  %22 = insertelement <4 x float> undef, float %21, i32 0
  %23 = bitcast %jl_value_t* %19 to i8*
  %24 = getelementptr i8, i8* %23, i64 4
  %25 = bitcast i8* %24 to float*
  %26 = load float, float* %25, align 4
  %27 = insertelement <4 x float> %22, float %26, i32 1
  %28 = getelementptr %jl_value_t, %jl_value_t* %19, i64 1
  %29 = bitcast %jl_value_t* %28 to float*
  %30 = load float, float* %29, align 8
  %31 = insertelement <4 x float> %27, float %30, i32 2
  %32 = getelementptr i8, i8* %23, i64 12
  %33 = bitcast i8* %32 to float*
  %34 = load float, float* %33, align 4
  %35 = insertelement <4 x float> %31, float %34, i32 3
  %36 = bitcast %jl_value_t* %0 to <4 x float>**
  %37 = load <4 x float>*, <4 x float>** %36, align 8
  store <4 x float> %35, <4 x float>* %37, align 8
  %38 = load i64, i64* %10, align 8
  store i64 %38, i64* %7, align 8
  ret void
}

Given this, I assume that the call to jl_apply_generic returns the address of the array element, and the following four insertelement instructions are the assignment of the Vec tuple to the array element.

@timholy
Copy link
Contributor

timholy commented Jul 14, 2016

This doesn't seem like it could possibly be affected by any of my recent array changes:

julia> using SIMD

julia> v = Vec{4,Float32}(1)
4-element SIMD.Vec{4,Float32}:
Float32⟨1.0,1.0,1.0,1.0⟩

julia> a = Array{Vec{4,Float32}}(1)
1-element Array{SIMD.Vec{4,Float32},1}:
 Float32⟨-0.00026230933,4.5644e-41,-0.02648136,4.5644e-41⟩

julia> @which a[1] = v
setindex!{T}(A::Array{T,N<:Any}, x, i1::Real) at array.jl:372

That line is here. This problem seems to be fixed (or at least, better) if you comment out this line. So I'd look into your convert method.

@KristofferC
Copy link
Collaborator

Ref #6

@GunnarFarneback
Copy link
Contributor Author

This looks relevant:

julia> using SIMD

julia> code_llvm(Tuple, (Vec{4, Float32},))

define void @julia_Type_67726([4 x float]* noalias sret, %jl_value_t*, %Vec*) #0 {
  [...]
  %19 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %4, i32 3)
  [...]
}

In my older Julia (Commit bc56e32* (49 days old master)) this doesn't seem completely sane but might explain the observed regression:

julia> using SIMD

julia> code_llvm(Tuple, (Vec{4, Float32},))

define void @julia_Type_50122([4 x float]* sret, %jl_value_t*, %Vec*) #0 {
  [...]
  %18 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %5, i32 3)
  [...]
}

julia> Tuple(Vec{4, Float32}(0))
(0.0f0,0.0f0,0.0f0,0.0f0)

julia> code_llvm(Tuple, (Vec{4, Float32},))

define void @julia_Type_50122([4 x float]* sret, %jl_value_t*, %Vec*) #0 {
top:
  %3 = alloca [4 x float], align 4
  call void @julia_convert_50126([4 x float]* nonnull sret %3, %jl_value_t* inttoptr (i64 140529556360832 to %jl_value_t*), %Vec* %2) #0
  %4 = bitcast [4 x float]* %0 to i8*
  %5 = bitcast [4 x float]* %3 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %4, i8* %5, i64 16, i32 4, i1 false)
  ret void
}

In current Julia the simpler code can't be provoked by running the constructor once.

@eschnett
Copy link
Owner

@GunnarFarneback Thanks for pointing to #6; yes, this was the problem. Apologies for not understanding the main point of your pull request when you requested it three months ago.

@timholy
Copy link
Contributor

timholy commented Jul 15, 2016

(I think you meant @KristofferC.)

@eschnett
Copy link
Owner

@timholy @KristofferC Yes, sorry again. Not my day today.

@KristofferC
Copy link
Collaborator

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants