Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SIMD with Base.literal_pow #332

Merged
merged 5 commits into from
Jul 30, 2018
Merged

Conversation

tkf
Copy link
Contributor

@tkf tkf commented Jul 19, 2018

This branch eliminates run-time branching (introduced in #331) and enables SIMD (previously impossible in master) for power functions with low degree (<= 3). For example,

using ForwardDiff: Dual
using StaticArrays
x = Dual(1.0, 2.0)
xs = SVector(x, x, x, x)
@inline pow2(x) = x^2
f(xs) = pow2.(xs)
@code_llvm f(xs)

executed with julia -O3 prints two fmul <4 x double> instructions (presumably one for .value and one for .partials[1]):

define void @julia_f_63134(%SArray* noalias nocapture sret, %SArray* nocapture readonly dereferenceable(64)) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 0
  %3 = load double, double* %2, align 8
  %4 = fmul double %3, 2.000000e+00
  %5 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 1, i32 0, i64 0
  %6 = load double, double* %5, align 8
  %7 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 0
  %8 = load double, double* %7, align 8
  %9 = fmul double %8, 2.000000e+00
  %10 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 1, i32 0, i64 0
  %11 = load double, double* %10, align 8
  %12 = insertelement <4 x double> undef, double %3, i32 0
  %13 = insertelement <4 x double> %12, double %4, i32 1
  %14 = insertelement <4 x double> %13, double %8, i32 2
  %15 = insertelement <4 x double> %14, double %9, i32 3
  %16 = insertelement <4 x double> %12, double %6, i32 1
  %17 = insertelement <4 x double> %16, double %8, i32 2
  %18 = insertelement <4 x double> %17, double %11, i32 3
  %19 = fmul <4 x double> %15, %18
  %20 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 0
  %21 = load double, double* %20, align 8
  %22 = fmul double %21, 2.000000e+00
  %23 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 1, i32 0, i64 0
  %24 = load double, double* %23, align 8
  %25 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 0
  %26 = load double, double* %25, align 8
  %27 = fmul double %26, 2.000000e+00
  %28 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 1, i32 0, i64 0
  %29 = load double, double* %28, align 8
  %30 = insertelement <4 x double> undef, double %21, i32 0
  %31 = insertelement <4 x double> %30, double %22, i32 1
  %32 = insertelement <4 x double> %31, double %26, i32 2
  %33 = insertelement <4 x double> %32, double %27, i32 3
  %34 = insertelement <4 x double> %30, double %24, i32 1
  %35 = insertelement <4 x double> %34, double %26, i32 2
  %36 = insertelement <4 x double> %35, double %29, i32 3
  %37 = fmul <4 x double> %33, %36
  %38 = bitcast %SArray* %0 to <4 x double>*
  store <4 x double> %19, <4 x double>* %38, align 8
  %39 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 2, i32 0
  %40 = bitcast double* %39 to <4 x double>*
  store <4 x double> %37, <4 x double>* %40, align 8
  ret void
}

 

In #331 branch and master, it was not possible to generate IR with SIMD instructions:

define void @julia_f_63115(%SArray* noalias nocapture sret, %SArray* nocapture readonly dereferenceable(64)) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 0
  %3 = load double, double* %2, align 8
  %pow2 = fmul double %3, %3
  %4 = fadd double %3, 2.000000e+00
  %notlhs = fcmp ord double %pow2, 0.000000e+00
  %notrhs = fcmp uno double %4, 0.000000e+00
  %5 = or i1 %notrhs, %notlhs
  br i1 %5, label %L27, label %if

if:                                               ; preds = %top
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L27:                                              ; preds = %top
  %6 = fadd double %3, 1.000000e+00
  %notlhs62 = fcmp ord double %3, 0.000000e+00
  %notrhs63 = fcmp uno double %6, 0.000000e+00
  %7 = or i1 %notlhs62, %notrhs63
  br i1 %7, label %L66, label %if55

if34:                                             ; preds = %L66
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L124:                                             ; preds = %L66
  %8 = fadd double %47, 1.000000e+00
  %notlhs76 = fcmp ord double %47, 0.000000e+00
  %notrhs77 = fcmp uno double %8, 0.000000e+00
  %9 = or i1 %notlhs76, %notrhs77
  br i1 %9, label %L163, label %if52

if38:                                             ; preds = %L163
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L221:                                             ; preds = %L163
  %10 = fadd double %43, 1.000000e+00
  %notlhs90 = fcmp ord double %43, 0.000000e+00
  %notrhs91 = fcmp uno double %10, 0.000000e+00
  %11 = or i1 %notlhs90, %notrhs91
  br i1 %11, label %L260, label %if49

if42:                                             ; preds = %L260
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L318:                                             ; preds = %L260
  %12 = fadd double %39, 1.000000e+00
  %notlhs104 = fcmp ord double %39, 0.000000e+00
  %notrhs105 = fcmp uno double %12, 0.000000e+00
  %13 = or i1 %notlhs104, %notrhs105
  br i1 %13, label %L357, label %if46

if46:                                             ; preds = %L318
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L357:                                             ; preds = %L318
  %14 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 1, i32 0, i64 0
  %15 = load double, double* %14, align 8
  %16 = fmul double %15, 2.000000e+00
  %17 = fmul double %3, %16
  %18 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 1, i32 0, i64 0
  %19 = load double, double* %18, align 8
  %20 = fmul double %19, 2.000000e+00
  %21 = fmul double %47, %20
  %22 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 1, i32 0, i64 0
  %23 = load double, double* %22, align 8
  %24 = fmul double %23, 2.000000e+00
  %25 = fmul double %43, %24
  %26 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 1, i32 0, i64 0
  %27 = load double, double* %26, align 8
  %28 = fmul double %27, 2.000000e+00
  %29 = fmul double %39, %28
  %30 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 0, i32 0
  store double %pow2, double* %30, align 8
  %31 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 0, i32 1, i32 0, i64 0
  store double %17, double* %31, align 8
  %32 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 1, i32 0
  store double %pow269, double* %32, align 8
  %33 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 1, i32 1, i32 0, i64 0
  store double %21, double* %33, align 8
  %34 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 2, i32 0
  store double %pow283, double* %34, align 8
  %35 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 2, i32 1, i32 0, i64 0
  store double %25, double* %35, align 8
  %36 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 3, i32 0
  store double %pow297, double* %36, align 8
  %37 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 3, i32 1, i32 0, i64 0
  store double %29, double* %37, align 8
  ret void

if49:                                             ; preds = %L221
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L260:                                             ; preds = %L221
  %38 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 0
  %39 = load double, double* %38, align 8
  %pow297 = fmul double %39, %39
  %40 = fadd double %39, 2.000000e+00
  %notlhs99 = fcmp ord double %pow297, 0.000000e+00
  %notrhs100 = fcmp uno double %40, 0.000000e+00
  %41 = or i1 %notrhs100, %notlhs99
  br i1 %41, label %L318, label %if42

if52:                                             ; preds = %L124
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L163:                                             ; preds = %L124
  %42 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 0
  %43 = load double, double* %42, align 8
  %pow283 = fmul double %43, %43
  %44 = fadd double %43, 2.000000e+00
  %notlhs85 = fcmp ord double %pow283, 0.000000e+00
  %notrhs86 = fcmp uno double %44, 0.000000e+00
  %45 = or i1 %notrhs86, %notlhs85
  br i1 %45, label %L221, label %if38

if55:                                             ; preds = %L27
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L66:                                              ; preds = %L27
  %46 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 0
  %47 = load double, double* %46, align 8
  %pow269 = fmul double %47, %47
  %48 = fadd double %47, 2.000000e+00
  %notlhs71 = fcmp ord double %pow269, 0.000000e+00
  %notrhs72 = fcmp uno double %48, 0.000000e+00
  %49 = or i1 %notrhs72, %notlhs71
  br i1 %49, label %L124, label %if34
}

 

My first implementation was using @generated and it generated literal_pow code for any degree. I thought it was OK since Base.literal_pow has a fall-back implementation for higher degree power functions but it turned out this implantation was slower when the benchmark involves high degree power functions. Limiting degree to three yields better performance for all degrees I've tired. I get similar benchmark results in two machines:

@jrevels jrevels mentioned this pull request Jul 19, 2018
@jrevels
Copy link
Member

jrevels commented Jul 19, 2018

Once again, great stuff! Could use a rebase now that #331 is merged. We can add tests for this to the other SIMD tests in test/SIMDTest.jl.

@KristofferC
Copy link
Collaborator

Perhaps go up to 4. Not that uncommon with power of 4 in physical formulas.

@tkf
Copy link
Contributor Author

tkf commented Jul 19, 2018

@jrevels Thanks for reviewing & merging #331!

I rebased this PR and (I think) fixed the code for new literal_pow interface in Julia 0.7. I don't have Julia 0.7 in my laptop at the moment so let Travis verify that.

@KristofferC Yeah, it would be interesting to see if adding higher order implementation helps. I stopped at 3 because Julia Base itself only has it only up to 3 (and my benchmark shows it is better than no bound):

https://github.com/JuliaLang/julia/blob/c6a949a9c3c2eecefd3b04d3b1a146c7c86b3868/base/intfuncs.jl#L237-L244

I wonder why they chose 3 and what benchmark they used.

@tkf
Copy link
Contributor Author

tkf commented Jul 19, 2018

Oh, I'll have a look at test/SIMDTest.jl as well.

@tkf tkf force-pushed the literal_pow branch 3 times, most recently from 2e087e5 to 8550e49 Compare July 20, 2018 02:49
@tkf
Copy link
Contributor Author

tkf commented Jul 20, 2018

So making the SIMD test work in Travis was a bit tricky since it requires StaticArrays and Pkg3 does not add the dependencies including StaticArrays to the global environment:

https://travis-ci.org/JuliaDiff/ForwardDiff.jl/builds/406057434

I could manually add packages but I thought adding Project.toml and let Pkg3 handle everything would be the easiest solution. You are going to add Project.toml at some point anyway, right?

Project.toml Outdated
@@ -0,0 +1,12 @@
name = "ForwardDiff"
uuid = "f6369f11-7733-5829-9624-2563aa707210"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredrikekre Re: #334 (comment)

Thanks, I fixed uuid in this PR.

@KristofferC
Copy link
Collaborator

For now, until there is a good automatic way of translating REQUIRE files, perhaps just Pkg.add("StaticArrays") before the include is the easiest.

This is missing test dependencies for example (and those are being tweaked at the moment).

@jrevels
Copy link
Member

jrevels commented Jul 27, 2018

If it would pass tests (since we have JuliaLang/julia#26594 in v0.7), we could also just get rid of the separate julia -O3 run altogether in the Travis config and instead include SIMDTests.jl with the rest of the included files in runtests.jl.

If that doesn't work, we can then try Pkg.add("StaticArrays") as @KristofferC suggests (thanks!).

@KristofferC
Copy link
Collaborator

If it would pass tests (since we have JuliaLang/julia#26594 in v0.7)

About that... JuliaLang/julia#27049, JuliaLang/julia#27659. If you want it, I think you need to lobby for it.

@jrevels
Copy link
Member

jrevels commented Jul 27, 2018

About that... JuliaLang/julia#27049, JuliaLang/julia#27659. If you want it, I think you need to lobby for it.

Aww I didn't realize JuliaLang/julia#27659 got closed :(

@tkf
Copy link
Contributor Author

tkf commented Jul 27, 2018

The SIMD test problem is already fixed by using Project.toml (essentially in 0eb7afb). Sorry if my comment was unclear.

@jrevels Do you have some reasons to avoid start using Pkg3 now? If so, I'll revert the relevant commits and close #333. If using Pkg3 is fine, I think this PR is ready to go.

But note that I tried to keep changes related to Project.toml minimal here. If you want make it perfect in this RP, I'll merge #333 into this PR.

FYI some (my understanding of) details for why adding Project.toml is the solution: Since ForwardDiff has StaticArrays as a dependency, we can import it if we enable ForwardDiff's Project.toml. But that's automatic since Travis sets JULIA_PROJECT="@." https://github.com/travis-ci/travis-build/blob/5aef38fe4785caf3fec7a53364e6c17501c8fbb9/lib/travis/build/script/julia.rb#L23 which activate ForwardDiff's Project.toml. Then julia --color=yes -O3 -e 'include("test/SIMDTest.jl")' just works since StaticArrays is importable in this julia process. @KristofferC please correct me if I'm missing something :)

@jrevels
Copy link
Member

jrevels commented Jul 29, 2018

The reason not to include a Project.toml and instead just do Pkg.add is two-fold: one, using Pkg.add is the more minimal change that (hopefully) fixes the problem, and two, it is not yet standard for packages to have Project.tomls.

IIUC, the plan is for the Pkg3 team to automate the process of adding a Project.toml for all registered packages, or at least announce when packages should make that kind of update themselves. We don't want to jump the gun, since that might make it harder for us to participate in the "official" migration later.

@tkf
Copy link
Contributor Author

tkf commented Jul 29, 2018

I see. That sounds very reasonable. I updated the PR to use Pkg.add("StaticArrays") in .travis.yml (and Travis is happy with it).

@jrevels
Copy link
Member

jrevels commented Jul 30, 2018

Awesome stuff, thanks again for all this work @tkf!

@jrevels jrevels merged commit d7ebc3d into JuliaDiff:master Jul 30, 2018
@tkf tkf mentioned this pull request Aug 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants