Enable SIMD with Base.literal_pow #332

tkf · 2018-07-19T01:50:05Z

This branch eliminates run-time branching (introduced in #331) and enables SIMD (previously impossible in master) for power functions with low degree (<= 3). For example,

using ForwardDiff: Dual
using StaticArrays
x = Dual(1.0, 2.0)
xs = SVector(x, x, x, x)
@inline pow2(x) = x^2
f(xs) = pow2.(xs)
@code_llvm f(xs)

executed with julia -O3 prints two fmul <4 x double> instructions (presumably one for .value and one for .partials[1]):

define void @julia_f_63134(%SArray* noalias nocapture sret, %SArray* nocapture readonly dereferenceable(64)) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 0
  %3 = load double, double* %2, align 8
  %4 = fmul double %3, 2.000000e+00
  %5 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 1, i32 0, i64 0
  %6 = load double, double* %5, align 8
  %7 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 0
  %8 = load double, double* %7, align 8
  %9 = fmul double %8, 2.000000e+00
  %10 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 1, i32 0, i64 0
  %11 = load double, double* %10, align 8
  %12 = insertelement <4 x double> undef, double %3, i32 0
  %13 = insertelement <4 x double> %12, double %4, i32 1
  %14 = insertelement <4 x double> %13, double %8, i32 2
  %15 = insertelement <4 x double> %14, double %9, i32 3
  %16 = insertelement <4 x double> %12, double %6, i32 1
  %17 = insertelement <4 x double> %16, double %8, i32 2
  %18 = insertelement <4 x double> %17, double %11, i32 3
  %19 = fmul <4 x double> %15, %18
  %20 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 0
  %21 = load double, double* %20, align 8
  %22 = fmul double %21, 2.000000e+00
  %23 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 1, i32 0, i64 0
  %24 = load double, double* %23, align 8
  %25 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 0
  %26 = load double, double* %25, align 8
  %27 = fmul double %26, 2.000000e+00
  %28 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 1, i32 0, i64 0
  %29 = load double, double* %28, align 8
  %30 = insertelement <4 x double> undef, double %21, i32 0
  %31 = insertelement <4 x double> %30, double %22, i32 1
  %32 = insertelement <4 x double> %31, double %26, i32 2
  %33 = insertelement <4 x double> %32, double %27, i32 3
  %34 = insertelement <4 x double> %30, double %24, i32 1
  %35 = insertelement <4 x double> %34, double %26, i32 2
  %36 = insertelement <4 x double> %35, double %29, i32 3
  %37 = fmul <4 x double> %33, %36
  %38 = bitcast %SArray* %0 to <4 x double>*
  store <4 x double> %19, <4 x double>* %38, align 8
  %39 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 2, i32 0
  %40 = bitcast double* %39 to <4 x double>*
  store <4 x double> %37, <4 x double>* %40, align 8
  ret void
}

In #331 branch and master, it was not possible to generate IR with SIMD instructions:

define void @julia_f_63115(%SArray* noalias nocapture sret, %SArray* nocapture readonly dereferenceable(64)) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 0
  %3 = load double, double* %2, align 8
  %pow2 = fmul double %3, %3
  %4 = fadd double %3, 2.000000e+00
  %notlhs = fcmp ord double %pow2, 0.000000e+00
  %notrhs = fcmp uno double %4, 0.000000e+00
  %5 = or i1 %notrhs, %notlhs
  br i1 %5, label %L27, label %if

if:                                               ; preds = %top
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L27:                                              ; preds = %top
  %6 = fadd double %3, 1.000000e+00
  %notlhs62 = fcmp ord double %3, 0.000000e+00
  %notrhs63 = fcmp uno double %6, 0.000000e+00
  %7 = or i1 %notlhs62, %notrhs63
  br i1 %7, label %L66, label %if55

if34:                                             ; preds = %L66
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L124:                                             ; preds = %L66
  %8 = fadd double %47, 1.000000e+00
  %notlhs76 = fcmp ord double %47, 0.000000e+00
  %notrhs77 = fcmp uno double %8, 0.000000e+00
  %9 = or i1 %notlhs76, %notrhs77
  br i1 %9, label %L163, label %if52

if38:                                             ; preds = %L163
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L221:                                             ; preds = %L163
  %10 = fadd double %43, 1.000000e+00
  %notlhs90 = fcmp ord double %43, 0.000000e+00
  %notrhs91 = fcmp uno double %10, 0.000000e+00
  %11 = or i1 %notlhs90, %notrhs91
  br i1 %11, label %L260, label %if49

if42:                                             ; preds = %L260
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L318:                                             ; preds = %L260
  %12 = fadd double %39, 1.000000e+00
  %notlhs104 = fcmp ord double %39, 0.000000e+00
  %notrhs105 = fcmp uno double %12, 0.000000e+00
  %13 = or i1 %notlhs104, %notrhs105
  br i1 %13, label %L357, label %if46

if46:                                             ; preds = %L318
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L357:                                             ; preds = %L318
  %14 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 0, i32 1, i32 0, i64 0
  %15 = load double, double* %14, align 8
  %16 = fmul double %15, 2.000000e+00
  %17 = fmul double %3, %16
  %18 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 1, i32 0, i64 0
  %19 = load double, double* %18, align 8
  %20 = fmul double %19, 2.000000e+00
  %21 = fmul double %47, %20
  %22 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 1, i32 0, i64 0
  %23 = load double, double* %22, align 8
  %24 = fmul double %23, 2.000000e+00
  %25 = fmul double %43, %24
  %26 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 1, i32 0, i64 0
  %27 = load double, double* %26, align 8
  %28 = fmul double %27, 2.000000e+00
  %29 = fmul double %39, %28
  %30 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 0, i32 0
  store double %pow2, double* %30, align 8
  %31 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 0, i32 1, i32 0, i64 0
  store double %17, double* %31, align 8
  %32 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 1, i32 0
  store double %pow269, double* %32, align 8
  %33 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 1, i32 1, i32 0, i64 0
  store double %21, double* %33, align 8
  %34 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 2, i32 0
  store double %pow283, double* %34, align 8
  %35 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 2, i32 1, i32 0, i64 0
  store double %25, double* %35, align 8
  %36 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 3, i32 0
  store double %pow297, double* %36, align 8
  %37 = getelementptr inbounds %SArray, %SArray* %0, i64 0, i32 0, i64 3, i32 1, i32 0, i64 0
  store double %29, double* %37, align 8
  ret void

if49:                                             ; preds = %L221
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L260:                                             ; preds = %L221
  %38 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 3, i32 0
  %39 = load double, double* %38, align 8
  %pow297 = fmul double %39, %39
  %40 = fadd double %39, 2.000000e+00
  %notlhs99 = fcmp ord double %pow297, 0.000000e+00
  %notrhs100 = fcmp uno double %40, 0.000000e+00
  %41 = or i1 %notrhs100, %notlhs99
  br i1 %41, label %L318, label %if42

if52:                                             ; preds = %L124
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L163:                                             ; preds = %L124
  %42 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 2, i32 0
  %43 = load double, double* %42, align 8
  %pow283 = fmul double %43, %43
  %44 = fadd double %43, 2.000000e+00
  %notlhs85 = fcmp ord double %pow283, 0.000000e+00
  %notrhs86 = fcmp uno double %44, 0.000000e+00
  %45 = or i1 %notrhs86, %notlhs85
  br i1 %45, label %L221, label %if38

if55:                                             ; preds = %L27
  call void @jl_throw(i8** inttoptr (i64 140415041378608 to i8**))
  unreachable

L66:                                              ; preds = %L27
  %46 = getelementptr inbounds %SArray, %SArray* %1, i64 0, i32 0, i64 1, i32 0
  %47 = load double, double* %46, align 8
  %pow269 = fmul double %47, %47
  %48 = fadd double %47, 2.000000e+00
  %notlhs71 = fcmp ord double %pow269, 0.000000e+00
  %notrhs72 = fcmp uno double %48, 0.000000e+00
  %49 = or i1 %notrhs72, %notlhs71
  br i1 %49, label %L124, label %if34
}

My first implementation was using @generated and it generated literal_pow code for any degree. I thought it was OK since Base.literal_pow has a fall-back implementation for higher degree power functions but it turned out this implantation was slower when the benchmark involves high degree power functions. Limiting degree to three yields better performance for all degrees I've tired. I get similar benchmark results in two machines:

Machine 1 (a laptop)
- @generated-based vs Special case x^0 #331: https://gist.github.com/b44e7cfc85e682fa293182eae086ac5f
- for-@eval-based vs Special case x^0 #331: https://gist.github.com/2283e2397df7dd604147cf9077190084
Machine 2 (a desktop)
- @generated-based vs Special case x^0 #331: https://gist.github.com/tkf/0350bc6f30405b489c41b697d3692b88
- for-@eval-based vs Special case x^0 #331: https://gist.github.com/3834bdee95980c7a9268cc7e57b9f495

jrevels · 2018-07-19T12:15:11Z

Once again, great stuff! Could use a rebase now that #331 is merged. We can add tests for this to the other SIMD tests in test/SIMDTest.jl.

KristofferC · 2018-07-19T12:31:29Z

Perhaps go up to 4. Not that uncommon with power of 4 in physical formulas.

tkf · 2018-07-19T20:17:59Z

@jrevels Thanks for reviewing & merging #331!

I rebased this PR and (I think) fixed the code for new literal_pow interface in Julia 0.7. I don't have Julia 0.7 in my laptop at the moment so let Travis verify that.

@KristofferC Yeah, it would be interesting to see if adding higher order implementation helps. I stopped at 3 because Julia Base itself only has it only up to 3 (and my benchmark shows it is better than no bound):

https://github.com/JuliaLang/julia/blob/c6a949a9c3c2eecefd3b04d3b1a146c7c86b3868/base/intfuncs.jl#L237-L244

I wonder why they chose 3 and what benchmark they used.

tkf · 2018-07-19T20:19:20Z

Oh, I'll have a look at test/SIMDTest.jl as well.

tkf · 2018-07-20T03:01:39Z

So making the SIMD test work in Travis was a bit tricky since it requires StaticArrays and Pkg3 does not add the dependencies including StaticArrays to the global environment:

https://travis-ci.org/JuliaDiff/ForwardDiff.jl/builds/406057434

I could manually add packages but I thought adding Project.toml and let Pkg3 handle everything would be the easiest solution. You are going to add Project.toml at some point anyway, right?

tkf · 2018-07-20T21:59:49Z

Project.toml

@@ -0,0 +1,12 @@
+name = "ForwardDiff"
+uuid = "f6369f11-7733-5829-9624-2563aa707210"


@fredrikekre Re: #334 (comment)

Thanks, I fixed uuid in this PR.

KristofferC · 2018-07-27T14:28:14Z

For now, until there is a good automatic way of translating REQUIRE files, perhaps just Pkg.add("StaticArrays") before the include is the easiest.

This is missing test dependencies for example (and those are being tweaked at the moment).

jrevels · 2018-07-27T14:37:17Z

If it would pass tests (since we have JuliaLang/julia#26594 in v0.7), we could also just get rid of the separate julia -O3 run altogether in the Travis config and instead include SIMDTests.jl with the rest of the included files in runtests.jl.

If that doesn't work, we can then try Pkg.add("StaticArrays") as @KristofferC suggests (thanks!).

KristofferC · 2018-07-27T14:40:47Z

If it would pass tests (since we have JuliaLang/julia#26594 in v0.7)

About that... JuliaLang/julia#27049, JuliaLang/julia#27659. If you want it, I think you need to lobby for it.

jrevels · 2018-07-27T14:43:32Z

About that... JuliaLang/julia#27049, JuliaLang/julia#27659. If you want it, I think you need to lobby for it.

Aww I didn't realize JuliaLang/julia#27659 got closed :(

tkf · 2018-07-27T20:26:15Z

The SIMD test problem is already fixed by using Project.toml (essentially in 0eb7afb). Sorry if my comment was unclear.

@jrevels Do you have some reasons to avoid start using Pkg3 now? If so, I'll revert the relevant commits and close #333. If using Pkg3 is fine, I think this PR is ready to go.

But note that I tried to keep changes related to Project.toml minimal here. If you want make it perfect in this RP, I'll merge #333 into this PR.

FYI some (my understanding of) details for why adding Project.toml is the solution: Since ForwardDiff has StaticArrays as a dependency, we can import it if we enable ForwardDiff's Project.toml. But that's automatic since Travis sets JULIA_PROJECT="@." https://github.com/travis-ci/travis-build/blob/5aef38fe4785caf3fec7a53364e6c17501c8fbb9/lib/travis/build/script/julia.rb#L23 which activate ForwardDiff's Project.toml. Then julia --color=yes -O3 -e 'include("test/SIMDTest.jl")' just works since StaticArrays is importable in this julia process. @KristofferC please correct me if I'm missing something :)

jrevels · 2018-07-29T15:48:13Z

The reason not to include a Project.toml and instead just do Pkg.add is two-fold: one, using Pkg.add is the more minimal change that (hopefully) fixes the problem, and two, it is not yet standard for packages to have Project.tomls.

IIUC, the plan is for the Pkg3 team to automate the process of adding a Project.toml for all registered packages, or at least announce when packages should make that kind of update themselves. We don't want to jump the gun, since that might make it harder for us to participate in the "official" migration later.

tkf · 2018-07-29T19:30:20Z

I see. That sounds very reasonable. I updated the PR to use Pkg.add("StaticArrays") in .travis.yml (and Travis is happy with it).

jrevels · 2018-07-30T13:48:43Z

Awesome stuff, thanks again for all this work @tkf!

jrevels mentioned this pull request Jul 19, 2018

Special case x^0 #331

Merged

tkf force-pushed the literal_pow branch from e7e0547 to 855b8fa Compare July 19, 2018 20:17

tkf added 3 commits July 19, 2018 17:33

Add Base.literal_pow

fb5bc6e

Add SIMD tests for vectorized power

bfc71cd

Use testset in SIMDTest.jl

4f04bed

tkf force-pushed the literal_pow branch 3 times, most recently from 2e087e5 to 8550e49 Compare July 20, 2018 02:49

tkf mentioned this pull request Jul 20, 2018

Finishing up Project.toml configuration #333

Closed

4 tasks

tkf force-pushed the literal_pow branch from 8550e49 to 95a16fd Compare July 20, 2018 21:58

tkf commented Jul 20, 2018

View reviewed changes

tkf added 2 commits July 29, 2018 12:05

Install StaticArrays manually in Travis

1525c4c

Use --color=yes in Travis CI

d311518

tkf force-pushed the literal_pow branch from 95a16fd to d311518 Compare July 29, 2018 19:09

jrevels merged commit d7ebc3d into JuliaDiff:master Jul 30, 2018

tkf mentioned this pull request Aug 24, 2018

Fix CI JuliaPlots/Plots.jl#1689

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SIMD with Base.literal_pow #332

Enable SIMD with Base.literal_pow #332

tkf commented Jul 19, 2018 •

edited

Loading

jrevels commented Jul 19, 2018

KristofferC commented Jul 19, 2018

tkf commented Jul 19, 2018

tkf commented Jul 19, 2018

tkf commented Jul 20, 2018

tkf Jul 20, 2018

KristofferC commented Jul 27, 2018

jrevels commented Jul 27, 2018 •

edited

Loading

KristofferC commented Jul 27, 2018

jrevels commented Jul 27, 2018

tkf commented Jul 27, 2018

jrevels commented Jul 29, 2018

tkf commented Jul 29, 2018

jrevels commented Jul 30, 2018

		@@ -0,0 +1,12 @@
		name = "ForwardDiff"
		uuid = "f6369f11-7733-5829-9624-2563aa707210"

Enable SIMD with Base.literal_pow #332

Enable SIMD with Base.literal_pow #332

Conversation

tkf commented Jul 19, 2018 • edited Loading

jrevels commented Jul 19, 2018

KristofferC commented Jul 19, 2018

tkf commented Jul 19, 2018

tkf commented Jul 19, 2018

tkf commented Jul 20, 2018

tkf Jul 20, 2018

Choose a reason for hiding this comment

KristofferC commented Jul 27, 2018

jrevels commented Jul 27, 2018 • edited Loading

KristofferC commented Jul 27, 2018

jrevels commented Jul 27, 2018

tkf commented Jul 27, 2018

jrevels commented Jul 29, 2018

tkf commented Jul 29, 2018

jrevels commented Jul 30, 2018

tkf commented Jul 19, 2018 •

edited

Loading

jrevels commented Jul 27, 2018 •

edited

Loading