Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More LoopVectorization tests & checks #57

Merged
merged 40 commits into from
Jan 23, 2021
Merged

More LoopVectorization tests & checks #57

merged 40 commits into from
Jan 23, 2021

Conversation

mcabbott
Copy link
Owner

@mcabbott mcabbott commented Dec 12, 2020

This is part II of #53.

Closes #77, closes #75, closes #72.

@mcabbott
Copy link
Owner Author

CI on Julia 1.4 picks CUDA v0.1.0 which fails:

WARNING: using CUDA.CUDA in module Main conflicts with an existing identifier.
ERROR: LoadError: importing CUDA into Main conflicts with an existing identifier

Perhaps because it has SpecialFunctions v1.1.0, and KernelAbstractions v0.2.4?

Testing locally on 1.4, the resolver picks CUDA v1.3.3 & it passes. It also has KernelAbstractions v0.4.5, and SpecialFunctions v0.10.3.

Maybe adding [compat] CUDA = "1, 2" as in #55 will solve this.

@DilumAluthge
Copy link
Contributor

I would say: merge #55 first, and then rebase this PR on master.

@mcabbott
Copy link
Owner Author

I was going to suggest the reverse... but either will work ultimately.

I am keen to keep tests on 1.4, I dropped 1.3 when I got tired of fighting the resolver. But 1.4 forces LoopVectorization 0.8, which it seems a bit soon to drop completely, given that this package can't bound the version actually used outside of tests.

There are, however, still some test bugs to track down. The recent 1.4 pass here still has some tests disabled.

There is also a surprising slowdown, somehow LV takes 2410.5 seconds on Julia 1.4 (LV 0.8), vs 6149.6 seconds on 1.5 (0.9), there aren't that many tests disabled. From JuliaSIMD/LoopVectorization.jl#171 I learn that this may be caused by coverage, I wonder if that can be run more selectively?

@chriselrod
Copy link
Contributor

LoopVectorization's tests are 99.9%+ compilation.

That may be a Julia 1.4 vs 1.5 difference. I recall 1.4 being faster than 1.5 on the same LoopVectorization version.

Running tests for both LoopVectorization 0.8.26 and 0.9.6 on Julia 1.5, I get (clipping off the gemm tests):
0.8.26:

#= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/printmethods.jl:2 =# @__LINE__() = 2
  2.300284 seconds (5.02 M allocations: 256.232 MiB, 2.01% gc time)
(Float64, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/fallback.jl:4 =# @__LINE__()) = (Float64, 4)
  6.730764 seconds (12.41 M allocations: 626.128 MiB, 3.69% gc time)
  0.035563 seconds (69.39 k allocations: 3.810 MiB)
  2.574874 seconds (8.74 M allocations: 446.962 MiB, 3.17% gc time)
  0.517646 seconds (2.19 M allocations: 110.348 MiB, 3.07% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/offsetarrays.jl:204 =# @__LINE__()) = (Float32, 204)
r = -1:1
r = -2:2
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/offsetarrays.jl:204 =# @__LINE__()) = (Float64, 204)
r = -1:1
r = -2:2
186.278399 seconds (323.83 M allocations: 24.462 GiB, 5.23% gc time)
(Float64, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/tensors.jl:51 =# @__LINE__()) = (Float64, 51)
  5.293765 seconds (13.67 M allocations: 746.867 MiB, 5.85% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/map.jl:4 =# @__LINE__()) = (Float32, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/map.jl:4 =# @__LINE__()) = (Float64, 4)
  1.959536 seconds (7.39 M allocations: 376.356 MiB, 4.18% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/filter.jl:4 =# @__LINE__()) = (Float32, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/filter.jl:4 =# @__LINE__()) = (Float64, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/filter.jl:4 =# @__LINE__()) = (Int32, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/filter.jl:4 =# @__LINE__()) = (Int64, 4)
  0.306723 seconds (724.39 k allocations: 37.850 MiB, 2.75% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/mapreduce.jl:19 =# @__LINE__()) = (Int32, 19)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/mapreduce.jl:19 =# @__LINE__()) = (Int64, 19)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/mapreduce.jl:19 =# @__LINE__()) = (Float32, 19)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/mapreduce.jl:19 =# @__LINE__()) = (Float64, 19)
 49.398566 seconds (448.45 M allocations: 29.106 GiB, 10.44% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/ifelsemasks.jl:366 =# @__LINE__()) = (Float32, 366)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/ifelsemasks.jl:366 =# @__LINE__()) = (Float64, 366)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/ifelsemasks.jl:366 =# @__LINE__()) = (Int32, 366)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/ifelsemasks.jl:366 =# @__LINE__()) = (Int64, 366)
 20.996625 seconds (56.05 M allocations: 2.809 GiB, 8.57% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/dot.jl:234 =# @__LINE__()) = (Float32, 234)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/dot.jl:234 =# @__LINE__()) = (Float64, 234)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/dot.jl:234 =# @__LINE__()) = (Int32, 234)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/dot.jl:234 =# @__LINE__()) = (Int64, 234)
 12.680781 seconds (44.60 M allocations: 2.271 GiB, 4.08% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/special.jl:339 =# @__LINE__()) = (Float32, 339)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/special.jl:339 =# @__LINE__()) = (Float64, 339)
  4.091667 seconds (13.12 M allocations: 633.796 MiB, 2.57% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/gemv.jl:211 =# @__LINE__()) = (Float32, 211)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/gemv.jl:211 =# @__LINE__()) = (Float64, 211)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/gemv.jl:211 =# @__LINE__()) = (Int32, 211)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/gemv.jl:211 =# @__LINE__()) = (Int64, 211)
 15.978444 seconds (52.62 M allocations: 2.395 GiB, 2.73% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/miscellaneous.jl:789 =# @__LINE__()) = (Float32, 789)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/miscellaneous.jl:789 =# @__LINE__()) = (Float64, 789)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/miscellaneous.jl:1070 =# @__LINE__()) = (Float32, 1070)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/miscellaneous.jl:1070 =# @__LINE__()) = (Float64, 1070)
 30.148872 seconds (127.87 M allocations: 6.813 GiB, 8.33% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/copy.jl:129 =# @__LINE__()) = (Float32, 129)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/copy.jl:129 =# @__LINE__()) = (Float64, 129)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/copy.jl:129 =# @__LINE__()) = (Int32, 129)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/copy.jl:129 =# @__LINE__()) = (Int64, 129)
  3.025664 seconds (9.11 M allocations: 447.853 MiB, 4.00% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/broadcast.jl:5 =# @__LINE__()) = (Float32, 5)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/broadcast.jl:5 =# @__LINE__()) = (Float64, 5)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/broadcast.jl:5 =# @__LINE__()) = (Int32, 5)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/pHMnJ/test/broadcast.jl:5 =# @__LINE__()) = (Int64, 5)
 98.177554 seconds (140.41 M allocations: 7.892 GiB, 4.08% gc time)

0.9.6:

#= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/printmethods.jl:2 =# @__LINE__() = 2
  2.346917 seconds (5.03 M allocations: 256.636 MiB, 2.05% gc time)
  0.010064 seconds (9.74 k allocations: 628.211 KiB)
(Float64, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/fallback.jl:4 =# @__LINE__()) = (Float64, 4)
  9.985701 seconds (19.52 M allocations: 987.222 MiB, 4.43% gc time)
  0.032851 seconds (67.26 k allocations: 3.726 MiB)
  2.540316 seconds (9.26 M allocations: 477.631 MiB, 3.24% gc time)
  0.834916 seconds (3.25 M allocations: 166.502 MiB, 2.88% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/offsetarrays.jl:211 =# @__LINE__()) = (Float32, 211)
r = -1:1
r = -2:2
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/offsetarrays.jl:211 =# @__LINE__()) = (Float64, 211)
r = -1:1
r = -2:2
  5.230486 seconds (16.37 M allocations: 794.435 MiB, 2.26% gc time)
(Float64, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/tensors.jl:51 =# @__LINE__()) = (Float64, 51)
  5.337574 seconds (16.69 M allocations: 905.243 MiB, 9.21% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/map.jl:4 =# @__LINE__()) = (Float32, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/map.jl:4 =# @__LINE__()) = (Float64, 4)
  2.939322 seconds (10.46 M allocations: 519.786 MiB, 4.44% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/filter.jl:4 =# @__LINE__()) = (Float32, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/filter.jl:4 =# @__LINE__()) = (Float64, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/filter.jl:4 =# @__LINE__()) = (Int32, 4)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/filter.jl:4 =# @__LINE__()) = (Int64, 4)
  0.476314 seconds (1.19 M allocations: 62.302 MiB, 1.93% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/mapreduce.jl:19 =# @__LINE__()) = (Int32, 19)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/mapreduce.jl:19 =# @__LINE__()) = (Int64, 19)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/mapreduce.jl:19 =# @__LINE__()) = (Float32, 19)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/mapreduce.jl:19 =# @__LINE__()) = (Float64, 19)
 49.636117 seconds (485.14 M allocations: 30.130 GiB, 10.02% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/ifelsemasks.jl:366 =# @__LINE__()) = (Float32, 366)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/ifelsemasks.jl:366 =# @__LINE__()) = (Float64, 366)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/ifelsemasks.jl:366 =# @__LINE__()) = (Int32, 366)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/ifelsemasks.jl:366 =# @__LINE__()) = (Int64, 366)
 20.505279 seconds (60.46 M allocations: 3.053 GiB, 9.45% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/dot.jl:234 =# @__LINE__()) = (Float32, 234)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/dot.jl:234 =# @__LINE__()) = (Float64, 234)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/dot.jl:234 =# @__LINE__()) = (Int32, 234)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/dot.jl:234 =# @__LINE__()) = (Int64, 234)
 13.784316 seconds (50.99 M allocations: 2.600 GiB, 5.45% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/special.jl:339 =# @__LINE__()) = (Float32, 339)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/special.jl:339 =# @__LINE__()) = (Float64, 339)
  4.632350 seconds (15.96 M allocations: 765.493 MiB, 3.58% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/gemv.jl:211 =# @__LINE__()) = (Float32, 211)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/gemv.jl:211 =# @__LINE__()) = (Float64, 211)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/gemv.jl:211 =# @__LINE__()) = (Int32, 211)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/gemv.jl:211 =# @__LINE__()) = (Int64, 211)
 16.232091 seconds (53.92 M allocations: 2.471 GiB, 3.38% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/miscellaneous.jl:792 =# @__LINE__()) = (Float32, 792)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/miscellaneous.jl:792 =# @__LINE__()) = (Float64, 792)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/miscellaneous.jl:1075 =# @__LINE__()) = (Float32, 1075)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/miscellaneous.jl:1075 =# @__LINE__()) = (Float64, 1075)
 33.904497 seconds (138.12 M allocations: 7.276 GiB, 8.18% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/copy.jl:129 =# @__LINE__()) = (Float32, 129)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/copy.jl:129 =# @__LINE__()) = (Float64, 129)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/copy.jl:129 =# @__LINE__()) = (Int32, 129)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/copy.jl:129 =# @__LINE__()) = (Int64, 129)
  2.892922 seconds (9.38 M allocations: 471.123 MiB, 4.23% gc time)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/broadcast.jl:8 =# @__LINE__()) = (Float32, 8)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/broadcast.jl:8 =# @__LINE__()) = (Float64, 8)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/broadcast.jl:8 =# @__LINE__()) = (Int32, 8)
(T, #= /home/chriselrod/.julia/packages/LoopVectorization/5Eosy/test/broadcast.jl:8 =# @__LINE__()) = (Int64, 8)
101.429469 seconds (151.64 M allocations: 8.432 GiB, 3.62% gc time)

Gemm total, and test total:
0.8.26:

179.480013 seconds (388.27 M allocations: 23.585 GiB, 5.36% gc time)
Test Summary:        | Pass  Total
LoopVectorization.jl | 1724   1724
620.346936 seconds (1.66 G allocations: 103.007 GiB, 5.61% gc time)
    Testing LoopVectorization tests passed

0.9.6:

174.757699 seconds (417.31 M allocations: 25.917 GiB, 5.65% gc time)
Test Summary:        |  Pass  Total
LoopVectorization.jl | 20429  20429
447.869647 seconds (1.47 G allocations: 85.253 GiB, 5.85% gc time)
    Testing LoopVectorization tests passed

Overall, locally, the tests are almost 3 minutes (25%) faster.
This is on a computer with AVX512; compile times are much better without it.
Additionally, all tests were enabled. When on GitHub CI, several of the tests are disabled starting somewhere in the 0.9 series.

So, I think regressions are Julia-related.

@mcabbott
Copy link
Owner Author

Thanks for digging, as always.

Am I right to think that these local tests don't include coverage?

And, is there or should there be a Julia issue about such regressions? They could be inevitable consequence of progress elsewhere, but could also be bugs.

@chriselrod
Copy link
Contributor

chriselrod commented Dec 13, 2020

Yes, it was without coverage.

There's another package making a lot of use of DifferentialEquations.jl and ForwardDiff.jl that hit a pretty severe regression in compile times in some benchmarks on Julia 1.6, that some Julia core folks are looking into.

I'm not aware of an issue, but it probably wouldn't hurt.
I suspect LLVM's to blame.
I'm building Julia with ENABLE_TIMINGS defined in src/options.h, in hopes of getting output like this.

Couldn't find any documentation about it anywhere, so I'm not sure what to do from here.

EDIT: While building Julia from source, it keeps dumping these summaries now. So, I guess it'll do that automatically.
EDIT:

640.984369 seconds (379.33 M allocations: 25.899 GiB, 1.58% gc time, 99.55% compilation time)
Test Summary:        |  Pass  Total
LoopVectorization.jl | 20429  20429
1572.714641 seconds (1.61 G allocations: 107.668 GiB, 2.26% gc time, 99.16% compilation time)
ROOT                      :  0.07 %   3378911825
GC                        :  2.26 %   106785484285
LOWERING                  :  2.82 %   133262113044
PARSING                   :  0.01 %   445595991
INFERENCE                 :  5.07 %   239589938800
CODEGEN                   : 75.02 %   3544299937574
METHOD_LOOKUP_SLOW        :  0.01 %   393193036
METHOD_LOOKUP_FAST        :  0.46 %   21748429238
LLVM_OPT                  : 10.95 %   517271539743
LLVM_MODULE_FINISH        :  0.11 %   5067346249
METHOD_MATCH              :  0.30 %   14070103600
TYPE_CACHE_LOOKUP         :  0.78 %   36937850996
TYPE_CACHE_INSERT         :  0.00 %   85812602
STAGED_FUNCTION           :  1.28 %   60412449084
MACRO_INVOCATION          :  0.00 %   167520658
AST_COMPRESS              :  0.37 %   17331975652
AST_UNCOMPRESS            :  0.30 %   14386285566
SYSIMG_LOAD               :  0.00 %   155666605
ADD_METHOD                :  0.01 %   677747888
LOAD_MODULE               :  0.00 %   149572157
INIT_MODULE               :  0.00 %   12006711
     Testing LoopVectorization tests passed

So most of the time is spent in codegen. I'm not really sure what that means.

Also, I discovered a regression where @avx wasn't actually applying to the offset arrays. That's why that test was so much faster. Fixing it with a new release of ArrayInterface.

src/macro.jl Outdated
@@ -494,7 +494,8 @@ padmodclamp_pair(A, inds, store) = begin
elseif ex.args[1] == :pad && length(ex.args) >= 2
i = ex.args[2]
if !all(==(0), ex.args[3:end]) || length(ex.args) == 2
push!(nopadif, :($i ∈ $axes($A,$d)))
# push!(nopadif, :($i ∈ $axes($A,$d)))
push!(nopadif, :($i >= first(axes($A,$d))), :($i <= Base.last(axes($A,$d)))) # allows avx? Weirdly, deleting "Base." causes errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, deleting "Base." causes errors

That's odd, wouldn't mind an issue.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, I was trying to isolate it a bit, but so far the simple ones all work just fine!

Repository owner locked and limited conversation to collaborators Dec 17, 2020
Repository owner unlocked this conversation Dec 17, 2020
@DilumAluthge
Copy link
Contributor

FYI, on Julia 1.5, JULIA_NUM_THREADS=6 will give you 2 threads (since GHA only has 2 CPUs per runner). But on Julia 1.6, JULIA_NUM_THREADS=6 will give you 6 threads.

Repository owner deleted a comment from codecov-io Dec 21, 2020
@mcabbott
Copy link
Owner Author

mcabbott commented Jan 23, 2021

Let's call this done.

  • Gradient checks using LoopVectorization are disabled completely on CI. They sometimes passed, sometimes failed, with strange errors.
  • Gradients for max/min using LoopVectorization are also disabled, they need some work.

@mcabbott mcabbott merged commit 5439bc3 into master Jan 23, 2021
@mcabbott mcabbott deleted the avxci2 branch January 23, 2021 13:07
@mcabbott mcabbott mentioned this pull request Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants