Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add patch to set -O2 for suitesparse build #20165

Merged
merged 1 commit into from
Mar 4, 2017

Conversation

staticfloat
Copy link
Member

This fixes the compiler regression noticed in GCC 6.3.0 in #20123 by disabling -ftree-slp-vectorize as a side-effect of moving from -O3 to -O2. We prefer to compile with -O2 by default anyway, so let's do so.

@tkelman
Copy link
Contributor

tkelman commented Jan 21, 2017

We prefer to compile with -O2 by default anyway, so let's do so.

Where are you getting that from? I see a heck of a lot more instances of -O3 in our makefiles, only a handful of -O2.

@nanosoldier runbenchmarks("sparse", vs = ":master")

ifeq ($(USE_SYSTEM_BLAS), 0)
$(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/build-compiled: | $(build_prefix)/manifest/openblas
else ifeq ($(USE_SYSTEM_LAPACK), 0)
$(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/build-compiled: | $(build_prefix)/manifest/lapack
endif
$(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/build-compiled: $(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/source-extracted $(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/SuiteSparse-winclang.patch-applied
$(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/build-compiled: $(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/source-extracted $(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/SuiteSparse-O2.patch-applied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run make check-whitespace locally if you ever do ci skip

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here. cc @jrevels

@tkelman
Copy link
Contributor

tkelman commented Jan 21, 2017

Ah some might be under linalg... @nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@ViralBShah
Copy link
Member

As @nalimilan pointed out elsewhere, we really ought to be using -O2 as the default except in cases where we can prove that -O3 does better. This is the policy in many linux distros apparently.

@ararslan ararslan added building Build system, or building Julia or its dependencies sparse Sparse arrays labels Jan 23, 2017
@staticfloat
Copy link
Member Author

@tkelman So it looks like we've got significant performance regressions with this changeset, huh.

@tkelman
Copy link
Contributor

tkelman commented Feb 11, 2017

I'm not so sure those were related, but I'd also prefer this be targeted to the specific architectures where it's known to cause a problem. And, you know, report the gcc bug and see how long it takes them to fix it (they can be very quick on some bugs).

In this case I'm not sure the benchmarks on nanosoldier include good test cases for suitesparse functionality. A cholfact or lufact of some large known sparse matrix (say from the formerly-UF collection) would be worth timing.

@pabloferz
Copy link
Contributor

Those regressions might be due to #19976

Copy link
Contributor

@tkelman tkelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix the whitespace, and benchmark on something that exercises suitesparse thoroughly

@tkelman
Copy link
Contributor

tkelman commented Feb 12, 2017

-O3 definitely enables a whole range of tree vectorization passes in gcc that should be improving performance on code like suitesparse. Since we've been using O3 for so long, the burden should be on proving that changing to O2 doesn't hurt.

@staticfloat staticfloat force-pushed the sf/suitesparse_o2_master branch from 238b9cc to 8dcf9ca Compare February 24, 2017 23:44
@staticfloat
Copy link
Member Author

staticfloat commented Feb 24, 2017

On this branch (newly rebased on top of master), built on x86_64 using gcc 5.4.0:

Test (Worker)             | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
sparse/umfpack (5)        |    9.74  |  0.16  |  1.6 | 325.47     | 241.17
sparse/cholmod (5)        |   21.69  |  0.39  |  1.8 | 636.43     | 313.36
sparse/spqr (5)           |    2.47  |  0.04  |  1.5 | 86.65      | 313.36
sparse/sparse (2)         |   85.97  | 17.31  | 20.1 | 2012.92    | 387.27
sparse/higherorderfns (4) |   87.18  |  1.44  |  1.7 | 3554.29    | 413.45
sparse/sparsevector (3)   |   87.83  |  1.05  |  1.2 | 2588.59    | 381.42

On master (d321300):

Test (Worker)             | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
sparse/umfpack (5)        |    9.60  |  0.17  |  1.8 | 325.55     | 235.16
sparse/cholmod (5)        |   19.08  |  0.36  |  1.9 | 636.44     | 309.36
sparse/spqr (5)           |    2.49  |  0.03  |  1.4 | 86.65      | 309.36
sparse/sparse (2)         |   82.86  | 17.63  | 21.3 | 2012.94    | 387.06
sparse/sparsevector (3)   |   84.38  |  0.97  |  1.2 | 2581.58    | 384.82
sparse/higherorderfns (4) |   84.73  |  1.40  |  1.7 | 3554.21    | 411.90

@ViralBShah are there any other tests that it might be instructive to run to make sure that suitesparse isn't getting severely slowed down? So far it looks like cholmod is ~10% slower.

I should note that the only difference between these two builds is to switch branches and make -C deps distclean-suitesparse. The speed of the compiler itself is not changing here.

@staticfloat
Copy link
Member Author

Here are the same results on the same machine but using gcc 6.3.0, like we use on the buildbots. (All hail Docker 🐳 )

With -O3 on master:

Test (Worker)             | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
sparse/umfpack (5)        |    9.39  |  0.18  |  1.9 | 326.13     | 238.03
sparse/cholmod (5)        |   18.44  |  0.34  |  1.9 | 636.91     | 312.15
sparse/spqr (5)           |    2.42  |  0.03  |  1.4 | 86.65      | 312.15
sparse/sparse (2)         |   80.55  | 17.08  | 21.2 | 2015.10    | 390.14
sparse/sparsevector (3)   |   82.44  |  0.98  |  1.2 | 2590.26    | 385.08
sparse/higherorderfns (4) |   83.13  |  1.38  |  1.7 | 3553.82    | 410.05

With -O2 on this branch:

Test (Worker)             | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
sparse/umfpack (5)        |    8.75  |  0.20  |  2.2 | 302.62     | 228.37
sparse/cholmod (5)        |   17.34  |  0.30  |  1.7 | 603.76     | 302.43
sparse/spqr (5)           |    2.38  |  0.03  |  1.4 | 88.33      | 302.43
sparse/sparse (2)         |   77.48  | 16.12  | 20.8 | 1993.22    | 391.83
sparse/higherorderfns (4) |   79.89  |  1.28  |  1.6 | 3236.99    | 416.92
sparse/sparsevector (3)   |   81.31  |  1.13  |  1.4 | 2754.84    | 357.80

I.....huh. Looks like GCC 6.3.0's -O2 has improved.

@tkelman
Copy link
Contributor

tkelman commented Feb 25, 2017

test runtime isn't as useful as a perf benchmark run via BenchmarkTools, the tests are likely driven by julia JIT compile time rather than library function runtime

@staticfloat
Copy link
Member Author

staticfloat commented Feb 28, 2017

I ran the sparse perf benchmark on x86_64 with both -O2 and -O3 versions of suitesparse built. I tested the performance using the following script:

# sparse_perf.jl
using BaseBenchmarks
using BenchmarkTools

benchmarks = BaseBenchmarks.load!("sparse")

warmup(benchmarks)
results = run(benchmarks; verbose = true)
BenchmarkTools.save("../master.jld", "results", results)

With an identical one that output to ../pr.jld for the -O2 version of Julia. I analyzed with the following script:

# compare.jl
using BenchmarkTools, BaseBenchmarks, JLD

pr = load("pr.jld", "results")
master = load("master.jld", "results")
pairs = leaves(regressions(judge(minimum(pr), minimum(master))))

Which found no consistent regressions. The full results are available as a .jld file, if you're interested.

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2017

what does that benchmark run though? is it actually doing anything that exercises suitesparse, or only the pure-julia sparse code?

this needs to be reported as a gcc bug.

@staticfloat
Copy link
Member Author

this needs to be reported as a gcc bug.

As I'm assuming I would have to boil this down into an MWE, I do not have the time or energy to report this as a GCC bug at the moment.

is it actually doing anything that exercises suitesparse, or only the pure-julia sparse code?

I downloaded three two matrices (apparently we can't decompose integer-valued matrices) from the University of Florida dataset and ran factorize() on them. Here's the script:

# sparse_perf.jl
using BenchmarkTools
using MatrixMarket

#M1 = MatrixMarket.mmread("rosen2/rosen2.mtx")
M2 = MatrixMarket.mmread("bayer09/bayer09.mtx")
M3 = MatrixMarket.mmread("bcsstk23/bcsstk23.mtx")

#t1 = @benchmark factorize(M1)
t2 = @benchmark factorize(M2)
t3 = @benchmark factorize(M3)


println("Real unsymmetric (bayer09)")
display(t2)
println("\nReal symmetric (bcsstk23)")
display(t3)
println()

Here's the -O3 output:

Real unsymmetric (bayer09)
BenchmarkTools.Trial: 
  memory estimate:  3.17 MiB
  allocs estimate:  64
  --------------
  minimum time:     4.811 ms (0.00% GC)
  median time:      4.878 ms (0.00% GC)
  mean time:        4.946 ms (0.74% GC)
  maximum time:     16.998 ms (1.51% GC)
  --------------
  samples:          1004
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
Real symmetric (bcsstk23)
BenchmarkTools.Trial: 
  memory estimate:  9.49 MiB
  allocs estimate:  48
  --------------
  minimum time:     11.374 ms (0.00% GC)
  median time:      11.447 ms (0.00% GC)
  mean time:        11.701 ms (1.97% GC)
  maximum time:     62.680 ms (81.45% GC)
  --------------
  samples:          425
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Here's the -O2 output:

Real unsymmetric (bayer09)
BenchmarkTools.Trial: 
  memory estimate:  3.17 MiB
  allocs estimate:  64
  --------------
  minimum time:     4.752 ms (0.00% GC)
  median time:      4.833 ms (0.00% GC)
  mean time:        4.884 ms (0.78% GC)
  maximum time:     7.559 ms (3.49% GC)
  --------------
  samples:          1018
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
Real symmetric (bcsstk23)
BenchmarkTools.Trial: 
  memory estimate:  9.49 MiB
  allocs estimate:  48
  --------------
  minimum time:     11.782 ms (0.00% GC)
  median time:      11.830 ms (0.00% GC)
  mean time:        12.111 ms (1.98% GC)
  maximum time:     63.003 ms (81.00% GC)
  --------------
  samples:          411
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

It looks to me that we don't have too much difference here.

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2017

We already have the Julia-side example here, right? #20123 (comment) Translating that into ccalls shouldn't be that hard. If upstream defaults to -O3 then we should consider this temporary until the GCC bug gets fixed, or convince SuiteSparse to change their defaults.

@staticfloat
Copy link
Member Author

I'd like to get this merged for 0.6 so that building with the latest GCC doesn't cause problems.

@staticfloat staticfloat force-pushed the sf/suitesparse_o2_master branch from 8dcf9ca to 2c62f07 Compare March 4, 2017 07:50
@tkelman
Copy link
Contributor

tkelman commented Mar 4, 2017

I still don't like doing this on platforms where it doesn't cause problems, or merging without an upstream bug report to gcc.

@ViralBShah
Copy link
Member

Sorry @tkelman I am going ahead with this.

@ViralBShah ViralBShah merged commit 145eae8 into master Mar 4, 2017
@ViralBShah
Copy link
Member

Agree about the upstream bug report to gcc and we should perhaps also notify Tim Davis.

@tkelman tkelman deleted the sf/suitesparse_o2_master branch March 4, 2017 15:06
@tkelman
Copy link
Contributor

tkelman commented Mar 4, 2017

And if we don't do these things before merging, they never happen.

@tkelman
Copy link
Contributor

tkelman commented Mar 4, 2017

The passes enabled by -O3 probably can't do much unless you also let gcc use newer instructions with -march=native.

@staticfloat
Copy link
Member Author

And if we don't do these things before merging, they never happen.

Here's a ccall-based MWE with a dockerfile that sets up the environment for the segfault.

@@ -37,12 +37,17 @@ $(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/SuiteSparse-winclang.patch-applied: $
cd $(dir $@) && patch -p0 < $(SRCDIR)/patches/SuiteSparse-winclang.patch
echo 1 > $@

$(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/SuiteSparse-O2.patch-applied: $(BUILDDIR)/SuiteSparse-$(SUITESPARSE_VER)/SuiteSparse-winclang.patch-applied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also wrong, these belong in srccache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move all the .patch-applied files into srccache then? I was just following what was already in the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah on second thought it depends whether the dependency is capable of being built out of tree. we could use comments in these makefiles that explain inconsistencies like this

tkelman added a commit that referenced this pull request Mar 19, 2017
This reverts commit 145eae8.

This was actually a bug in the power kernels in OpenBLAS,
not a GCC or SuiteSparse problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
building Build system, or building Julia or its dependencies sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants