1400% performance regression for custom project on julia 0.5 #16047

davidanthoff · 2016-04-25T22:37:41Z

UPDATE 7/15/16: These numbers are outdated. Current results and info is here.

I brought this up on julia-dev: I'm seeing a 30% increase in run time for one internal project on julia 0.5 relative to the latest julia 0.4, all on Windows:

julia 0.4: 665.095440 seconds (2.03 G allocations: 36.515 GB, 0.53% gc time)
julia 0.5: 904.061203 seconds (4.55 G allocations: 75.579 GB, 0.72% gc time)

The repo code is here https://github.com/davidanthoff/4sD.jl. The repo is private, but I'd be happy to give access to any core developer who wants to diagnose this.

This is a full end-to-end code, i.e. I haven't isolated the root case. Unfortunately I don't have the time (and probably not the expertise) to narrow it down. I know these sorts of reports are not ideal, but I guess still better than ignoring this regression...

Just pinging @tkelman: this is the bellman equation solving code we developed last summer with your help. Just for your info, no expectation that you do anything about this!

The text was updated successfully, but these errors were encountered:

yuyichao · 2016-04-25T22:45:53Z

Just a guess, If you are using closures or comprehensions a lot, you might be hitting #15276

davidanthoff · 2016-04-25T23:38:41Z

That might be it, there are certainly some closures that are affected by this. I'll rerun my test here once #15276 is fixed.

StefanKarpinski · 2016-04-28T15:06:12Z

This is 0.5.x not 0.5.0 – it's not a release blocker, although it is important.

ViralBShah · 2016-04-28T15:16:17Z

Yes, that's right. However, it would be nice not to release 0.5 with known perf regressions, especially if many will have a slower experience.

davidanthoff · 2016-04-29T00:39:53Z

I just reran with the latest master

              _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-dev+3782 (2016-04-28 12:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit e8601e8* (0 days old master)
|__/                   |  x86_64-w64-mingw32

I had hoped that #16057 and #16060 might have improved things, but no luck so far, if anything the julia 0.5 version got a bit slower than the previous build I had tried:

julia 0.4: 676.414572 seconds (2.03 G allocations: 36.772 GB, 0.60% gc time)
julia 0.5: 967.918446 seconds (4.52 G allocations: 75.280 GB, 0.75% gc time)

I think @JeffBezanson wanted to try one more thing on the closure side of things to address #16050, although I see that also on julia 0.4, so not sure that is going to help narrow the difference between 0.4 and 0.5.

But this could of course also be related to something entirely unrelated to closures...

Maybe the best thing is to wait until @JeffBezanson has done all the things he wants to do about closures, and then I'll rerun this and see how things look?

JeffBezanson · 2016-04-29T00:42:04Z

If you have the time --- does a profile reveal anything? Comparing profiles in the two versions might help narrow it down.

davidanthoff · 2016-05-02T18:33:16Z

I've profiled the code under julia 0.4 and 0.5, the profiles are here.

I'm not really sure what I'm looking for, though, or how to go about this...

One thing that I did notice is that the 0.4 version shows simdloop.jl a couple of times, whereas the 0.5 never shows that. Not sure that indicates anything, though...

yuyichao · 2016-05-02T18:34:56Z

Can you try 0.5 with -O3?

davidanthoff · 2016-05-02T22:34:33Z

I just tried with the latest nightly, and with that performance is completely gone, the whole algorithm now takes something like 2 hours. This run was without the -O3 flag. The version I tried today was:

julia> versioninfo()
Julia Version 0.5.0-dev+3839
Commit 6bc704d* (2016-05-02 16:25 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

So some commit between 0.5.0-dev+3782 and 0.5.0-dev+3839 seems to have caused a really massive perf regression.

JeffBezanson · 2016-05-02T22:50:24Z

Could be b0c54f0, just a guess.

davidanthoff · 2016-05-03T00:48:26Z

Just for completenes, the exact timing I get is:

7329.798341 seconds (179.17 G allocations: 4.941 TB, 7.75% gc time)

So a little over two hours with the latest master. gc time seems also up relative to the 0.5 build from last week. Let me know if there is anything else I can provide in terms of diagnostics at this point to help.

vtjnash · 2016-05-03T00:59:12Z

Can you give me access to the repo? Running --track-allocations should hopefully also point to which part of the program is now allocating much more.

davidanthoff · 2016-05-06T18:16:28Z

@vtjnash I gave you access to the two repos that make up this problem. The readme here has the instructions on how to get this running.

Having said that, right now the whole thing doesn't even run on 0.5, some recent change either in 0.5 or in the Requires package prevents the code from running on 0.5. I'm unable to figure out what is wrong. The whole thing does work on 0.4...

As an aside, the Requires package is pretty remarkable, it seems to have a lot of users (i.e. I'm not even using it directly, but some of the packages that I depend on apparently do), and at the same time the package has not a single test...

tkelman · 2016-05-06T21:19:01Z

What that package does is also not really compatible with precompilation. Most users of Requires should consider refactoring to avoid it. Optional dependencies at the moment are better handled by moving functionality into separate packages that depend on both.

vtjnash · 2016-05-07T00:19:41Z

Could be b0c54f0, just a guess.

just a good guess :)

I think the problem child is ForwardDiff.jl. Its seems to have a number of questionable functions that rely heavily on code-generation (such as generated functions over Val{length(array)} or conditional tests for subtypes – aka reimplementing dispatch badly). Overall, this is likely to make it very susceptible to improved heuristics in base. I suspect the Varargs fix broke its promotion rule for promote(x, y):

 63     @inline function promote_eltype{F<:ForwardDiffNumber}(::Type{F}, types::DataType...)
 64         return switch_eltype(F, promote_type(eltype(F), types...))
 65     end

davidanthoff · 2016-05-07T00:27:16Z

I think ForwardDiff.jl actually has performance benchmarks, but I wasn't able to run them, I seem to be missing a package.

@jrevels Any chance you could run the ForwardDiff.jl benchmarks on julia 0.4 and the latest nightly and see whether there is a performance regression?

davidanthoff · 2016-05-10T00:38:39Z

Just a heads up that I don't think ForwardDiff.jl is the culprit. I modified my code to use an old version of ForwardDiff (0.0.3) that doesn't have any of the fancy optimizations of the current version. That old ForwardDiff.jl just uses DualNumbers.jl internally, and all of that seems entirely straightforward, I don't see any of the things that @vtjnash mentioned as problems there.

The whole thing is still running, but so far it looks likely that everything is about a factor 2 slower on julia 0.5. I'll report back once the runs are done.

So, @jrevels probably no need to run those benchmarks.

davidanthoff · 2016-05-10T04:33:40Z

Ok, here are the final numbers:

julia 0.4: 9478.776417 seconds (4.41 G allocations: 14.830 TB, 3.82% gc time)
julia 0.5: 16511.839204 seconds (7.55 G allocations: 361.927 GB, 0.21% gc time)

So, overall julia 0.5 takes about 75% longer to finish this.

The julia 0.5 build is 0.5.0-dev+3993 (bf73102). These numbers use ForwardDiff.jl v0.0.3. Overall things are much slower with that old version of ForwardDiff.jl, but the relative difference between julia 0.4 and 0.5 is large, and I believe ForwardDiff.jl v0.0.3 is not using any of the problematic things that @vtjnash mentioned. Just running this is tricky now: I had to patch ForwardDiff.jl v0.0.3 slightly to run on jl 0.5, I also had to patch Requires.jl to run on 0.5... So to run my example one should use this branch of ForwardDiff, and on julia 0.5 this branch of Requires.jl.

But, long story short: ForwarDiff.jl is not the culprit, and there is an even more severe performance regression in julia 0.5 than the title of this issue indicates. No clue what the root cause might be.

vtjnash · 2016-05-10T05:15:32Z

Unless the performance benchmarks were written very carefully, I'm not going to put any stock in them. It's way too easy to craft benchmarks that harm application performance.

ViralBShah · 2016-05-10T07:10:07Z

@davidanthoff How easy is it to run your code if someone were to dig in?

ViralBShah · 2016-05-10T07:15:44Z

@vtjnash This is not just a benchmark - as I understand. It is a user code that was fast in 0.4 and has slowed down.

pkofod · 2016-05-10T08:44:09Z

Unless the performance benchmarks were written very carefully, I'm not going to put any stock in them. It's way too easy to craft benchmarks that harm application performance.

Are you basically saying that it doesn't matter if the code written in the wild gets slower with increasing version numbers, as long as microbenchmarks are good? Or are you talking about the ForwardDiff.jl perf benchmarks?

nalimilan · 2016-05-10T09:20:22Z

@pkofod I think he precisely meant the contrary: that benchmarks can often be misleading, and that actual useful code is the best way to measure performance.

pkofod · 2016-05-10T09:29:56Z

Ah, okay. I misunderstood then. My bad.

JeffBezanson · 2016-05-16T15:13:12Z

Thanks @davidanthoff for the reduced case. I think I found at least another chunk of the problem.

This fixes a bug plus optimizes more variables. Should help #16047.

davidanthoff · 2016-05-17T18:00:02Z

@JeffBezanson I tried the latest master with #16386. It improves my repo case a lot!

I'm now starting to add back more of the complexity of my "real" problem to the repo case. I updated the gist with the repo here with a new version that is now again much slower on 0.5 than 0.4. What I added is one if block in the function that creates the closures.

I'm not sure this is the most efficient way to go about this. I could also give you access to the full problem, but it is involved and messy and I probably have an easier time isolating these relatively straight forward repo cases than someone who isn't familiar with the full problem... So unless you have a better idea, I'll just continue with this setup.

StefanKarpinski · 2016-05-17T20:15:01Z

@davidanthoff – thanks for having the patience to go through this process. It's a huge benefit to everyone that we have these use cases to figure out where the performance regressions are.

davidanthoff · 2016-05-28T04:54:40Z

Any news on this? Just a warning, the current example that shows a slowdown is still not up to the complexity level of the whole problem, so we might need a couple more iterations until this is fully resolved.

davidanthoff · 2016-06-29T20:19:11Z

Bump, what is the plan for this? This is a really bad performance regression, and I am worried that it will take a whole bunch of further back and forth until this is completely resolved.

This clearly doesn't have to be resolved before a feature freeze, but I do think it should be resolved before a 0.5 release.

JeffBezanson · 2016-06-30T16:48:37Z

Yes, I'm still ruminating on how to address this for 0.5. We do at least have a workaround now, which is to declare types for variables used by inner functions (e.g. x::Float64 = 1.0). Ideally that should not be necessary, but for now it should (almost) completely fix performance.

davidanthoff · 2016-07-09T00:40:27Z

I guess I could try to fix all the type instabilities by declaring types that are used in inner functions, as you suggest, and then rerun my performance comparison. That would at least give us a sense whether there are some other perf problems lurking somewhere, or whether these type instabilities we have identified so far are the whole story. Not sure when I'll have time for this, but it is on my list.

davidanthoff · 2016-07-15T17:39:37Z

I reran this yesterday. Here are my new numbers:

julia 0.4: 719.568477 seconds (2.13 G allocations: 38.170 GB, 0.60% gc time)
julia 0.5: 11385.609778 seconds (181.07 G allocations: 5.269 TB, 7.52% gc time)

For a while we thought #15276 (or #16048) might be the culprit, but I don't think that is the case. I changed the repo case so that it avoids the #15276 issue, and then I get these numbers:

julia 0.4: 684.270957 seconds (2.13 G allocations: 38.236 GB, 0.58% gc time)
julia 0.5: 10780.112021 seconds (177.44 G allocations: 5.218 TB, 7.91% gc time)

So while #15276 doesn't help performance, it doesn't explain the huge regression from 0.4 to 0.5.

I have a much simpler repo story now, but after a Pkg.update this morning it doesn't work anymore... I'll try to fix that next and then could give anyone who wants to repo this access to the private repository with the code.

julia> versioninfo()
Julia Version 0.5.0-dev+5435
Commit 894fbe7* (2016-07-14 16:00 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

vtjnash · 2016-07-15T18:11:53Z

If you run with --track-allocation, do you see where all those extra allocations come from?

davidanthoff · 2016-07-15T18:21:42Z

@vtjnash: I'll try, but #17438 broke pretty much everything for me right now, so I can't rerun until that is sorted out.

mbrookhart · 2016-07-18T00:22:21Z

I'm seeing something similar. I'm running a simple Computational Physics problem (the Ising Model) as a benchmark, found here: https://github.com/aristotle2497/SimpleJuliaTests/blob/master/Ising.jl The code is simple function declarations and in place array manipulations, no packages are imported. I'm using the Ubuntu PPAs to get releases and nightly builds.

The code run times:
0.4.6
4.327104 seconds (158 allocations: 373.151 KB)
0.5.0-dev+4154
10.384944 seconds (365.32 M allocations: 5.444 GB, 1.51% gc time)

A comparision of the mem files produced by --track-allocation=user shows that the extra memory is being generate at the end statement of the outer for loop in the step_time function?

Thanks

yuyichao · 2016-07-18T00:23:21Z

Note that the PPA is more than a thousand commits behind master.

yuyichao · 2016-07-18T00:25:26Z

And on current master (0.5.0-dev+5475) I see the same performance compare to 0.4 for the Ising.jl example.

davidanthoff · 2016-07-18T00:47:00Z

@aristotle2497 If you still see some problems, probably better to open a new issue for your case.

yuyichao · 2016-07-18T00:50:12Z

But please test a recent version before opening a new issue.

mbrookhart · 2016-07-18T02:26:22Z

Wait, ppa:staticfloat/julianightlies is a thousand commits behind master? Someone should edit this page http://julialang.org/downloads/platform.html, which currently indicates that the ppa is a build of master from last night.

Anyway, I'll pull master, recompile, and let you know if I still see any problems.

Thanks. I've been following Julia for a couple of years, but I'm just starting to use it seriously.

yuyichao · 2016-07-18T02:32:41Z

We've mentioned at many places that those should be removed and I thought we already did but apparently not.....

tkelman · 2016-07-18T02:33:29Z

yes, we need to edit that page, the ppa is not recommended any more because it's not actively maintained. there are generic linux binaries that are up to date.

jminardi · 2016-09-22T21:06:24Z

We are also seeing roughly 30% slowdown in our real-world julia app.

Any tips for debugging this?

JeffBezanson · 2016-09-22T21:28:53Z

Hi Jack. I would look at

Is there an increase in allocation?
Compare profiles from the old version and the new version.
Look at uses of inner functions that might touch performance hot spots. If those functions use variables from their outer scopes, try adding type declarations to those variables.

maxvoxel8 · 2016-09-23T17:24:55Z

I tracked down the primary cause of the slowdown @jminardi mentioned. Here is an abbreviated example:

@time begin
    scale = Any[1,2,3]

    data = [Vector{Float64}(rand(3)) for i in 1:47000]

    Vector{Float64}[Vector{Float64}(v .* scale) for v in data]
end

I get the output

Julia 0.4.6: 1.329589 seconds (2.23 M allocations: 76.748 MB, 4.32% gc time)
Julia 0.5.0: 2.985515 seconds (3.44 M allocations: 107.333 MB, 7.80% gc time)

Oddly the slowdown is not linear - changing the size of data to 500,000 results in

Julia 0.4.6: 4.359198 seconds (18.54 M allocations: 574.429 MB, 8.55% gc time)
Julia 0.5.0: 23.488645 seconds (31.98 M allocations: 943.713 MB, 2.57% gc time)

Changing Any to Float64 changes the result to

47,000 Elements:
Julia 0.4.6: 0.966634 seconds (1.15 M allocations: 44.238 MB, 2.66% gc time)
Julia 0.5.0: 0.260013 seconds (397.43 k allocations: 22.304 MB)

500,000 Elements:
Julia 0.4.6: 2.276912 seconds (8.85 M allocations: 320.727 MB, 13.29% gc time)
Julia 0.5.0: 0.689436 seconds (2.21 M allocations: 146.724 MB, 45.75% gc time)

While the fix is simple I am surprised to see a non-linear performance impact.

JeffBezanson · 2016-09-23T19:51:11Z

Thanks @maxvoxel8, great to have this reduced example. Glad there is a workaround and that 0.5 is faster in some cases.

ViralBShah · 2018-03-04T00:49:37Z

I suspect this is ok to close and that if these issues are still relevant in the 0.7 release cycle, we can have new issues.

ViralBShah added performance Must go faster regression Regression in behavior compared to a previous version labels Apr 26, 2016

ViralBShah added this to the 0.5.0 milestone Apr 26, 2016

StefanKarpinski modified the milestones: 0.5.x, 0.5.0 Apr 28, 2016

davidanthoff mentioned this issue May 9, 2016

Broken on 0.5 JuliaPackaging/Requires.jl#22

Closed

JeffBezanson added a commit that referenced this issue May 16, 2016

improvements to lambda-optimize-vars

e4b0309

This fixes a bug plus optimizes more variables. Should help #16047.

JeffBezanson mentioned this issue May 16, 2016

improvements to lambda-optimize-vars #16386

Merged

davidanthoff changed the title ~~30% performance regression for custom project on julia 0.5~~ 1400% performance regression for custom project on julia 0.5 Jul 15, 2016

JeffBezanson mentioned this issue Aug 18, 2016

Code snippet runs 6x slower (and allocates 6x as much) on 0.5.0-rc2 than on 0.4.6 #18129

Closed

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016

ViralBShah closed this as completed Mar 4, 2018

1400% performance regression for custom project on julia 0.5 #16047

1400% performance regression for custom project on julia 0.5 #16047

Comments

davidanthoff commented Apr 25, 2016 • edited Loading

yuyichao commented Apr 25, 2016

davidanthoff commented Apr 25, 2016

StefanKarpinski commented Apr 28, 2016

ViralBShah commented Apr 28, 2016 • edited Loading

davidanthoff commented Apr 29, 2016

JeffBezanson commented Apr 29, 2016

davidanthoff commented May 2, 2016

yuyichao commented May 2, 2016

davidanthoff commented May 2, 2016

JeffBezanson commented May 2, 2016

davidanthoff commented May 3, 2016

vtjnash commented May 3, 2016

davidanthoff commented May 6, 2016

tkelman commented May 6, 2016

vtjnash commented May 7, 2016

davidanthoff commented May 7, 2016

davidanthoff commented May 10, 2016

davidanthoff commented May 10, 2016

vtjnash commented May 10, 2016

ViralBShah commented May 10, 2016

ViralBShah commented May 10, 2016

pkofod commented May 10, 2016

nalimilan commented May 10, 2016

pkofod commented May 10, 2016

JeffBezanson commented May 16, 2016

davidanthoff commented May 17, 2016

StefanKarpinski commented May 17, 2016

davidanthoff commented May 28, 2016

davidanthoff commented Jun 29, 2016

JeffBezanson commented Jun 30, 2016

davidanthoff commented Jul 9, 2016

davidanthoff commented Jul 15, 2016

vtjnash commented Jul 15, 2016

davidanthoff commented Jul 15, 2016

mbrookhart commented Jul 18, 2016

yuyichao commented Jul 18, 2016

yuyichao commented Jul 18, 2016

davidanthoff commented Jul 18, 2016

yuyichao commented Jul 18, 2016

mbrookhart commented Jul 18, 2016

yuyichao commented Jul 18, 2016

tkelman commented Jul 18, 2016

jminardi commented Sep 22, 2016

JeffBezanson commented Sep 22, 2016

maxvoxel8 commented Sep 23, 2016 • edited Loading

JeffBezanson commented Sep 23, 2016

ViralBShah commented Mar 4, 2018 • edited Loading

davidanthoff commented Apr 25, 2016 •

edited

Loading

ViralBShah commented Apr 28, 2016 •

edited

Loading

maxvoxel8 commented Sep 23, 2016 •

edited

Loading

ViralBShah commented Mar 4, 2018 •

edited

Loading