Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1400% performance regression for custom project on julia 0.5 #16047

Closed
davidanthoff opened this issue Apr 25, 2016 · 48 comments
Closed

1400% performance regression for custom project on julia 0.5 #16047

davidanthoff opened this issue Apr 25, 2016 · 48 comments
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@davidanthoff
Copy link
Contributor

davidanthoff commented Apr 25, 2016

UPDATE 7/15/16: These numbers are outdated. Current results and info is here.

I brought this up on julia-dev: I'm seeing a 30% increase in run time for one internal project on julia 0.5 relative to the latest julia 0.4, all on Windows:

julia 0.4: 665.095440 seconds (2.03 G allocations: 36.515 GB, 0.53% gc time)
julia 0.5: 904.061203 seconds (4.55 G allocations: 75.579 GB, 0.72% gc time)

The repo code is here https://github.com/davidanthoff/4sD.jl. The repo is private, but I'd be happy to give access to any core developer who wants to diagnose this.

This is a full end-to-end code, i.e. I haven't isolated the root case. Unfortunately I don't have the time (and probably not the expertise) to narrow it down. I know these sorts of reports are not ideal, but I guess still better than ignoring this regression...

Just pinging @tkelman: this is the bellman equation solving code we developed last summer with your help. Just for your info, no expectation that you do anything about this!

@yuyichao
Copy link
Contributor

Just a guess, If you are using closures or comprehensions a lot, you might be hitting #15276

@davidanthoff
Copy link
Contributor Author

That might be it, there are certainly some closures that are affected by this. I'll rerun my test here once #15276 is fixed.

@ViralBShah ViralBShah added performance Must go faster regression Regression in behavior compared to a previous version labels Apr 26, 2016
@ViralBShah ViralBShah added this to the 0.5.0 milestone Apr 26, 2016
@StefanKarpinski
Copy link
Member

This is 0.5.x not 0.5.0 – it's not a release blocker, although it is important.

@StefanKarpinski StefanKarpinski modified the milestones: 0.5.x, 0.5.0 Apr 28, 2016
@ViralBShah
Copy link
Member

ViralBShah commented Apr 28, 2016

Yes, that's right. However, it would be nice not to release 0.5 with known perf regressions, especially if many will have a slower experience.

@davidanthoff
Copy link
Contributor Author

I just reran with the latest master

              _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-dev+3782 (2016-04-28 12:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit e8601e8* (0 days old master)
|__/                   |  x86_64-w64-mingw32

I had hoped that #16057 and #16060 might have improved things, but no luck so far, if anything the julia 0.5 version got a bit slower than the previous build I had tried:

julia 0.4: 676.414572 seconds (2.03 G allocations: 36.772 GB, 0.60% gc time)
julia 0.5: 967.918446 seconds (4.52 G allocations: 75.280 GB, 0.75% gc time)

I think @JeffBezanson wanted to try one more thing on the closure side of things to address #16050, although I see that also on julia 0.4, so not sure that is going to help narrow the difference between 0.4 and 0.5.

But this could of course also be related to something entirely unrelated to closures...

Maybe the best thing is to wait until @JeffBezanson has done all the things he wants to do about closures, and then I'll rerun this and see how things look?

@JeffBezanson
Copy link
Member

If you have the time --- does a profile reveal anything? Comparing profiles in the two versions might help narrow it down.

@davidanthoff
Copy link
Contributor Author

I've profiled the code under julia 0.4 and 0.5, the profiles are here.

I'm not really sure what I'm looking for, though, or how to go about this...

One thing that I did notice is that the 0.4 version shows simdloop.jl a couple of times, whereas the 0.5 never shows that. Not sure that indicates anything, though...

@yuyichao
Copy link
Contributor

yuyichao commented May 2, 2016

Can you try 0.5 with -O3?

@davidanthoff
Copy link
Contributor Author

I just tried with the latest nightly, and with that performance is completely gone, the whole algorithm now takes something like 2 hours. This run was without the -O3 flag. The version I tried today was:

julia> versioninfo()
Julia Version 0.5.0-dev+3839
Commit 6bc704d* (2016-05-02 16:25 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

So some commit between 0.5.0-dev+3782 and 0.5.0-dev+3839 seems to have caused a really massive perf regression.

@JeffBezanson
Copy link
Member

Could be b0c54f0, just a guess.

@davidanthoff
Copy link
Contributor Author

Just for completenes, the exact timing I get is:

7329.798341 seconds (179.17 G allocations: 4.941 TB, 7.75% gc time)

So a little over two hours with the latest master. gc time seems also up relative to the 0.5 build from last week. Let me know if there is anything else I can provide in terms of diagnostics at this point to help.

@vtjnash
Copy link
Member

vtjnash commented May 3, 2016

Can you give me access to the repo? Running --track-allocations should hopefully also point to which part of the program is now allocating much more.

@davidanthoff
Copy link
Contributor Author

@vtjnash I gave you access to the two repos that make up this problem. The readme here has the instructions on how to get this running.

Having said that, right now the whole thing doesn't even run on 0.5, some recent change either in 0.5 or in the Requires package prevents the code from running on 0.5. I'm unable to figure out what is wrong. The whole thing does work on 0.4...

As an aside, the Requires package is pretty remarkable, it seems to have a lot of users (i.e. I'm not even using it directly, but some of the packages that I depend on apparently do), and at the same time the package has not a single test...

@tkelman
Copy link
Contributor

tkelman commented May 6, 2016

What that package does is also not really compatible with precompilation. Most users of Requires should consider refactoring to avoid it. Optional dependencies at the moment are better handled by moving functionality into separate packages that depend on both.

@vtjnash
Copy link
Member

vtjnash commented May 7, 2016

Could be b0c54f0, just a guess.

just a good guess :)

I think the problem child is ForwardDiff.jl. Its seems to have a number of questionable functions that rely heavily on code-generation (such as generated functions over Val{length(array)} or conditional tests for subtypes – aka reimplementing dispatch badly). Overall, this is likely to make it very susceptible to improved heuristics in base. I suspect the Varargs fix broke its promotion rule for promote(x, y):

 63     @inline function promote_eltype{F<:ForwardDiffNumber}(::Type{F}, types::DataType...)
 64         return switch_eltype(F, promote_type(eltype(F), types...))
 65     end

@davidanthoff
Copy link
Contributor Author

I think ForwardDiff.jl actually has performance benchmarks, but I wasn't able to run them, I seem to be missing a package.

@jrevels Any chance you could run the ForwardDiff.jl benchmarks on julia 0.4 and the latest nightly and see whether there is a performance regression?

@davidanthoff
Copy link
Contributor Author

Just a heads up that I don't think ForwardDiff.jl is the culprit. I modified my code to use an old version of ForwardDiff (0.0.3) that doesn't have any of the fancy optimizations of the current version. That old ForwardDiff.jl just uses DualNumbers.jl internally, and all of that seems entirely straightforward, I don't see any of the things that @vtjnash mentioned as problems there.

The whole thing is still running, but so far it looks likely that everything is about a factor 2 slower on julia 0.5. I'll report back once the runs are done.

So, @jrevels probably no need to run those benchmarks.

@davidanthoff
Copy link
Contributor Author

Ok, here are the final numbers:

julia 0.4: 9478.776417 seconds (4.41 G allocations: 14.830 TB, 3.82% gc time)
julia 0.5: 16511.839204 seconds (7.55 G allocations: 361.927 GB, 0.21% gc time)

So, overall julia 0.5 takes about 75% longer to finish this.

The julia 0.5 build is 0.5.0-dev+3993 (bf73102). These numbers use ForwardDiff.jl v0.0.3. Overall things are much slower with that old version of ForwardDiff.jl, but the relative difference between julia 0.4 and 0.5 is large, and I believe ForwardDiff.jl v0.0.3 is not using any of the problematic things that @vtjnash mentioned. Just running this is tricky now: I had to patch ForwardDiff.jl v0.0.3 slightly to run on jl 0.5, I also had to patch Requires.jl to run on 0.5... So to run my example one should use this branch of ForwardDiff, and on julia 0.5 this branch of Requires.jl.

But, long story short: ForwarDiff.jl is not the culprit, and there is an even more severe performance regression in julia 0.5 than the title of this issue indicates. No clue what the root cause might be.

@vtjnash
Copy link
Member

vtjnash commented May 10, 2016

Unless the performance benchmarks were written very carefully, I'm not going to put any stock in them. It's way too easy to craft benchmarks that harm application performance.

@ViralBShah
Copy link
Member

@davidanthoff How easy is it to run your code if someone were to dig in?

@ViralBShah
Copy link
Member

@vtjnash This is not just a benchmark - as I understand. It is a user code that was fast in 0.4 and has slowed down.

@pkofod
Copy link
Contributor

pkofod commented May 10, 2016

Unless the performance benchmarks were written very carefully, I'm not going to put any stock in them. It's way too easy to craft benchmarks that harm application performance.

Are you basically saying that it doesn't matter if the code written in the wild gets slower with increasing version numbers, as long as microbenchmarks are good? Or are you talking about the ForwardDiff.jl perf benchmarks?

@nalimilan
Copy link
Member

@pkofod I think he precisely meant the contrary: that benchmarks can often be misleading, and that actual useful code is the best way to measure performance.

@pkofod
Copy link
Contributor

pkofod commented May 10, 2016

Ah, okay. I misunderstood then. My bad.

@JeffBezanson
Copy link
Member

Thanks @davidanthoff for the reduced case. I think I found at least another chunk of the problem.

JeffBezanson added a commit that referenced this issue May 16, 2016
This fixes a bug plus optimizes more variables. Should help #16047.
@davidanthoff
Copy link
Contributor Author

@JeffBezanson I tried the latest master with #16386. It improves my repo case a lot!

I'm now starting to add back more of the complexity of my "real" problem to the repo case. I updated the gist with the repo here with a new version that is now again much slower on 0.5 than 0.4. What I added is one if block in the function that creates the closures.

I'm not sure this is the most efficient way to go about this. I could also give you access to the full problem, but it is involved and messy and I probably have an easier time isolating these relatively straight forward repo cases than someone who isn't familiar with the full problem... So unless you have a better idea, I'll just continue with this setup.

@StefanKarpinski
Copy link
Member

@davidanthoff – thanks for having the patience to go through this process. It's a huge benefit to everyone that we have these use cases to figure out where the performance regressions are.

@davidanthoff
Copy link
Contributor Author

Any news on this? Just a warning, the current example that shows a slowdown is still not up to the complexity level of the whole problem, so we might need a couple more iterations until this is fully resolved.

@davidanthoff
Copy link
Contributor Author

Bump, what is the plan for this? This is a really bad performance regression, and I am worried that it will take a whole bunch of further back and forth until this is completely resolved.

This clearly doesn't have to be resolved before a feature freeze, but I do think it should be resolved before a 0.5 release.

@JeffBezanson
Copy link
Member

Yes, I'm still ruminating on how to address this for 0.5. We do at least have a workaround now, which is to declare types for variables used by inner functions (e.g. x::Float64 = 1.0). Ideally that should not be necessary, but for now it should (almost) completely fix performance.

@davidanthoff
Copy link
Contributor Author

I guess I could try to fix all the type instabilities by declaring types that are used in inner functions, as you suggest, and then rerun my performance comparison. That would at least give us a sense whether there are some other perf problems lurking somewhere, or whether these type instabilities we have identified so far are the whole story. Not sure when I'll have time for this, but it is on my list.

@davidanthoff
Copy link
Contributor Author

I reran this yesterday. Here are my new numbers:

julia 0.4: 719.568477 seconds (2.13 G allocations: 38.170 GB, 0.60% gc time)
julia 0.5: 11385.609778 seconds (181.07 G allocations: 5.269 TB, 7.52% gc time)

For a while we thought #15276 (or #16048) might be the culprit, but I don't think that is the case. I changed the repo case so that it avoids the #15276 issue, and then I get these numbers:

julia 0.4: 684.270957 seconds (2.13 G allocations: 38.236 GB, 0.58% gc time)
julia 0.5: 10780.112021 seconds (177.44 G allocations: 5.218 TB, 7.91% gc time)

So while #15276 doesn't help performance, it doesn't explain the huge regression from 0.4 to 0.5.

I have a much simpler repo story now, but after a Pkg.update this morning it doesn't work anymore... I'll try to fix that next and then could give anyone who wants to repo this access to the private repository with the code.

julia> versioninfo()
Julia Version 0.5.0-dev+5435
Commit 894fbe7* (2016-07-14 16:00 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@davidanthoff davidanthoff changed the title 30% performance regression for custom project on julia 0.5 1400% performance regression for custom project on julia 0.5 Jul 15, 2016
@vtjnash
Copy link
Member

vtjnash commented Jul 15, 2016

If you run with --track-allocation, do you see where all those extra allocations come from?

@davidanthoff
Copy link
Contributor Author

@vtjnash: I'll try, but #17438 broke pretty much everything for me right now, so I can't rerun until that is sorted out.

@mbrookhart
Copy link

I'm seeing something similar. I'm running a simple Computational Physics problem (the Ising Model) as a benchmark, found here: https://github.com/aristotle2497/SimpleJuliaTests/blob/master/Ising.jl The code is simple function declarations and in place array manipulations, no packages are imported. I'm using the Ubuntu PPAs to get releases and nightly builds.

The code run times:
0.4.6
4.327104 seconds (158 allocations: 373.151 KB)
0.5.0-dev+4154
10.384944 seconds (365.32 M allocations: 5.444 GB, 1.51% gc time)

A comparision of the mem files produced by --track-allocation=user shows that the extra memory is being generate at the end statement of the outer for loop in the step_time function?

Thanks

@yuyichao
Copy link
Contributor

Note that the PPA is more than a thousand commits behind master.

@yuyichao
Copy link
Contributor

And on current master (0.5.0-dev+5475) I see the same performance compare to 0.4 for the Ising.jl example.

@davidanthoff
Copy link
Contributor Author

@aristotle2497 If you still see some problems, probably better to open a new issue for your case.

@yuyichao
Copy link
Contributor

But please test a recent version before opening a new issue.

@mbrookhart
Copy link

Wait, ppa:staticfloat/julianightlies is a thousand commits behind master? Someone should edit this page http://julialang.org/downloads/platform.html, which currently indicates that the ppa is a build of master from last night.

Anyway, I'll pull master, recompile, and let you know if I still see any problems.

Thanks. I've been following Julia for a couple of years, but I'm just starting to use it seriously.

@yuyichao
Copy link
Contributor

We've mentioned at many places that those should be removed and I thought we already did but apparently not.....

@tkelman
Copy link
Contributor

tkelman commented Jul 18, 2016

yes, we need to edit that page, the ppa is not recommended any more because it's not actively maintained. there are generic linux binaries that are up to date.

@jminardi
Copy link

We are also seeing roughly 30% slowdown in our real-world julia app.

Any tips for debugging this?

@JeffBezanson
Copy link
Member

Hi Jack. I would look at

  • Is there an increase in allocation?
  • Compare profiles from the old version and the new version.
  • Look at uses of inner functions that might touch performance hot spots. If those functions use variables from their outer scopes, try adding type declarations to those variables.

@maxvoxel8
Copy link

maxvoxel8 commented Sep 23, 2016

I tracked down the primary cause of the slowdown @jminardi mentioned. Here is an abbreviated example:

@time begin
    scale = Any[1,2,3]

    data = [Vector{Float64}(rand(3)) for i in 1:47000]

    Vector{Float64}[Vector{Float64}(v .* scale) for v in data]
end

I get the output

Julia 0.4.6: 1.329589 seconds (2.23 M allocations: 76.748 MB, 4.32% gc time)
Julia 0.5.0: 2.985515 seconds (3.44 M allocations: 107.333 MB, 7.80% gc time)

Oddly the slowdown is not linear - changing the size of data to 500,000 results in

Julia 0.4.6: 4.359198 seconds (18.54 M allocations: 574.429 MB, 8.55% gc time)
Julia 0.5.0: 23.488645 seconds (31.98 M allocations: 943.713 MB, 2.57% gc time)

Changing Any to Float64 changes the result to

47,000 Elements:
Julia 0.4.6: 0.966634 seconds (1.15 M allocations: 44.238 MB, 2.66% gc time)
Julia 0.5.0: 0.260013 seconds (397.43 k allocations: 22.304 MB)

500,000 Elements:
Julia 0.4.6: 2.276912 seconds (8.85 M allocations: 320.727 MB, 13.29% gc time)
Julia 0.5.0: 0.689436 seconds (2.21 M allocations: 146.724 MB, 45.75% gc time)

While the fix is simple I am surprised to see a non-linear performance impact.

@JeffBezanson
Copy link
Member

Thanks @maxvoxel8, great to have this reduced example. Glad there is a workaround and that 0.5 is faster in some cases.

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
@ViralBShah
Copy link
Member

ViralBShah commented Mar 4, 2018

I suspect this is ok to close and that if these issues are still relevant in the 0.7 release cycle, we can have new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests