Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency regression in Plots on master vs 1.7 #41914

Closed
KristofferC opened this issue Aug 18, 2021 · 20 comments
Closed

Latency regression in Plots on master vs 1.7 #41914

KristofferC opened this issue Aug 18, 2021 · 20 comments
Labels
compiler:latency Compiler latency regression Regression in behavior compared to a previous version

Comments

@KristofferC
Copy link
Member

1.7:

julia> @time using Plots
  2.311376 seconds (7.10 M allocations: 490.159 MiB, 2.82% gc time, 0.35% compilation time)

julia> @time (p = plot(rand(2,2)); display(p));
  6.417196 seconds (18.23 M allocations: 959.720 MiB, 3.93% gc time, 17.02% compilation time)

master:

julia> @time using Plots
  3.946437 seconds (9.13 M allocations: 622.219 MiB, 4.27% gc time, 1.18% compilation time)

julia> @time (p = plot(rand(2,2)); display(p));
  8.140353 seconds (18.87 M allocations: 1001.987 MiB, 1.61% gc time, 17.00% compilation time)

I haven't looked deeply into it yet.

@KristofferC KristofferC added regression Regression in behavior compared to a previous version compiler:latency Compiler latency labels Aug 18, 2021
@aviatesk
Copy link
Member

aviatesk commented Aug 20, 2021

From sysimg log, there seems to be significant difference between precompiled statements

release/v1.7

    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 31/31
Executing precompile statements... 1395/1431
Precompilation complete. Summary:
Total ─────── 105.469442 seconds
Generation ──  79.721119 seconds 75.5869%
Execution ───  25.748323 seconds 24.4131%

master

    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 31/31
Executing precompile statements... 1244/1276
Precompilation complete. Summary:
Total ───────  98.027579 seconds
Generation ──  74.974410 seconds 76.483%
Execution ───  23.053170 seconds 23.517%

and I guess we need to compile more things at runtime on master ? Haven't bisected what caused this particular regression though.

EDIT: well, there is not much difference between 1.7.0-beta4.1 and master ?

release/v1.7

~/julia3 remotes/origin/release-1.7 aviatesk@amdci2 7s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  3.693188 seconds (7.44 M allocations: 510.916 MiB, 3.79% gc time, 0.24% compilation time)
  2.870833 seconds (3.60 M allocations: 199.335 MiB, 7.50% gc time, 56.30% compilation time)

master

~/julia avi/noinfer ⇣⇡ aviatesk@amdci2 7s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  3.460878 seconds (7.19 M allocations: 493.561 MiB, 5.72% gc time, 1.23% compilation time)
  2.995208 seconds (3.49 M allocations: 196.758 MiB, 9.84% gc time, 57.56% compilation time)

@timholy
Copy link
Member

timholy commented Aug 20, 2021

Plots is a tough case because the most effective way to be cross-version and cross-platform is to use

if ccall(:jl_generating_output, Cint, ()) == 1
    while false; end   # disable the interpreter
    do_work(small_dataset)
end

rather than

if ccall(:jl_generating_output, Cint, ()) == 1
    precompile(do_work, (DataSetType,))
end

but plotting has side effects.Still, they should probably try to transition to this style wherever possible. Currently they rely on the bot to generate precompiles.

@KristofferC
Copy link
Member Author

KristofferC commented Aug 25, 2021

EDIT: well, there is not much difference between 1.7.0-beta4.1 and master ?

Yes, looks like something regressed vs beta3?

beta3:

julia> @time using Plots
  3.277906 seconds (7.42 M allocations: 511.464 MiB, 9.06% gc time, 0.20% compilation time)

julia> @time (p = plot(rand(2,2)); display(p));
  7.475758 seconds (18.31 M allocations: 962.836 MiB, 2.53% gc time, 15.59% compilation time)

on #41781:

julia> @time using Plots
  4.682400 seconds (8.93 M allocations: 610.531 MiB, 7.02% gc time, 0.16% compilation time)

julia> @time (p = plot(rand(2,2)); display(p));
  9.137335 seconds (19.63 M allocations: 1.005 GiB, 2.10% gc time, 16.63% compilation time)

Should be fairly quick to bisect perhaps.

@JeffBezanson
Copy link
Member

I get:

1.7.0-rc1:

julia> @time using Plots
  4.155797 seconds (8.16 M allocations: 571.039 MiB, 8.09% gc time, 0.34% compilation time)

julia> @time display(plot(rand(10)))
  6.991758 seconds (10.62 M allocations: 569.403 MiB, 1.73% gc time)

master:

julia> @time using Plots
  6.922687 seconds (12.16 M allocations: 778.810 MiB, 5.00% gc time, 1.01% compilation time)

julia> @time display(plot(rand(10)))
 10.923829 seconds (22.90 M allocations: 1.203 GiB, 2.35% gc time)

@JeffBezanson JeffBezanson added the priority This should be addressed urgently label Sep 14, 2021
@aviatesk

This comment has been minimized.

@aviatesk
Copy link
Member

So there are two levels of TTFP regression...:

  • between 1.7-beta3 and 1.7-rc1
  • 1.7-rc1 and master

I filed an issue around throw block deoptimization (#42258), which I thought can be a trigger of the regression between 1.7-rc1 and master.
I tried the plot benchmark with reverting #42149 that can be pretty confused by the issue, but it seems like the regression between master and v1.7-rc1 is not caused by that:

master

~/julia master aviatesk@amdci2 11s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  7.265401 seconds (12.46 M allocations: 793.437 MiB, 5.08% gc time, 0.65% compilation time)
  2.661832 seconds (3.55 M allocations: 196.283 MiB, 2.25% gc time)

with reverting #42149

~/julia3 master ⇡ aviatesk@amdci2 10m 7s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  7.294589 seconds (12.44 M allocations: 791.542 MiB, 5.15% gc time, 0.63% compilation time)
  2.622218 seconds (3.64 M allocations: 204.145 MiB, 1.88% gc time)

@KristofferC
Copy link
Member Author

KristofferC commented Sep 15, 2021

Tried again: I get:

julia> @time using Plots
# 1.6.2:
  2.795078 seconds (8.13 M allocations: 564.960 MiB, 3.26% gc time, 0.10% compilation time)
# 1.7.0-rc1:
  4.379164 seconds (10.48 M allocations: 713.449 MiB, 5.69% gc time, 0.15% compilation time)
# master:
  7.175294 seconds (16.00 M allocations: 1018.494 MiB, 3.80% gc time, 0.17% compilation time)
julia> @time display(plot(rand(10)))
# 1.6.2:
  6.755702 seconds (18.10 M allocations: 1.013 GiB, 4.01% gc time, 15.37% compilation time)
# 1.7.0-rc1:
  8.278466 seconds (18.85 M allocations: 1001.208 MiB, 1.95% gc time)
# mater:
 10.549896 seconds (25.17 M allocations: 1.321 GiB, 1.90% gc time, 13.70% compilation time)

So 1.7 is bad and master is really bad. 2x regression in using seems milestone worthy. I'll try to bisect it.

@KristofferC
Copy link
Member Author

From sysimg log, there seems to be significant difference between precompiled statements

@aviatesk, that's from #41898

@KristofferC KristofferC added this to the 1.7 milestone Sep 15, 2021
@aviatesk
Copy link
Member

So 1.7 is bad and master is really bad.

😱 okay, I will also try to bisect it (I think currently we don't have a solid way to define a latency regression)

@KristofferC
Copy link
Member Author

KristofferC commented Sep 15, 2021

I checked out 1.6.2 locally and get:

julia> @time using Plots
  5.467579 seconds (10.95 M allocations: 743.108 MiB, 4.75% gc time, 0.13% compilation time)

julia> VERSION
v"1.6.2"

And if I use the released version downloaded from the website:

julia> @time using Plots
  2.821719 seconds (7.97 M allocations: 554.232 MiB, 7.47% gc time, 0.15% compilation time)

julia> VERSION
v"1.6.2"

Hard to bisect when the same Julia version can give so dramatically different results... I even cleaned out .julia/compiled in between.

Edit: I need to check when I'm back at the computer that I am not running with assertions on or something like that.

@vchuravy
Copy link
Member

Hard to bisect when the same Julia version can give so dramatically different results...

Do you have any Make.user flags set?

@KristofferC
Copy link
Member Author

Literally just updated my comment :)

@aviatesk
Copy link
Member

I started git bisect between HEAD and 9eade61 with the following script (to find the regression between 1.7-rc1 and master:

# ~/julia4 tags/v1.7.0-rc1 aviatesk@amdci2
# ❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
#   4.071415 seconds (8.09 M allocations: 569.073 MiB, 7.24% gc time, 0.10% compilation time)
#   2.741481 seconds (3.46 M allocations: 194.945 MiB, 5.41% gc time)

# ~/julia master aviatesk@amdci2 4m 33s
# ❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
#   7.275295 seconds (12.46 M allocations: 793.529 MiB, 5.06% gc time, 0.65% compilation time)
#   2.663575 seconds (3.55 M allocations: 196.283 MiB, 2.25% gc time)

make cleanall
make -j 64
./usr/bin/julia -e 'using Pkg; Pkg.precompile()'
./usr/bin/julia -e """
let
    allocated = @allocated using Plots
    if allocated > 750_000_000 # let's set threshold on 750 MiB
        exit(1)
    else
        exit(0)
    end
end
"""

And it bisects 5405994, which is fairly reasonable: before the commit, @time using Plots ran on interpreter, but now it's compiled and comes with some overhead.
If we revert that commit, we can get back to the 1.7-rc1 latency at least:

# HEAD
~/julia3 master aviatesk@amdci2 10s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  7.271732 seconds (12.44 M allocations: 791.544 MiB, 5.17% gc time, 0.63% compilation time)
  2.656330 seconds (3.64 M allocations: 204.145 MiB, 2.37% gc time)

# reverted 5405994f2c5a2a91813dccd203e8f07f72b07521
~/julia3 avi/try aviatesk@amdci2 9m 40s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  4.482852 seconds (9.17 M allocations: 606.832 MiB, 4.89% gc time, 0.93% compilation time)
  2.759052 seconds (3.59 M allocations: 200.138 MiB, 5.18% gc time)

This suggests we may need to change our mind on how to the latency regression quickly ?
I'd like to hear what @timholy suggests on this here again.

@KristofferC
Copy link
Member Author

And it bisects 5405994, which is fairly reasonable: before the commit, @time using Plots ran on interpreter, but now it's compiled and comes with some overhead.

Thanks, now we just need to find the 1.6 -> 1.7 regression, I guess.

@timholy
Copy link
Member

timholy commented Sep 15, 2021

With regards to 5405994, I think there was some consensus that @time was supposed to compile its workload first. Does that mean that the previous heuristic didn't always work? Of course you can circumvent this with tstart = time(); using Plots; time() - tstart, which is recommended any time you're analyzing latency-related issues.

Was that your question, @aviatesk, or is it something else? I haven't bisected or done anything else like that.

@aviatesk
Copy link
Member

Was that your question, @aviatesk, or is it something else?

Yes, thanks for your response.

I think there was some consensus that @time was supposed to compile its workload first. Does that mean that the previous heuristic didn't always work?

Yeah, when I merged the PR, I assumed the previous while false; end trick was functional, but it seems like it didn't.

I wonder @time will show much different result for this sort of measurement between v1.8 and lower, which can be a source of confusion ? Maybe now we want to have a simple macro which is supposed to be used for latency-related measurement ?

@KristofferC
Copy link
Member Author

Okay, so embarrassingly, making sure that I don't run with assertions, I don't see a regression in 1.7 vs 1.6. And the regression on master is explained by #41914 (comment). So I am not sure it makes sense to keep this open?

@aviatesk
Copy link
Member

I'm trying to find the different between 1.7 and 1.6, and let me close this if I can't find the difference.

@aviatesk aviatesk removed the priority This should be addressed urgently label Sep 15, 2021
@aviatesk
Copy link
Member

aviatesk commented Sep 15, 2021

Okay, I got:

~/julia3 remotes/origin/release-1.6 aviatesk@amdci2 7s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  4.115876 seconds (7.77 M allocations: 553.396 MiB, 5.08% gc time, 14.70% compilation time)
  2.541933 seconds (3.12 M allocations: 187.996 MiB, 1.98% gc time, 54.41% compilation time)

~/julia3 tags/v1.7.0-rc1 aviatesk@amdci2 10m 28s
❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3))'
  3.951941 seconds (7.92 M allocations: 551.359 MiB, 5.15% gc time, 0.10% compilation time)
  2.640314 seconds (3.45 M allocations: 193.259 MiB, 1.89% gc time)

So I'd like to conclude there is not much difference between 1.6 and 1.7.
I remember #39305 slightly increased the memory allocation of this super simple benchmark, but I think we prefer the performance benefit from that PR over this minor latency difference.

@KristofferC KristofferC removed this from the 1.7 milestone Sep 15, 2021
@timholy
Copy link
Member

timholy commented Sep 15, 2021

There's a very tiny using regression that I see between 1.6 (julia) and 1.8 (juliam):

Fairly recent master:

tim@diva:~/src/julia-master$ juliam -q --project --startup-file=no
julia> @time using Requires
  0.069032 seconds (50.07 k allocations: 4.413 MiB, 50.21% compilation time)

julia> 
tim@diva:~/src/julia-master$ juliam -q --project --startup-file=no
julia> tstart = time(); using Requires; time()-tstart
0.0838780403137207

1.6:

julia> 
tim@diva:~/src/julia-master$ julia -q --project --startup-file=no
julia> @time using Requires
  0.038621 seconds (108.95 k allocations: 6.865 MiB, 92.67% compilation time)

julia> 
tim@diva:~/src/julia-master$ julia -q --project --startup-file=no
julia> tstart = time(); using Requires; time()-tstart
0.06404614448547363

There's run-to-run variability, but these particular instances are pretty typical.

Requires is a small package so we are sensitive to any workload in the using pipeline that hasn't been precompiled. SnoopCompile can't help here because you first have to load it, but we can mimic what it does:

tim@diva:~/src/julia-master$ juliam -q --startup-file=no
julia> Core.Compiler.Timings.reset_timings(); Core.Compiler.__set_measure_typeinf(true); using Requires; Core.Compiler.__set_measure_typeinf(false); Core.Compiler.Timings.close_current_timer()

julia> traw = deepcopy(Core.Compiler.Timings._timings[1]);

julia> using SnoopCompile

julia> tinf = SnoopCompile.SnoopCompileCore.InferenceTimingNode(traw)
InferenceTimingNode: 0.073816/0.073816 on Core.Compiler.Timings.ROOT() with 0 direct children

The 0 direct children means it's not an inference issue.

Of course, this doesn't stress-test things like method table insertion, but the benchmarks above make it seem like this isn't a big issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

5 participants