-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance of AMR with p4est #638
Improving performance of AMR with p4est #638
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR. I have only one minor suggestion and a couple of questions for me to understand things better. Before merging this, I'd also like for @efaulhaber to approve this PR.
How does the P4estMesh now hold up against the TreeMesh in terms of performance, e.g, for the test setup referenced in the PR description?
Thank you very much! I appreciate the thorough guide on how to approach performance optimizations like that.
There's also an option to show a flame graph: I can see that a lot of time is spent in I'd like to have the first picture in reverse, like a quantitative version of the flame graph by |
No, I have never tried that.
What exactly do you mean by that? |
julia> using Trixi
julia> trixi_include(joinpath(examples_dir(), "2d", "elixir_advection_amr.jl"))
[...]
─────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 108ms / 93.1% 16.0MiB / 99.2%
Section ncalls time %tot avg alloc %tot avg
─────────────────────────────────────────────────────────────────────────────────────
rhs! 1.60k 60.1ms 60.1% 37.6μs 586KiB 3.60% 375B
[...]
AMR 63 28.1ms 28.0% 445μs 13.6MiB 85.6% 221KiB
refine 63 17.8ms 17.8% 282μs 5.52MiB 34.8% 89.7KiB
solver 63 15.6ms 15.5% 247μs 5.06MiB 31.8% 82.2KiB
mesh 63 2.16ms 2.16% 34.2μs 459KiB 2.82% 7.28KiB
refine_unbalanced! 63 1.91ms 1.91% 30.4μs 8.59KiB 0.05% 140B
~mesh~ 63 191μs 0.19% 3.03μs 348KiB 2.14% 5.53KiB
rebalance! 63 54.7μs 0.05% 869ns 102KiB 0.63% 1.62KiB
~refine~ 63 57.3μs 0.06% 909ns 17.6KiB 0.11% 285B
coarsen 63 8.11ms 8.10% 129μs 7.14MiB 45.0% 116KiB
solver 63 5.04ms 5.04% 80.0μs 5.18MiB 32.6% 84.2KiB
mesh 63 2.47ms 2.47% 39.3μs 302KiB 1.85% 4.79KiB
~coarsen~ 63 597μs 0.60% 9.48μs 1.66MiB 10.5% 27.1KiB
indicator 63 1.77ms 1.77% 28.1μs 441KiB 2.71% 7.00KiB
~AMR~ 63 393μs 0.39% 6.24μs 515KiB 3.17% 8.18KiB
[...]
julia> trixi_include(joinpath(examples_dir(), "2d", "elixir_advection_amr.jl"),
mesh=P4estMesh((1, 1), polydeg=3,
coordinates_min=coordinates_min, coordinates_max=coordinates_max,
initial_refinement_level=4))
[...]
────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 575ms / 98.3% 69.6MiB / 100%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
rhs! 1.61k 232ms 41.0% 144μs 53.5MiB 77.1% 34.1KiB
[...]
AMR 64 68.7ms 12.2% 1.07ms 13.5MiB 19.5% 217KiB
refine 64 34.0ms 6.02% 532μs 5.95MiB 8.57% 95.1KiB
solver 64 21.4ms 3.79% 335μs 5.44MiB 7.84% 87.1KiB
mesh 64 12.6ms 2.22% 196μs 515KiB 0.73% 8.05KiB
rebalance 128 11.6ms 2.05% 90.3μs 2.00KiB 0.00% 16.0B
refine 64 580μs 0.10% 9.06μs 0.00B 0.00% 0.00B
~mesh~ 64 432μs 0.08% 6.75μs 513KiB 0.72% 8.02KiB
~refine~ 64 27.2μs 0.00% 424ns 1.66KiB 0.00% 26.5B
coarsen 64 32.5ms 5.75% 507μs 7.36MiB 10.6% 118KiB
solver 64 20.3ms 3.59% 317μs 5.55MiB 8.00% 88.8KiB
mesh 64 12.2ms 2.16% 190μs 1.81MiB 2.61% 29.0KiB
rebalance 128 10.6ms 1.88% 83.2μs 2.00KiB 0.00% 16.0B
~mesh~ 64 1.03ms 0.18% 16.2μs 1.29MiB 1.86% 20.6KiB
coarsen! 64 505μs 0.09% 7.90μs 534KiB 0.75% 8.34KiB
~coarsen~ 64 28.7μs 0.01% 448ns 1.66KiB 0.00% 26.5B
indicator 64 1.96ms 0.35% 30.6μs 224KiB 0.32% 3.50KiB
~AMR~ 64 244μs 0.04% 3.82μs 2.48KiB 0.00% 39.8B
[...] The code is run twice to exclude compilation time on Julia v1.6.1 using |
Thanks for the timings! The p4est AMR now looks very competitive in comparison to the the TreeMesh. If we exclude the solver-specific parts from the |
Codecov Report
@@ Coverage Diff @@
## main #638 +/- ##
==========================================
- Coverage 93.36% 93.33% -0.03%
==========================================
Files 155 156 +1
Lines 15353 15465 +112
==========================================
+ Hits 14334 14434 +100
- Misses 1019 1031 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Why is |
Thus, the allocations are indeed caused by #628, but that's not the only reason for the reduced performance compared to the |
Something like my first screenshot above, with exact timings of each part (instead of a graph that only shows qualitative information). Where I could see the lowest row of the flame graph with exact timings, expand it, and then the row above that is shown with exact timings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small thing and a few comment suggestions, otherwise everything looks good to me.
Good job! It's really fast now!
nodes) | ||
basis::LobattoLegendreBasis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing other than the basis' nodes is used here, right? This function will be called by Trixi2Vtk where we only have the nodes and no basis. I should've commented that, sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind changing that part in Trixi2Vtk? I would prefer passing the basis for consistency with other meshes and to enable multiple dispatch. For example, other meshes use something like
# Get cell length in reference mesh
dx = 2 / size(mesh, 1)
dy = 2 / size(mesh, 2)
# Calculate node coordinates of reference mesh
cell_x_offset = -1 + (cell_x-1) * dx + dx/2
cell_y_offset = -1 + (cell_y-1) * dy + dy/2
which is specific to the nodes
of the LobattoLegendreBasis
. In particular, it won't work for FD methods implemented in #617.
What do you think, @sloede?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong feelings either way. However, I'd suggest to follow the principle of not implementing something for a feature that isn't even there yet (YAGNI). Thus if changing this for dispatch is only needed for a future version of FD-SBP, I wouldn't change the interface right now.
On the other hand, if it would remove currently existing features if we continue to use plain nodes instead of passing the basis
, then we should use the implementation proposed by Hendrik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, FD-SBP are already in main
, so I would like to keep this version here
As far as I know, the length of the bars in the output of ProfileView is quantitative (but there is no axis/legend). Might be worth opening an issue there or asking on discourse. If you get an answer, I would also be interested in it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@efaulhaber: You can use PProf.jl for this. After running |
That's pretty cool, thanks! A bit chaotic for bigger profiles, but I guess I like it better than the flame graphs. |
You can find an extended description of these changes and my reasoning for them in my blog post. I will be happy to get your feedback and improve both this PR and the blog post accompanying this PR.
Using
julia --num-threads=1 --check-bounds=no
, I get the following results on the currentmain
.This PR improves these results to
Closes #627