-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p4est interface performance #767
Conversation
Codecov Report
@@ Coverage Diff @@
## main #767 +/- ##
==========================================
- Coverage 93.60% 93.59% -0.01%
==========================================
Files 182 182
Lines 17713 17789 +76
==========================================
+ Hits 16579 16649 +70
- Misses 1134 1140 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this!
However, I'm still not convinced that pulling the if
out of this loop is the best solution, and that there is no way to let the compiler handle this part. If this is really the only way (for now at least), I'm fine with it for 2D.
But I guess the 3D code will be horribly ugly, as we will end up with a lot of boilerplate code there. Wouldn't that be 48 possibilities, in both interface and mortar logic? Plus the 12 for the primary element?
Co-authored-by: Erik Faulhaber <[email protected]>
Thanks you very much for your input, @efaulhaber.
I'm open to better suggestions but I haven't found one yet... Do you have an idea?
I didn't count the number of possibilities but that's definitely the bad part about it... How do you usually handle this, @andrewwinters5000? |
I added this to the TODO notes of our meeting next week |
Unfortunately, no. The best ideas I had with my limited Julia knowledge are the ones I described in #642. |
We discussed this in the Trixi meeting yesterday with @andrewwinters5000 and decided to make a compromise of performance and readability/simplicity of the code. I implemented such an approach loosely mimicking what we do for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for starting to tackle this performance bottleneck! I've left a few notes and suggestions regarding naming/style, but nothing I noticed really affects the logic you used in the implementation. I already apologize for some of the more verbose comments which also reflect my own flow of thoughts ;-)
Have you run a test what this does for the timings of a simulation with mortar elements?
julia> using Trixi
julia> trixi_include("examples/p4est_2d_dgsem/elixir_advection_amr_unstructured_flag.jl",
save_solution=TrivialCallback(), save_restart=TrivialCallback(),
tspan=(0.0, 100.0))
[...]
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
rhs! 5.33k 486ms 62.2% 91.2μs 166MiB 47.4% 31.9KiB
prolong2mortars 5.33k 158ms 20.2% 29.7μs 164MiB 46.9% 31.5KiB
interface flux 5.33k 81.3ms 10.4% 15.3μs 0.00B 0.00% 0.00B
volume integral 5.33k 63.5ms 8.12% 11.9μs 0.00B 0.00% 0.00B
~rhs!~ 5.33k 58.1ms 7.43% 10.9μs 1.88MiB 0.54% 370B
prolong2interfaces 5.33k 49.4ms 6.31% 9.26μs 0.00B 0.00% 0.00B
boundary flux 5.33k 29.7ms 3.79% 5.56μs 0.00B 0.00% 0.00B
mortar flux 5.33k 24.1ms 3.08% 4.51μs 0.00B 0.00% 0.00B
surface integral 5.33k 10.3ms 1.32% 1.93μs 0.00B 0.00% 0.00B
Jacobian 5.33k 7.25ms 0.93% 1.36μs 0.00B 0.00% 0.00B
prolong2boundaries 5.33k 2.36ms 0.30% 443ns 0.00B 0.00% 0.00B
reset ∂u/∂t 5.33k 1.99ms 0.25% 374ns 0.00B 0.00% 0.00B
source terms 5.33k 114μs 0.01% 21.3ns 0.00B 0.00% 0.00B This PR: julia> using Trixi
julia> trixi_include("examples/p4est_2d_dgsem/elixir_advection_amr_unstructured_flag.jl",
save_solution=TrivialCallback(), save_restart=TrivialCallback(),
tspan=(0.0, 100.0))
[...]
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
rhs! 5.33k 399ms 55.5% 74.8μs 166MiB 47.4% 31.9KiB
prolong2mortars 5.33k 160ms 22.3% 30.1μs 164MiB 46.9% 31.5KiB
~rhs!~ 5.33k 68.1ms 9.47% 12.8μs 1.88MiB 0.54% 370B
volume integral 5.33k 62.4ms 8.68% 11.7μs 0.00B 0.00% 0.00B
interface flux 5.33k 35.1ms 4.88% 6.58μs 0.00B 0.00% 0.00B
boundary flux 5.33k 25.4ms 3.53% 4.77μs 0.00B 0.00% 0.00B
prolong2interfaces 5.33k 15.7ms 2.18% 2.95μs 0.00B 0.00% 0.00B
mortar flux 5.33k 11.4ms 1.58% 2.13μs 0.00B 0.00% 0.00B
surface integral 5.33k 9.93ms 1.38% 1.86μs 0.00B 0.00% 0.00B
Jacobian 5.33k 7.10ms 0.99% 1.33μs 0.00B 0.00% 0.00B
reset ∂u/∂t 5.33k 1.89ms 0.26% 354ns 0.00B 0.00% 0.00B
prolong2boundaries 5.33k 1.56ms 0.22% 293ns 0.00B 0.00% 0.00B
source terms 5.33k 115μs 0.02% 21.7ns 0.00B 0.00% 0.00B So there is still #628 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes! LGTM!
@efaulhaber We have already merged this to be able to move ahead, but it would still be great if you can have a look at this PR once you find the time, and also to have a look at #777. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like your solution. In my eyes, it's just as clean as before, but with a big increase in performance.
# Copy solution data from the secondary element on a case-by-case basis to get | ||
# the correct face and orientation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a case-by-case basis anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, right. I will update the comment later
@@ -20,32 +20,65 @@ function create_cache(mesh::P4estMesh{2}, equations, mortar_l2::LobattoLegendreM | |||
end | |||
|
|||
|
|||
# TODO: p4est interface performance, move and generalize this function for 3D | |||
@inline function index_to_start_step(index::Symbol, index_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really have an idea what this function does until I worked through the following code. Would it make sense to add a comment here explaining what this function does, or would such a comment be more complicated than reading the following code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has been vastly improved with #783.
This PR improves the performance of surface stuff for the
P4estMesh
. It's a first step with a new design. Instead of using the clean but inefficient code based onevaluate_index
in inner loops, the logic is moved to a higher level and some loop code is repeated with minor adaptations. This is somehow less clean but also much more efficient. Here are the timings based on one of the setups described in issue #642 (usingjulia --check-bounds=no --threads=1
).With the first (verbose) version of PR, I get
With the second version mimicking the handling of interfaces for the
UnstructuredMesh2D
, I getThat's ca. 10% less efficient than the verbose first version but still much more efficient than the version in
main
.The current version is limited to interfaces in 2D. Further changes that would need to be addressed if we follow this design are
Left for another PR later (listed in #642):
TODO: p4est interface performance
I would like to get your feedback on this approach at first such that we can discuss whether there is a better approach. If not, I suggest to merge this (since it's easy enough to review) and perform more performance improvements in other PRs step by step to attack #642.