Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p4est interface performance #767

Merged
merged 15 commits into from
Aug 12, 2021
Merged

p4est interface performance #767

merged 15 commits into from
Aug 12, 2021

Conversation

ranocha
Copy link
Member

@ranocha ranocha commented Aug 4, 2021

This PR improves the performance of surface stuff for the P4estMesh. It's a first step with a new design. Instead of using the clean but inefficient code based on evaluate_index in inner loops, the logic is moved to a higher level and some loop code is repeated with minor adaptations. This is somehow less clean but also much more efficient. Here are the timings based on one of the setups described in issue #642 (using julia --check-bounds=no --threads=1).

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_euler_vortex.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback())
[...]
 ───────────────────────────────────────────────────────────────────────────────
            Trixi.jl                    Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            354ms / 88.7%           1.37MiB / 84.1%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                    2.78k    288ms  91.9%   104μs   0.99MiB  85.4%     372B
   interface flux        2.78k    119ms  37.9%  42.8μs     0.00B  0.00%    0.00B
   volume integral       2.78k   72.5ms  23.1%  26.1μs     0.00B  0.00%    0.00B
   ~rhs!~                2.78k   44.2ms  14.1%  15.9μs   0.99MiB  85.4%     372B
   surface integral      2.78k   23.6ms  7.51%  8.48μs     0.00B  0.00%    0.00B
   prolong2interfaces    2.78k   16.1ms  5.12%  5.78μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           2.78k   6.95ms  2.22%  2.50μs     0.00B  0.00%    0.00B
   Jacobian              2.78k   5.49ms  1.75%  1.97μs     0.00B  0.00%    0.00B
   prolong2boundaries    2.78k    139μs  0.04%  49.9ns     0.00B  0.00%    0.00B
   prolong2mortars       2.78k    132μs  0.04%  47.3ns     0.00B  0.00%    0.00B
   mortar flux           2.78k   82.6μs  0.03%  29.7ns     0.00B  0.00%    0.00B
   boundary flux         2.78k   79.3μs  0.03%  28.5ns     0.00B  0.00%    0.00B
   source terms          2.78k   59.1μs  0.02%  21.2ns     0.00B  0.00%    0.00B
 calculate dt              557   14.9ms  4.74%  26.7μs     0.00B  0.00%    0.00B
 analyze solution            7   10.5ms  3.36%  1.51ms    173KiB  14.6%  24.7KiB
 ───────────────────────────────────────────────────────────────────────────────

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_euler_vortex.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback(),
           mesh=StructuredMesh((16, 16), float.(coordinates_min), float.(coordinates_max)))
[...]
 ─────────────────────────────────────────────────────────────────────────────
           Trixi.jl                   Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:           374ms / 88.4%           1.73MiB / 87.2%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 rhs!                  2.78k    300ms  90.7%   108μs   0.98MiB  65.3%     370B
   interface flux      2.78k    133ms  40.1%  47.7μs     0.00B  0.00%    0.00B
   volume integral     2.78k   94.3ms  28.5%  33.9μs     0.00B  0.00%    0.00B
   ~rhs!~              2.78k   32.4ms  9.79%  11.7μs   0.98MiB  65.3%     370B
   surface integral    2.78k   25.5ms  7.71%  9.18μs     0.00B  0.00%    0.00B
   Jacobian            2.78k   7.82ms  2.36%  2.81μs     0.00B  0.00%    0.00B
   reset ∂u/∂t         2.78k   7.21ms  2.18%  2.59μs     0.00B  0.00%    0.00B
   boundary flux       2.78k   73.7μs  0.02%  26.5ns     0.00B  0.00%    0.00B
   source terms        2.78k   58.6μs  0.02%  21.1ns     0.00B  0.00%    0.00B
 calculate dt            557   20.1ms  6.08%  36.1μs     0.00B  0.00%    0.00B
 analyze solution          7   10.7ms  3.25%  1.53ms    535KiB  34.7%  76.4KiB
 ─────────────────────────────────────────────────────────────────────────────

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_euler_vortex.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback(),
           mesh=P4estMesh((1, 1), polydeg=3,
                   coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                   initial_refinement_level=4))
[...]
 ───────────────────────────────────────────────────────────────────────────────
            Trixi.jl                    Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            804ms / 94.6%           1.73MiB / 87.3%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                    2.78k    727ms  95.6%   262μs   0.99MiB  65.2%     372B
   prolong2interfaces    2.78k    275ms  36.1%  98.8μs     0.00B  0.00%    0.00B
   interface flux        2.78k    268ms  35.2%  96.2μs     0.00B  0.00%    0.00B
   volume integral       2.78k   96.5ms  12.7%  34.7μs     0.00B  0.00%    0.00B
   ~rhs!~                2.78k   48.9ms  6.43%  17.6μs   0.99MiB  65.2%     372B
   surface integral      2.78k   24.3ms  3.20%  8.74μs     0.00B  0.00%    0.00B
   Jacobian              2.78k   7.59ms  1.00%  2.73μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           2.78k   7.17ms  0.94%  2.58μs     0.00B  0.00%    0.00B
   prolong2mortars       2.78k    157μs  0.02%  56.4ns     0.00B  0.00%    0.00B
   prolong2boundaries    2.78k    147μs  0.02%  52.7ns     0.00B  0.00%    0.00B
   mortar flux           2.78k   89.1μs  0.01%  32.0ns     0.00B  0.00%    0.00B
   boundary flux         2.78k   77.1μs  0.01%  27.7ns     0.00B  0.00%    0.00B
   source terms          2.78k   61.4μs  0.01%  22.1ns     0.00B  0.00%    0.00B
 calculate dt              557   20.8ms  2.73%  37.3μs     0.00B  0.00%    0.00B
 analyze solution            7   12.8ms  1.68%  1.82ms    539KiB  34.8%  77.0KiB
 ───────────────────────────────────────────────────────────────────────────────

With the first (verbose) version of PR, I get

 ───────────────────────────────────────────────────────────────────────────────
            Trixi.jl                    Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            394ms / 89.5%           1.73MiB / 87.2%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                    2.78k    322ms  91.3%   116μs   0.99MiB  65.3%     372B
   interface flux        2.78k    125ms  35.6%  45.1μs     0.00B  0.00%    0.00B
   volume integral       2.78k   91.1ms  25.8%  32.8μs     0.00B  0.00%    0.00B
   ~rhs!~                2.78k   44.2ms  12.6%  15.9μs   0.99MiB  65.3%     372B
   prolong2interfaces    2.78k   24.0ms  6.79%  8.61μs     0.00B  0.00%    0.00B
   surface integral      2.78k   22.5ms  6.37%  8.08μs     0.00B  0.00%    0.00B
   Jacobian              2.78k   7.21ms  2.04%  2.59μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           2.78k   7.03ms  2.00%  2.53μs     0.00B  0.00%    0.00B
   prolong2boundaries    2.78k    143μs  0.04%  51.3ns     0.00B  0.00%    0.00B
   prolong2mortars       2.78k    136μs  0.04%  48.9ns     0.00B  0.00%    0.00B
   mortar flux           2.78k   87.8μs  0.02%  31.6ns     0.00B  0.00%    0.00B
   boundary flux         2.78k   78.4μs  0.02%  28.2ns     0.00B  0.00%    0.00B
   source terms          2.78k   61.1μs  0.02%  22.0ns     0.00B  0.00%    0.00B
 calculate dt              557   20.0ms  5.66%  35.8μs     0.00B  0.00%    0.00B
 analyze solution            7   10.6ms  3.01%  1.52ms    536KiB  34.7%  76.6KiB
 ───────────────────────────────────────────────────────────────────────────────

With the second version mimicking the handling of interfaces for the UnstructuredMesh2D, I get

 ───────────────────────────────────────────────────────────────────────────────
            Trixi.jl                    Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            417ms / 87.4%           1.73MiB / 87.4%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                    2.78k    334ms  91.5%   120μs   0.99MiB  65.3%     372B
   interface flux        2.78k    138ms  37.8%  49.6μs     0.00B  0.00%    0.00B
   volume integral       2.78k   87.8ms  24.1%  31.6μs     0.00B  0.00%    0.00B
   ~rhs!~                2.78k   42.1ms  11.5%  15.1μs   0.99MiB  65.3%     372B
   prolong2interfaces    2.78k   29.2ms  8.01%  10.5μs     0.00B  0.00%    0.00B
   surface integral      2.78k   22.6ms  6.19%  8.12μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           2.78k   6.85ms  1.88%  2.46μs     0.00B  0.00%    0.00B
   Jacobian              2.78k   6.78ms  1.86%  2.44μs     0.00B  0.00%    0.00B
   prolong2boundaries    2.78k    139μs  0.04%  50.2ns     0.00B  0.00%    0.00B
   prolong2mortars       2.78k    119μs  0.03%  42.6ns     0.00B  0.00%    0.00B
   mortar flux           2.78k   93.7μs  0.03%  33.7ns     0.00B  0.00%    0.00B
   boundary flux         2.78k   71.8μs  0.02%  25.8ns     0.00B  0.00%    0.00B
   source terms          2.78k   46.6μs  0.01%  16.8ns     0.00B  0.00%    0.00B
 calculate dt              557   19.7ms  5.41%  35.4μs     0.00B  0.00%    0.00B
 analyze solution            7   11.4ms  3.11%  1.62ms    538KiB  34.7%  76.8KiB
 ───────────────────────────────────────────────────────────────────────────────

That's ca. 10% less efficient than the verbose first version but still much more efficient than the version in main.

The current version is limited to interfaces in 2D. Further changes that would need to be addressed if we follow this design are

  • 2D interfaces
  • 2D boundaries
  • 2D mortars

Left for another PR later (listed in #642):

  • 3D interfaces
  • 3D boundaries
  • 3D mortars
  • Look for TODO: p4est interface performance

I would like to get your feedback on this approach at first such that we can discuss whether there is a better approach. If not, I suggest to merge this (since it's easy enough to review) and perform more performance improvements in other PRs step by step to attack #642.

@ranocha ranocha requested a review from efaulhaber August 4, 2021 17:01
@ranocha ranocha added the performance We are greedy label Aug 4, 2021
@codecov
Copy link

codecov bot commented Aug 4, 2021

Codecov Report

Merging #767 (58955d0) into main (5d457f2) will decrease coverage by 0.01%.
The diff coverage is 98.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #767      +/-   ##
==========================================
- Coverage   93.60%   93.59%   -0.01%     
==========================================
  Files         182      182              
  Lines       17713    17789      +76     
==========================================
+ Hits        16579    16649      +70     
- Misses       1134     1140       +6     
Flag Coverage Δ
unittests 93.59% <98.55%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sem/elixir_advection_nonconforming_unstructured.jl 100.00% <ø> (ø)
src/solvers/dgsem_p4est/dg.jl 90.91% <85.71%> (-2.84%) ⬇️
src/solvers/dgsem_p4est/dg_2d.jl 99.02% <99.17%> (-0.25%) ⬇️
src/solvers/dgsem_p4est/containers.jl 91.28% <100.00%> (-1.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d457f2...58955d0. Read the comment docs.

Copy link
Member

@efaulhaber efaulhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this!

However, I'm still not convinced that pulling the if out of this loop is the best solution, and that there is no way to let the compiler handle this part. If this is really the only way (for now at least), I'm fine with it for 2D.

But I guess the 3D code will be horribly ugly, as we will end up with a lot of boilerplate code there. Wouldn't that be 48 possibilities, in both interface and mortar logic? Plus the 12 for the primary element?

src/solvers/dgsem_p4est/dg_2d.jl Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
@ranocha ranocha added discussion triage To be decided at the next Trixi meeting and removed discussion labels Aug 5, 2021
@ranocha
Copy link
Member Author

ranocha commented Aug 5, 2021

Thanks you very much for your input, @efaulhaber.

However, I'm still not convinced that pulling the if out of this loop is the best solution, and that there is no way to let the compiler handle this part. If this is really the only way (for now at least), I'm fine with it for 2D.

I'm open to better suggestions but I haven't found one yet... Do you have an idea?

But I guess the 3D code will be horribly ugly, as we will end up with a lot of boilerplate code there. Wouldn't that be 48 possibilities, in both interface and mortar logic? Plus the 12 for the primary element?

I didn't count the number of possibilities but that's definitely the bad part about it... How do you usually handle this, @andrewwinters5000?

@ranocha
Copy link
Member Author

ranocha commented Aug 5, 2021

I added this to the TODO notes of our meeting next week

@efaulhaber
Copy link
Member

I'm open to better suggestions but I haven't found one yet... Do you have an idea?

Unfortunately, no. The best ideas I had with my limited Julia knowledge are the ones I described in #642.

@ranocha
Copy link
Member Author

ranocha commented Aug 11, 2021

We discussed this in the Trixi meeting yesterday with @andrewwinters5000 and decided to make a compromise of performance and readability/simplicity of the code. I implemented such an approach loosely mimicking what we do for the UnstrucutredMesh2D at the interfaces. See the updated performance numbers in the first post. If you don't have any better idea, this seems to be an acceptable compromise.

@ranocha ranocha mentioned this pull request Aug 11, 2021
@ranocha ranocha removed the triage To be decided at the next Trixi meeting label Aug 11, 2021
@ranocha ranocha marked this pull request as draft August 11, 2021 13:58
@ranocha ranocha marked this pull request as ready for review August 11, 2021 15:46
@ranocha ranocha requested review from efaulhaber and sloede August 11, 2021 16:10
Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for starting to tackle this performance bottleneck! I've left a few notes and suggestions regarding naming/style, but nothing I noticed really affects the logic you used in the implementation. I already apologize for some of the more verbose comments which also reflect my own flow of thoughts ;-)

Have you run a test what this does for the timings of a simulation with mortar elements?

src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Show resolved Hide resolved
src/solvers/dgsem_p4est/dg_2d.jl Show resolved Hide resolved
@ranocha
Copy link
Member Author

ranocha commented Aug 12, 2021

Have you run a test what this does for the timings of a simulation with mortar elements?

julia> using Trixi

julia> trixi_include("examples/p4est_2d_dgsem/elixir_advection_amr_unstructured_flag.jl",
                     save_solution=TrivialCallback(), save_restart=TrivialCallback(),
                     tspan=(0.0, 100.0))
[...]
 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                         5.33k    486ms  62.2%  91.2μs    166MiB  47.4%  31.9KiB
   prolong2mortars            5.33k    158ms  20.2%  29.7μs    164MiB  46.9%  31.5KiB
   interface flux             5.33k   81.3ms  10.4%  15.3μs     0.00B  0.00%    0.00B
   volume integral            5.33k   63.5ms  8.12%  11.9μs     0.00B  0.00%    0.00B
   ~rhs!~                     5.33k   58.1ms  7.43%  10.9μs   1.88MiB  0.54%     370B
   prolong2interfaces         5.33k   49.4ms  6.31%  9.26μs     0.00B  0.00%    0.00B
   boundary flux              5.33k   29.7ms  3.79%  5.56μs     0.00B  0.00%    0.00B
   mortar flux                5.33k   24.1ms  3.08%  4.51μs     0.00B  0.00%    0.00B
   surface integral           5.33k   10.3ms  1.32%  1.93μs     0.00B  0.00%    0.00B
   Jacobian                   5.33k   7.25ms  0.93%  1.36μs     0.00B  0.00%    0.00B
   prolong2boundaries         5.33k   2.36ms  0.30%   443ns     0.00B  0.00%    0.00B
   reset ∂u/∂t                5.33k   1.99ms  0.25%   374ns     0.00B  0.00%    0.00B
   source terms               5.33k    114μs  0.01%  21.3ns     0.00B  0.00%    0.00B

This PR:

julia> using Trixi

julia> trixi_include("examples/p4est_2d_dgsem/elixir_advection_amr_unstructured_flag.jl",
                     save_solution=TrivialCallback(), save_restart=TrivialCallback(),
                     tspan=(0.0, 100.0))
[...]
 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 rhs!                         5.33k    399ms  55.5%  74.8μs    166MiB  47.4%  31.9KiB
   prolong2mortars            5.33k    160ms  22.3%  30.1μs    164MiB  46.9%  31.5KiB
   ~rhs!~                     5.33k   68.1ms  9.47%  12.8μs   1.88MiB  0.54%     370B
   volume integral            5.33k   62.4ms  8.68%  11.7μs     0.00B  0.00%    0.00B
   interface flux             5.33k   35.1ms  4.88%  6.58μs     0.00B  0.00%    0.00B
   boundary flux              5.33k   25.4ms  3.53%  4.77μs     0.00B  0.00%    0.00B
   prolong2interfaces         5.33k   15.7ms  2.18%  2.95μs     0.00B  0.00%    0.00B
   mortar flux                5.33k   11.4ms  1.58%  2.13μs     0.00B  0.00%    0.00B
   surface integral           5.33k   9.93ms  1.38%  1.86μs     0.00B  0.00%    0.00B
   Jacobian                   5.33k   7.10ms  0.99%  1.33μs     0.00B  0.00%    0.00B
   reset ∂u/∂t                5.33k   1.89ms  0.26%   354ns     0.00B  0.00%    0.00B
   prolong2boundaries         5.33k   1.56ms  0.22%   293ns     0.00B  0.00%    0.00B
   source terms               5.33k    115μs  0.02%  21.7ns     0.00B  0.00%    0.00B

So there is still #628

@ranocha ranocha requested a review from sloede August 12, 2021 06:27
Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes! LGTM!

@ranocha ranocha merged commit 7eaf913 into main Aug 12, 2021
@ranocha ranocha deleted the hr/p4est_interface_performance branch August 12, 2021 10:06
@sloede
Copy link
Member

sloede commented Aug 12, 2021

@efaulhaber We have already merged this to be able to move ahead, but it would still be great if you can have a look at this PR once you find the time, and also to have a look at #777.

Copy link
Member

@efaulhaber efaulhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like your solution. In my eyes, it's just as clean as before, but with a big increase in performance.

Comment on lines +66 to +67
# Copy solution data from the secondary element on a case-by-case basis to get
# the correct face and orientation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a case-by-case basis anymore, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right. I will update the comment later

@@ -20,32 +20,65 @@ function create_cache(mesh::P4estMesh{2}, equations, mortar_l2::LobattoLegendreM
end


# TODO: p4est interface performance, move and generalize this function for 3D
@inline function index_to_start_step(index::Symbol, index_range)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really have an idea what this function does until I worked through the following code. Would it make sense to add a comment here explaining what this function does, or would such a comment be more complicated than reading the following code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been vastly improved with #783.

src/solvers/dgsem_p4est/dg_2d.jl Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance We are greedy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants