Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface stuff is slow for the P4estMesh #642

Closed
7 tasks done
Tracked by #720
ranocha opened this issue Jun 13, 2021 · 13 comments · Fixed by #783
Closed
7 tasks done
Tracked by #720

Surface stuff is slow for the P4estMesh #642

ranocha opened this issue Jun 13, 2021 · 13 comments · Fixed by #783
Labels
performance We are greedy

Comments

@ranocha
Copy link
Member

ranocha commented Jun 13, 2021

This came up in #638 (comment). For example, I get

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_advection_extended.jl"),
                     save_restart=TrivialCallback(), save_solution=TrivialCallback())
[...]
 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                      201   6.05ms  86.3%  30.1μs   83.1KiB  78.9%     423B
   volume integral         201   2.26ms  32.3%  11.3μs     0.00B  0.00%    0.00B
   ~rhs!~                  201   2.19ms  31.3%  10.9μs   83.1KiB  78.9%     423B
   interface flux          201    633μs  9.04%  3.15μs     0.00B  0.00%    0.00B
   surface integral        201    432μs  6.17%  2.15μs     0.00B  0.00%    0.00B
   prolong2interfaces      201    316μs  4.52%  1.57μs     0.00B  0.00%    0.00B
   Jacobian                201   89.6μs  1.28%   446ns     0.00B  0.00%    0.00B
   reset ∂u/∂t             201   84.9μs  1.21%   422ns     0.00B  0.00%    0.00B
   prolong2mortars         201   10.5μs  0.15%  52.3ns     0.00B  0.00%    0.00B
   prolong2boundaries      201   7.33μs  0.10%  36.5ns     0.00B  0.00%    0.00B
   mortar flux             201   5.52μs  0.08%  27.4ns     0.00B  0.00%    0.00B
   boundary flux           201   5.29μs  0.08%  26.3ns     0.00B  0.00%    0.00B
   source terms            201   3.77μs  0.05%  18.8ns     0.00B  0.00%    0.00B
 analyze solution            2    936μs  13.4%   468μs   22.3KiB  21.1%  11.1KiB
 calculate dt               41   22.7μs  0.32%   553ns     0.00B  0.00%    0.00B


julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_advection_extended.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback(),
           mesh=StructuredMesh((16, 16), float.(coordinates_min), float.(coordinates_max)))
[...]
 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 rhs!                    206   7.57ms  83.2%  36.7μs   80.0KiB  38.8%     398B
   volume integral       206   3.07ms  33.8%  14.9μs     0.00B  0.00%    0.00B
   ~rhs!~                206   1.82ms  20.0%  8.81μs   80.0KiB  38.8%     398B
   interface flux        206   1.72ms  18.9%  8.37μs     0.00B  0.00%    0.00B
   surface integral      206    503μs  5.53%  2.44μs     0.00B  0.00%    0.00B
   Jacobian              206    343μs  3.77%  1.67μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           206   96.0μs  1.05%   466ns     0.00B  0.00%    0.00B
   boundary flux         206   5.54μs  0.06%  26.9ns     0.00B  0.00%    0.00B
   source terms          206   3.67μs  0.04%  17.8ns     0.00B  0.00%    0.00B
 analyze solution          2   1.06ms  11.7%   532μs    126KiB  61.2%  63.1KiB
 calculate dt             42    468μs  5.15%  11.1μs     0.00B  0.00%    0.00B

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_advection_extended.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback(),
           mesh=P4estMesh((1, 1), polydeg=3,
                   coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                   initial_refinement_level=4))
[...]
 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                      206   15.0ms  90.1%  72.8μs   84.9KiB  40.2%     422B
   interface flux          206   5.10ms  30.7%  24.8μs     0.00B  0.00%    0.00B
   volume integral         206   3.16ms  19.0%  15.3μs     0.00B  0.00%    0.00B
   prolong2interfaces      206   3.03ms  18.2%  14.7μs     0.00B  0.00%    0.00B
   ~rhs!~                  206   2.71ms  16.3%  13.2μs   84.9KiB  40.2%     422B
   surface integral        206    498μs  2.99%  2.42μs     0.00B  0.00%    0.00B
   Jacobian                206    355μs  2.13%  1.72μs     0.00B  0.00%    0.00B
   reset ∂u/∂t             206   99.5μs  0.60%   483ns     0.00B  0.00%    0.00B
   prolong2boundaries      206   9.76μs  0.06%  47.4ns     0.00B  0.00%    0.00B
   prolong2mortars         206   8.90μs  0.05%  43.2ns     0.00B  0.00%    0.00B
   mortar flux             206   6.95μs  0.04%  33.7ns     0.00B  0.00%    0.00B
   boundary flux           206   6.21μs  0.04%  30.1ns     0.00B  0.00%    0.00B
   source terms            206   3.90μs  0.02%  19.0ns     0.00B  0.00%    0.00B
 analyze solution            2   1.16ms  6.98%   581μs    126KiB  59.8%  63.1KiB
 calculate dt               42    479μs  2.88%  11.4μs     0.00B  0.00%    0.00B

Another example:

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_euler_vortex.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback())
[...]
 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                    2.78k    292ms  92.3%   105μs   0.99MiB  85.6%     372B
   interface flux        2.78k    121ms  38.2%  43.4μs     0.00B  0.00%    0.00B
   volume integral       2.78k   72.7ms  23.0%  26.1μs     0.00B  0.00%    0.00B
   ~rhs!~                2.78k   45.1ms  14.3%  16.2μs   0.99MiB  85.6%     372B
   surface integral      2.78k   23.9ms  7.55%  8.58μs     0.00B  0.00%    0.00B
   prolong2interfaces    2.78k   16.3ms  5.16%  5.86μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           2.78k   6.95ms  2.20%  2.50μs     0.00B  0.00%    0.00B
   Jacobian              2.78k   5.69ms  1.80%  2.04μs     0.00B  0.00%    0.00B
   prolong2boundaries    2.78k    178μs  0.06%  63.8ns     0.00B  0.00%    0.00B
   prolong2mortars       2.78k    137μs  0.04%  49.3ns     0.00B  0.00%    0.00B
   mortar flux           2.78k   76.5μs  0.02%  27.5ns     0.00B  0.00%    0.00B
   boundary flux         2.78k   73.6μs  0.02%  26.5ns     0.00B  0.00%    0.00B
   source terms          2.78k   61.9μs  0.02%  22.2ns     0.00B  0.00%    0.00B
 calculate dt              557   14.5ms  4.60%  26.1μs     0.00B  0.00%    0.00B
 analyze solution            7   9.74ms  3.08%  1.39ms    171KiB  14.4%  24.4KiB

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_euler_vortex.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback(),
           mesh=StructuredMesh((16, 16), float.(coordinates_min), float.(coordinates_max)))
[...]
 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 rhs!                  2.78k    314ms  90.0%   113μs   0.98MiB  65.1%     370B
   interface flux      2.78k    138ms  39.5%  49.6μs     0.00B  0.00%    0.00B
   volume integral     2.78k    101ms  28.8%  36.2μs     0.00B  0.00%    0.00B
   ~rhs!~              2.78k   34.1ms  9.77%  12.3μs   0.98MiB  65.1%     370B
   surface integral    2.78k   24.9ms  7.12%  8.94μs     0.00B  0.00%    0.00B
   Jacobian            2.78k   8.61ms  2.46%  3.10μs     0.00B  0.00%    0.00B
   reset ∂u/∂t         2.78k   7.89ms  2.26%  2.84μs     0.00B  0.00%    0.00B
   boundary flux       2.78k   76.1μs  0.02%  27.4ns     0.00B  0.00%    0.00B
   source terms        2.78k   68.7μs  0.02%  24.7ns     0.00B  0.00%    0.00B
 calculate dt            557   21.2ms  6.06%  38.0μs     0.00B  0.00%    0.00B
 analyze solution          7   13.9ms  3.99%  1.99ms    538KiB  34.9%  76.9KiB

julia> trixi_include(joinpath(examples_dir(), "tree_2d_dgsem", "elixir_euler_vortex.jl"),
           save_restart=TrivialCallback(), save_solution=TrivialCallback(),
           mesh=P4estMesh((1, 1), polydeg=3,
                   coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                   initial_refinement_level=4))
[...]
 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 rhs!                    2.78k    669ms  95.6%   241μs   0.99MiB  65.3%     372B
   interface flux        2.78k    249ms  35.5%  89.4μs     0.00B  0.00%    0.00B
   prolong2interfaces    2.78k    246ms  35.2%  88.5μs     0.00B  0.00%    0.00B
   volume integral       2.78k   91.6ms  13.1%  32.9μs     0.00B  0.00%    0.00B
   ~rhs!~                2.78k   44.5ms  6.35%  16.0μs   0.99MiB  65.3%     372B
   surface integral      2.78k   23.3ms  3.33%  8.38μs     0.00B  0.00%    0.00B
   Jacobian              2.78k   7.66ms  1.09%  2.75μs     0.00B  0.00%    0.00B
   reset ∂u/∂t           2.78k   6.98ms  1.00%  2.51μs     0.00B  0.00%    0.00B
   prolong2boundaries    2.78k    156μs  0.02%  56.1ns     0.00B  0.00%    0.00B
   prolong2mortars       2.78k    127μs  0.02%  45.8ns     0.00B  0.00%    0.00B
   boundary flux         2.78k   94.2μs  0.01%  33.9ns     0.00B  0.00%    0.00B
   mortar flux           2.78k   90.9μs  0.01%  32.7ns     0.00B  0.00%    0.00B
   source terms          2.78k   58.8μs  0.01%  21.1ns     0.00B  0.00%    0.00B
 calculate dt              557   19.6ms  2.80%  35.2μs     0.00B  0.00%    0.00B
 analyze solution            7   11.1ms  1.59%  1.59ms    536KiB  34.7%  76.6KiB

(running the code twice to exclude compilation times). It's okay that the volume integral is a bit slower since it needs to handle curvilinear stuff. However, everything related to the surfaces is also much slower. We should investigate this and try to fix it (if reasonably possible).

Related to #628 (but more general).

TODO

@sloede
Copy link
Member

sloede commented Jun 13, 2021

Have you compared the numbers to the same setup with the CurvedMesh or UnstructuredQuadMesh to find out whether it is maybe systemic to the curved solvers?

@ranocha
Copy link
Member Author

ranocha commented Jun 13, 2021

I added the structured mesh above. I don't have a mesh file to check the unstructured mesh right now

@sloede
Copy link
Member

sloede commented Jun 13, 2021

Now we have 16.1 ms vs. 10.1 ms vs. 2.6 ms for the interface flux for P4estMesh vs. CurvedMesh vs. TreeMesh. Considering that P4est+Curved vs. Tree need to take into account normal vectors, and P4est vs. Curved needs to take into account flipped-ness/different element-local coordinate systems, the differences seem to be at least explainable. Although I don't know why and if they have to be so large.

@ranocha
Copy link
Member Author

ranocha commented Jun 13, 2021

I added another example for the Euler equations. That's less dramatic but we should still have a look at it, I think. However, it doesn't have top priority, of course.

@ranocha ranocha added the performance We are greedy label Jun 13, 2021
@sloede
Copy link
Member

sloede commented Jun 13, 2021

Please don't get me wrong: Thanks a lot for reporting this, and I do think we should investigate! With my comment above I was not trying to dismiss this issue, but rather add to the discussion about potential reasons.

@ranocha
Copy link
Member Author

ranocha commented Jun 13, 2021

Appreciated 👍 Maybe it's just related to the possibility of unstructured interfaces and we need to find a better way to improve the performance of that part.

@ranocha
Copy link
Member Author

ranocha commented Jun 13, 2021

A first profiling of the Euler stuff shows that a lot of time is spent in evaluate_index. Calling it in an inner loop such as

for i in eachnode(dg), v in eachvariable(equations)
interfaces.u[1, v, i, interface] = u[v, evaluate_index(primary_indices, size_, 1, i),
evaluate_index(primary_indices, size_, 2, i),
primary_element]
interfaces.u[2, v, i, interface] = u[v, evaluate_index(secondary_indices, size_, 1, i),
evaluate_index(secondary_indices, size_, 2, i),
secondary_element]
end

again and again is very inefficient. Instead, the correct loop indices should be determined once before these inner loops run.

@sloede
Copy link
Member

sloede commented Jun 13, 2021

Calling it in an inner loop such as ... gain and again is very inefficient

Yes, this seems like a likely candidate for optimization. However, IMHO we should not touch this until we have support for 3D unstructured in Trixi, on which @efaulhaber is currently working. Once that's finished, we can optimize it directly on the 3D code and then just du the same for 2d to make it easy to transition between those two.

@efaulhaber
Copy link
Member

efaulhaber commented Jun 13, 2021

I was thinking if there's a way to implement something similar to Colon (or :, which is a synonym) to just index the array there by this new type, so that the indexing itself is done by to_indices, similar to how Colon works.

Another idea: It's possible to create a view like this.

julia> A = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> view(A, 2:-1:1, :)
2×2 view(::Matrix{Int64}, 2:-1:1, :) with eltype Int64:
 3  4
 1  2

I haven't tested yet if this would be any more efficient, but it could be worth a try.

@efaulhaber
Copy link
Member

efaulhaber commented Jun 13, 2021

Btw, where do all these allocations come from? I get this when running your first example:

julia> trixi_include("examples/2d/elixir_advection_extended.jl")
[...]
 Section                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────
 rhs!                         201   4.40ms  39.2%  21.9μs   82.7KiB  21.9%     421B
   volume integral            201   2.59ms  23.0%  12.9μs     0.00B  0.00%    0.00B
   interface flux             201    564μs  5.01%  2.80μs     0.00B  0.00%    0.00B
   surface integral           201    480μs  4.27%  2.39μs     0.00B  0.00%    0.00B
   prolong2interfaces         201    361μs  3.21%  1.80μs     0.00B  0.00%    0.00B
   ~rhs!~                     201    221μs  1.96%  1.10μs   82.7KiB  21.9%     421B
   Jacobian                   201   93.0μs  0.83%   463ns     0.00B  0.00%    0.00B
   reset ∂u/∂t                201   62.1μs  0.55%   309ns     0.00B  0.00%    0.00B
   prolong2mortars            201   7.30μs  0.06%  36.3ns     0.00B  0.00%    0.00B
   prolong2boundaries         201   7.20μs  0.06%  35.8ns     0.00B  0.00%    0.00B
   source terms               201   6.70μs  0.06%  33.3ns     0.00B  0.00%    0.00B
   mortar flux                201   6.70μs  0.06%  33.3ns     0.00B  0.00%    0.00B
   boundary flux              201   6.40μs  0.06%  31.8ns     0.00B  0.00%    0.00B


julia> trixi_include(joinpath(examples_dir(), "2d", "elixir_advection_extended.jl"),
          mesh=CurvedMesh((16, 16), float.(coordinates_min), float.(coordinates_max)))
[...]
 Section                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────
 rhs!                         206   5.12ms  43.1%  24.9μs   79.8KiB  17.5%     397B
   volume integral            206   3.04ms  25.6%  14.7μs     0.00B  0.00%    0.00B
   interface flux             206   1.14ms  9.58%  5.52μs     0.00B  0.00%    0.00B
   surface integral           206    451μs  3.80%  2.19μs     0.00B  0.00%    0.00B
   Jacobian                   206    246μs  2.07%  1.20μs     0.00B  0.00%    0.00B
   ~rhs!~                     206    174μs  1.47%   846ns   79.8KiB  17.5%     397B
   reset ∂u/∂t                206   63.7μs  0.54%   309ns     0.00B  0.00%    0.00B
   source terms               206   7.00μs  0.06%  34.0ns     0.00B  0.00%    0.00B
   boundary flux              206   5.00μs  0.04%  24.3ns     0.00B  0.00%    0.00B

julia> trixi_include(joinpath(examples_dir(), "2d", "elixir_advection_extended.jl"),
          mesh=P4estMesh((1, 1), polydeg=3,
                  coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                  initial_refinement_level=4))
[...]
 Section                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────
 rhs!                         206   10.5ms  57.1%  51.2μs   84.5KiB  18.2%     420B
   interface flux             206   4.34ms  23.5%  21.1μs     0.00B  0.00%    0.00B
   volume integral            206   3.05ms  16.5%  14.8μs     0.00B  0.00%    0.00B
   prolong2interfaces         206   2.06ms  11.2%  10.0μs     0.00B  0.00%    0.00B
   surface integral           206    458μs  2.48%  2.22μs     0.00B  0.00%    0.00B
   Jacobian                   206    263μs  1.43%  1.28μs     0.00B  0.00%    0.00B
   ~rhs!~                     206    259μs  1.40%  1.26μs   84.5KiB  18.2%     420B
   reset ∂u/∂t                206   74.5μs  0.40%   362ns     0.00B  0.00%    0.00B
   prolong2boundaries         206   8.50μs  0.05%  41.3ns     0.00B  0.00%    0.00B
   prolong2mortars            206   8.40μs  0.05%  40.8ns     0.00B  0.00%    0.00B
   mortar flux                206   8.40μs  0.05%  40.8ns     0.00B  0.00%    0.00B
   source terms               206   5.70μs  0.03%  27.7ns     0.00B  0.00%    0.00B
   boundary flux              206   5.50μs  0.03%  26.7ns     0.00B  0.00%    0.00B

Edit: This was with check-bounds=no. Otherwise, the volume integral still takes the most time in the last example:

julia> trixi_include(joinpath(examples_dir(), "2d", "elixir_advection_extended.jl"),
          mesh=P4estMesh((1, 1), polydeg=3,
                  coordinates_min=coordinates_min, coordinates_max=coordinates_max,
                  initial_refinement_level=4))
[...]
 Section                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────
 rhs!                         206   17.1ms  68.2%  83.1μs   84.5KiB  18.2%     420B
   volume integral            206   6.55ms  26.1%  31.8μs     0.00B  0.00%    0.00B
   interface flux             206   5.99ms  23.8%  29.1μs     0.00B  0.00%    0.00B
   prolong2interfaces         206   2.87ms  11.4%  13.9μs     0.00B  0.00%    0.00B
   surface integral           206    839μs  3.34%  4.07μs     0.00B  0.00%    0.00B
   Jacobian                   206    475μs  1.89%  2.31μs     0.00B  0.00%    0.00B
   ~rhs!~                     206    278μs  1.11%  1.35μs   84.5KiB  18.2%     420B
   reset ∂u/∂t                206   74.7μs  0.30%   363ns     0.00B  0.00%    0.00B
   prolong2boundaries         206   7.80μs  0.03%  37.9ns     0.00B  0.00%    0.00B
   prolong2mortars            206   7.60μs  0.03%  36.9ns     0.00B  0.00%    0.00B
   mortar flux                206   7.60μs  0.03%  36.9ns     0.00B  0.00%    0.00B
   boundary flux              206   7.40μs  0.03%  35.9ns     0.00B  0.00%    0.00B
   source terms               206   5.80μs  0.02%  28.2ns     0.00B  0.00%    0.00B

@ranocha
Copy link
Member Author

ranocha commented Jun 14, 2021

Btw, where do all these allocations come from?

Looks like I was running it on a computer with JULIA_NUM_THREADS set to something bigger than one - fixed above.

@ranocha
Copy link
Member Author

ranocha commented Jun 14, 2021

However, IMHO we should not touch this until we have support for 3D unstructured in Trixi

Sure. As I said above, I opened this issue to record this observation @efaulhaber made in the PR linked above and to prepare some information that can be used to improve the solver based on the P4estMesh. As always, we should follow our usual steps of

  1. Make it work
  2. Make it fast

@ranocha
Copy link
Member Author

ranocha commented Aug 4, 2021

I was thinking if there's a way to implement something similar to Colon (or :, which is a synonym) to just index the array there by this new type, so that the indexing itself is done by to_indices, similar to how Colon works.

Another idea: It's possible to create a view like this.

julia> A = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> view(A, 2:-1:1, :)
2×2 view(::Matrix{Int64}, 2:-1:1, :) with eltype Int64:
 3  4
 1  2

I haven't tested yet if this would be any more efficient, but it could be worth a try.

The problem with such an approach is that we would get some type instabilities since the different views would have different types. Thus, we would also need to add new callsites to get efficient code...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance We are greedy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants