BL/TinyProfiler SyncBeforeComms #2762

kngott · 2022-05-11T21:36:20Z

Summary

Adds macros and sync locations before FillBoundary and ParallelCopy. This allows codes to get an accurate picture of the time spent in their communications, removing any load imbalance that can be captured in these types of calls. Only works when running TinyProfiler or BLProfiler, and the time the MPI_Barrier took is reported as SyncBeforeComms.

Turned on with amrex.use_profiler_syncs = 1.

Additional background

Tested on Perlmutter with TinyProfiler. BLProfiler testing is underway.

FabArray::FillBoundary()                                    14262      79.74      82.08      84.49  10.94%
SyncBeforeComms                                             23262       16.8         37      77.84  10.08%
FabArray::ParallelCopy()                                     3000      58.95      68.46      73.11   9.47%

Also tweaked the position of the FillBoundary_nowait timer for consistency.
Where else should the sync be placed? (Particle Redistribute and FillPatch?)

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

WeiqunZhang · 2022-05-12T22:48:16Z

Do we really need both BLProfileSync and TinyProfileSync?

WeiqunZhang · 2022-05-12T22:58:27Z

Src/Base/AMReX_FabArrayCommI.H

@@ -240,6 +243,8 @@ FabArray<FAB>::FillBoundary_finish ()
    fbd.reset();

 #endif
+
+    BL_PROFILE_SYNC_STOP();


It might too late because the function might return early.

Should be fixed.

kngott · 2022-05-12T23:22:46Z

Do we really need both BLProfileSync and TinyProfileSync?

I suppose not, if we use BL_PROFILE for the timers and wrap it in a TINY_PROFILE or BASE_PROFILE ifdef. I can't imagine a substantial difference between the two.

Where do you see it living? In TinyProfiler? (because BLProfile is completely (ifdef-else if-else)'d over.)

WeiqunZhang · 2022-05-16T14:46:51Z

I suppose it needs to be in BLProfile and visible to either profiler because AMReX_TinyProfiler.H is not included if AMREX_TINY_PROFILING is not included.

WeiqunZhang · 2022-05-16T17:56:11Z

We also need to update the cmake logic. See #2774.

kngott · 2022-05-18T05:23:42Z

Any thoughts on where the runtime static variables would go?

BLProfiler doesn't exist at all if TinyProfiler is on, but the TinyProfiler.H file isn't loaded when BLProfiler is on. So, there currently isn't a single consistent namespace or class to hold the static values. There also isn't an consistent Initialize() with a ParmParse to check for the runtime flag in either. Is there a higher-level place that makes sense to keep it? i.e. system in AMReX.H?

WeiqunZhang · 2022-05-18T15:23:31Z

$ git grep "BL_PROFILE_INITIALIZE"
Src/Base/AMReX.cpp:    BL_PROFILE_INITIALIZE();
Src/Base/AMReX_BLProfiler.H:#define BL_PROFILE_INITIALIZE()  amrex::BLProfiler::Initialize();
Src/Base/AMReX_BLProfiler.H:#define BL_PROFILE_INITIALIZE()
Src/Base/AMReX_BLProfiler.H:#define BL_PROFILE_INITIALIZE()

$ git grep "BL_TINY_PROFILE_INITIALIZE"
Src/Base/AMReX.cpp:    BL_TINY_PROFILE_INITIALIZE();
Src/Base/AMReX_BLProfiler.H:#define BL_TINY_PROFILE_INITIALIZE()
Src/Base/AMReX_BLProfiler.H:#define BL_TINY_PROFILE_INITIALIZE()   amrex::TinyProfiler::Initialize()
Src/Base/AMReX_BLProfiler.H:#define BL_TINY_PROFILE_INITIALIZE()

Change to something like amrex::BLProfiler::Initialize(); amerx::ProfileSync::Initialize();?

…s#2759)

The default stream (e.g., stream used outside MFIter) used to be the null stream for CUDA and HIP. By default, there is implicit synchronization between the null stream and other streams. To avoid that, the default stream in AMReX is now no longer the null stream. The behavior of Gpu::synchronize being device wide synchronization has not changed. However, for most of its use cases, it can be replaced by a new function Gpu::streamSynchronizeAll that will synchronize the activities on all AMReX streams without performing a device wide synchronization that could potentially interfere with other libraries (e.g., MPI). The behavior of [dtod|dtoh|htod]_memcpy has changed. For CUDA and HIP, these functions used to call the synchronous version of the memcpy. However, the exact synchronization behavior depends on the memory types. For SYCL/DPC++, there is no equivalent form because a queue (i.e., stream) must be specified. Furthermore, there is no guarantee of consistence across different vendor platforms. This has now changed to calling the asynchronous form using the current stream followed by a stream synchronization.

The time used for computing velocity in the non-subcycling mode is incorrect. Close AMReX-Codes#2725

Make the dt in the AmrLevel test consistent with that in the AmrCore Test. That is we use the velocity at t+0.5*dt (here dt is from the previous step) to estimate the dt for the next step.

On Perlmutter, `g++ -O3 -march=znver3` produces lots of stringop-overflow warnings in FabConv. These warnings are false positive because the compiler does not know sizeof(amrex::Real) is either 4 or 8. This commit fixes the warnings. Close AMReX-Codes#2750

The move constructor and assignment operator for `AmrCore` with particles was broken. When moving `AmrParGDB`, its internal `m_amrcore` pointer needs to be updated, too.

This allows, for example, refining based on the mass in a cell rather than only on its density. A function to obtain the cell volume at runtime given an IntVect, that can be run inside a ParallelFor, is added to Geometry.

* CI--HIP: wget gpg key from https instead of http * change other http to https

…MReX ...) (AMReX-Codes#2770)

…2774)

Add the `<utility>` header for `std::move`.

…integrate() driver function. (AMReX-Codes#2780)

kngott · 2022-05-19T04:44:25Z

Accidentally rebased in here, so will reopen in a new PR that doesn't have a broken history.

kngott added 4 commits May 10, 2022 16:19

First draft.

f236e57

Base Profile Fixes.

c6fd925

Adjust timers and syncs for consistency.

8bafe67

Extra lines.

bea31cb

kngott requested review from atmyers and WeiqunZhang May 11, 2022 21:36

WeiqunZhang reviewed May 12, 2022

View reviewed changes

Take into account leaving finishes early.

088771c

jrood-nrel and others added 16 commits May 18, 2022 21:35

Add HDF5 H5Z-ZFP support in CMake (AMReX-Codes#2753)

08638fc

add scomp and ncomp arguments to IntegratorOps functions. (AMReX-Code…

cdc0daa

…s#2759)

multilevel version of writeplotfiletoascii (AMReX-Codes#2742)

a4bf36e

Fix the Advection_AmrCore test (AMReX-Codes#2761)

60a5547

The time used for computing velocity in the non-subcycling mode is incorrect. Close AMReX-Codes#2725

Time step in the AmrLevel test (AMReX-Codes#2763)

45f9617

Make the dt in the AmrLevel test consistent with that in the AmrCore Test. That is we use the velocity at t+0.5*dt (here dt is from the previous step) to estimate the dt for the next step.

this updates to recent Hypre API changes (AMReX-Codes#2765)

6a8011c

Fix maybe-uninitialized warning in calling mlock (AMReX-Codes#2768)

b6a4e64

Update particle << operator after changes to id/cpu (AMReX-Codes#2769)

80a15e4

Fix: AmrCore Move (AMReX-Codes#2773)

92ace57

The move constructor and assignment operator for `AmrCore` with particles was broken. When moving `AmrParGDB`, its internal `m_amrcore` pointer needs to be updated, too.

Change repo html address to Ubuntu 20.04 (AMReX-Codes#2766)

2ab2135

CI--HIP: wget gpg key from https instead of http (AMReX-Codes#2771)

e04edae

* CI--HIP: wget gpg key from https instead of http * change other http to https

configure value of AMReX_GPU_RDC flag for use in cmake find_package(A…

5df7ff4

…MReX ...) (AMReX-Codes#2770)

Fix the bug in the CMake build with AMReX_BASE_PROFILE. (AMReX-Codes#…

7bbe9fa

…2774)

ax3l and others added 4 commits May 18, 2022 21:35

AmrCore: Include utility (AMReX-Codes#2778)

32b1a0b

Add the `<utility>` header for `std::move`.

Add some timestep controls to the AMReX TimeIntegrator class for its …

899d907

…integrate() driver function. (AMReX-Codes#2780)

Remove a Sync object.

a0f5b04

Back to before

42a189e

kngott closed this May 19, 2022

kngott mentioned this pull request Oct 3, 2022

enable GPU-aware MPI when performance conditions are met #2967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BL/TinyProfiler SyncBeforeComms #2762

BL/TinyProfiler SyncBeforeComms #2762

kngott commented May 11, 2022

WeiqunZhang commented May 12, 2022

WeiqunZhang May 12, 2022

kngott May 12, 2022

kngott commented May 12, 2022

WeiqunZhang commented May 16, 2022

WeiqunZhang commented May 16, 2022

kngott commented May 18, 2022

WeiqunZhang commented May 18, 2022

kngott commented May 19, 2022

BL/TinyProfiler SyncBeforeComms #2762

BL/TinyProfiler SyncBeforeComms #2762

Conversation

kngott commented May 11, 2022

Summary

Additional background

Checklist

WeiqunZhang commented May 12, 2022

WeiqunZhang May 12, 2022

Choose a reason for hiding this comment

kngott May 12, 2022

Choose a reason for hiding this comment

kngott commented May 12, 2022

WeiqunZhang commented May 16, 2022

WeiqunZhang commented May 16, 2022

kngott commented May 18, 2022

WeiqunZhang commented May 18, 2022

kngott commented May 19, 2022