Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BL/TinyProfiler SyncBeforeComms #2762

Closed
wants to merge 25 commits into from

Conversation

kngott
Copy link
Contributor

@kngott kngott commented May 11, 2022

Summary

Adds macros and sync locations before FillBoundary and ParallelCopy. This allows codes to get an accurate picture of the time spent in their communications, removing any load imbalance that can be captured in these types of calls. Only works when running TinyProfiler or BLProfiler, and the time the MPI_Barrier took is reported as SyncBeforeComms.

Turned on with amrex.use_profiler_syncs = 1.

Additional background

Tested on Perlmutter with TinyProfiler. BLProfiler testing is underway.

FabArray::FillBoundary()                                    14262      79.74      82.08      84.49  10.94%
SyncBeforeComms                                             23262       16.8         37      77.84  10.08%
FabArray::ParallelCopy()                                     3000      58.95      68.46      73.11   9.47%

Also tweaked the position of the FillBoundary_nowait timer for consistency.
Where else should the sync be placed? (Particle Redistribute and FillPatch?)

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@kngott kngott requested review from atmyers and WeiqunZhang May 11, 2022 21:36
@WeiqunZhang
Copy link
Member

Do we really need both BLProfileSync and TinyProfileSync?

@@ -240,6 +243,8 @@ FabArray<FAB>::FillBoundary_finish ()
fbd.reset();

#endif

BL_PROFILE_SYNC_STOP();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might too late because the function might return early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed.

@kngott
Copy link
Contributor Author

kngott commented May 12, 2022

Do we really need both BLProfileSync and TinyProfileSync?

I suppose not, if we use BL_PROFILE for the timers and wrap it in a TINY_PROFILE or BASE_PROFILE ifdef. I can't imagine a substantial difference between the two.

Where do you see it living? In TinyProfiler? (because BLProfile is completely (ifdef-else if-else)'d over.)

@WeiqunZhang
Copy link
Member

I suppose it needs to be in BLProfile and visible to either profiler because AMReX_TinyProfiler.H is not included if AMREX_TINY_PROFILING is not included.

@WeiqunZhang
Copy link
Member

We also need to update the cmake logic. See #2774.

@kngott
Copy link
Contributor Author

kngott commented May 18, 2022

Any thoughts on where the runtime static variables would go?

BLProfiler doesn't exist at all if TinyProfiler is on, but the TinyProfiler.H file isn't loaded when BLProfiler is on. So, there currently isn't a single consistent namespace or class to hold the static values. There also isn't an consistent Initialize() with a ParmParse to check for the runtime flag in either. Is there a higher-level place that makes sense to keep it? i.e. system in AMReX.H?

@WeiqunZhang
Copy link
Member

$ git grep "BL_PROFILE_INITIALIZE"
Src/Base/AMReX.cpp:    BL_PROFILE_INITIALIZE();
Src/Base/AMReX_BLProfiler.H:#define BL_PROFILE_INITIALIZE()  amrex::BLProfiler::Initialize();
Src/Base/AMReX_BLProfiler.H:#define BL_PROFILE_INITIALIZE()
Src/Base/AMReX_BLProfiler.H:#define BL_PROFILE_INITIALIZE()
$ git grep "BL_TINY_PROFILE_INITIALIZE"
Src/Base/AMReX.cpp:    BL_TINY_PROFILE_INITIALIZE();
Src/Base/AMReX_BLProfiler.H:#define BL_TINY_PROFILE_INITIALIZE()
Src/Base/AMReX_BLProfiler.H:#define BL_TINY_PROFILE_INITIALIZE()   amrex::TinyProfiler::Initialize()
Src/Base/AMReX_BLProfiler.H:#define BL_TINY_PROFILE_INITIALIZE()

Change to something like amrex::BLProfiler::Initialize(); amerx::ProfileSync::Initialize();?

jrood-nrel and others added 16 commits May 18, 2022 21:35
The default stream (e.g., stream used outside MFIter) used to be the null
stream for CUDA and HIP.  By default, there is implicit synchronization
between the null stream and other streams.  To avoid that, the default
stream in AMReX is now no longer the null stream.

The behavior of Gpu::synchronize being device wide synchronization has not
changed.  However, for most of its use cases, it can be replaced by a new
function Gpu::streamSynchronizeAll that will synchronize the activities on
all AMReX streams without performing a device wide synchronization that
could potentially interfere with other libraries (e.g., MPI).

The behavior of [dtod|dtoh|htod]_memcpy has changed.  For CUDA and HIP,
these functions used to call the synchronous version of the memcpy.
However, the exact synchronization behavior depends on the memory types.
For SYCL/DPC++, there is no equivalent form because a queue (i.e., stream)
must be specified.  Furthermore, there is no guarantee of consistence across
different vendor platforms.  This has now changed to calling the
asynchronous form using the current stream followed by a stream
synchronization.
The time used for computing velocity in the non-subcycling mode is
incorrect.

Close AMReX-Codes#2725
Make the dt in the AmrLevel test consistent with that in the AmrCore Test.
That is we use the velocity at t+0.5*dt (here dt is from the previous step)
to estimate the dt for the next step.
On Perlmutter, `g++ -O3 -march=znver3` produces lots of stringop-overflow warnings in
FabConv.  These warnings are false positive because the compiler does not know
sizeof(amrex::Real) is either 4 or 8.  This commit fixes the warnings.

Close AMReX-Codes#2750
The move constructor and assignment operator for `AmrCore` with
particles was broken.

When moving `AmrParGDB`, its internal `m_amrcore` pointer needs
to be updated, too.
This allows, for example, refining based on the mass in a cell rather than only on its density.

A function to obtain the cell volume at runtime given an IntVect, that can be run inside a ParallelFor, is added to Geometry.
* CI--HIP: wget gpg key from https instead of http

* change other http to https
@kngott
Copy link
Contributor Author

kngott commented May 19, 2022

Accidentally rebased in here, so will reopen in a new PR that doesn't have a broken history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.