-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BL/TinyProfiler SyncBeforeComms #2762
Conversation
Do we really need both |
@@ -240,6 +243,8 @@ FabArray<FAB>::FillBoundary_finish () | |||
fbd.reset(); | |||
|
|||
#endif | |||
|
|||
BL_PROFILE_SYNC_STOP(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might too late because the function might return early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed.
I suppose not, if we use BL_PROFILE for the timers and wrap it in a TINY_PROFILE or BASE_PROFILE ifdef. I can't imagine a substantial difference between the two. Where do you see it living? In TinyProfiler? (because BLProfile is completely (ifdef-else if-else)'d over.) |
I suppose it needs to be in BLProfile and visible to either profiler because AMReX_TinyProfiler.H is not included if |
We also need to update the cmake logic. See #2774. |
Any thoughts on where the runtime static variables would go? BLProfiler doesn't exist at all if TinyProfiler is on, but the TinyProfiler.H file isn't loaded when BLProfiler is on. So, there currently isn't a single consistent namespace or class to hold the static values. There also isn't an consistent |
Change to something like |
The default stream (e.g., stream used outside MFIter) used to be the null stream for CUDA and HIP. By default, there is implicit synchronization between the null stream and other streams. To avoid that, the default stream in AMReX is now no longer the null stream. The behavior of Gpu::synchronize being device wide synchronization has not changed. However, for most of its use cases, it can be replaced by a new function Gpu::streamSynchronizeAll that will synchronize the activities on all AMReX streams without performing a device wide synchronization that could potentially interfere with other libraries (e.g., MPI). The behavior of [dtod|dtoh|htod]_memcpy has changed. For CUDA and HIP, these functions used to call the synchronous version of the memcpy. However, the exact synchronization behavior depends on the memory types. For SYCL/DPC++, there is no equivalent form because a queue (i.e., stream) must be specified. Furthermore, there is no guarantee of consistence across different vendor platforms. This has now changed to calling the asynchronous form using the current stream followed by a stream synchronization.
The time used for computing velocity in the non-subcycling mode is incorrect. Close AMReX-Codes#2725
Make the dt in the AmrLevel test consistent with that in the AmrCore Test. That is we use the velocity at t+0.5*dt (here dt is from the previous step) to estimate the dt for the next step.
On Perlmutter, `g++ -O3 -march=znver3` produces lots of stringop-overflow warnings in FabConv. These warnings are false positive because the compiler does not know sizeof(amrex::Real) is either 4 or 8. This commit fixes the warnings. Close AMReX-Codes#2750
The move constructor and assignment operator for `AmrCore` with particles was broken. When moving `AmrParGDB`, its internal `m_amrcore` pointer needs to be updated, too.
This allows, for example, refining based on the mass in a cell rather than only on its density. A function to obtain the cell volume at runtime given an IntVect, that can be run inside a ParallelFor, is added to Geometry.
* CI--HIP: wget gpg key from https instead of http * change other http to https
Add the `<utility>` header for `std::move`.
…integrate() driver function. (AMReX-Codes#2780)
Accidentally rebased in here, so will reopen in a new PR that doesn't have a broken history. |
Summary
Adds macros and sync locations before FillBoundary and ParallelCopy. This allows codes to get an accurate picture of the time spent in their communications, removing any load imbalance that can be captured in these types of calls. Only works when running TinyProfiler or BLProfiler, and the time the MPI_Barrier took is reported as
SyncBeforeComms
.Turned on with
amrex.use_profiler_syncs = 1
.Additional background
Tested on Perlmutter with TinyProfiler. BLProfiler testing is underway.
Also tweaked the position of the FillBoundary_nowait timer for consistency.
Where else should the sync be placed? (Particle Redistribute and FillPatch?)
Checklist
The proposed changes: