Better separation of concerns in the communication framework #3907

jngrad · 2020-09-22T09:03:06Z

Background

ESPResSo uses a custom MPI communication framework where everything is managed from the head node. Functions visible in the script interface must be implemented twice: once as a worker function, and once as a head function that contains the same body as the worker function, plus a preamble that triggers the worker function on worker nodes with mpi_call_*(). The body usually contains boost::mpi and MPI_* functions to synchronize data. Often, the body can be implemented only once in a separate free function. These head and worker functions (aka "callbacks") are managed by our MpiCallbacks framework which executes them via a visitor pattern.

Problem statement

The source file communication.cpp is the central place to register callback functions, which should be declared and defined in the relevant cpp resp. hpp files. However for historical reasons, many callbacks have been written directly in communication.cpp, which means this file has to know about the implementation details of a variety of classes/structs. This prevents us from hiding implementation details of these classes/structs (e.g. making members private, replacing manual memory management by standard containers, etc.) and can lead to code duplication when the body of the worker function is copy-pasted into the body of the head function. Duplicated code will diverge over time.

Prior work

Between 4.1 and 4.3 (waLBerla branch), many callbacks were moved to dedicated source files. For example the NpT callback was moved to npt.cpp (d21c7a4), because communication.cpp doesn't need to know the internal structure of the NpT parameters class, it only needs to know that a callback function exists, which is visible via a function prototype in npt.hpp. The steepest descent parameters object went one step further and has static storage, making it inaccessible outside the source file:

espresso/src/core/integrators/steepest_descent.cpp

Lines 40 to 41 in f760050

    
           /** Currently active steepest descent instance */ 
        
           static SteepestDescentParameters params{};

Course of action

Here are a few candidates that could be simplified further or moved to dedicated files:

mpi_bcast_ia_params() which is written twice, and has now diverged: the worker implementation allocates a pointer with new and overwrites the old pointer without a delete, leading to a memory leak. Adding delete before the new doesn't work, because the MPI communication step before that overwrote the address of the allocated memory (see Memory leaks in tabulated bonds #3906 for details).
mpi_bcast_coulomb_params_slave() needs to know about the Coulomb and dipolar parameter structs. It should rely on boost::serialization instead, and have the callbacks written in the relevant electrostatics .cpp files.
mpi_gather_stats() mixes pressure/energy observables with LB communication. The LB communication logic was removed in 4.3. This can be simplified further in 4.3 by replacing the enum class GatherStats and the switch statement by a 0/1 integer and a conditional, and removing code duplication between the head and worker functions.

Challenges

Moving the remaining callbacks from communication.cpp to dedicated .cpp/.hpp files means we have to rely on the comm_cart global variable. Passing this global by reference means the dedicated .hpp files that declare callbacks must include MPI headers. However, some parts of the CUDA code are not allowed to include MPI header files, therefore features that are used by CUDA code and rely on MPI communication must either use the comm_cart global variable via extern, or rely on a second header file, e.g. npt_mpi.hpp, to write the prototype of the callback.

Outlook

This refactoring will facilitate hiding implementation details of parameter structs and help finding new ways to simplify the callback infrastructure. For example, #3891 removed one level of MPI indirection by having mpi_steepest_descent() call integrate() via mpi_steepest_descent_slave() instead of calling a chain of callbacks. In fact, mpi_steepest_descent() could probably call mpi_integrate_slave() directly instead of mpi_steepest_descent_slave(), removing yet another MPI indirection. This refactoring will also help on the long term if we choose to move in the direction of MPI+OpenMP or pure OpenMP.

The text was updated successfully, but these errors were encountered:

mkuron · 2020-09-22T09:27:59Z

However, some parts of the CUDA code are not allowed to include MPI header files, therefore features that are used by CUDA code and rely on MPI communication must either use the comm_cart global variable via extern, or rely on a second header file, e.g. npt_mpi.hpp, to write the prototype of the callback.

That's an anachronism from the autotools build system (which would call mpic++ and nvcc -- nowadays we use the MPI library search path directly with c++ or nvcc) and from when the OpenMPI headers contained syntax that could not be parsed by nvcc (that was fixed around CUDA 5.5).

RudolfWeeber · 2020-09-22T10:08:11Z

I think many of the affected callbacks should be replaced by script interface code, rather than be refactored as callbacks. The interaction stuff is a prime candidate, IMO. The LB stuff in gather_stats will go away with Walberla. I think the callback registration should happen, where they are implemented as done in lb_walberla_interface.cpp.

jngrad · 2020-10-22T11:41:37Z

Addendum: there are a lot of places in the code where free functions that should have internal linkage end up having external linkage, just so that they could be used in communication.cpp. Some of them have a clear warning that they should not be used for any other purpose:

espresso/src/core/particle_data.hpp

Lines 370 to 375 in 572f854

    
           /** Used by \ref mpi_rescale_particles, should not be used elsewhere. 
        
            *  Locally rescale all particles on current node. 
        
            *  @param dir   direction to scale (0/1/2 = x/y/z, 3 = x+y+z isotropically) 
        
            *  @param scale factor by which to rescale (>1: stretch, <1: contract) 
        
            */ 
        
           void local_rescale_particles(int dir, double scale);

But most of them don't. Furthermore, many of them have side effects in the form of events, so accidentally calling one of them can have a noticeable impact on performance. This design is not sustainable. It's also not difficult to fix it, because it's only a matter of moving the callbacks from communication.cpp to the .cpp files where they are used: jngrad/espresso:fix-3906...fix-3907. The free functions can then be made static, or merged into the callbacks directly, so that they are not accessible in other translation units.

jngrad · 2020-10-23T07:17:19Z

Also, there doesn't seem to be a good reason for making the following header files indirectly included in 90 .cpp files:

espresso/src/core/communication.hpp

Lines 55 to 57 in 572f854

    
           #include "Particle.hpp" 
        
           #include "cuda_init.hpp" 
        
           #include "grid_based_algorithms/lb_constants.hpp"

Fixes #3907 Move callbacks from `communication.cpp` to the relevant source files. This hides implementation details of many submodules, which no longer need to expose local functions in the public interface with doxygen blocks stating for example "Used by \ref mpi_place_particle, should not be used elsewhere". This also reduces the visibility of `MpiCallbacks.hpp`, `communication.hpp`, `Particle.hpp` and `cuda_init.hpp`. When the head function and worker function have the same code, remove code duplication using `mpi_call_all()`. Remove the `CALLBACK_LIST` macro and split `mpi_gather_stats()` into independent callbacks.

jngrad added the Core label Sep 22, 2020

jngrad mentioned this issue Oct 25, 2020

Better separation of concerns in the core #3971

Merged

KaiSzuttor added the CodeQuality label Nov 2, 2020

kodiakhq bot closed this as completed in #3971 Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better separation of concerns in the communication framework #3907

Better separation of concerns in the communication framework #3907

jngrad commented Sep 22, 2020

mkuron commented Sep 22, 2020

RudolfWeeber commented Sep 22, 2020 via email

jngrad commented Oct 22, 2020

jngrad commented Oct 23, 2020

Better separation of concerns in the communication framework #3907

Better separation of concerns in the communication framework #3907

Comments

jngrad commented Sep 22, 2020

mkuron commented Sep 22, 2020

RudolfWeeber commented Sep 22, 2020 via email

jngrad commented Oct 22, 2020

jngrad commented Oct 23, 2020