Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better separation of concerns in the communication framework #3907

Closed
jngrad opened this issue Sep 22, 2020 · 4 comments · Fixed by #3971
Closed

Better separation of concerns in the communication framework #3907

jngrad opened this issue Sep 22, 2020 · 4 comments · Fixed by #3971

Comments

@jngrad
Copy link
Member

jngrad commented Sep 22, 2020

Background

ESPResSo uses a custom MPI communication framework where everything is managed from the head node. Functions visible in the script interface must be implemented twice: once as a worker function, and once as a head function that contains the same body as the worker function, plus a preamble that triggers the worker function on worker nodes with mpi_call_*(). The body usually contains boost::mpi and MPI_* functions to synchronize data. Often, the body can be implemented only once in a separate free function. These head and worker functions (aka "callbacks") are managed by our MpiCallbacks framework which executes them via a visitor pattern.

Problem statement

The source file communication.cpp is the central place to register callback functions, which should be declared and defined in the relevant cpp resp. hpp files. However for historical reasons, many callbacks have been written directly in communication.cpp, which means this file has to know about the implementation details of a variety of classes/structs. This prevents us from hiding implementation details of these classes/structs (e.g. making members private, replacing manual memory management by standard containers, etc.) and can lead to code duplication when the body of the worker function is copy-pasted into the body of the head function. Duplicated code will diverge over time.

Prior work

Between 4.1 and 4.3 (waLBerla branch), many callbacks were moved to dedicated source files. For example the NpT callback was moved to npt.cpp (d21c7a4), because communication.cpp doesn't need to know the internal structure of the NpT parameters class, it only needs to know that a callback function exists, which is visible via a function prototype in npt.hpp. The steepest descent parameters object went one step further and has static storage, making it inaccessible outside the source file:

/** Currently active steepest descent instance */
static SteepestDescentParameters params{};

Course of action

Here are a few candidates that could be simplified further or moved to dedicated files:

  • mpi_bcast_ia_params() which is written twice, and has now diverged: the worker implementation allocates a pointer with new and overwrites the old pointer without a delete, leading to a memory leak. Adding delete before the new doesn't work, because the MPI communication step before that overwrote the address of the allocated memory (see Memory leaks in tabulated bonds #3906 for details).
  • mpi_bcast_coulomb_params_slave() needs to know about the Coulomb and dipolar parameter structs. It should rely on boost::serialization instead, and have the callbacks written in the relevant electrostatics .cpp files.
  • mpi_gather_stats() mixes pressure/energy observables with LB communication. The LB communication logic was removed in 4.3. This can be simplified further in 4.3 by replacing the enum class GatherStats and the switch statement by a 0/1 integer and a conditional, and removing code duplication between the head and worker functions.

Challenges

Moving the remaining callbacks from communication.cpp to dedicated .cpp/.hpp files means we have to rely on the comm_cart global variable. Passing this global by reference means the dedicated .hpp files that declare callbacks must include MPI headers. However, some parts of the CUDA code are not allowed to include MPI header files, therefore features that are used by CUDA code and rely on MPI communication must either use the comm_cart global variable via extern, or rely on a second header file, e.g. npt_mpi.hpp, to write the prototype of the callback.

Outlook

This refactoring will facilitate hiding implementation details of parameter structs and help finding new ways to simplify the callback infrastructure. For example, #3891 removed one level of MPI indirection by having mpi_steepest_descent() call integrate() via mpi_steepest_descent_slave() instead of calling a chain of callbacks. In fact, mpi_steepest_descent() could probably call mpi_integrate_slave() directly instead of mpi_steepest_descent_slave(), removing yet another MPI indirection. This refactoring will also help on the long term if we choose to move in the direction of MPI+OpenMP or pure OpenMP.

@jngrad jngrad added the Core label Sep 22, 2020
@mkuron
Copy link
Member

mkuron commented Sep 22, 2020

However, some parts of the CUDA code are not allowed to include MPI header files, therefore features that are used by CUDA code and rely on MPI communication must either use the comm_cart global variable via extern, or rely on a second header file, e.g. npt_mpi.hpp, to write the prototype of the callback.

That's an anachronism from the autotools build system (which would call mpic++ and nvcc -- nowadays we use the MPI library search path directly with c++ or nvcc) and from when the OpenMPI headers contained syntax that could not be parsed by nvcc (that was fixed around CUDA 5.5).

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Sep 22, 2020 via email

@jngrad
Copy link
Member Author

jngrad commented Oct 22, 2020

Addendum: there are a lot of places in the code where free functions that should have internal linkage end up having external linkage, just so that they could be used in communication.cpp. Some of them have a clear warning that they should not be used for any other purpose:

/** Used by \ref mpi_rescale_particles, should not be used elsewhere.
* Locally rescale all particles on current node.
* @param dir direction to scale (0/1/2 = x/y/z, 3 = x+y+z isotropically)
* @param scale factor by which to rescale (>1: stretch, <1: contract)
*/
void local_rescale_particles(int dir, double scale);

But most of them don't. Furthermore, many of them have side effects in the form of events, so accidentally calling one of them can have a noticeable impact on performance. This design is not sustainable. It's also not difficult to fix it, because it's only a matter of moving the callbacks from communication.cpp to the .cpp files where they are used: jngrad/espresso:fix-3906...fix-3907. The free functions can then be made static, or merged into the callbacks directly, so that they are not accessible in other translation units.

@jngrad
Copy link
Member Author

jngrad commented Oct 23, 2020

Also, there doesn't seem to be a good reason for making the following header files indirectly included in 90 .cpp files:

#include "Particle.hpp"
#include "cuda_init.hpp"
#include "grid_based_algorithms/lb_constants.hpp"

@kodiakhq kodiakhq bot closed this as completed in #3971 Nov 2, 2020
kodiakhq bot added a commit that referenced this issue Nov 2, 2020
Fixes #3907

Move callbacks from `communication.cpp` to the relevant source files. This hides implementation details of many submodules, which no longer need to expose local functions in the public interface with doxygen blocks stating for example "Used by \ref mpi_place_particle, should not be used elsewhere". This also reduces the visibility of `MpiCallbacks.hpp`, `communication.hpp`, `Particle.hpp` and `cuda_init.hpp`. When the head function and worker function have the same code, remove code duplication using `mpi_call_all()`. Remove the `CALLBACK_LIST` macro and split `mpi_gather_stats()` into independent callbacks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants