MPI communication overhead #4415

jngrad · 2022-01-03T16:36:57Z

TL;DR: Simulations of a LJ fluid can be sped up by 4% or 5% by following the MPI performance optimization instructions to reduce overhead when communicating Utils::Vector and Particle objects.

Serialization optimization

The serialization payload for Utils::Vector has the following layout:

template <typename T, std::size_t N>
struct Buffer {
  /* metadata */
  short unsigned version[2]; // for Utils::detail::Storage and Utils::Vector
  std::size_t n_elements;    // equal to N
  /* data */
  T data[N];
};

This payload is stored without padding in a std::vector<char>. For Utils::Vector3i, the metadata consumes 12 bytes, while the data itself consumes 12 bytes. Similarly for Utils::Vector3d, the metadata consumes 12 bytes while the data consumes 24 bytes.

We can remove the metadata as follows:

since the vector size it known at compile-time, we can store the data as a contiguous array with the boost::serialization::make_array wrapper and pass the vector size as a function argument, thus saving 8 bytes (afba12e)
since the Utils::Vector class and its dependencies have the same layout in all of ESPResSo, and we cannot reload from a checkpoint with a different version of ESPResSo (undefined behavior), we can skip versioning of the Utils::detail::Storage and Utils::Vector classes with boost::serialization::object_serializable, thus saving 2x2 bytes (803841e)

You can visualize the buffer content under different serialization conditions with the MWE below, using 803841e.

MWE (click to unroll)

#include <boost/archive/text_iarchive.hpp>
#include <boost/archive/text_oarchive.hpp>
#include <boost/array.hpp>
#include <boost/mpi.hpp>
#include <boost/mpi/datatype.hpp>
#include <boost/mpi/packed_iarchive.hpp>
#include <boost/mpi/packed_oarchive.hpp>
#include <boost/serialization/access.hpp>
#include <boost/serialization/array.hpp>

#include <utils/Vector.hpp>

#include <array>
#include <iostream>
#include <sstream>

namespace boost::mpi {
using packed_archive = packed_oarchive::buffer_type;
}

void print(std::stringstream &buffer) {
  std::cout << buffer.str() << "\n";
  buffer.seekg(0, std::ios_base::end);
  std::cout << "(" << buffer.tellg() << " bytes)\n";
  buffer.seekg(0, std::ios_base::beg);
}

void print(boost::mpi::packed_archive &buffer) {
  int remainder = buffer.size() % 4;
  for (int i = 0; i < remainder; ++i)
    std::cout << "  ";
  for (int i = 0; i < buffer.size(); ++i) {
    int j = i + remainder;
    if ((j != 0) and (j % 4 == 0))
      std::cout << "\n";
    std::cout << static_cast<short int>(buffer[i]) << " ";
  }
  std::cout << "\n(" << buffer.size() * sizeof(boost::mpi::packed_archive::value_type) << " bytes)\n";
}

namespace serialization_default {
template <typename T> void save(T &oa) {
  std::array<int, 3> values{{4, 5, 6}};
  oa << values;
}
template <typename T> void load(T &ia) {
  std::array<int, 3> values;
  ia >> values;
  assert(values[0] == 4 and values[1] == 5 and values[2] == 6);
}
} // namespace serialization_default

namespace serialization_make_array {
template <typename T> void save(T &oa) {
  std::array<int, 3> values{{4, 5, 6}};
  oa << boost::serialization::make_array(values.data(), values.size());
}
template <typename T> void load(T &ia) {
  std::array<int, 3> values;
  ia >> boost::serialization::make_array(values.data(), values.size());
  assert(values[0] == 4 and values[1] == 5 and values[2] == 6);
}
} // namespace serialization_make_array

namespace serialization_vector {
template <typename T> void save(T &oa) {
  Utils::Vector3i values{{4, 5, 6}};
  oa << values;
}
template <typename T> void load(T &ia) {
  Utils::Vector3i values;
  ia >> values;
  assert(values[0] == 4 and values[1] == 5 and values[2] == 6);
}
} // namespace serialization_vector

int main(int argc, char **argv) {
  boost::mpi::environment mpi_env{argc, argv};
  boost::mpi::communicator comm_cart{};
  {
    using namespace serialization_default;
    std::stringstream buffer{};
    boost::archive::text_oarchive oa{buffer};
    save(oa);
    boost::archive::text_iarchive ia{buffer};
    load(ia);
    std::cout << std::endl << "default text serialization:\n";
    print(buffer);
  }
  {
    using namespace serialization_make_array;
    std::stringstream buffer{};
    boost::archive::text_oarchive oa{buffer};
    save(oa);
    boost::archive::text_iarchive ia{buffer};
    load(ia);
    std::cout << std::endl << "make_array text serialization:\n";
    print(buffer);
  }
  {
    using namespace serialization_default;
    boost::mpi::packed_archive buffer{};
    boost::mpi::packed_oarchive oa{comm_cart, buffer};
    save(oa);
    boost::mpi::packed_iarchive ia{comm_cart, buffer};
    load(ia);
    std::cout << std::endl << "default mpi serialization:\n";
    print(buffer);
  }
  {
    using namespace serialization_make_array;
    boost::mpi::packed_archive buffer{};
    boost::mpi::packed_oarchive oa{comm_cart, buffer};
    save(oa);
    boost::mpi::packed_iarchive ia{comm_cart, buffer};
    load(ia);
    std::cout << std::endl << "make_array mpi serialization:\n";
    print(buffer);
  }
  {
    using namespace serialization_vector;
    boost::mpi::packed_archive buffer{};
    boost::mpi::packed_oarchive oa{comm_cart, buffer};
    save(oa);
    boost::mpi::packed_iarchive ia{comm_cart, buffer};
    load(ia);
    std::cout << std::endl << "vector mpi serialization:\n";
    print(buffer);
  }
}

Output:

$ mpic++ mwe.cpp -std=c++17 -lboost_serialization -lboost_mpi -Isrc/utils/include
$ ./a.out

default text serialization:
22 serialization::archive 17 0 0 3 4 5 6
(40 bytes)

make_array text serialization:
22 serialization::archive 17 4 5 6
(34 bytes)

default mpi serialization:
    0 0
3 0 0 0
0 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
(22 bytes)

make_array mpi serialization:
4 0 0 0
5 0 0 0
6 0 0 0
(12 bytes)

vector mpi serialization:
4 0 0 0
5 0 0 0
6 0 0 0
(12 bytes)

Communication optimization

There are additional ways to optimize communication that don't have a visible impact on the serialization buffer, yet reduce the communication overhead by a small amount:

since Utils::Vector objects don't have a virtual base class and are communicated between MPI nodes, serialization-by-pointer is not useful, and we can skip tracking of address serialization with boost::serialization::track_never (5d8dae4)
since Utils::Vector only stores an array, we can:
- serialize it as a MPI datatype (e.g. double* resp. int*) if the underlying type is itself a MPI datatype (e.g. double resp. int) using boost::mpi::is_mpi_datatype (6e94858)
- serialize it bitwise if the underlying type is itself bitwise serializable (e.g. double, int) and the platform has defined macro BOOST_MPI_HOMOGENEOUS, using boost::serialization::is_bitwise_serializable (55b6a32)
  - boost::serialization::array_wrapper objects (obtained by applying boost::serialization::make_array on class members) are not bitwise serializable because the type is non-trivial; however if we mark the Utils::detail::Storage as bitwise serializable, then there is no need to use array_wrapper inside it, because bitwise serializable types don't write the data length in the Boost archive

Bitwise serialization of Particle data members

The last performance bottleneck is the serialization of Particle substructs. Since they now contain MPI datatypes exclusively, we can mark them as bitwise serializable and change their implementation level to reduce the communication overhead.

The implementation level only has an effect on Boost archives, which we don't use in ghost communication. The ghost communication protocol relies on MemcpyIArchive and MemcpyOArchive (defined in Utils/memcpy_archive.hpp), which are a re-implementation of the Boost MPI serialization logic that prioritize bitwise serialization for types that support it. So by making the Particle substructs bitwise serializable, we guarantee that ghost communication always uses std::memcpy.

To further optimize MPI communication, one can shrink the size of the Particle struct and its substructs by reducing the amount of padding between members of different types. This is achieved by re-ordering the boolean flags and char types consecutively and grouping them in batches of sizeof double, which is typically 8. This removes 32 bytes from the Particle struct (i.e. 5% of 624 bytes) and 24 bytes from the ParticleProperties struct (i.e. 7% of 328 bytes).

Benchmarking

For a simple LJ gas at 0.5 packing fraction on the maxset config file, the performance gain is:

3.8% +/- 0.7% for 1'000 particles per core (Utils::Vector communication optimization is the main contributing factor)
5.4% +/- 0.6% for 10'000 particles per core (reducing the size of the Particle substructs is the main contributing factor)

Raw data: benchmarks.csv, benchmarks.py

The text was updated successfully, but these errors were encountered:

jngrad · 2022-03-17T17:50:15Z

Compact vectors

Follow-up to #3638.

The memory layout for std::vector in LLVM libcxx looks like this:

template <typename T>
class vector {
  T *begin;
  T *end;
  T *end_capacity;
};

The memory layout for boost::container::vector since Boost 1.67 looks like this:

template <typename T, typename size_type>
class vector {
  T *begin;
  size_type size;
  size_type capacity;
};

By choosing a small type for size_type, e.g. std::uint16_t, one can decrease the size of the Particle struct significantly compared to a std::vector, whose pointers are usually 64bits long. This also reduces the serialization payload by writing only 16bits in the header, while std::vector uses a std::size_t to serialize the vector length, which is usually 64bits long. Such a compact vector can be used for both the bond list and exclusion list, leading to an additional 4% speed-up on maxset (2dbe3d4).

Benchmarking

For a simple LJ gas at 0.5 packing fraction with exclusions on the maxset config file, the performance gain is:

8.6% +/- 0.8% for 1000 particles
9.1% +/- 0.4% for 10000 particles

Raw data: benchmarks.csv, benchmarks.py

Fixes #4415, fixes #3638 Description of changes: - mark classes `Utils::Array`, `Utils::Vector` and `Utils::Quaternion` as MPI datatypes and bitwise serializable - remove MPI communication overhead for `Utils::Vector` and `Particle` by removing tracking information and metadata - check in unit tests that all `Particle` substructs are bitwise serializable - use compact vectors for the bond list and exclusion list For a LJ fluid simulation, the speed-up is around 8% on maxset configuration and 3% on empty configuration.

jngrad added Core Performance labels Jan 3, 2022

jngrad mentioned this issue Jan 3, 2022

Reduce MPI communication overhead #4414

Merged

kodiakhq bot closed this as completed in #4414 Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI communication overhead #4415

MPI communication overhead #4415

jngrad commented Jan 3, 2022 •

edited

Loading

jngrad commented Mar 17, 2022

MPI communication overhead #4415

MPI communication overhead #4415

Comments

jngrad commented Jan 3, 2022 • edited Loading

Serialization optimization

Communication optimization

Bitwise serialization of Particle data members

Benchmarking

jngrad commented Mar 17, 2022

Compact vectors

Benchmarking

jngrad commented Jan 3, 2022 •

edited

Loading