You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: Simulations of a LJ fluid can be sped up by 4% or 5% by following the MPI performance optimization instructions to reduce overhead when communicating Utils::Vector and Particle objects.
Serialization optimization
The serialization payload for Utils::Vector has the following layout:
template <typename T, std::size_t N>
structBuffer {
/* metadata */shortunsigned version[2]; // for Utils::detail::Storage and Utils::Vector
std::size_t n_elements; // equal to N/* data */
T data[N];
};
This payload is stored without padding in a std::vector<char>. For Utils::Vector3i, the metadata consumes 12 bytes, while the data itself consumes 12 bytes. Similarly for Utils::Vector3d, the metadata consumes 12 bytes while the data consumes 24 bytes.
We can remove the metadata as follows:
since the vector size it known at compile-time, we can store the data as a contiguous array with the boost::serialization::make_array wrapper and pass the vector size as a function argument, thus saving 8 bytes (afba12e)
since the Utils::Vector class and its dependencies have the same layout in all of ESPResSo, and we cannot reload from a checkpoint with a different version of ESPResSo (undefined behavior), we can skip versioning of the Utils::detail::Storage and Utils::Vector classes with boost::serialization::object_serializable, thus saving 2x2 bytes (803841e)
You can visualize the buffer content under different serialization conditions with the MWE below, using 803841e.
There are additional ways to optimize communication that don't have a visible impact on the serialization buffer, yet reduce the communication overhead by a small amount:
since Utils::Vector objects don't have a virtual base class and are communicated between MPI nodes, serialization-by-pointer is not useful, and we can skip tracking of address serialization with boost::serialization::track_never (5d8dae4)
since Utils::Vector only stores an array, we can:
serialize it as a MPI datatype (e.g. double* resp. int*) if the underlying type is itself a MPI datatype (e.g. double resp. int) using boost::mpi::is_mpi_datatype (6e94858)
serialize it bitwise if the underlying type is itself bitwise serializable (e.g. double, int) and the platform has defined macro BOOST_MPI_HOMOGENEOUS, using boost::serialization::is_bitwise_serializable (55b6a32)
boost::serialization::array_wrapper objects (obtained by applying boost::serialization::make_array on class members) are not bitwise serializable because the type is non-trivial; however if we mark the Utils::detail::Storage as bitwise serializable, then there is no need to use array_wrapper inside it, because bitwise serializable types don't write the data length in the Boost archive
Bitwise serialization of Particle data members
The last performance bottleneck is the serialization of Particle substructs. Since they now contain MPI datatypes exclusively, we can mark them as bitwise serializable and change their implementation level to reduce the communication overhead.
The implementation level only has an effect on Boost archives, which we don't use in ghost communication. The ghost communication protocol relies on MemcpyIArchive and MemcpyOArchive (defined in Utils/memcpy_archive.hpp), which are a re-implementation of the Boost MPI serialization logic that prioritize bitwise serialization for types that support it. So by making the Particle substructs bitwise serializable, we guarantee that ghost communication always uses std::memcpy.
To further optimize MPI communication, one can shrink the size of the Particle struct and its substructs by reducing the amount of padding between members of different types. This is achieved by re-ordering the boolean flags and char types consecutively and grouping them in batches of sizeof double, which is typically 8. This removes 32 bytes from the Particle struct (i.e. 5% of 624 bytes) and 24 bytes from the ParticleProperties struct (i.e. 7% of 328 bytes).
Benchmarking
For a simple LJ gas at 0.5 packing fraction on the maxset config file, the performance gain is:
3.8% +/- 0.7% for 1'000 particles per core (Utils::Vector communication optimization is the main contributing factor)
5.4% +/- 0.6% for 10'000 particles per core (reducing the size of the Particle substructs is the main contributing factor)
By choosing a small type for size_type, e.g. std::uint16_t, one can decrease the size of the Particle struct significantly compared to a std::vector, whose pointers are usually 64bits long. This also reduces the serialization payload by writing only 16bits in the header, while std::vector uses a std::size_t to serialize the vector length, which is usually 64bits long. Such a compact vector can be used for both the bond list and exclusion list, leading to an additional 4% speed-up on maxset (2dbe3d4).
Benchmarking
For a simple LJ gas at 0.5 packing fraction with exclusions on the maxset config file, the performance gain is:
Fixes#4415, fixes#3638
Description of changes:
- mark classes `Utils::Array`, `Utils::Vector` and `Utils::Quaternion` as MPI datatypes and bitwise serializable
- remove MPI communication overhead for `Utils::Vector` and `Particle` by removing tracking information and metadata
- check in unit tests that all `Particle` substructs are bitwise serializable
- use compact vectors for the bond list and exclusion list
For a LJ fluid simulation, the speed-up is around 8% on maxset configuration and 3% on empty configuration.
TL;DR: Simulations of a LJ fluid can be sped up by 4% or 5% by following the MPI performance optimization instructions to reduce overhead when communicating
Utils::Vector
andParticle
objects.Serialization optimization
The serialization payload for
Utils::Vector
has the following layout:This payload is stored without padding in a
std::vector<char>
. ForUtils::Vector3i
, the metadata consumes 12 bytes, while the data itself consumes 12 bytes. Similarly forUtils::Vector3d
, the metadata consumes 12 bytes while the data consumes 24 bytes.We can remove the metadata as follows:
boost::serialization::make_array
wrapper and pass the vector size as a function argument, thus saving 8 bytes (afba12e)Utils::Vector
class and its dependencies have the same layout in all of ESPResSo, and we cannot reload from a checkpoint with a different version of ESPResSo (undefined behavior), we can skip versioning of theUtils::detail::Storage
andUtils::Vector
classes withboost::serialization::object_serializable
, thus saving 2x2 bytes (803841e)You can visualize the buffer content under different serialization conditions with the MWE below, using 803841e.
MWE (click to unroll)
Output:
Communication optimization
There are additional ways to optimize communication that don't have a visible impact on the serialization buffer, yet reduce the communication overhead by a small amount:
Utils::Vector
objects don't have a virtual base class and are communicated between MPI nodes, serialization-by-pointer is not useful, and we can skip tracking of address serialization withboost::serialization::track_never
(5d8dae4)Utils::Vector
only stores an array, we can:double*
resp.int*
) if the underlying type is itself a MPI datatype (e.g.double
resp.int
) usingboost::mpi::is_mpi_datatype
(6e94858)double
,int
) and the platform has defined macroBOOST_MPI_HOMOGENEOUS
, usingboost::serialization::is_bitwise_serializable
(55b6a32)boost::serialization::array_wrapper
objects (obtained by applyingboost::serialization::make_array
on class members) are not bitwise serializable because the type is non-trivial; however if we mark theUtils::detail::Storage
as bitwise serializable, then there is no need to usearray_wrapper
inside it, because bitwise serializable types don't write the data length in the Boost archiveBitwise serialization of Particle data members
The last performance bottleneck is the serialization of
Particle
substructs. Since they now contain MPI datatypes exclusively, we can mark them as bitwise serializable and change their implementation level to reduce the communication overhead.The implementation level only has an effect on Boost archives, which we don't use in ghost communication. The ghost communication protocol relies on
MemcpyIArchive
andMemcpyOArchive
(defined inUtils/memcpy_archive.hpp
), which are a re-implementation of the Boost MPI serialization logic that prioritize bitwise serialization for types that support it. So by making the Particle substructs bitwise serializable, we guarantee that ghost communication always usesstd::memcpy
.To further optimize MPI communication, one can shrink the size of the
Particle
struct and its substructs by reducing the amount of padding between members of different types. This is achieved by re-ordering the boolean flags and char types consecutively and grouping them in batches ofsizeof double
, which is typically 8. This removes 32 bytes from theParticle
struct (i.e. 5% of 624 bytes) and 24 bytes from theParticleProperties
struct (i.e. 7% of 328 bytes).Benchmarking
For a simple LJ gas at 0.5 packing fraction on the maxset config file, the performance gain is:
Utils::Vector
communication optimization is the main contributing factor)Particle
substructs is the main contributing factor)Raw data: benchmarks.csv, benchmarks.py
The text was updated successfully, but these errors were encountered: