Checkpointing doesn't support recursion with more than 1 MPI rank #4280

jngrad · 2021-06-11T20:59:01Z

In ESPResSo 4.2-dev, a regression from #4167 broke the checkpointing mechanism. When reloading from a checkpoint file, script interface objects with global creation policy are not broadcasted, i.e. they are only restored on rank 0. This affects ~63 python classes.

The checkpointing tests don't check that these objects are restored on all MPI ranks. This could be achieved by e.g. introducing a new particle with type 10 on a cell belonging to MPI rank 1 or rank 2 that only interacts with a shape-based constraint. Then we load the system from the checkpoint, recalculate the force on that particle and check that it is non-zero.

MWE:

# mwe_save.py
import os
import espressomd.checkpointing
import espressomd.shapes

checkpoint = espressomd.checkpointing.Checkpoint(checkpoint_id='mwe', checkpoint_path=".")
if checkpoint.has_checkpoints(): # cleanup old checkpoint files
    for filepath in os.listdir(checkpoint.checkpoint_dir):
        os.remove(os.path.join(checkpoint.checkpoint_dir, filepath))

system = espressomd.System(box_l=3 * [10.0], time_step = 0.01)
system.part.add(pos=[system.box_l / 2. - 1., system.box_l / 2. + 1.], type=[0, 0])
system.constraints.add(shape=espressomd.shapes.Sphere(center=system.box_l / 2., radius=1), particle_type=1)
system.non_bonded_inter[1, 0].lennard_jones.set_params(epsilon=1., sigma=1., cutoff=3.0, shift=0.1)
checkpoint.register("system")
checkpoint.save(0)

system.integrator.run(0)
print(system.part[:].f)

# mwe_load.py
import espressomd.checkpointing

checkpoint = espressomd.checkpointing.Checkpoint(checkpoint_id='mwe', checkpoint_path=".")
checkpoint.load(0)

system.integrator.run(0)
print(system.part[:].f)

Output:

$> mpiexec -n 2 ./pypresso mwe_save.py
[[-1475.26213767 -1475.26213767 -1475.26213767]
 [ 1475.26213767  1475.26213767  1475.26213767]]
$> mpiexec -n 2 ./pypresso mwe_load.py
[[-1475.26213767 -1475.26213767 -1475.26213767]
 [    0.             0.             0.        ]]

The text was updated successfully, but these errors were encountered:

jngrad · 2021-06-14T15:58:34Z

This doesn't affect release 4.1.4.

On 4.2-dev, this only affects objects contained in an object list (e.g. Constraints, LBBoundaries, etc.). The contained objects are deserialized on the head node, then added to the core via std::move() on the head node only:

espresso/src/script_interface/ObjectList.hpp

Lines 146 to 150 in 7eaf45f

    
           for (auto const &packed_object : object_states) { 
        
             auto o = std::dynamic_pointer_cast<ManagedType>( 
        
                 BaseType::deserialize(packed_object, *BaseType::context())); 
        
             add(std::move(o)); 
        
           }

jngrad · 2021-06-21T15:49:21Z

The fundamental issue with the current object management framework is the lack of support for recursive unpickling. At the moment, contained objects are managed by the GlobalContext on which the ObjectList container is stored. The container and containees need to be stored in the same context std::unordered_map, so that the ObjectList can access its contained objects.

Calling `ObjectList::set_internal_state()` on all nodes

This doesn't work out, because the contained objects end up being deserialized twice on worker nodes:

first, they are deserialized with the wrong ObjectId but are properly added in the core (via the call to ObjectList::set_internal_state() on all nodes)
then, they are deserialized with the correct ObjectId but are not added in the core (via the call to GlobalContext::make_shared() which calls GlobalContext::remote_make_handle()).

This is a consequence of the separation of concerns: ObjectHandle(from which ObjectList and contained objects inherit) only takes care of sending shared pointers to core containers, while Context takes care of instantiation and broadcast on MPI nodes.

Creating a third policy `GlobalContaineePolicy` for contained objects

This doesn't work, because contained objects would live in a different std::unordered_map object than the one the global ObjectList object lives in (the container needs to have access to the containees).

Adding a new flag `bool is_containee` in the `ObjectHandle` class

This allows us to inject a hook in GlobalContext::make_shared() to prevent the call to GlobalContext::remote_make_handle() and therefore prevent the constructed containee from being overwritten later on worker nodes. But the ObjectId is still incorrect (UnpackVisitor::operator() ends up looking up an ObjectId that doesn't exist).

Rewrite the `ContextManager` framework to handle recursion

This would make the MPI callback logic a lot more complicated to keep track of the recursion level, and we won't be able to use the std::unordered_map to store (ObjectId, shared_ptr<ManagedType>) key/value pairs anymore, because now ObjectList would be managing MPI communication of its contained objects. To find a specific object we would have to traverse the GlobalContext map and if it contains ObjectList objects, traverse their maps as well. This would break the separation of concerns between the Context classes (that manage MPI communication and object storage) and ObjectHandle classes. We would also need a mechanism to prevent key collision in different maps.

RudolfWeeber · 2021-06-22T14:47:05Z

Thank you for investigation this so thorroughly. In view of the ongoing discussion to move to shared memory parallelization, we might consider working around this by reverting (most of) the PR which introduced the regression.
Then, we would be back with the old 'solution' to handle the serialization of containers from Cython (via the ScripObjectRegsitry Cython class, IIRC). Then, everything except union shapes would work again. We would then have to write serilaization for the Union Shape in Cython analogously to scri tobject registry.

If we choose to go to shared memory parallism, the issue will go away, anyway.

Temporary fix for #4280 Description of changes: - prevent serialization of `ObjectList` objects when the MPI world size is greater than 1 - test that `ObjectList` objects are correctly reloaded when the MPI world size is 1 - document LB checkpointing and cleanup script interface documentation

jngrad · 2022-05-12T09:32:12Z

For the time being, we can restore checkpointing of most ObjectList classes by re-implementing the Python pickling functions (see a4e28bc in #4510). This doesn't work on all ObjectList, for example the Union shape is an ObjectList that can be stored as a shape-based constraint in a Constraints container, which is itself an ObjectList. There is no easy way to handle this kind of recursion in the Python pickling logic.

…4724) Fixes espressomd#4280 Description of changes: - checkpoint restrictions on the number of MPI ranks have been lifted

jngrad added Bug ScriptInterface labels Jun 11, 2021

jngrad modified the milestone: Espresso 4.2 Jun 11, 2021

jngrad self-assigned this Jun 14, 2021

jngrad mentioned this issue Jun 15, 2021

Fix ObjectList checkpointing regression #4281

Closed

jngrad changed the title ~~Checkpointing doesn't work with more than 1 MPI rank~~ Checkpointing doesn't support recursion with more than 1 MPI rank Jun 21, 2021

jngrad mentioned this issue Jun 25, 2021

Catch ObjectList checkpointing bug #4287

Merged

jngrad mentioned this issue May 11, 2023

Rewrite script interface object containers serialization #4724

Merged

kodiakhq bot closed this as completed in #4724 May 12, 2023

kodiakhq bot closed this as completed in b3efb66 May 12, 2023

jngrad pushed a commit to jngrad/espresso that referenced this issue Jul 13, 2023

Rewrite script interface object containers serialization (espressomd#…

a3547e2

…4724) Fixes espressomd#4280 Description of changes: - checkpoint restrictions on the number of MPI ranks have been lifted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing doesn't support recursion with more than 1 MPI rank #4280

Checkpointing doesn't support recursion with more than 1 MPI rank #4280

jngrad commented Jun 11, 2021 •

edited

Loading

jngrad commented Jun 14, 2021

jngrad commented Jun 21, 2021

RudolfWeeber commented Jun 22, 2021

jngrad commented May 12, 2022

Checkpointing doesn't support recursion with more than 1 MPI rank #4280

Checkpointing doesn't support recursion with more than 1 MPI rank #4280

Comments

jngrad commented Jun 11, 2021 • edited Loading

jngrad commented Jun 14, 2021

jngrad commented Jun 21, 2021

Calling ObjectList::set_internal_state() on all nodes

Creating a third policy GlobalContaineePolicy for contained objects

Adding a new flag bool is_containee in the ObjectHandle class

Rewrite the ContextManager framework to handle recursion

RudolfWeeber commented Jun 22, 2021

jngrad commented May 12, 2022

jngrad commented Jun 11, 2021 •

edited

Loading

Calling `ObjectList::set_internal_state()` on all nodes

Creating a third policy `GlobalContaineePolicy` for contained objects

Adding a new flag `bool is_containee` in the `ObjectHandle` class

Rewrite the `ContextManager` framework to handle recursion