Loading partial networks from checkpoints #1

joegilkes · 2023-10-16T11:03:12Z

Currently, networks explored via IterativeExplore that are terminated early (due to an error, exceeding walltime, etc.) can be restarted directly from the contents of their rdir_head. However, this is only useful as long as rdir_head is always available.

If running on distributed resources like HPC, network exploration should be performed within a scratch space to allow for the currently heavy IO requirements of CDE runs. However, these scratch spaces are usually semi-volatile and in many cases cease to exist once a job is finished. This wipes the entire rdir_head directory tree, preventing restarts.

While rdir_head could be periodically backed up to non-volatile storage, this would be incredibly expensive and would nullify many of the benefits of performing exploration on a scratch space. Instead, we could use the already implemented incomplete network saves (which can be saved into a non-scratch directory) as checkpoints and allow for partial (or full) network restoration from them when rdir_head is not present (e.g. when it has been wiped by end of job). This would work as follows:

Check if rdir_head exists. If it does, the network within may either be full (present in the directory tree from the initial level) or partial (present in the directory tree only from a certain point, as it has been loaded from a checkpoint before).
If not, check if checkpoints exist. If they do, read in the latest checkpoint, establish next level seeds and create a new partial directory tree starting from this level.
If not, start a new exploration from scratch.

In step 1, when there is a full network it can be directly loaded. However, when there is only a partial network, a checkpoint file corresponding to the exploration progress made from the level(s) before those that exist in rdir_head MUST be available for exploration to continue without error.

The text was updated successfully, but these errors were encountered:

joegilkes added the enhancement New feature or request label Oct 16, 2023

joegilkes self-assigned this Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading partial networks from checkpoints #1

Loading partial networks from checkpoints #1

joegilkes commented Oct 16, 2023

Loading partial networks from checkpoints #1

Loading partial networks from checkpoints #1

Comments

joegilkes commented Oct 16, 2023