You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, networks explored via IterativeExplore that are terminated early (due to an error, exceeding walltime, etc.) can be restarted directly from the contents of their rdir_head. However, this is only useful as long as rdir_head is always available.
If running on distributed resources like HPC, network exploration should be performed within a scratch space to allow for the currently heavy IO requirements of CDE runs. However, these scratch spaces are usually semi-volatile and in many cases cease to exist once a job is finished. This wipes the entire rdir_head directory tree, preventing restarts.
While rdir_head could be periodically backed up to non-volatile storage, this would be incredibly expensive and would nullify many of the benefits of performing exploration on a scratch space. Instead, we could use the already implemented incomplete network saves (which can be saved into a non-scratch directory) as checkpoints and allow for partial (or full) network restoration from them when rdir_head is not present (e.g. when it has been wiped by end of job). This would work as follows:
Check if rdir_head exists. If it does, the network within may either be full (present in the directory tree from the initial level) or partial (present in the directory tree only from a certain point, as it has been loaded from a checkpoint before).
If not, check if checkpoints exist. If they do, read in the latest checkpoint, establish next level seeds and create a new partial directory tree starting from this level.
If not, start a new exploration from scratch.
In step 1, when there is a full network it can be directly loaded. However, when there is only a partial network, a checkpoint file corresponding to the exploration progress made from the level(s) before those that exist in rdir_head MUST be available for exploration to continue without error.
The text was updated successfully, but these errors were encountered:
Currently, networks explored via
IterativeExplore
that are terminated early (due to an error, exceeding walltime, etc.) can be restarted directly from the contents of theirrdir_head
. However, this is only useful as long asrdir_head
is always available.If running on distributed resources like HPC, network exploration should be performed within a scratch space to allow for the currently heavy IO requirements of CDE runs. However, these scratch spaces are usually semi-volatile and in many cases cease to exist once a job is finished. This wipes the entire
rdir_head
directory tree, preventing restarts.While
rdir_head
could be periodically backed up to non-volatile storage, this would be incredibly expensive and would nullify many of the benefits of performing exploration on a scratch space. Instead, we could use the already implemented incomplete network saves (which can be saved into a non-scratch directory) as checkpoints and allow for partial (or full) network restoration from them whenrdir_head
is not present (e.g. when it has been wiped by end of job). This would work as follows:rdir_head
exists. If it does, the network within may either be full (present in the directory tree from the initial level) or partial (present in the directory tree only from a certain point, as it has been loaded from a checkpoint before).In step 1, when there is a full network it can be directly loaded. However, when there is only a partial network, a checkpoint file corresponding to the exploration progress made from the level(s) before those that exist in
rdir_head
MUST be available for exploration to continue without error.The text was updated successfully, but these errors were encountered: