Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple and distributed checkpointing and automatic recovery #144

Merged
merged 5 commits into from
Dec 7, 2017
Merged

Add simple and distributed checkpointing and automatic recovery #144

merged 5 commits into from
Dec 7, 2017

Conversation

kuenishi
Copy link
Member

@kuenishi kuenishi commented Dec 1, 2017

At each distributed checkpointing, all processes run MPI gather and
MPI bcast to share common set of iteration numbers that has
corresponding snapshot. Also at startup it runs syncing among
procceses, if _DistCPRExtension.maybe_resume() is explicitly called.

Limitations

  • Number of nodes must be the same among runs
  • In resuming all processes must reachable to its own snapshot files
    that has corresponding rank numbers
  • Needs manual operation when any latest set of files are broken;
    just remove them and use one more older set

At each distributed checkpointing, all processes run MPI gather and
MPI bcast to share common set of iteration numbers that has
corresponding snapshot. Also at startup it runs syncing among
procceses, if _DistCPRExtension.maybe_resume() is explicitly called.

Limitations
* Number of nodes must be the same among runs
* In resuming all processes must reachable to its own snapshot files
  that has corresponding rank numbers
* Needs manual operation when any latest set of files are broken;
  just remove them and use one more older set

Signed-off-by: UENISHI Kota <[email protected]>
@keisukefukuda keisukefukuda self-assigned this Dec 4, 2017
@keisukefukuda keisukefukuda self-requested a review December 4, 2017 01:27
@keisukefukuda keisukefukuda merged commit 6997e9d into chainer:master Dec 7, 2017
@kuenishi kuenishi deleted the sq-cpr branch December 7, 2017 05:37
@iwiwi iwiwi added this to the v1.1.0 milestone Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants