New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Sort out distributed computation #32

Open

4 tasks

bpiwowar opened this issue Jan 13, 2024 · 0 comments

Assignees

Collaborator

bpiwowar commented Jan 13, 2024 •

edited

Loading

Distributed computation is not working well, and we should switch to DistributedDataParallel for better efficiency

Samplers should work on independent data subsets
Checkpointing needs to be done properly

Solve multiple backwards issues:

Backward is called within trainers (using the no_sync context might lead to problems if the involved parameters are not the same...)
Micro-batching (using the no_sync context)

See https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Depends on experimaestro/experimaestro-python#32 since object duplication does not work with the current config/object layout

The text was updated successfully, but these errors were encountered:

bpiwowar self-assigned this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment