Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort out distributed computation #32

Open
4 tasks
bpiwowar opened this issue Jan 13, 2024 · 0 comments
Open
4 tasks

Sort out distributed computation #32

bpiwowar opened this issue Jan 13, 2024 · 0 comments
Assignees

Comments

@bpiwowar
Copy link
Collaborator

bpiwowar commented Jan 13, 2024

Distributed computation is not working well, and we should switch to DistributedDataParallel for better efficiency

  • Samplers should work on independent data subsets
  • Checkpointing needs to be done properly

Solve multiple backwards issues:

  • Backward is called within trainers (using the no_sync context might lead to problems if the involved parameters are not the same...)
  • Micro-batching (using the no_sync context)

See https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Depends on experimaestro/experimaestro-python#32 since object duplication does not work with the current config/object layout

@bpiwowar bpiwowar self-assigned this Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant