Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataloader] dataloading improvement tracking issue #37

Open
3 tasks
d4l3k opened this issue Dec 12, 2024 · 2 comments
Open
3 tasks

[dataloader] dataloading improvement tracking issue #37

d4l3k opened this issue Dec 12, 2024 · 2 comments
Labels
data Related to dataloading enhancement New feature or request

Comments

@d4l3k
Copy link
Member

d4l3k commented Dec 12, 2024

This is a tracking issue for dataloader improvements. The current support is very basic and we likely need to make some bigger changes to make this more efficient

  • track dataloader step counts on a per replica_id basis
  • add mechanism for reinstantiating dataloader from checkpoint and fast forwarding to the correct step count
  • throw this all out and use a deterministic index managed by Lighthouse?
@d4l3k
Copy link
Member Author

d4l3k commented Dec 12, 2024

This relates to pytorch/data#1337

@d4l3k
Copy link
Member Author

d4l3k commented Dec 12, 2024

Notes from Andrew:

we do have a flag “snapshot_every_n_steps” that will only update the checkpoints every say 10 steps, and then there’s a counter in there so if you request checkpoint at step 15, it will load the snapshot from step 10 and then throw away 5 batches to recover the state

This is very similar to what we want

@d4l3k d4l3k added enhancement New feature or request data Related to dataloading labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Related to dataloading enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant