Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-download only relevant files per rank. #124

Merged

Conversation

2015aroras
Copy link
Contributor

@2015aroras 2015aroras commented Dec 18, 2024

This builds on #123 and tries to change it in ways listed below. Completely untested and just an idea.

  • Instead of pre-downloading the whole checkpoint, download a checkpoint file when a range query is about to made on it for the first time. resource_path will (hopefully) prevent any concurrency issues. As a bonus, this also splits up the work between the various processes.
  • Instead of saving each non-sharded data item on the rank with the smallest save payload (thus spreading the items out across a bunch of files/ranks), save them all to the lowest rank (empirically, rank 0). This will limit how many files we have to download/read from, but makes rank 0 bigger (I think only by like 1 MB though).

@2015aroras 2015aroras requested a review from epwalsh December 18, 2024 00:46
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, thanks for implemented it

@2015aroras 2015aroras marked this pull request as ready for review December 18, 2024 19:09
@epwalsh
Copy link
Member

epwalsh commented Dec 18, 2024

@2015aroras ready to merge into the other branch?

@2015aroras
Copy link
Contributor Author

c99da44 I need to cache the GCS client in cached_path, otherwise we get an error we saw in the old trainer until we did a similar fix.

@2015aroras 2015aroras merged commit aa4afb3 into epwalsh/pre-download-checkpoint Dec 19, 2024
1 check passed
@2015aroras 2015aroras deleted the shanea/pre-download-checkpoint branch December 19, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants