-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce nightly runners #410
Conversation
The documentation is not available anymore as the PR was closed or merged. |
- name: Update clone | ||
run: | | ||
source activate accelerate | ||
git config --global --add safe.directory '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting odd permission errors, and this was the documented solution I found. (Will post the link in the AM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LysandreJik here's the specific thread: actions/checkout#760 (comment)
How we got here is the runner has a clone of accelerate inside the working directory for the action, and that clone is then updated, checked out, etc. But we run into this permission denied unsafe repository from github, so github needs to be told we can safely adjust the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, interesting. Thanks!
* Install git * Fix CPU image as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
* Fix CUDA tests * Use num_processes to keep everything under one test
* Update test requirements to include psutil, tensorboard, and the right tensorflow version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
- name: Update clone | ||
run: | | ||
source activate accelerate | ||
git config --global --add safe.directory '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, interesting. Thanks!
* Introduce nightly builds * Fixup docker images slightly * Make device-count specific test use `torch.cuda.device_count()` rather than `Accelerator.num_processes` to avoid bug.
Introduce nightly runners
Important
Should be merged after:
What does this add?
This PR adds nightly runners that will test on a GPU and multi-GPU, and runs all slow tests. In total it should take somewhere between 5-10 minutes to run (depending on whether a cache of a Docker image needed to be downloaded beforehand)
Why is it needed?
We currently have no runners for testing CUDA and multi-GPU tests, or slow tests on the CI. As a result there were many undiscovered errors and bugs that needed to be solved. This PR ensures that we can trust the quality of our code and know when we have issues on platforms other than on the CPU.
Currently they are running nightly, a different workflow will handle when pushes to main occur that I will work on later next week.