-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow / ENH: Add SSH into our runners workflow #30425
Workflow / ENH: Add SSH into our runners workflow #30425
Conversation
.github/workflows/ssh-runner.yml
Outdated
- name: Check out code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Install locally transformers & other libs | ||
run: | | ||
apt install sudo | ||
sudo -H pip install --upgrade pip | ||
sudo -H pip uninstall -y transformers | ||
sudo -H pip install -U -e ".[testing]" | ||
MAX_JOBS=4 pip install flash-attn --no-build-isolation | ||
pip install bitsandbytes | ||
|
||
- name: NVIDIA-SMI | ||
run: | | ||
nvidia-smi | ||
|
||
- name: Show installed libraries and their versions | ||
run: pip freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use the docker image of daily CI and other steps (except the tests and report part). That way, it's more close to the actual env of the test environment, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: 522ed15
- name: Tailscale # In order to be able to SSH when a test fails | ||
uses: huggingface/tailscale-action@v1 | ||
with: | ||
authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} | ||
slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} | ||
slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} | ||
waitForSSH: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glegendre01 Is it possible to timeout the job with explicit amount of time here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's possible.
sshTimeout: 10m for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @glegendre01 I didn't see this sshTimeout
in this file - hope I am looking the right file.
What I have in mind (from my experience with CircleCI), there are 2 interesting timeout:
- timeout the job if no ssh connection being done with X minutes (say 10m)
- timeout the job after Y minutes even if there is connection (say 1 hour)
Probably this/these is/are already done in this file without having them in the inputs
? (I don't mean to ask them being in inputs
, just wondering if such timeout are implemented). I do see timeout 5m
in that file, but not sure its meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that file is different, you need to look for the file with the v1
tag: https://github.com/huggingface/tailscale-action/blob/v1/action.yaml#L59 🤯 for the other points, i will let @glegendre01 answer 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @younesbelkada thanks a lot
. I think so far we can add sshTimeout
then merge the PR. Or we wait for another approve if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Ok yes, I think we can use the default one (5min), I would love to have a review from @glegendre01 before merging it if possible 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, there is default value 😅
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but let's check how it works after it is merged to main
.
Thanks a lot for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea !
* add SSH into our runners workflow * fix * fix * fix * use our previous approaches * forward contrib credits from discussions --------- Co-authored-by: Yih-Dar <[email protected]>
What does this PR do?
This PR adds a new workflow that enables developers to easily SSH into our runners and debug the failing tests, as discussed offline with @ydshieh during lunch
cc @glegendre01