Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow / ENH: Add SSH into our runners workflow #30425

Merged
merged 6 commits into from
Apr 25, 2024

Conversation

younesbelkada
Copy link
Contributor

What does this PR do?

This PR adds a new workflow that enables developers to easily SSH into our runners and debug the failing tests, as discussed offline with @ydshieh during lunch

cc @glegendre01

Comment on lines 34 to 51
- name: Check out code
uses: actions/checkout@v4

- name: Install locally transformers & other libs
run: |
apt install sudo
sudo -H pip install --upgrade pip
sudo -H pip uninstall -y transformers
sudo -H pip install -U -e ".[testing]"
MAX_JOBS=4 pip install flash-attn --no-build-isolation
pip install bitsandbytes

- name: NVIDIA-SMI
run: |
nvidia-smi

- name: Show installed libraries and their versions
run: pip freeze
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use the docker image of daily CI and other steps (except the tests and report part). That way, it's more close to the actual env of the test environment, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: 522ed15

Comment on lines +53 to +59
- name: Tailscale # In order to be able to SSH when a test fails
uses: huggingface/tailscale-action@v1
with:
authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }}
slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
waitForSSH: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glegendre01 Is it possible to timeout the job with explicit amount of time here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's possible.
sshTimeout: 10m for example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glegendre01 I didn't see this sshTimeout in this file - hope I am looking the right file.

What I have in mind (from my experience with CircleCI), there are 2 interesting timeout:

  • timeout the job if no ssh connection being done with X minutes (say 10m)
  • timeout the job after Y minutes even if there is connection (say 1 hour)

Probably this/these is/are already done in this file without having them in the inputs? (I don't mean to ask them being in inputs, just wondering if such timeout are implemented). I do see timeout 5m in that file, but not sure its meaning.

Copy link
Contributor Author

@younesbelkada younesbelkada Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that file is different, you need to look for the file with the v1 tag: https://github.com/huggingface/tailscale-action/blob/v1/action.yaml#L59 🤯 for the other points, i will let @glegendre01 answer 🙏

Copy link
Collaborator

@ydshieh ydshieh Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @younesbelkada thanks a lot

. I think so far we can add sshTimeout then merge the PR. Or we wait for another approve if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Ok yes, I think we can use the default one (5min), I would love to have a review from @glegendre01 before merging it if possible 🙏

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, there is default value 😅

@younesbelkada younesbelkada requested a review from ydshieh April 23, 2024 13:02
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let's check how it works after it is merged to main.

Thanks a lot for this!

Copy link
Contributor

@glegendre01 glegendre01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea !

@younesbelkada younesbelkada merged commit cebb072 into huggingface:main Apr 25, 2024
8 checks passed
@younesbelkada younesbelkada deleted the add-workflow-ssh branch April 25, 2024 08:23
itazap pushed a commit that referenced this pull request May 14, 2024
* add SSH into our runners workflow

* fix

* fix

* fix

* use our previous approaches

* forward contrib credits from discussions

---------

Co-authored-by: Yih-Dar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants