Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nightly workflows to open an issue if CI fails #3952

Merged
merged 42 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
b2e118b
Update H100 workflow to open an issue if nightly CI fails
loadams Jul 13, 2023
fa860f5
Test running as not CI
loadams Jul 13, 2023
3942858
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 13, 2023
2575723
Add all nightly/switch envvar name
loadams Jul 13, 2023
53799ac
Test with AMD
loadams Jul 13, 2023
05ca420
Add way to get url, switch path of template
loadams Jul 13, 2023
94c4687
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 14, 2023
5360ad5
Add additional checkout step
loadams Jul 14, 2023
d48eec2
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 14, 2023
da696b8
Move actions checkout step
loadams Jul 14, 2023
ca0eb6e
Try absolute path with github workspace
loadams Jul 14, 2023
668d5cd
Create issue without template/path
loadams Jul 17, 2023
3f19897
Re-enable and add debug logic
loadams Jul 17, 2023
9458cdb
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 17, 2023
bb96008
add if failed()
loadams Jul 17, 2023
2bd7d36
More debug
loadams Jul 17, 2023
1f3d0c7
Try without checkout action uses
loadams Jul 17, 2023
571cffb
Rename file
loadams Jul 18, 2023
fa6e66a
Update variables
loadams Jul 18, 2023
f0fbd13
Update issue template
loadams Jul 18, 2023
e7c2915
Confirm removing permissions still work
loadams Jul 18, 2023
b93fdbb
Revert "Confirm removing permissions still work"
loadams Jul 18, 2023
c0d11fb
Re-enable permissions
loadams Jul 18, 2023
5c5c5fd
Remove PR trigger for AMD MI200 tests
loadams Jul 18, 2023
bb90f1d
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 18, 2023
f58e97c
Revert "Remove PR trigger for AMD MI200 tests"
loadams Jul 18, 2023
e5c1aa7
Test update_existing
loadams Jul 18, 2023
48aea0b
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 19, 2023
c3a273c
Switch to composite action
loadams Jul 19, 2023
9d468f2
Fix line ending encoding issue
loadams Jul 19, 2023
65732ad
Switch failure to be a variable
loadams Jul 19, 2023
baa2d48
Test with second workflow
loadams Jul 19, 2023
607b787
Format fix
loadams Jul 19, 2023
027b18b
Switch failure to always
loadams Jul 19, 2023
82321bd
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 19, 2023
3e75c59
Merge branch 'master' into loadams/auto-task-open-failure
loadams Jul 20, 2023
403adb6
Merge branch 'master' into loadams/auto-task-open-failure
loadams Aug 9, 2023
cccf6cf
Switch back to previously working way
loadams Aug 9, 2023
e051da7
Test permission changes
loadams Aug 9, 2023
40a5871
Revert "Test permission changes"
loadams Aug 9, 2023
f568389
Update existing bugs with newest build failure link
loadams Aug 9, 2023
2a18d21
Remove PR triggers for that were used for testing.
loadams Aug 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/ISSUE_TEMPLATE/ci_failure_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
name: CI failure report
about: Report a DeepSpeed CI failure
title: "{{ env.GITHUB_WORKFLOW }} CI test failure"
labels: ci-failure
assignees: ''

---

The Nightly CI for {{ env.GITHUB_SERVER_URL }}/{{ env.GITHUB_REPOSITORY }}/actions/runs/{{ env.GITHUB_RUN_ID }} failed.
13 changes: 13 additions & 0 deletions .github/workflows/amd-mi200.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
amd-tests:
# The type of runner that the job will run on
Expand Down Expand Up @@ -65,3 +69,12 @@ jobs:
cd tests
pytest $PYTEST_OPTS -n 4 --verbose unit/
pytest $PYTEST_OPTS -m 'sequential' unit/

- name: Open GitHub issue if nightly CI fails
if: failure()
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true
13 changes: 13 additions & 0 deletions .github/workflows/nv-h100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, h100]
Expand Down Expand Up @@ -49,3 +53,12 @@ jobs:
cd tests
python -m pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.0" --cuda_ver="12"
python -m pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.0" --cuda_ver="12"

- name: Open GitHub issue if nightly CI fails
if: failure()
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true
13 changes: 13 additions & 0 deletions .github/workflows/nv-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu116, v100]
Expand Down Expand Up @@ -47,3 +51,12 @@ jobs:
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
pytest $PYTEST_OPTS --forked -m 'nightly' unit/ --torch_ver="1.13" --cuda_ver="11.6"

- name: Open GitHub issue if nightly CI fails
if: failure()
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true
13 changes: 13 additions & 0 deletions .github/workflows/nv-torch-nightly-v100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu116, v100]
Expand Down Expand Up @@ -48,3 +52,12 @@ jobs:
cd tests
pytest $PYTEST_OPTS --forked -n 4 unit/
pytest $PYTEST_OPTS --forked -m 'sequential' unit/

- name: Open GitHub issue if nightly CI fails
if: failure()
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true
13 changes: 13 additions & 0 deletions .github/workflows/nv-torch19-p40.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu111, p40]
Expand Down Expand Up @@ -47,3 +51,12 @@ jobs:
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="1.9" --cuda_ver="11.1"

- name: Open GitHub issue if nightly CI fails
if: failure()
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true
13 changes: 13 additions & 0 deletions .github/workflows/nv-torch19-v100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
issues: write

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu111, v100]
Expand Down Expand Up @@ -48,3 +52,12 @@ jobs:
cd tests
pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="1.9" --cuda_ver="11"
pytest $PYTEST_OPTS --forked -m 'sequential' unit/ --torch_ver="1.9" --cuda_ver="11"

- name: Open GitHub issue if nightly CI fails
if: failure()
uses: JasonEtco/create-an-issue@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md
update_existing: true