Skip to content

Commit

Permalink
remove delete-temp-images job (#709)
Browse files Browse the repository at this point in the history
Follow-up to #708.

Proposes completely removing the `delete-temp-images` job, in favor of relying on the scheduled nightly cleanup at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml.

## Notes for Reviewers

### Details

CI here writes images to the `rapidsai/staging` repo on DockerHub, then later copies them to individual user-facing repos.
To avoid those temporary CI artifacts piling up in the `rapidsai/staging` repo, pull requests and branch builds run a job called `delete-temp-images` which does what it sounds like.

In exchange for more aggressive cleaning, this job introduces significant complexity for development here. Most notably, we've observed several instances where that job deletes images before all CI jobs needing them have completed successfully, leading to all of CI needing to be re-run.

Significant effort has been put into trying to avoid that, and we've found it's been difficult to get it right:

some attempts:

* #702
* #708

a recent example:

* #696 (comment)

### Ok so how will we clean up?

The workflow at https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml.

It runs once a day and deletes anything from `rapidsai/staging` that's more than 30 days old.

### Benefits of these changes

As described in #708 (comment) ...

CI here will work as it does in other RAPIDS repos.... if any jobs fail for retryable reasons (like network issues), you can safely click "re-run failed jobs" and make incremental progress towards all builds passing.

Also reduces the need to maintain code that has to keep up with the DockerHub API in two places (by deleting `ci/delete-temp-images.sh` here).

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)
  - https://github.com/jakirkham

URL: #709
  • Loading branch information
jameslamb authored Aug 22, 2024
1 parent afd8c20 commit b007a92
Show file tree
Hide file tree
Showing 6 changed files with 15 additions and 90 deletions.
35 changes: 0 additions & 35 deletions .github/workflows/build-test-publish-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ jobs:
- build
- build-multiarch-manifest
- test
- delete-temp-images
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
checks:
Expand Down Expand Up @@ -249,37 +248,3 @@ jobs:
cuda${{ matrix.CUDA_VER }}-\
py${{ matrix.PYTHON_VER }}-\
${{ matrix.ARCH }}"
delete-temp-images:
if: ${{ !cancelled() && (needs.test.result == 'success' || needs.test.result == 'skipped') }}
needs: [compute-matrix, build, build-multiarch-manifest, test]
strategy:
matrix: ${{ fromJSON(needs.compute-matrix.outputs.MATRIX) }}
fail-fast: false
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Remove temporary images
shell: bash
env:
RAFT_ANN_BENCH_CPU_IMAGE_BUILT: ${{ matrix.BUILD_RAFT_ANN_BENCH_CPU_IMAGE }}
BASE_IMAGE_REPO: ${{ needs.compute-matrix.outputs.BASE_IMAGE_REPO }}
BASE_TAG_PREFIX: ${{ needs.compute-matrix.outputs.BASE_TAG_PREFIX }}
RAPIDS_VER: ${{ needs.compute-matrix.outputs.RAPIDS_VER }}
ALPHA_TAG: ${{ needs.compute-matrix.outputs.ALPHA_TAG }}
CUDA_TAG: ${{ matrix.CUDA_TAG }}
PYTHON_VER: ${{ matrix.PYTHON_VER }}
NOTEBOOKS_IMAGE_REPO: ${{ needs.compute-matrix.outputs.NOTEBOOKS_IMAGE_REPO }}
NOTEBOOKS_TAG_PREFIX: ${{ needs.compute-matrix.outputs.NOTEBOOKS_TAG_PREFIX }}
RAFT_ANN_BENCH_IMAGE_REPO: ${{ needs.compute-matrix.outputs.RAFT_ANN_BENCH_IMAGE_REPO }}
RAFT_ANN_BENCH_TAG_PREFIX: ${{ needs.compute-matrix.outputs.RAFT_ANN_BENCH_TAG_PREFIX }}
RAFT_ANN_BENCH_DATASETS_IMAGE_REPO: ${{ needs.compute-matrix.outputs.RAFT_ANN_BENCH_DATASETS_IMAGE_REPO }}
RAFT_ANN_BENCH_DATASETS_TAG_PREFIX: ${{ needs.compute-matrix.outputs.RAFT_ANN_BENCH_DATASETS_TAG_PREFIX }}
RAFT_ANN_BENCH_CPU_IMAGE_REPO: ${{ needs.compute-matrix.outputs.RAFT_ANN_BENCH_CPU_IMAGE_REPO }}
RAFT_ANN_BENCH_CPU_TAG_PREFIX: ${{ needs.compute-matrix.outputs.RAFT_ANN_BENCH_CPU_TAG_PREFIX }}
GPUCIBOT_DOCKERHUB_USER: ${{ secrets.GPUCIBOT_DOCKERHUB_USER }}
GPUCIBOT_DOCKERHUB_TOKEN: ${{ secrets.GPUCIBOT_DOCKERHUB_TOKEN }}
ARCHES: ${{ toJSON(matrix.ARCHES) }}
run: ci/delete-temp-images.sh
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,14 @@ To build just the `base` image with default arguments: `docker buildx build --pu
- `CUDA_VER` - Version of CUDA to use. Should be `major.minor.patch`
- `PYTHON_VER` - Version of Python to use. Should be `major.minor`
- `RAPIDS_VER` - Version of RAPIDS to use. Should be `YY.MM`

## Cleaning Up

Every build first writes images to the https://hub.docker.com/r/rapidsai/staging repo on DockerHub,
then pushes them on to the individual repos like `rapidsai/base`, `rapidsai/notebooks`, etc.

A scheduled job regularly deletes old images from that `rapidsai/staging` repo.
See https://github.com/rapidsai/workflows/blob/main/.github/workflows/cleanup_staging.yaml for details.

If you come back to a pull requests here after more than a few days and find that jobs are failing with errors
that suggest that some necessary images don't exist, re-run all of CI on that pull request to produce new images.
50 changes: 0 additions & 50 deletions ci/delete-temp-images.sh

This file was deleted.

6 changes: 3 additions & 3 deletions matrix-test.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# CUDA_VER is `<major>.<minor>` (e.g. `12.0`)

pull-request:
- { CUDA_VER: '11.8', ARCH: 'amd64', PYTHON_VER: '3.9', GPU: 'v100', DRIVER: 'earliest' }
- { CUDA_VER: '11.8', ARCH: 'amd64', PYTHON_VER: '3.10', GPU: 'v100', DRIVER: 'earliest' }
- { CUDA_VER: '12.0', ARCH: 'amd64', PYTHON_VER: '3.10', GPU: 'v100', DRIVER: 'latest' }
- { CUDA_VER: '12.2', ARCH: 'arm64', PYTHON_VER: '3.11', GPU: 'a100', DRIVER: 'latest' }
- { CUDA_VER: '12.5', ARCH: 'amd64', PYTHON_VER: '3.11', GPU: 'v100', DRIVER: 'latest' }
branch:
- { CUDA_VER: '11.8', ARCH: 'amd64', PYTHON_VER: '3.9', GPU: 'v100', DRIVER: 'earliest' }
- { CUDA_VER: '11.8', ARCH: 'amd64', PYTHON_VER: '3.9', GPU: 'v100', DRIVER: 'latest' }
- { CUDA_VER: '11.8', ARCH: 'amd64', PYTHON_VER: '3.10', GPU: 'v100', DRIVER: 'earliest' }
- { CUDA_VER: '11.8', ARCH: 'amd64', PYTHON_VER: '3.10', GPU: 'v100', DRIVER: 'latest' }
- { CUDA_VER: '12.0', ARCH: 'amd64', PYTHON_VER: '3.10', GPU: 'v100', DRIVER: 'latest' }
- { CUDA_VER: '12.0', ARCH: 'arm64', PYTHON_VER: '3.10', GPU: 'a100', DRIVER: 'latest' }
- { CUDA_VER: '12.2', ARCH: 'amd64', PYTHON_VER: '3.11', GPU: 'v100', DRIVER: 'latest' }
Expand Down
1 change: 0 additions & 1 deletion matrix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,5 @@ CUDA_VER: # Should be `<major>.<minor>.<patch>` (e.g. `11.2.2`)
- "12.2.2"
- "12.5.1"
PYTHON_VER:
- "3.9"
- "3.10"
- "3.11"
2 changes: 1 addition & 1 deletion raft-ann-bench/cpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ EOF
# we need perl temporarily for the remaining benchmark perl scripts
RUN apt-get install perl -y

# Install python before updating environment, otherwise Python 3.9 image
# Install python before updating environment, otherwise Python 3 image
# runs into a solver conflict with truststore 0.8.0. This avoids the environment installing
# packages incompatible with python version needed before python itself is pinned to the correct version.
RUN <<EOF
Expand Down

0 comments on commit b007a92

Please sign in to comment.