Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of CI jobs requiring self-hosted runners #17957

Closed
Tracked by #16203
ScottTodd opened this issue Jul 18, 2024 · 4 comments
Closed
Tracked by #16203

Reduce the number of CI jobs requiring self-hosted runners #17957

ScottTodd opened this issue Jul 18, 2024 · 4 comments
Assignees
Labels
cleanup 🧹 infrastructure Relating to build systems, CI, or testing

Comments

@ScottTodd
Copy link
Member

See also

There are currently 11 jobs in ci.yml that use the - cpu runner label to get a large self-hosted CPU runner:

  1. build_test_all_bazel
  2. python_release_packages (see Remove 'python_release_packages' job from ci.yml #17955 for that specifically)
  3. sanitizers :: asan
  4. sanitizers :: tsan
  5. gcc
  6. debug
  7. byo_llvm
  8. cross_compile_and_test (riscv_64 linux)
  9. cross_compile_and_test (riscv_32 linux)
  10. cross_compile_and_test (riscv_32 generic)
  11. cross_compile_and_test (wasm32)

There are even more jobs in build_and_test_android.yml, benchmark.yml, build_e2e_test_artifacts.yml, etc.

Some of these jobs are fast and could be switched to standard GitHub-hosted runners if they kept the access needed to download and upload artifacts (most of these jobs use storage in Google Cloud). Of the slow jobs, several could be moved to opt-in or even nightly.

@ScottTodd ScottTodd added infrastructure Relating to build systems, CI, or testing cleanup 🧹 labels Jul 18, 2024
jpienaar added a commit that referenced this issue Jul 19, 2024
To decrease resource pressure on presubmits, disable the cross compile tests for now.

#17957
jpienaar added a commit that referenced this issue Jul 19, 2024
To decrease resource pressure on presubmits, disable the cross compile
tests for now.

#17957

---------

Co-authored-by: Scott Todd <[email protected]>
ScottTodd added a commit that referenced this issue Jul 19, 2024
Progress on #17957

These auxiliary jobs use expensive CPU runners that are currently in
short supply.

We could add these back as on-demand or nightly jobs, but running them
on every commit is too expensive relative to the value they provide.

skip-ci: just disabling jobs
@ScottTodd
Copy link
Member Author

With the attached commits, we're down to:

ci.yml, presubmit and postsubmit:

  1. build_all
  2. build_test_all_bazel
  3. sanitizers :: asan
  4. build_and_test_android / cross_compile

benchmark.yml, postsubmit and opt-in on presubmit (llvm integrates, PRs with benchmark labels):

  1. build_for_benchmarks
  2. build_benchmark_tools
  3. build_e2e_test_artifacts
  4. compilation_benchmarks
  5. execution_benchmarks / generate_matrix

pkgci.yml, presubmit and postsubmit:

  1. linux_x86_64_release_packages

@ScottTodd ScottTodd self-assigned this Jul 19, 2024
@ScottTodd
Copy link
Member Author

Started some experiments moving heavy workflows to nightly using standard runners (could also strike some middle-ground with self-hosted runners but wanted to test the limits): https://github.com/ScottTodd/iree/tree/infra-ci-nightly

ScottTodd added a commit that referenced this issue Jul 24, 2024
Progress on #17957.

This moves the ASan (Address Sanitizer) and TSan (Thread Sanitizer) jobs
into their own workflows, so they can run based on independent triggers
and can be individually enabled/disabled as needed.

For now these are the triggers:
* ASan runs on `pull_request` and `push` events, similar to how it ran
as part of `ci.yml`
* TSan runs on a nightly `schedule` and on-demand using
`workflow_dispatch`. We can add other ways to trigger it like via pull
request labels or git trailers as needed.

Both of these jobs need more disk space than GitHub's standard runners
have, so they are running on larger self-hosted runners. If we have
enough runner capacity then both jobs could run on every commit
(`pull_request` and `push`). I think ASan is valuable enough for that
but TSan is more situational.

---

For more information about these sanitizers, see
* https://iree.dev/developers/debugging/sanitizers/
* https://github.com/google/sanitizers
* https://clang.llvm.org/docs/AddressSanitizer.html
* https://clang.llvm.org/docs/ThreadSanitizer.html
* https://clang.llvm.org/docs/MemorySanitizer.html
ScottTodd added a commit that referenced this issue Jul 24, 2024
)

Progress on #17957 - bringing
back jobs that were disabled, with changes to the `runs-on` and
triggering.

For now these are the configurations:
* `gcc` runs on a nightly `schedule` on standard GitHub-hosted runners,
taking ~3h30m
* `byo_llvm` runs on a nightly `schedule` on standard GitHub-hosted
runners, taking ~3h20m
* `debug` runs on a nightly `schedule` on self-hosted runners, taking
~20m
  * This build runs out of disk space for standard GitHub-hosted runners

We can adjust the triggers over time, such as by adding ways to trigger
with labels / git trailers, or by triggering on `pull_request` when file
paths like `third_party/llvm-project` are changed.

skip-ci: adding new workflows
@ScottTodd
Copy link
Member Author

Largest remaining unknowns (which I'm not actively working on) are the https://github.com/iree-org/iree/blob/main/.github/workflows/benchmark.yml jobs and cross compilation jobs.

LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
To decrease resource pressure on presubmits, disable the cross compile
tests for now.

iree-org#17957

---------

Co-authored-by: Scott Todd <[email protected]>
Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
Progress on iree-org#17957

These auxiliary jobs use expensive CPU runners that are currently in
short supply.

We could add these back as on-demand or nightly jobs, but running them
on every commit is too expensive relative to the value they provide.

skip-ci: just disabling jobs
Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
Progress on iree-org#17957.

This moves the ASan (Address Sanitizer) and TSan (Thread Sanitizer) jobs
into their own workflows, so they can run based on independent triggers
and can be individually enabled/disabled as needed.

For now these are the triggers:
* ASan runs on `pull_request` and `push` events, similar to how it ran
as part of `ci.yml`
* TSan runs on a nightly `schedule` and on-demand using
`workflow_dispatch`. We can add other ways to trigger it like via pull
request labels or git trailers as needed.

Both of these jobs need more disk space than GitHub's standard runners
have, so they are running on larger self-hosted runners. If we have
enough runner capacity then both jobs could run on every commit
(`pull_request` and `push`). I think ASan is valuable enough for that
but TSan is more situational.

---

For more information about these sanitizers, see
* https://iree.dev/developers/debugging/sanitizers/
* https://github.com/google/sanitizers
* https://clang.llvm.org/docs/AddressSanitizer.html
* https://clang.llvm.org/docs/ThreadSanitizer.html
* https://clang.llvm.org/docs/MemorySanitizer.html

Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
…e-org#17981)

Progress on iree-org#17957 - bringing
back jobs that were disabled, with changes to the `runs-on` and
triggering.

For now these are the configurations:
* `gcc` runs on a nightly `schedule` on standard GitHub-hosted runners,
taking ~3h30m
* `byo_llvm` runs on a nightly `schedule` on standard GitHub-hosted
runners, taking ~3h20m
* `debug` runs on a nightly `schedule` on self-hosted runners, taking
~20m
  * This build runs out of disk space for standard GitHub-hosted runners

We can adjust the triggers over time, such as by adding ways to trigger
with labels / git trailers, or by triggering on `pull_request` when file
paths like `third_party/llvm-project` are changed.

skip-ci: adding new workflows
Signed-off-by: Lubo Litchev <[email protected]>
ScottTodd added a commit that referenced this issue Aug 5, 2024
Progress on #16203 and
#17957.

New page preview:
https://scotttodd.github.io/iree/developers/general/github-actions/

## Overview

This documents the state of our GitHub Actions usage after recent
refactoring work. We used to have a monolithic `ci.yml` that ran 10+
builds on every commit, many of which used large CPU runners to build
the full project (including the compiler) from source. Now we have
platform builds, some of which run every commit but many of which run on
nightly schedules. Most interesting _test_ jobs are now a part of
"pkgci", which builds release packages and then runs tests using those
packages (and sometimes runtime source builds).

Net results:

* Infrastructure costs should be more manageable - far fewer workflow
runs will require powerful CPU builders
* Self-hosted runners with special hardware have dedicated places to be
slotted in and are explicitly not on the critical path for pull requests

## Details

* Add new README badges
* pre-commit, OpenSSF Best Practices, releases (stable, nightly), PyPI
packages, select CI workflows
* Add github-actions.md webpage
  * Full workflow status tables
  * Descriptions of each type of workflow
  * Tips on (not) using Docker
  * Information about `pull_request`, `push`, `schedule` triggers
  * Information about GitHub-hosted runners and self-hosted runners
  * Maintenance/debugging tips
* Small cleanup passes through related files

skip-ci: modified script does not affect builds
ScottTodd added a commit that referenced this issue Aug 5, 2024
Progress on #16203 and
#17957.

This migrates `.github/workflows/build_and_test_android.yml` to
`.github/workflows/pkgci_test_android.yml`.

**For now, this only builds for Android, it does not run tests or use
real Android devices at all**.

The previous workflow
* Relied on the "install" directory from a CMake build 
* Ran on large self-hosted CPU build machines
* Built within Docker (using
[`build_tools/docker/dockerfiles/android.Dockerfile`](https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/android.Dockerfile))
* Used GCP/GCS for remote ccache storage
* Used GCP/GCS for passing files between jobs
* Ran tests on self-hosted lab machines (I think a raspberry pi
connected to some physical Android devices)

The new workflow
* Relies on Python packages produced by pkgci_build_packages
* Runs on standard GitHub-hosted runners
* Installs dependencies that it needs on-demand (ninja, Android NDK),
without using Docker
* Uses caches provided by GitHub Actions for ccache storage
* Could use Artifacts provided by GitHub Actions for passing files
between jobs
* Could run tests on self-hosted lab machines or Android emulators

I made some attempts at passing files from the build job to a test job
but ran into some GitHub Actions debugging that was tricky. Leaving the
remaining migration work there to contributors at Google or other
parties directly interested in Android CI infrastructure.

ci-exactly: build_packages, test_android
@ScottTodd
Copy link
Member Author

Going to call this fixed now.

The ci.yml workflow now only uses self-hosted CPU runners for build_test_all_bazel (except for build_all, which is about to be replaced with the equivalent linux_x64_clang job in ci_linux_x64_clang.yml).

Beyond that, we have pkgci.yml build_packages and ci_linux_x64_clang_asan.yml running on presubmit. A few postsubmit (nightly) builds also use self-hosted CPU runners (TSan, Debug).

The cross_compile_and_test jobs (riscv, wasm) have yet to be migrated to pkgci, but they are currently disabled.

The benchmark.yml workflows still use self-hosted runners where they shouldn't be needed, but I'm not personally working on migrating them. Might delete them if they incur shared costs though.

ScottTodd added a commit that referenced this issue Aug 6, 2024
…18107)

Cleanup relating to #16203 and
#17957.

After #18070, no jobs in `ci.yml`
are depending on the output of the `build_all` job. All jobs using the
compiler now live in `pkgci.yml` and use packages instead of the
"install dir".

* The reusable
[`build_all.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_all.yml)
workflow is still in use from benchmarking workflows, but those are
unmaintained and may be deleted soon.
* After this is merged I'll set the `linux_x64_clang` check to
"required", so PRs will need it to be passing before they can be merged.
* Now that `ci.yml` is pruned to just a few remaining jobs
(`build_test_all_bazel`, `build_test_runtime`, `small_runtime`,
`tracing`), we can also limit when the workflow runs at all. I've added
some `paths` patterns so this won't run on push events if only files
under `docs/` are changed, for example.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup 🧹 infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

No branches or pull requests

1 participant