Reduce the number of CI jobs requiring self-hosted runners #17957

ScottTodd · 2024-07-18T22:16:16Z

See also

Convert more CI jobs to be package-based #16203

There are currently 11 jobs in ci.yml that use the - cpu runner label to get a large self-hosted CPU runner:

build_test_all_bazel
python_release_packages (see Remove 'python_release_packages' job from ci.yml #17955 for that specifically)
sanitizers :: asan
sanitizers :: tsan
gcc
debug
byo_llvm
cross_compile_and_test (riscv_64 linux)
cross_compile_and_test (riscv_32 linux)
cross_compile_and_test (riscv_32 generic)
cross_compile_and_test (wasm32)

There are even more jobs in build_and_test_android.yml, benchmark.yml, build_e2e_test_artifacts.yml, etc.

Some of these jobs are fast and could be switched to standard GitHub-hosted runners if they kept the access needed to download and upload artifacts (most of these jobs use storage in Google Cloud). Of the slow jobs, several could be moved to opt-in or even nightly.

The text was updated successfully, but these errors were encountered:

To decrease resource pressure on presubmits, disable the cross compile tests for now. #17957

To decrease resource pressure on presubmits, disable the cross compile tests for now. #17957 --------- Co-authored-by: Scott Todd <[email protected]>

Progress on #17957 These auxiliary jobs use expensive CPU runners that are currently in short supply. We could add these back as on-demand or nightly jobs, but running them on every commit is too expensive relative to the value they provide. skip-ci: just disabling jobs

ScottTodd · 2024-07-19T18:39:20Z

With the attached commits, we're down to:

ci.yml, presubmit and postsubmit:

build_all
build_test_all_bazel
sanitizers :: asan
build_and_test_android / cross_compile

benchmark.yml, postsubmit and opt-in on presubmit (llvm integrates, PRs with benchmark labels):

build_for_benchmarks
build_benchmark_tools
build_e2e_test_artifacts
compilation_benchmarks
execution_benchmarks / generate_matrix

pkgci.yml, presubmit and postsubmit:

linux_x86_64_release_packages

ScottTodd · 2024-07-19T21:15:09Z

Started some experiments moving heavy workflows to nightly using standard runners (could also strike some middle-ground with self-hosted runners but wanted to test the limits): https://github.com/ScottTodd/iree/tree/infra-ci-nightly

Progress on #17957. This moves the ASan (Address Sanitizer) and TSan (Thread Sanitizer) jobs into their own workflows, so they can run based on independent triggers and can be individually enabled/disabled as needed. For now these are the triggers: * ASan runs on `pull_request` and `push` events, similar to how it ran as part of `ci.yml` * TSan runs on a nightly `schedule` and on-demand using `workflow_dispatch`. We can add other ways to trigger it like via pull request labels or git trailers as needed. Both of these jobs need more disk space than GitHub's standard runners have, so they are running on larger self-hosted runners. If we have enough runner capacity then both jobs could run on every commit (`pull_request` and `push`). I think ASan is valuable enough for that but TSan is more situational. --- For more information about these sanitizers, see * https://iree.dev/developers/debugging/sanitizers/ * https://github.com/google/sanitizers * https://clang.llvm.org/docs/AddressSanitizer.html * https://clang.llvm.org/docs/ThreadSanitizer.html * https://clang.llvm.org/docs/MemorySanitizer.html

) Progress on #17957 - bringing back jobs that were disabled, with changes to the `runs-on` and triggering. For now these are the configurations: * `gcc` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h30m * `byo_llvm` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h20m * `debug` runs on a nightly `schedule` on self-hosted runners, taking ~20m * This build runs out of disk space for standard GitHub-hosted runners We can adjust the triggers over time, such as by adding ways to trigger with labels / git trailers, or by triggering on `pull_request` when file paths like `third_party/llvm-project` are changed. skip-ci: adding new workflows

ScottTodd · 2024-07-24T23:09:45Z

Largest remaining unknowns (which I'm not actively working on) are the https://github.com/iree-org/iree/blob/main/.github/workflows/benchmark.yml jobs and cross compilation jobs.

To decrease resource pressure on presubmits, disable the cross compile tests for now. iree-org#17957 --------- Co-authored-by: Scott Todd <[email protected]> Signed-off-by: Lubo Litchev <[email protected]>

Progress on iree-org#17957 These auxiliary jobs use expensive CPU runners that are currently in short supply. We could add these back as on-demand or nightly jobs, but running them on every commit is too expensive relative to the value they provide. skip-ci: just disabling jobs Signed-off-by: Lubo Litchev <[email protected]>

Progress on iree-org#17957. This moves the ASan (Address Sanitizer) and TSan (Thread Sanitizer) jobs into their own workflows, so they can run based on independent triggers and can be individually enabled/disabled as needed. For now these are the triggers: * ASan runs on `pull_request` and `push` events, similar to how it ran as part of `ci.yml` * TSan runs on a nightly `schedule` and on-demand using `workflow_dispatch`. We can add other ways to trigger it like via pull request labels or git trailers as needed. Both of these jobs need more disk space than GitHub's standard runners have, so they are running on larger self-hosted runners. If we have enough runner capacity then both jobs could run on every commit (`pull_request` and `push`). I think ASan is valuable enough for that but TSan is more situational. --- For more information about these sanitizers, see * https://iree.dev/developers/debugging/sanitizers/ * https://github.com/google/sanitizers * https://clang.llvm.org/docs/AddressSanitizer.html * https://clang.llvm.org/docs/ThreadSanitizer.html * https://clang.llvm.org/docs/MemorySanitizer.html Signed-off-by: Lubo Litchev <[email protected]>

…e-org#17981) Progress on iree-org#17957 - bringing back jobs that were disabled, with changes to the `runs-on` and triggering. For now these are the configurations: * `gcc` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h30m * `byo_llvm` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h20m * `debug` runs on a nightly `schedule` on self-hosted runners, taking ~20m * This build runs out of disk space for standard GitHub-hosted runners We can adjust the triggers over time, such as by adding ways to trigger with labels / git trailers, or by triggering on `pull_request` when file paths like `third_party/llvm-project` are changed. skip-ci: adding new workflows Signed-off-by: Lubo Litchev <[email protected]>

Progress on #16203 and #17957. New page preview: https://scotttodd.github.io/iree/developers/general/github-actions/ ## Overview This documents the state of our GitHub Actions usage after recent refactoring work. We used to have a monolithic `ci.yml` that ran 10+ builds on every commit, many of which used large CPU runners to build the full project (including the compiler) from source. Now we have platform builds, some of which run every commit but many of which run on nightly schedules. Most interesting _test_ jobs are now a part of "pkgci", which builds release packages and then runs tests using those packages (and sometimes runtime source builds). Net results: * Infrastructure costs should be more manageable - far fewer workflow runs will require powerful CPU builders * Self-hosted runners with special hardware have dedicated places to be slotted in and are explicitly not on the critical path for pull requests ## Details * Add new README badges * pre-commit, OpenSSF Best Practices, releases (stable, nightly), PyPI packages, select CI workflows * Add github-actions.md webpage * Full workflow status tables * Descriptions of each type of workflow * Tips on (not) using Docker * Information about `pull_request`, `push`, `schedule` triggers * Information about GitHub-hosted runners and self-hosted runners * Maintenance/debugging tips * Small cleanup passes through related files skip-ci: modified script does not affect builds

Progress on #16203 and #17957. This migrates `.github/workflows/build_and_test_android.yml` to `.github/workflows/pkgci_test_android.yml`. **For now, this only builds for Android, it does not run tests or use real Android devices at all**. The previous workflow * Relied on the "install" directory from a CMake build * Ran on large self-hosted CPU build machines * Built within Docker (using [`build_tools/docker/dockerfiles/android.Dockerfile`](https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/android.Dockerfile)) * Used GCP/GCS for remote ccache storage * Used GCP/GCS for passing files between jobs * Ran tests on self-hosted lab machines (I think a raspberry pi connected to some physical Android devices) The new workflow * Relies on Python packages produced by pkgci_build_packages * Runs on standard GitHub-hosted runners * Installs dependencies that it needs on-demand (ninja, Android NDK), without using Docker * Uses caches provided by GitHub Actions for ccache storage * Could use Artifacts provided by GitHub Actions for passing files between jobs * Could run tests on self-hosted lab machines or Android emulators I made some attempts at passing files from the build job to a test job but ran into some GitHub Actions debugging that was tricky. Leaving the remaining migration work there to contributors at Google or other parties directly interested in Android CI infrastructure. ci-exactly: build_packages, test_android

ScottTodd · 2024-08-05T22:24:11Z

Going to call this fixed now.

The ci.yml workflow now only uses self-hosted CPU runners for build_test_all_bazel (except for build_all, which is about to be replaced with the equivalent linux_x64_clang job in ci_linux_x64_clang.yml).

Beyond that, we have pkgci.yml build_packages and ci_linux_x64_clang_asan.yml running on presubmit. A few postsubmit (nightly) builds also use self-hosted CPU runners (TSan, Debug).

The cross_compile_and_test jobs (riscv, wasm) have yet to be migrated to pkgci, but they are currently disabled.

The benchmark.yml workflows still use self-hosted runners where they shouldn't be needed, but I'm not personally working on migrating them. Might delete them if they incur shared costs though.

…18107) Cleanup relating to #16203 and #17957. After #18070, no jobs in `ci.yml` are depending on the output of the `build_all` job. All jobs using the compiler now live in `pkgci.yml` and use packages instead of the "install dir". * The reusable [`build_all.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_all.yml) workflow is still in use from benchmarking workflows, but those are unmaintained and may be deleted soon. * After this is merged I'll set the `linux_x64_clang` check to "required", so PRs will need it to be passing before they can be merged. * Now that `ci.yml` is pruned to just a few remaining jobs (`build_test_all_bazel`, `build_test_runtime`, `small_runtime`, `tracing`), we can also limit when the workflow runs at all. I've added some `paths` patterns so this won't run on push events if only files under `docs/` are changed, for example.

ScottTodd added infrastructure Relating to build systems, CI, or testing cleanup 🧹 labels Jul 18, 2024

jpienaar added a commit that referenced this issue Jul 19, 2024

Disable cross compilation tests for now

c875e9d

To decrease resource pressure on presubmits, disable the cross compile tests for now. #17957

jpienaar mentioned this issue Jul 19, 2024

Disable cross compilation tests for now #17961

Merged

jpienaar added a commit that referenced this issue Jul 19, 2024

Disable cross compilation tests for now (#17961)

71d58a3

To decrease resource pressure on presubmits, disable the cross compile tests for now. #17957 --------- Co-authored-by: Scott Todd <[email protected]>

ScottTodd mentioned this issue Jul 19, 2024

Disable tsan, gcc, debug, and byo_llvm jobs. #17967

Merged

ScottTodd self-assigned this Jul 19, 2024

This was referenced Jul 22, 2024

Move 'gcc', 'byo_llvm', and 'debug' jobs to their own workflows. #17980

Closed

Move 'gcc', 'byo_llvm', and 'debug' jobs to their own workflows. #17981

Merged

Move ASan and TSan jobs to their own workflows. #18003

Merged

This was referenced Jul 29, 2024

Convert more CI jobs to be package-based #16203

Closed

Refresh docs for GitHub Actions usage. #18035

Merged

ScottTodd mentioned this issue Aug 5, 2024

Move build_and_test_android from ci.yml to pkgci.yml. #18070

Merged

ScottTodd closed this as completed Aug 5, 2024

ScottTodd mentioned this issue Aug 5, 2024

Replace build_all job from ci.yml with ci_linux_x64_clang.yml. #18107

Merged

ScottTodd mentioned this issue Aug 15, 2024

Migrate jobs off current GCP GHA runner cluster #18238

Closed

30 tasks

ScottTodd mentioned this issue Dec 10, 2024

Trigger presubmit ci workflows from ci.yml via workflow_call. #19445

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of CI jobs requiring self-hosted runners #17957

Reduce the number of CI jobs requiring self-hosted runners #17957

ScottTodd commented Jul 18, 2024

ScottTodd commented Jul 19, 2024

ScottTodd commented Jul 19, 2024

ScottTodd commented Jul 24, 2024

ScottTodd commented Aug 5, 2024

Reduce the number of CI jobs requiring self-hosted runners #17957

Reduce the number of CI jobs requiring self-hosted runners #17957

Comments

ScottTodd commented Jul 18, 2024

ScottTodd commented Jul 19, 2024

ScottTodd commented Jul 19, 2024

ScottTodd commented Jul 24, 2024

ScottTodd commented Aug 5, 2024