-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the number of CI jobs requiring self-hosted runners #17957
Comments
To decrease resource pressure on presubmits, disable the cross compile tests for now. #17957
To decrease resource pressure on presubmits, disable the cross compile tests for now. #17957 --------- Co-authored-by: Scott Todd <[email protected]>
Progress on #17957 These auxiliary jobs use expensive CPU runners that are currently in short supply. We could add these back as on-demand or nightly jobs, but running them on every commit is too expensive relative to the value they provide. skip-ci: just disabling jobs
With the attached commits, we're down to:
|
Started some experiments moving heavy workflows to nightly using standard runners (could also strike some middle-ground with self-hosted runners but wanted to test the limits): https://github.com/ScottTodd/iree/tree/infra-ci-nightly |
Progress on #17957. This moves the ASan (Address Sanitizer) and TSan (Thread Sanitizer) jobs into their own workflows, so they can run based on independent triggers and can be individually enabled/disabled as needed. For now these are the triggers: * ASan runs on `pull_request` and `push` events, similar to how it ran as part of `ci.yml` * TSan runs on a nightly `schedule` and on-demand using `workflow_dispatch`. We can add other ways to trigger it like via pull request labels or git trailers as needed. Both of these jobs need more disk space than GitHub's standard runners have, so they are running on larger self-hosted runners. If we have enough runner capacity then both jobs could run on every commit (`pull_request` and `push`). I think ASan is valuable enough for that but TSan is more situational. --- For more information about these sanitizers, see * https://iree.dev/developers/debugging/sanitizers/ * https://github.com/google/sanitizers * https://clang.llvm.org/docs/AddressSanitizer.html * https://clang.llvm.org/docs/ThreadSanitizer.html * https://clang.llvm.org/docs/MemorySanitizer.html
) Progress on #17957 - bringing back jobs that were disabled, with changes to the `runs-on` and triggering. For now these are the configurations: * `gcc` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h30m * `byo_llvm` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h20m * `debug` runs on a nightly `schedule` on self-hosted runners, taking ~20m * This build runs out of disk space for standard GitHub-hosted runners We can adjust the triggers over time, such as by adding ways to trigger with labels / git trailers, or by triggering on `pull_request` when file paths like `third_party/llvm-project` are changed. skip-ci: adding new workflows
Largest remaining unknowns (which I'm not actively working on) are the https://github.com/iree-org/iree/blob/main/.github/workflows/benchmark.yml jobs and cross compilation jobs. |
To decrease resource pressure on presubmits, disable the cross compile tests for now. iree-org#17957 --------- Co-authored-by: Scott Todd <[email protected]> Signed-off-by: Lubo Litchev <[email protected]>
Progress on iree-org#17957 These auxiliary jobs use expensive CPU runners that are currently in short supply. We could add these back as on-demand or nightly jobs, but running them on every commit is too expensive relative to the value they provide. skip-ci: just disabling jobs Signed-off-by: Lubo Litchev <[email protected]>
Progress on iree-org#17957. This moves the ASan (Address Sanitizer) and TSan (Thread Sanitizer) jobs into their own workflows, so they can run based on independent triggers and can be individually enabled/disabled as needed. For now these are the triggers: * ASan runs on `pull_request` and `push` events, similar to how it ran as part of `ci.yml` * TSan runs on a nightly `schedule` and on-demand using `workflow_dispatch`. We can add other ways to trigger it like via pull request labels or git trailers as needed. Both of these jobs need more disk space than GitHub's standard runners have, so they are running on larger self-hosted runners. If we have enough runner capacity then both jobs could run on every commit (`pull_request` and `push`). I think ASan is valuable enough for that but TSan is more situational. --- For more information about these sanitizers, see * https://iree.dev/developers/debugging/sanitizers/ * https://github.com/google/sanitizers * https://clang.llvm.org/docs/AddressSanitizer.html * https://clang.llvm.org/docs/ThreadSanitizer.html * https://clang.llvm.org/docs/MemorySanitizer.html Signed-off-by: Lubo Litchev <[email protected]>
…e-org#17981) Progress on iree-org#17957 - bringing back jobs that were disabled, with changes to the `runs-on` and triggering. For now these are the configurations: * `gcc` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h30m * `byo_llvm` runs on a nightly `schedule` on standard GitHub-hosted runners, taking ~3h20m * `debug` runs on a nightly `schedule` on self-hosted runners, taking ~20m * This build runs out of disk space for standard GitHub-hosted runners We can adjust the triggers over time, such as by adding ways to trigger with labels / git trailers, or by triggering on `pull_request` when file paths like `third_party/llvm-project` are changed. skip-ci: adding new workflows Signed-off-by: Lubo Litchev <[email protected]>
Progress on #16203 and #17957. New page preview: https://scotttodd.github.io/iree/developers/general/github-actions/ ## Overview This documents the state of our GitHub Actions usage after recent refactoring work. We used to have a monolithic `ci.yml` that ran 10+ builds on every commit, many of which used large CPU runners to build the full project (including the compiler) from source. Now we have platform builds, some of which run every commit but many of which run on nightly schedules. Most interesting _test_ jobs are now a part of "pkgci", which builds release packages and then runs tests using those packages (and sometimes runtime source builds). Net results: * Infrastructure costs should be more manageable - far fewer workflow runs will require powerful CPU builders * Self-hosted runners with special hardware have dedicated places to be slotted in and are explicitly not on the critical path for pull requests ## Details * Add new README badges * pre-commit, OpenSSF Best Practices, releases (stable, nightly), PyPI packages, select CI workflows * Add github-actions.md webpage * Full workflow status tables * Descriptions of each type of workflow * Tips on (not) using Docker * Information about `pull_request`, `push`, `schedule` triggers * Information about GitHub-hosted runners and self-hosted runners * Maintenance/debugging tips * Small cleanup passes through related files skip-ci: modified script does not affect builds
Progress on #16203 and #17957. This migrates `.github/workflows/build_and_test_android.yml` to `.github/workflows/pkgci_test_android.yml`. **For now, this only builds for Android, it does not run tests or use real Android devices at all**. The previous workflow * Relied on the "install" directory from a CMake build * Ran on large self-hosted CPU build machines * Built within Docker (using [`build_tools/docker/dockerfiles/android.Dockerfile`](https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/android.Dockerfile)) * Used GCP/GCS for remote ccache storage * Used GCP/GCS for passing files between jobs * Ran tests on self-hosted lab machines (I think a raspberry pi connected to some physical Android devices) The new workflow * Relies on Python packages produced by pkgci_build_packages * Runs on standard GitHub-hosted runners * Installs dependencies that it needs on-demand (ninja, Android NDK), without using Docker * Uses caches provided by GitHub Actions for ccache storage * Could use Artifacts provided by GitHub Actions for passing files between jobs * Could run tests on self-hosted lab machines or Android emulators I made some attempts at passing files from the build job to a test job but ran into some GitHub Actions debugging that was tricky. Leaving the remaining migration work there to contributors at Google or other parties directly interested in Android CI infrastructure. ci-exactly: build_packages, test_android
Going to call this fixed now. The Beyond that, we have The The |
…18107) Cleanup relating to #16203 and #17957. After #18070, no jobs in `ci.yml` are depending on the output of the `build_all` job. All jobs using the compiler now live in `pkgci.yml` and use packages instead of the "install dir". * The reusable [`build_all.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_all.yml) workflow is still in use from benchmarking workflows, but those are unmaintained and may be deleted soon. * After this is merged I'll set the `linux_x64_clang` check to "required", so PRs will need it to be passing before they can be merged. * Now that `ci.yml` is pruned to just a few remaining jobs (`build_test_all_bazel`, `build_test_runtime`, `small_runtime`, `tracing`), we can also limit when the workflow runs at all. I've added some `paths` patterns so this won't run on push events if only files under `docs/` are changed, for example.
See also
There are currently 11 jobs in ci.yml that use the
- cpu
runner label to get a large self-hosted CPU runner:build_test_all_bazel
python_release_packages
(see Remove 'python_release_packages' job from ci.yml #17955 for that specifically)sanitizers :: asan
sanitizers :: tsan
gcc
debug
byo_llvm
cross_compile_and_test
(riscv_64 linux)cross_compile_and_test
(riscv_32 linux)cross_compile_and_test
(riscv_32 generic)cross_compile_and_test
(wasm32)There are even more jobs in
build_and_test_android.yml
,benchmark.yml
,build_e2e_test_artifacts.yml
, etc.Some of these jobs are fast and could be switched to standard GitHub-hosted runners if they kept the access needed to download and upload artifacts (most of these jobs use storage in Google Cloud). Of the slow jobs, several could be moved to opt-in or even nightly.
The text was updated successfully, but these errors were encountered: