-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infra] Initial Migration off GCP Runners #18381
Conversation
ac5fa7d
to
03e76ee
Compare
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
03e76ee
to
3adc115
Compare
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice of jobs to migrate first SGTM. Please also announce this change at least on IREE's #builds channel on Discord, so developers know to watch for issues.
runs-on: | ||
- self-hosted # must come first | ||
- runner-group=${{ github.event_name == 'pull_request' && 'presubmit' || 'postsubmit' }} | ||
- environment=prod | ||
- cpu | ||
- os-family=Linux | ||
runs-on: azure-linux-scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated with a table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the times will change though because the cache won't be very useful soon when we can't write to it anymore. I'll actually update the table to times without using the GCP cache as that's probably a better reference. EDIT: Updated
runs-on: | ||
- self-hosted # must come first | ||
- runner-group=${{ github.event_name == 'pull_request' && 'presubmit' || 'postsubmit' }} | ||
- environment=prod | ||
- cpu | ||
- os-family=Linux | ||
runs-on: azure-linux-scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You had some docs for runner configuration. Are those public or can parts of them be public? We could link to them from here: https://iree.dev/developers/general/github-actions/#self-hosted-runner-maintenance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all public, but I have it on my own repo here: https://github.com/saienduri/AKS-GitHubARC-Setup. Not sure if we want to add a link to my repo into the iree docs or create another repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A link to there is fine as a start. If the docs are small enough and generally useful to the project, we can add them in-tree. If the docs (and images used, etc.) get large, we may want to consider moving the docs/ directory to a full standalone repo like https://github.com/llvm/llvm-www. I like being able to update docs alongside code though.
) This commit is part of this larger issue that is tracking our migration off the GCP runners, storage buckets, etc: #18238. This builds on #18381, which migrated * `linux_x86_64_release_packages` * `linux_x64_clang_debug` * `linux_x64_clang_tsan` Here, we move over the rest of the critical linux builder workflows off of the GCP runners: * `linux_x64_clang` * `linux_x64_clang_asan` This also drops all CI usage of the GCP cache (`http://storage.googleapis.com/iree-sccache/ccache`). Some workflows now use sccache backed by Azure Blob Storage as a replacement. There are few issues with this (mozilla/sccache#2258) that prevent us providing read only access to the cache in PRs created from forks, so **PRs from forks currently don't use the cache and will have slower builds**. We're covering for this slowdown by using larger runners, but if we can roll out caching to all builds then we might use runners with fewer cores. Along with the changes to the cache, usage of Docker is rebased on images in the https://github.com/iree-org/base-docker-images/ repo and the `build_tools/docker/docker_run.sh` script is now only used by unmigrated workflows (`linux_arm64_clang` and `build_test_all_bazel`). --------- Signed-off-by: saienduri <[email protected]> Signed-off-by: Elias Joseph <[email protected]> Co-authored-by: Scott Todd <[email protected]> Co-authored-by: Elias Joseph <[email protected]>
This commit is part of this larger issue that is tracking our migration off the GCP runners, storage buckets, etc: iree-org#18238. In this initial port, we move over one high traffic job (`linux_x86_64_release_packages`) and a few nightlies (`linux_x64_clang_tsan`, `linux_x64_clang_debug`) to monitor and make sure the cluster is working as intended. Time Comparisons: Job | Cache? | Runner cluster | Time | Logs -- | -- | -- | -- | -- linux_x86_64_release_packages | GitHub Cache | AKS Cluster | 9 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10797464301/job/29948809708) linux_x64_clang_tsan | GCP Cache | AKS cluster | 10 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10797464292/job/29948816896) linux_x64_clang_debug | GCP Cache | AKS cluster | 11 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10797464308/job/29948805561) linux_x64_clang_tsan | No Cache | AKS cluster | 17 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10798471545/job/29952051686) linux_x64_clang_debug | No Cache | AKS cluster | 13 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10798475582/job/29952064138) | | | linux_x86_64_release_packages | GitHub Cache | GCP Runners | 11 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10796348911/job/29945148145) linux_x64_clang_tsan | GCP Cache | GCP Runners | 14 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10789692182/job/29923234380) linux_x64_clang_debug | GCP Cache | GCP Runners | 15 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10680250213/job/29601266656) The GCP cache timings for the AKS cluster are not a great representation of what we will be seeing going forward because the AKS cluster does not have the setup/authentication to write to the GCP cache. We have changes coming in https://github.com/iree-org/iree/tree/shared/runner-cluster-migration that will spin up an Azure cache using sccache to help with the No Cache timings. Right now the cluster is using 96 core machines, which we can probably tone down when the caching work lands. --------- Signed-off-by: saienduri <[email protected]>
…e-org#18511) This commit is part of this larger issue that is tracking our migration off the GCP runners, storage buckets, etc: iree-org#18238. This builds on iree-org#18381, which migrated * `linux_x86_64_release_packages` * `linux_x64_clang_debug` * `linux_x64_clang_tsan` Here, we move over the rest of the critical linux builder workflows off of the GCP runners: * `linux_x64_clang` * `linux_x64_clang_asan` This also drops all CI usage of the GCP cache (`http://storage.googleapis.com/iree-sccache/ccache`). Some workflows now use sccache backed by Azure Blob Storage as a replacement. There are few issues with this (mozilla/sccache#2258) that prevent us providing read only access to the cache in PRs created from forks, so **PRs from forks currently don't use the cache and will have slower builds**. We're covering for this slowdown by using larger runners, but if we can roll out caching to all builds then we might use runners with fewer cores. Along with the changes to the cache, usage of Docker is rebased on images in the https://github.com/iree-org/base-docker-images/ repo and the `build_tools/docker/docker_run.sh` script is now only used by unmigrated workflows (`linux_arm64_clang` and `build_test_all_bazel`). --------- Signed-off-by: saienduri <[email protected]> Signed-off-by: Elias Joseph <[email protected]> Co-authored-by: Scott Todd <[email protected]> Co-authored-by: Elias Joseph <[email protected]>
This commit is part of this larger issue that is tracking our migration off the GCP runners, storage buckets, etc: #18238.
In this initial port, we move over one high traffic job (
linux_x86_64_release_packages
) and a few nightlies (linux_x64_clang_tsan
,linux_x64_clang_debug
) to monitor and make sure the cluster is working as intended.Time Comparisons:
The GCP cache timings for the AKS cluster are not a great representation of what we will be seeing going forward because the AKS cluster does not have the setup/authentication to write to the GCP cache. We have changes coming in https://github.com/iree-org/iree/tree/shared/runner-cluster-migration that will spin up an Azure cache using sccache to help with the No Cache timings. Right now the cluster is using 96 core machines, which we can probably tone down when the caching work lands.