Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra] Initial Migration off GCP Runners #18381

Merged
merged 9 commits into from
Sep 10, 2024

Conversation

saienduri
Copy link
Collaborator

@saienduri saienduri commented Aug 28, 2024

This commit is part of this larger issue that is tracking our migration off the GCP runners, storage buckets, etc: #18238.
In this initial port, we move over one high traffic job (linux_x86_64_release_packages) and a few nightlies (linux_x64_clang_tsan, linux_x64_clang_debug) to monitor and make sure the cluster is working as intended.

Time Comparisons:

Job Cache? Runner cluster Time Logs
linux_x86_64_release_packages GitHub Cache AKS Cluster 9 minutes logs
linux_x64_clang_tsan GCP Cache AKS cluster 10 minutes logs
linux_x64_clang_debug GCP Cache AKS cluster 11 minutes logs
linux_x64_clang_tsan No Cache AKS cluster 17 minutes logs
linux_x64_clang_debug No Cache AKS cluster 13 minutes logs
linux_x86_64_release_packages GitHub Cache GCP Runners 11 minutes logs
linux_x64_clang_tsan GCP Cache GCP Runners 14 minutes logs
linux_x64_clang_debug GCP Cache GCP Runners 15 minutes logs

The GCP cache timings for the AKS cluster are not a great representation of what we will be seeing going forward because the AKS cluster does not have the setup/authentication to write to the GCP cache. We have changes coming in https://github.com/iree-org/iree/tree/shared/runner-cluster-migration that will spin up an Azure cache using sccache to help with the No Cache timings. Right now the cluster is using 96 core machines, which we can probably tone down when the caching work lands.

@saienduri saienduri requested a review from ScottTodd as a code owner August 28, 2024 02:33
@ScottTodd ScottTodd marked this pull request as draft August 28, 2024 15:26
@saienduri saienduri force-pushed the users/saienduri/cluster-testing branch 3 times, most recently from ac5fa7d to 03e76ee Compare September 9, 2024 23:54
@saienduri saienduri force-pushed the users/saienduri/cluster-testing branch from 03e76ee to 3adc115 Compare September 10, 2024 17:07
@saienduri saienduri marked this pull request as ready for review September 10, 2024 17:16
@saienduri saienduri changed the title [Cluster Testing] PkgCI Linux Build [Infra] Initial Migration off GCP Runners Sep 10, 2024
@saienduri saienduri added the infrastructure Relating to build systems, CI, or testing label Sep 10, 2024
Signed-off-by: saienduri <[email protected]>
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice of jobs to migrate first SGTM. Please also announce this change at least on IREE's #builds channel on Discord, so developers know to watch for issues.

runs-on:
- self-hosted # must come first
- runner-group=${{ github.event_name == 'pull_request' && 'presubmit' || 'postsubmit' }}
- environment=prod
- cpu
- os-family=Linux
runs-on: azure-linux-scale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When these jobs finish running, can you share some sample logs before and after with the build time changes in the PR description?

Sample table: #18396
Or sample bulleted list with less info: #18429

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated with a table

Copy link
Collaborator Author

@saienduri saienduri Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the times will change though because the cache won't be very useful soon when we can't write to it anymore. I'll actually update the table to times without using the GCP cache as that's probably a better reference. EDIT: Updated

runs-on:
- self-hosted # must come first
- runner-group=${{ github.event_name == 'pull_request' && 'presubmit' || 'postsubmit' }}
- environment=prod
- cpu
- os-family=Linux
runs-on: azure-linux-scale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You had some docs for runner configuration. Are those public or can parts of them be public? We could link to them from here: https://iree.dev/developers/general/github-actions/#self-hosted-runner-maintenance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all public, but I have it on my own repo here: https://github.com/saienduri/AKS-GitHubARC-Setup. Not sure if we want to add a link to my repo into the iree docs or create another repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link to there is fine as a start. If the docs are small enough and generally useful to the project, we can add them in-tree. If the docs (and images used, etc.) get large, we may want to consider moving the docs/ directory to a full standalone repo like https://github.com/llvm/llvm-www. I like being able to update docs alongside code though.

@saienduri saienduri requested a review from ScottTodd September 10, 2024 18:26
@saienduri saienduri merged commit c3cbfbd into main Sep 10, 2024
44 checks passed
@saienduri saienduri deleted the users/saienduri/cluster-testing branch September 10, 2024 22:42
ScottTodd added a commit that referenced this pull request Sep 13, 2024
)

This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
#18238.

This builds on #18381, which
migrated
* `linux_x86_64_release_packages`
* `linux_x64_clang_debug`
* `linux_x64_clang_tsan`

Here, we move over the rest of the critical linux builder workflows off
of the GCP runners:
* `linux_x64_clang`
* `linux_x64_clang_asan`

This also drops all CI usage of the GCP cache
(`http://storage.googleapis.com/iree-sccache/ccache`). Some workflows
now use sccache backed by Azure Blob Storage as a replacement. There are
few issues with this (mozilla/sccache#2258)
that prevent us providing read only access to the cache in PRs created
from forks, so **PRs from forks currently don't use the cache and will
have slower builds**. We're covering for this slowdown by using larger
runners, but if we can roll out caching to all builds then we might use
runners with fewer cores.

Along with the changes to the cache, usage of Docker is rebased on
images in the https://github.com/iree-org/base-docker-images/ repo and
the `build_tools/docker/docker_run.sh` script is now only used by
unmigrated workflows (`linux_arm64_clang` and `build_test_all_bazel`).

---------

Signed-off-by: saienduri <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
Co-authored-by: Elias Joseph <[email protected]>
josemonsalve2 pushed a commit to josemonsalve2/iree that referenced this pull request Sep 14, 2024
This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
iree-org#18238.
In this initial port, we move over one high traffic job
(`linux_x86_64_release_packages`) and a few nightlies
(`linux_x64_clang_tsan`, `linux_x64_clang_debug`) to monitor and make
sure the cluster is working as intended.

Time Comparisons:

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
linux_x86_64_release_packages | GitHub Cache | AKS Cluster | 9 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464301/job/29948809708)
linux_x64_clang_tsan | GCP Cache | AKS cluster | 10 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464292/job/29948816896)
linux_x64_clang_debug | GCP Cache | AKS cluster | 11 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464308/job/29948805561)
linux_x64_clang_tsan | No Cache | AKS cluster | 17 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10798471545/job/29952051686)
linux_x64_clang_debug | No Cache | AKS cluster | 13 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10798475582/job/29952064138)
| | | 
linux_x86_64_release_packages | GitHub Cache | GCP Runners | 11 minutes
|
[logs](https://github.com/iree-org/iree/actions/runs/10796348911/job/29945148145)
linux_x64_clang_tsan | GCP Cache | GCP Runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10789692182/job/29923234380)
linux_x64_clang_debug | GCP Cache | GCP Runners | 15 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10680250213/job/29601266656)

The GCP cache timings for the AKS cluster are not a great representation
of what we will be seeing going forward because the AKS cluster does not
have the setup/authentication to write to the GCP cache. We have changes
coming in
https://github.com/iree-org/iree/tree/shared/runner-cluster-migration
that will spin up an Azure cache using sccache to help with the No Cache
timings. Right now the cluster is using 96 core machines, which we can
probably tone down when the caching work lands.

---------

Signed-off-by: saienduri <[email protected]>
raikonenfnu pushed a commit to raikonenfnu/iree that referenced this pull request Sep 16, 2024
…e-org#18511)

This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
iree-org#18238.

This builds on iree-org#18381, which
migrated
* `linux_x86_64_release_packages`
* `linux_x64_clang_debug`
* `linux_x64_clang_tsan`

Here, we move over the rest of the critical linux builder workflows off
of the GCP runners:
* `linux_x64_clang`
* `linux_x64_clang_asan`

This also drops all CI usage of the GCP cache
(`http://storage.googleapis.com/iree-sccache/ccache`). Some workflows
now use sccache backed by Azure Blob Storage as a replacement. There are
few issues with this (mozilla/sccache#2258)
that prevent us providing read only access to the cache in PRs created
from forks, so **PRs from forks currently don't use the cache and will
have slower builds**. We're covering for this slowdown by using larger
runners, but if we can roll out caching to all builds then we might use
runners with fewer cores.

Along with the changes to the cache, usage of Docker is rebased on
images in the https://github.com/iree-org/base-docker-images/ repo and
the `build_tools/docker/docker_run.sh` script is now only used by
unmigrated workflows (`linux_arm64_clang` and `build_test_all_bazel`).

---------

Signed-off-by: saienduri <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
Co-authored-by: Elias Joseph <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants