Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate jobs off current GCP GHA runner cluster #18238

Closed
22 of 30 tasks
ScottTodd opened this issue Aug 15, 2024 · 17 comments · Fixed by #19242
Closed
22 of 30 tasks

Migrate jobs off current GCP GHA runner cluster #18238

ScottTodd opened this issue Aug 15, 2024 · 17 comments · Fixed by #19242
Assignees
Labels
infrastructure Relating to build systems, CI, or testing

Comments

@ScottTodd
Copy link
Member

ScottTodd commented Aug 15, 2024

Following the work at #17957 and #16203, it is just about time to migrate away from the GitHub Actions runners hosted on Google Cloud Platform.

Workflow refactoring tasks

Refactor workflows such that they don't depend on GCP:

  • Docker prefetch/preload
  • Installed packages like the gcloud command
  • Read/write access to the remote ccache storage bucket at http://storage.googleapis.com/iree-sccache/ccache (configured using setup_ccache.sh)
  • General reliance on the build_tools/github_actions/docker_run.sh script

Runner setup tasks

  • Read up on https://github.com/actions/actions-runner-controller and give it a try
  • Add Linux x86_64 CPU builders
  • Experiment with core count: 16 cores minimum, 96 cores ideal?
  • Experiment with autoscaling instances: up to 8-16 max? scale down to 1 at midnight PST?
  • Add Linux NVIDIA GPU runner(s): can use small/cheap GPUs like NVIDIA T4s we current test on - need baseline coverage for CUDA and Vulkan
  • Add other runners: arm64? Android? Windows? Some of these could be off the cloud and just run in local labs
  • Consider setting up a remote cache storage bucket/account. 10GB minimum - ideally located on a network close to the runners
  • Consider prepopulating caches on runners somehow: git repository / submodules, Dockerfiles, test inputs
  • Register new runners in iree-org (organization) or iree-org/iree (repository)
  • Decide on how runners should be distributed. We currently have separate pools for "presubmit" and "postsubmit"
  • Research monitoring/logging (queue times, uptime, autoscaling usage, crash frequency, etc.)

Transition tasks

  • Switch a few non-critical jobs (like the nightly 'debug or 'tsan' jobs) to the new runners and monitor for stability, performance, etc.

Switch all jobs that need a self hosted runner to the new runners

Other

@ScottTodd ScottTodd added the infrastructure Relating to build systems, CI, or testing label Aug 15, 2024
@ScottTodd
Copy link
Member Author

Experiments are showing that local ccache using github actions is going to be nowhere near functional for some of the current CI builds. Maybe I have something misconfigured, but I'm seeing cache sizes of up to 2GB still not be enough for Debug or ASan jobs. I can try running with no cache limit to see what that produces, but GitHub's soft limit of 10GB across all cache entries before it starts evicting entries will trigger very frequently if we have too many jobs using unique cache keys.

ScottTodd added a commit that referenced this issue Aug 16, 2024
…18252)

Progress on #15332 and
#18238 .

The
[`build_tools/docker/docker_run.sh`](https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh)
script does a bunch of weird/hacky setup, including setup for `gcloud`
(for working with GCP) and Bazel-specific Docker workarounds. Most CMake
builds can just use a container for the entire workflow
(https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container).
Note that GitHub in its infinite wisdom changed the default shell _just_
for jobs that run in a container, from `bash` to `sh`, so we flip it
back.

These jobs run nightly on GitHub-hosted runners, so I tested here:
*
https://github.com/iree-org/iree/actions/runs/10396020082/job/28789218696
*
https://github.com/iree-org/iree/actions/runs/10422541951/job/28867245589

(Those jobs should also run on this PR, but they'll take a while)

skip-ci: no impact on other workflows
ScottTodd added a commit that referenced this issue Aug 19, 2024
Progress on #15332 and
#18238 .

Similar to #18252, this drops a
dependency on the
[`build_tools/docker/docker_run.sh`](https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh)
script. Unlike that PR, this goes a step further and also stops using
[`build_tools/cmake/build_all.sh`](https://github.com/iree-org/iree/blob/main/build_tools/cmake/build_all.sh).

Functional changes:
* No more building `iree-test-deps`
* We only get marginal value out of compiling test files using a debug
compiler
* Those tests are on the path to being moved to
https://github.com/iree-org/iree-test-suites
* No more ccache
* The debug build cache is too large for a local / GitHub Actions cache
* I want to limit our reliance on the remote cache at
`http://storage.googleapis.com/iree-sccache/ccache` (which uses GCP for
storage and needs GCP auth)
* Experiments show that this build is not significantly faster when
using a cache, or at least dropping `iree-test-deps` provides equivalent
time savings

Logs before:
https://github.com/iree-org/iree/actions/runs/10417779910/job/28864909582
(96% cache hits, 9 minute build but 19 minutes total, due to
`iree-test-deps`)
Logs after:
https://github.com/iree-org/iree/actions/runs/10423409599/job/28870060781?pr=18255
(no cache, 11 minute build)

ci-exactly: linux_x64_clang_debug

---------

Co-authored-by: Marius Brehler <[email protected]>
@amd-chrissosa amd-chrissosa self-assigned this Aug 20, 2024
@amd-chrissosa
Copy link
Contributor

Experiments so far:

I have gone through https://github.com/actions/actions-runner-controller and gave it a try through a basic POC but many things still aren't working yet.

To replicate what I've done so far:

  • Create an AKS cluster - creating a nodepool that is set up to autoscale.
  • Enabled AKS after installing Helm on my local client. I suggest creating your own values.yaml file in order to set the necessary values which you'll have to work with.
  • Configured a new workflow to use the runners setup in this config.

These all work fairly out of the box. Few suggestions:

  • Use different node pools for the linux_x86_64 builders vs the Linux NVIDIA GPU runner(s). Suggest getting basic pre-ci / ci working through the x86_64 builders first.
  • Don't worry too much about autoscaling settings for now, they are very easy to reconfigure. Suggest setting up autoscaling to have a min of 3 nodes and at most something like 20 to be safe for the original node pool.
  • Distinguish between different uses using different runner scale sets. Runner scale sets are homogeneous runners - they have the same runner config. Of course you can just use one runner set and customize as part of the build but you can install any number of runner_sets per k8s namespace/cluster

Currently blocked - getting images working. Going to keep trying to work on this but may pull someone in to help at this point since the k8s part is at least figured out.

@ScottTodd
Copy link
Member Author

I created https://github.com/iree-org/base-docker-images and am working to migrate what's left in https://github.com/iree-org/iree/tree/main/build_tools/docker to that repo. Starting with a few workflows that don't have special GCP requirements right now like https://github.com/iree-org/iree/blob/main/.github/workflows/ci_linux_x64_clang_debug.yml.

Local testing of iree-org/base-docker-images#4 looks promising to replace gcr.io/iree-oss/base with a new ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64 (or we can just put ghcr.io/iree-org/cpubuilder_ubuntu_jammy_ghr_x86_64 on the cluster for those builds, instead of using Docker inside Docker).

We could also try using the manylinux image but I'm not sure if we should expect that to work well enough with the base C++ toolchains outside of python packaging. I gave that a try locally too but got errors like:

# python3 -m pip install -r ./runtime/bindings/python/iree/runtime/build_requirements.txt
WARNING: Running pip install with root privileges is generally not a good idea. Try `__main__.py install --user` instead.
Collecting pip>=21.3 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 1.6MB/s 
Collecting setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7))
  Could not find a version that satisfies the requirement setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)) (from versions: 0.6b1, 0.6b2, 0.6b3, 0.6b4, 0.6rc1, ...
... 59.3.0, 59.4.0, 59.5.0, 59.6.0)
No matching distribution found for setuptools>=62.4.0 (from -r ./runtime/bindings/python/iree/runtime/build_requirements.txt (line 7)

@ScottTodd
Copy link
Member Author

If we're not sure how we want to set up a remote cache by the time we want to transition, I could at least prep a PR that switches relevant workflows to stop using a remote cache.

ScottTodd added a commit that referenced this issue Aug 29, 2024
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
@ScottTodd
Copy link
Member Author

Shared branch tracking the migration: https://github.com/iree-org/iree/tree/shared/runner-cluster-migration

That currently switches the runs-on: for multiple jobs to the new cluster and changes some workflows from using the GCP cache to using no cache. We'll try setting up a new cache and continue testing there before merging to main.

ScottTodd added a commit that referenced this issue Aug 29, 2024
Progress on #15332. I'm trying to
get rid of the `docker_run.sh` scripts, replacing them with GitHub's
`container:` feature. While local development flows _may_ want to use
Docker like the CI workflows do, those scripts contained a lot of
special handling and file mounting to be compatible with Bazel. Much of
that is not needed for CMake and can be folded away, though the
`--privileged` option needed here is one exception.

This stops using the remote cache that is hosted on GCP. We can try
adding back a cache using GitHub or our own hosted storage as part of
#18238.

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
ASan | Cache | GCP runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064)
ASan | No cache | GCP runners | 28 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181)
ASan | Cache | Azure runners | (not configured yet)
ASan | No cache | Azure runners | 35 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396)
| | | 
TSan | Cache | GCP runners | 12 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939)
TSan | No cache | GCP runners | 21 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002)
TSan | Cache | Azure runners | (not configured yet)
TSan | No cache | Azure runners | 32 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396)

ci-exactly: linux_x64_clang_asan
@ScottTodd
Copy link
Member Author

We're still figuring out how to get build times back to reasonable on the new cluster by configuring some sort of cache. The linux_x64_clang build is taking around 30 minutes for the entire job on the new runner cluster with no cache, compared to 9 minutes for the entire job on old runners with a cache.

ccache (https://ccache.dev/) does not have first class support for Azure Blob Storage, so we are trying a few things:

  • Not sure if Azure supports HTTP access in the way that GCP does:
    export CCACHE_REMOTE_STORAGE="http://storage.googleapis.com/iree-sccache/ccache"
    if (( IREE_WRITE_REMOTE_CCACHE == 1 )); then
    set +x # Don't leak the token (even though it's short-lived)
    export CCACHE_REMOTE_STORAGE="${CCACHE_REMOTE_STORAGE}|bearer-token=${IREE_CCACHE_GCP_TOKEN}"
    set -x
    else
    export CCACHE_REMOTE_STORAGE="${CCACHE_REMOTE_STORAGE}|read-only"
    fi
  • We've tried using blobfuse2 (https://github.com/Azure/azure-storage-fuse) to mount the remote directory and treat it as local (blobfuse2 mount ... /mnt/azureblob + CCACHE_DIR=/mnt/azureblob/ccache-container), but that has some confusing configuration and doesn't appear to support multiple concurrent readers/writers:

    Blobfuse2 supports both reads and writes however, it does not guarantee continuous sync of data written to storage using other APIs or other mounts of Blobfuse2. For data integrity it is recommended that multiple sources do not modify the same blob/file.

sccache (https://github.com/mozilla/sccache) is promising since it does have first class support for Azure Blob Storage: https://github.com/mozilla/sccache/blob/main/docs/Azure.md

Either way we still need to figure out the security/access model. Ideally we'd have public read access the cache, but we might need to limit even that if the APIs aren't available. Might have to make some (temporary?) tradeoffs where only PRs sent from the main repo would get access to the cache via GitHub Secrets (which aren't shared with PRs from forks) 🙁

@benvanik
Copy link
Collaborator

benvanik commented Sep 9, 2024

As a data point I've used sccache locally and it worked as expected for our cmake builds.

@ScottTodd
Copy link
Member Author

Yep I just had good results with sccache locally on Linux and using Azure. I think good next steps are:

  1. Install sccache in the dockerfiles: Install sccache in cpubuilder dockerfiles. base-docker-images#8
  2. Test sccache inside Docker (or skip this step if confident in the cache hit rates and such)
  3. Switch the test PR (Implemented caching with Azure containers using sccache #18466) to use sccache instead of ccache and confirm that github actions + docker + sccache + Azure all play nicely together

ScottTodd added a commit to iree-org/base-docker-images that referenced this issue Sep 10, 2024
Progress on iree-org/iree#18238

https://github.com/mozilla/sccache

We may use this instead of ccache for our shared remote cache usage,
consider sccache has first class Azure Blob Storage support:
https://github.com/mozilla/sccache/blob/main/docs/Azure.md.
@ScottTodd
Copy link
Member Author

Cache scopes / namespaces / keys

sccache supports a SCCACHE_AZURE_KEY_PREFIX environment variable:

You can also define a prefix that will be prepended to the keys of all cache objects created and read within the container, effectively creating a scope. To do that use the SCCACHE_AZURE_KEY_PREFIX environment variable. This can be useful when sharing a bucket with another application.

We can use that to have a single storage account for multiple projects and that will also allow us to better manage the storage in the cloud project itself, e.g. checking the size of each folder or deleting an entire folder. Note that sccache's architecture (https://github.com/mozilla/sccache/blob/main/docs/Architecture.md) includes a sophisticated hash function which includes environment variables, the compiler binary, compiler arguments, files, etc. , so sharing a cache folder between e.g. MSVC on Windows and clang on Linux should be fine. I'd still prefer we separate those caches though.

Some naming ideas:

Any of the scopes that have frequently changing names should have TTLs on their files or we should audit and clean them up manually from time to time, so they don't live indefinitely.

saienduri added a commit that referenced this issue Sep 10, 2024
This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
#18238.
In this initial port, we move over one high traffic job
(`linux_x86_64_release_packages`) and a few nightlies
(`linux_x64_clang_tsan`, `linux_x64_clang_debug`) to monitor and make
sure the cluster is working as intended.

Time Comparisons:

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
linux_x86_64_release_packages | GitHub Cache | AKS Cluster | 9 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464301/job/29948809708)
linux_x64_clang_tsan | GCP Cache | AKS cluster | 10 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464292/job/29948816896)
linux_x64_clang_debug | GCP Cache | AKS cluster | 11 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464308/job/29948805561)
linux_x64_clang_tsan | No Cache | AKS cluster | 17 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10798471545/job/29952051686)
linux_x64_clang_debug | No Cache | AKS cluster | 13 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10798475582/job/29952064138)
| | | 
linux_x86_64_release_packages | GitHub Cache | GCP Runners | 11 minutes
|
[logs](https://github.com/iree-org/iree/actions/runs/10796348911/job/29945148145)
linux_x64_clang_tsan | GCP Cache | GCP Runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10789692182/job/29923234380)
linux_x64_clang_debug | GCP Cache | GCP Runners | 15 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10680250213/job/29601266656)

The GCP cache timings for the AKS cluster are not a great representation
of what we will be seeing going forward because the AKS cluster does not
have the setup/authentication to write to the GCP cache. We have changes
coming in
https://github.com/iree-org/iree/tree/shared/runner-cluster-migration
that will spin up an Azure cache using sccache to help with the No Cache
timings. Right now the cluster is using 96 core machines, which we can
probably tone down when the caching work lands.

---------

Signed-off-by: saienduri <[email protected]>
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. I'm trying to
get rid of the `docker_run.sh` scripts, replacing them with GitHub's
`container:` feature. While local development flows _may_ want to use
Docker like the CI workflows do, those scripts contained a lot of
special handling and file mounting to be compatible with Bazel. Much of
that is not needed for CMake and can be folded away, though the
`--privileged` option needed here is one exception.

This stops using the remote cache that is hosted on GCP. We can try
adding back a cache using GitHub or our own hosted storage as part of
#18238.

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
ASan | Cache | GCP runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064)
ASan | No cache | GCP runners | 28 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181)
ASan | Cache | Azure runners | (not configured yet)
ASan | No cache | Azure runners | 35 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396)
| | |
TSan | Cache | GCP runners | 12 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939)
TSan | No cache | GCP runners | 21 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002)
TSan | Cache | Azure runners | (not configured yet)
TSan | No cache | Azure runners | 32 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396)

ci-exactly: linux_x64_clang_asan
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
Signed-off-by: saienduri <[email protected]>
josemonsalve2 pushed a commit to josemonsalve2/iree that referenced this issue Sep 14, 2024
This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
iree-org#18238.
In this initial port, we move over one high traffic job
(`linux_x86_64_release_packages`) and a few nightlies
(`linux_x64_clang_tsan`, `linux_x64_clang_debug`) to monitor and make
sure the cluster is working as intended.

Time Comparisons:

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
linux_x86_64_release_packages | GitHub Cache | AKS Cluster | 9 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464301/job/29948809708)
linux_x64_clang_tsan | GCP Cache | AKS cluster | 10 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464292/job/29948816896)
linux_x64_clang_debug | GCP Cache | AKS cluster | 11 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10797464308/job/29948805561)
linux_x64_clang_tsan | No Cache | AKS cluster | 17 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10798471545/job/29952051686)
linux_x64_clang_debug | No Cache | AKS cluster | 13 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10798475582/job/29952064138)
| | | 
linux_x86_64_release_packages | GitHub Cache | GCP Runners | 11 minutes
|
[logs](https://github.com/iree-org/iree/actions/runs/10796348911/job/29945148145)
linux_x64_clang_tsan | GCP Cache | GCP Runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10789692182/job/29923234380)
linux_x64_clang_debug | GCP Cache | GCP Runners | 15 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10680250213/job/29601266656)

The GCP cache timings for the AKS cluster are not a great representation
of what we will be seeing going forward because the AKS cluster does not
have the setup/authentication to write to the GCP cache. We have changes
coming in
https://github.com/iree-org/iree/tree/shared/runner-cluster-migration
that will spin up an Azure cache using sccache to help with the No Cache
timings. Right now the cluster is using 96 core machines, which we can
probably tone down when the caching work lands.

---------

Signed-off-by: saienduri <[email protected]>
ScottTodd added a commit that referenced this issue Sep 16, 2024
See #18238.

We've finished migrating most load bearing workflows to use a new
cluster of self-hosted runners. These workflows are still using GCP
runners and are disabled:

* `build_test_all_bazel`: this may work on the new cluster using the
existing `gcr.io/iree-oss/base-bleeding-edge` dockerfile, but it uses
some remote cache storage on GCP and I want to migrate that to
https://github.com/iree-org/base-docker-images/. Need to take some time
to install deps, evaluate build times with/without a remote cache, etc.
* `test_nvidia_t4`, `nvidiagpu_cuda`, `nvidiagpu_vulkan`: we'll try to
spin up some VMs in the new cluster / cloud project with similar GPUs.
That's a high priority for us, so maybe within a few weeks.

Additionally, these workflows are still enabled but we should find a
longer term solution for them:

* `linux_arm64_clang` this is still enabled in code... for now. We can
disable
https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml
from the UI
* arm64 packages are also still enabled:
https://github.com/iree-org/iree/blob/cc891ba8e7da3a3ef1c8650a66af0aa53ceed06b/.github/workflows/build_package.yml#L46-L50
@ScottTodd
Copy link
Member Author

Current status:

  • New cluster of x86_64 Linux CPU build machines on Azure using https://github.com/actions/actions-runner-controller is online. Some documentation on our setup is at https://github.com/saienduri/AKS-GitHubARC-Setup
  • Most workflows have been migrated to using the new cluster
  • Workflows using the new runners only have access to remote storage sccache when triggered from this repository (not from PRs originating from forks). The ASan workflow in particular is slow because of this - 30 minutes when it could be 10 minutes.
  • The GCP runners have been deregistered and turned off, except for the arm64 runners
  • The Bazel and NVIDIA CPU (CUDA + Vulkan) workflows are currently disabled
  • Some workflows still read from GCP storage buckets. Migrate GCS files to new (ideally public) locations #18518 tracks cleaning those up. If the buckets are made private / deleted before those uses are updated, we'll have some tests to disable
  • We're looking at bringing up Windows CPU build runners that will let us move the current 5 hour nightly Windows build to a 20-30 minute nightly build or ideally a build that runs on every commit/PR. We'll need to figure out the cost / budgeting there and take a look at workflow time, caching optimizations, etc.

raikonenfnu pushed a commit to raikonenfnu/iree that referenced this issue Sep 16, 2024
…e-org#18511)

This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
iree-org#18238.

This builds on iree-org#18381, which
migrated
* `linux_x86_64_release_packages`
* `linux_x64_clang_debug`
* `linux_x64_clang_tsan`

Here, we move over the rest of the critical linux builder workflows off
of the GCP runners:
* `linux_x64_clang`
* `linux_x64_clang_asan`

This also drops all CI usage of the GCP cache
(`http://storage.googleapis.com/iree-sccache/ccache`). Some workflows
now use sccache backed by Azure Blob Storage as a replacement. There are
few issues with this (mozilla/sccache#2258)
that prevent us providing read only access to the cache in PRs created
from forks, so **PRs from forks currently don't use the cache and will
have slower builds**. We're covering for this slowdown by using larger
runners, but if we can roll out caching to all builds then we might use
runners with fewer cores.

Along with the changes to the cache, usage of Docker is rebased on
images in the https://github.com/iree-org/base-docker-images/ repo and
the `build_tools/docker/docker_run.sh` script is now only used by
unmigrated workflows (`linux_arm64_clang` and `build_test_all_bazel`).

---------

Signed-off-by: saienduri <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
Co-authored-by: Elias Joseph <[email protected]>
raikonenfnu pushed a commit to raikonenfnu/iree that referenced this issue Sep 16, 2024
…#18526)

See iree-org#18238.

We've finished migrating most load bearing workflows to use a new
cluster of self-hosted runners. These workflows are still using GCP
runners and are disabled:

* `build_test_all_bazel`: this may work on the new cluster using the
existing `gcr.io/iree-oss/base-bleeding-edge` dockerfile, but it uses
some remote cache storage on GCP and I want to migrate that to
https://github.com/iree-org/base-docker-images/. Need to take some time
to install deps, evaluate build times with/without a remote cache, etc.
* `test_nvidia_t4`, `nvidiagpu_cuda`, `nvidiagpu_vulkan`: we'll try to
spin up some VMs in the new cluster / cloud project with similar GPUs.
That's a high priority for us, so maybe within a few weeks.

Additionally, these workflows are still enabled but we should find a
longer term solution for them:

* `linux_arm64_clang` this is still enabled in code... for now. We can
disable
https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml
from the UI
* arm64 packages are also still enabled:
https://github.com/iree-org/iree/blob/cc891ba8e7da3a3ef1c8650a66af0aa53ceed06b/.github/workflows/build_package.yml#L46-L50
@ScottTodd
Copy link
Member Author

The Bazel build would also benefit from a remote cache we can directly manage and configure for public read + privileged write access.

Instructions for Bazel: https://bazel.build/remote/caching#nginx
Instructions for sccache: https://github.com/mozilla/sccache/blob/main/docs/Webdav.md

ScottTodd added a commit that referenced this issue Sep 19, 2024
Progress on #15332 and
#18238. Fixes
#16915.

This switches the `build_test_all_bazel` CI job from the
`gcr.io/iree-oss/base-bleeding-edge` Dockerfile using GCP for remote
cache storage to the `ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64`
Dockerfile with no remote cache.

With no cache, this job takes between 18 and 25 minutes. Early testing
also showed times as long as 60 minutes, if the Docker command and
runner are both not optimally configured for Bazel (e.g. not using a RAM
disk).

The job is also moved from running on every commit to running on a
nightly schedule while we evaluate how frequently it breaks and how long
it takes to run. If we set up a new remote cache
(https://bazel.build/remote/caching), we can move it back to running
more regularly.
ScottTodd added a commit that referenced this issue Sep 23, 2024
Progress on #15332. This was the
last active use of
[`build_tools/docker/`](https://github.com/iree-org/iree/tree/main/build_tools/docker),
so we can now delete that directory:
#18566.

This uses the same "cpubuilder" dockerfile as the x86_64 builds, which
is now built for multiple architectures thanks to
iree-org/base-docker-images#11. As before, we
install a qemu binary in the dockerfile, this time using the approach in
iree-org/base-docker-images#13 instead of a
forked dockerfile.

Prior PRs for context:
* #14372
* #16331

Build time varies pretty wildly depending on cache hit rate and the
phase of the moon:

| Scenario | Cache hit rate | Time | Logs |
| -- | -- | -- | -- |
Cold cache | 0% | 1h45m |
[Logs](https://github.com/iree-org/iree/actions/runs/10962049593/job/30440393279)
Warm (?) cache | 61% | 48m |
[Logs](https://github.com/iree-org/iree/actions/runs/10963546631/job/30445257323)
Warm (hot?) cache | 98% | 16m |
[Logs](https://github.com/iree-org/iree/actions/runs/10964289304/job/30447618503?pr=18569)

CI history
(https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain)
shows that regular 97% cache hit rates and 17 minute job times are
possible. I'm not sure why one test run only got 61% cache hits. This
job only runs nightly, so that's not a super high priority to
investigate and fix.

If we migrate the arm64 runner off of GCP
(#18238) we can further simplify
this workflow by dropping its reliance on `gcloud auth
application-default print-access-token` and the `docker_run.sh` script.
Other workflows are now using `source setup_sccache.sh` and some other
code.
ScottTodd added a commit that referenced this issue Sep 23, 2024
Fixes #15332.

The dockerfiles in this repository have all been migrated to
https://github.com/iree-org/base-docker-images/ and all uses in-tree
have been updated.

I'm keeping the
https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh
script for now, but I've replaced nearly all uses of that with GitHub's
`container:` argument
(https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container).
All remaining uses need to run some code outside of Docker first, like
`gcloud auth application-default print-access-token`. As we continue to
migrate jobs off of GCP runners
(#18238), we'll be using a
different authentication and caching setup that removes that
requirement.
@banach-space
Copy link
Collaborator

Hey @ScottTodd , IREE has now been added to the list of supported repos for https://gitlab.arm.com/tooling/gha-runner-docs 🥳

Would be able to give that a try? C7g instances include SVE (these are Graviton 3 machines) and that's what I suggest using. Here's an overview of the hardware:

I'd probably start with c7g.4xlarge as the medium option and see how things. I am obviously available to help with this :)

-Andrzej

@ScottTodd
Copy link
Member Author

Thanks! Do you know if the iree-org/iree repository or the whole iree-org organization was approved? I'm seeing where we would install the app and what access it would want/need.

@banach-space
Copy link
Collaborator

Just the repo. Let me know if that's an issue - these are "early days" and IREE is effectively one of the genuine pigs :)

ScottTodd added a commit that referenced this issue Oct 1, 2024
Context:
#18238 (comment)

This uses https://gitlab.arm.com/tooling/gha-runner-docs to run on Arm
Hosted GitHub Action (GHA) Runners, instead of the runners that Google
has been hosting. Note that GitHub also offers Arm runners, but they are
expensive and require a paid GitHub plan to use
(https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/about-larger-runners).

For now this is continuing to run nightly, but we could also explore
running more regularly if Arm wants and approves. We'd want to figure
out how to use a build cache efficiently for that though. We can use
sccache storage on Azure, but there might be charges between Azure and
AWS for the several gigabytes of data moving back and forth. If we set
up a dedicated cache server
(#18557), we'll at least have
more visibility into and control over the storage and compute side of
billing.

Test runs:
* https://github.com/iree-org/iree/actions/runs/11114007934 (42 minutes)
* https://github.com/iree-org/iree/actions/runs/11114658487 (40 minutes)
* https://github.com/iree-org/iree/actions/runs/11114757082 (38 minutes)
* https://github.com/iree-org/iree/actions/runs/11128634554 (40 minutes)

skip-ci: no impact on other builds
ScottTodd added a commit that referenced this issue Oct 1, 2024
Progress on #18238. See also
#18643.

This is for nightly package builds / releases.

Untested. We can roll this back if it fails to build.

skip-ci: no impact on other workflows.
@ScottTodd
Copy link
Member Author

ARM runners are migrated (assuming tonight's nightly package build works).

We're still working on bringing back NVIDIA/CUDA runners and larger Windows runners.

@jpienaar
Copy link
Member

Should I pull down the other ARM runners?

@marbre
Copy link
Member

marbre commented Oct 10, 2024

Should I pull down the other ARM runners?

Yes, that should be fine.

Groverkss pushed a commit to Groverkss/iree that referenced this issue Dec 1, 2024
…9242)

This code was used to configure self-hosted runners on GCP. We have
migrated self-hosted runners to Azure and on-prem runners, so this fixes
iree-org#18238. Sub-issues to add back
Windows and NVIDIA GPU coverage will remain open.
giacs-epic pushed a commit to giacs-epic/iree that referenced this issue Dec 4, 2024
…9242)

This code was used to configure self-hosted runners on GCP. We have
migrated self-hosted runners to Azure and on-prem runners, so this fixes
iree-org#18238. Sub-issues to add back
Windows and NVIDIA GPU coverage will remain open.

Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants