Convert more CI jobs to be package-based #16203

ScottTodd · 2024-01-24T20:21:14Z

Current tasks

Background information

Test workflows (for integration and e2e tests as well as benchmarks) should be consuming packages, not source build directories. Build workflows could run unit tests.

For example, the test_all workflow currently downloads a nearly 10GB build archive here: https://github.com/openxla/iree/blob/79b6129e2333ae26e7e13b68c27566102dcece6e/.github/workflows/ci.yml#L290-L294

The PkgCI workflows show how this could be set up.

This may involve restructuring some parts of the project (like tests/e2e/) to be primarily based on packages and not the full CMake project.

In ci.yml, these are the jobs that currently depend on the build archive produced by build_all:

Related discussions:

Discord discussion on 2024-01-24

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-01-25T00:03:07Z

We could separate "unit tests" from "integration/e2e tests" in the CMake project. Unit tests should be able to run right after the build step, while integration tests should use a release/dist package for compiler tools and a [cross-compiled] runtime build for test binaries.

I'm considering a nested CMake project for integration tests, replacing the iree-test-deps utility target, but that might not be needed.

Take these test_gpu logs as an example.

That is running this command:

        run: |
          ./build_tools/github_actions/docker_run.sh \
              --env IREE_NVIDIA_SM80_TESTS_DISABLE \
              --env IREE_MULTI_DEVICE_TESTS_DISABLE \
              --env IREE_CTEST_LABEL_REGEX \
              --env IREE_VULKAN_DISABLE=0 \
              --env IREE_VULKAN_F16_DISABLE=0 \
              --env IREE_CUDA_DISABLE=0 \
              --env IREE_NVIDIA_GPU_TESTS_DISABLE=0 \
              --env CTEST_PARALLEL_LEVEL=2 \
              --env NVIDIA_DRIVER_CAPABILITIES=all \
              --gpus all \
              gcr.io/iree-oss/nvidia@sha256:892fefbdf90c93b407303adadfa87f22c0f1e84b7e819e69643c78fc5927c2ba \
              bash -euo pipefail -c \
                "./build_tools/scripts/check_cuda.sh
                ./build_tools/scripts/check_vulkan.sh
                ./build_tools/cmake/ctest_all.sh ${BUILD_DIR}"

with all of those filters set, these are the only test source directories included:

iree/hal/drivers/cuda2/cts                          =  83.08 sec*proc (24 tests)
iree/hal/drivers/vulkan                             =   0.33 sec*proc (1 test)
iree/hal/drivers/vulkan/cts                         =  44.97 sec*proc (12 tests)
iree/modules/check/test                             =   5.95 sec*proc (2 tests)
iree/samples/custom_dispatch/vulkan/shaders         =   2.19 sec*proc (2 tests)
iree/samples/simple_embedding                       =   0.65 sec*proc (1 test)
iree/tests/e2e/linalg                               =   8.50 sec*proc (7 tests)
iree/tests/e2e/linalg_ext_ops                       =  32.42 sec*proc (11 tests)
iree/tests/e2e/matmul                               =   8.48 sec*proc (5 tests)
iree/tests/e2e/regression                           =  47.80 sec*proc (41 tests)
iree/tests/e2e/stablehlo_models/mnist_train_test    =  29.41 sec*proc (2 tests)
iree/tests/e2e/stablehlo_ops                        = 410.55 sec*proc (302 tests)
iree/tests/e2e/tensor_ops                           =   9.71 sec*proc (6 tests)
iree/tests/e2e/tosa_ops                             = 142.57 sec*proc (120 tests)
iree/tests/e2e/vulkan_specific                      =   5.20 sec*proc (5 tests)
iree/tests/transform_dialect/cuda                   =   0.20 sec*proc (1 test)

iree/hal/drivers (except cts) is pure runtime code
iree/hal/drivers/*/cts can use compiler tools
iree/tests/ is mostly comprised of "check" tests. A few tests use lit (using iree-compile, iree-opt, iree-run-mlir, FileCheck, iree-run-module, etc.)

stellaraccident · 2024-01-25T00:24:15Z

+1 - I've been meaning to do something like that.

The line between unit and integration tests can sometimes be blurred a bit but a unit test can never require special hardware. That belongs in something that can be independently executed with tools provided out of band

ScottTodd · 2024-01-25T19:15:07Z

I'm deciding which of these job sequences to aim for:

build_dist_package --> compile_test_deps --> test_gpu
build_dist_package --> test_gpu

Portable targets like Android require running the compiler on a host machine, but other jobs like test_gpu that run on Linux/Windows can run the compiler if they want. The current tests/e2e/ folder after building iree-test-deps is ~34MB with ~1500 files and large CPU runners take around 30 seconds to generate all of those .vmfb files. All of those stats will likely grow over time. We don't really want to be spending CPU time on GPU machines, but keeping that flexibility for e2e Python tests could be useful.

ScottTodd · 2024-01-25T20:28:45Z

Got some good data from my test PR #16216 (on the first try too, woohoo!)

Here's a sample run using just the "install" dir from a prior job: https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216.

Stage	Time taken	Notes
Checkout	1m16s	Could use the smaller 'runtime' checkout?
Download install/ dir	5s	2.97GB file, could be trimmed - see notes below
Extract install/ dir	30s
Build runtime	1m
Build test deps	2m04s	Currently generating all test .vmfb files, even those for CPU
Test all	6m20s

TOTAL	11m47s

The install dir has two copies of the 3.6GB libIREECompiler.so that don't appear to be symlinks:

install/lib/libIREECompiler.so
install/python_packages/iree_compiler/iree/compiler/_mlir_libs/libIREECompiler.so

If this was using an "iree-dist" package instead of the full "install/" directory, that would just be the bin/ and lib/ folders, instead of also including python_packages/.

Compared to a baseline run using the full "build" dir from a prior job: https://github.com/openxla/iree/actions/runs/7647953292/job/20840561210

Stage	Time taken	Notes
Checkout	46s
Download build/ dir	9s	5.94 GB (was 8.5GB)
Extract install/ dir	1m01s
Test all	6m37s

TOTAL	8m46s

It doesn't seem too unreasonable to keep the test artifact generation on the same job (on a GPU machine), at least with the current number of test cases. It would be nice to share workflow configurations between "desktop Linux GPU" and "Android GPU" though.

benvanik · 2024-01-25T20:36:56Z

oof at double libIREECompiler.so

may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match

ScottTodd · 2024-01-25T20:57:16Z

oof at double libIREECompiler.so

Stella has been suggesting using a package distribution like iree-dist (without that problem), but I was just starting with the install/ directory from a regular source build.

may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match

That variance looks about right for Linux (just clicking through asan, tsan, build_all, test_all, etc. jobs on a run like https://github.com/openxla/iree/actions/runs/7647953292 and looking at the checkout step in each.

The smaller "runtime only" checkout is more like 5s (no submodules) + 6s (runtime submodules): https://github.com/openxla/iree/actions/runs/7647953292/job/20839868561.

Having these tests jobs use the main project makes drawing solid lines around components like "compiler" and "runtime" hard though. I don't really want to accidently exclude certain tests by forcing "integration test" jobs to use the big -DIREE_BUILD_COMPILER=OFF hammer. If tests were in a separate CMake project (nested in a subdir of the repo) then it would be easier to say to test authors/maintainers (oh... that's me) "these have to use installed artifacts, work with that".

ScottTodd · 2024-01-26T00:13:22Z

I think I'll try converting tests/ (and later possibly samples/) into a standalone CMake project, possibly with the ability to still include it from the root project for developer source builds.

The test_gpu job would still build the runtime (for CUDA, Vulkan, etc. unit tests that run on a GPU), but it will also build the "tests" subproject using iree-dist (or install/). That may pull in FileCheck, llvm-lit.py, and other tools from LLVM that are needed to run the tests, but it will not build the compiler binaries from source.

ScottTodd · 2024-01-26T16:50:31Z

Found a few things to fix first / as part of this.

Most of the tests in tests/e2e/stablehlo_models/ have been skipped, I think following #15837 with this change:

ctest_all.sh
+if (( IREE_METAL_DISABLE == 1 )); then
+  label_exclude_args+=("^driver=metal$")
+fi

The combination of these different filtering mechanisms is excluding tests since no CI configuration has both Vulkan and Metal:
https://github.com/openxla/iree/blob/f3b008c6db310f787ad76f151c21a30f72b14794/tests/e2e/stablehlo_models/CMakeLists.txt#L32-L35 https://github.com/openxla/iree/blob/f3b008c6db310f787ad76f151c21a30f72b14794/tests/e2e/stablehlo_models/edge_detection.mlir#L2-L4

I don't think we should use RUN: [[ $IREE_*_DISABLE == 1 ]] any longer. These tests should have separate targets for each HAL driver ('check' test suites do this automatically).

ScottTodd · 2024-01-26T16:57:44Z

I was considering disallowing "lit" integration tests in tests/ altogether, but many are legitimate uses:

So we should have the 'lit' / FileCheck tooling still available on host platforms that run those tests I think.

stellaraccident · 2024-01-26T18:07:08Z

Yeah, we can fix that double compiler binary thing with the proper flow.

Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place.

ScottTodd · 2024-01-26T18:21:14Z

Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place.

Here is the timing on the GPU machine:

Step	Timing
Configure	20s
Build runtime	60s
Build test deps (all)	2m04s
Run ctest (GPU only)	6m20s

The "build test deps" step is running the compiler to generate .vmfb files for iree-check-module:

[1/[14](https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216#step:9:15)[21](https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216#step:9:22)] Generating check_vulkan-spirv_vulkan_conv2d.mlir_module.vmfb from conv2d.mlir
[2/1421] Generating check_winograd_vulkan-spirv_vulkan_conv2d.mlir_module.vmfb from conv2d.mlir
[3/1421] Generating check_large_linalg_matmul_cuda_f32_to_i4.mlir_module.vmfb from f32_to_i4.mlir
[4/1421] Generating check_large_linalg_matmul_cuda_conv2d.mlir_module.vmfb from conv2d.mlir
[5/1421] Generating check_vmvx_local-task_conv2d.mlir_module.vmfb from conv2d.mlir
[6/1421] Generating check_large_linalg_matmul_cuda_i4_to_f32.mlir_module.vmfb from i4_to_f32.mlir
[7/1421] Generating check_vulkan-spirv_vulkan_i4_to_f32.mlir_module.vmfb from i4_to_f32.mlir
[8/1421] Generating check_winograd_llvm-cpu_local-task_conv2d.mlir_module.vmfb from conv2d.mlir
[9/1421] Generating check_llvm-cpu_local-task_sort.mlir_module.vmfb from sort.mlir

If "build test deps" was taking 5+ minutes, I'd be more inclined to move it to a CPU machine and pass the test files to the GPU runner, as we have the Android / RISC-V cross-compilation test jobs currently configured. We might still end up with that sort of setup eventually, but I don't think it is needed for a "v0" of package-based CI.

stellaraccident · 2024-01-26T18:30:45Z

Yeah, and I'd rather make the compiler faster than build exotic infra unless if it becomes needed...

ScottTodd · 2024-01-27T00:11:50Z

Made some pretty good progress on prerequisite tasks this week.

The latest thing I'm trying to enable is the use of iree_lit_test without needing to build the full compiler. That would let us run various tests under samples/ and tests/ that are using lit instead of check (some for good reasons, others just by convention) with a iree-dist package*.

I have a WIP commit here that gets close: ScottTodd@47bb19a

* another way to enable that is to allow test jobs to set IREE_BUILD_COMPILER and then be very careful about which targets those jobs choose to build before running tests (ctest for now, but also pytest in the future?). Some tests require building googletest binaries like hal/drivers/vulkan/cts/vulkan_driver_test.exe, so I've been leaning on just building the all target, but we could instead have more utility targets like iree-test-deps. Actually, something like that may be a more robust idea... @stellaraccident have any suggestions? Something like iree-run-tests from #12156 ?

benvanik · 2024-01-27T00:33:15Z

would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc)

ScottTodd · 2024-01-27T00:56:56Z

Simple is what I'm aiming for... just figuring out how to get there still.

I want test jobs to run any tests that either use special hardware (GPUs) or use both the compiler and the runtime.

I'd like them to follow this pattern:

cmake -B ../iree-build-tests/ . -DIREE_HOST_BIN_DIR={PATH_TO_IREE_DIST} {CONFIGURE_OPTIONS}
cmake --build ../iree-build-tests/ {--target SOME_TARGET?}
ctest --test-dir ../iree-build-tests/

If {CONFIGURE_OPTIONS} can be empty or at least leave off -DIREE_BUILD_COMPILER=OFF then great
If the build step can build the default all, great. Otherwise, I can build iree-test-deps or some similar utility target

stellaraccident · 2024-01-27T20:47:54Z

would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc)

What Ben says. I've been down the path of mixing this stuff up and don't want to see anyone else fall in and then have to dig out of the pit.

We can package some of the test utilities. No big deal.

Progress on #16203. * Install more tools that are used for lit tests * Import those installed tools * Filter based on tools existing (imported or built) rather than based on building the compiler * Add tests subdirectory _after_ tools are imported/defined Now I can run more tests from a build that imports compiler tools rather than build them from source: ``` cmake -G Ninja -B ../iree-build-rt/ . \ -DIREE_BUILD_COMPILER=OFF \ -DIREE_HOST_BIN_DIR=D:\dev\projects\iree-build\install\bin \ -DLLVM_EXTERNAL_LIT=D:\dev\projects\iree\third_party\llvm-project\llvm\utils\lit\lit.py cmake --build ../iree-build-rt cmake --build --target iree-test-deps ctest --test-dir ../iree-build-rt ``` New failures that I see are: * `iree/tests/e2e/regression/libm_linking.mlir.test` * WindowsLinkerTool looks for `lld-link`, which we don't install. That appears to be a copy of `lld`, which is different from IREE's `iree-lld` (which complains with `Expected -flavor <gnu|link|darwin|wasm>` when I try to use it in that test) * `iree/samples/custom_module/dynamic/test/example.mlir.test` * Not sure what's happening here - this passes in my regular build and works when run standalone but segfaults when run under ctest in the installed test suite

ScottTodd · 2024-01-31T00:06:34Z

I think enough of the tests are segmented now (or will be once I land a few PRs).

Next I was planning on

switching a few existing jobs from using the "build" archive to using the "install" archive in-place. That will let us closely check for coverage gaps
switching from "install" to "iree-dist", try using github artifacts instead of GCS
taking a closer look at which build jobs are actually needed and how they are configured... I have a feeling that we don't need asan/tsan/tracing/debug/etc. as currently implemented (we still want that coverage... but the current jobs are unwieldy)

Progress on #16203 This mechanism for filtering inside lit tests * is a poor match for package-based CI * has caused some tests to be skipped unintentionally (a test suite is skipped if _any_ tag is excluded, so including Vulkan _and_ Metal tests in the same suite results in the suite always being skipped) * is awkward for developers - running a single lit file to test multiple backends is an interesting idea, but it requires developers to set these environment variables instead of filtering at configure time The `IREE_.*_DISABLE` environment variables are still used as a way to instruct the various `build_tools/.*/.*test.*.sh` scripts which test labels to filter. The existing CI jobs use those environment variables extensively. Specific changes: * Drop vulkan/metal and only use vmvx and llvm-cpu in some lit tests * Convert some lit tests into check tests

ScottTodd · 2024-01-31T14:51:39Z

switching a few existing jobs from using the "build" archive to using the "install" archive in-place

Or I could just fork the jobs over the pkgci... that might be better. 🤔 (would want iree-dist over there, and maybe a few extra tools included like lit.py)

ScottTodd · 2024-01-31T23:03:27Z

Okay, I have a script that "tests a package": ScottTodd@43b3922 . Going to try wiring that up to GitHub Actions, first pointed at the iree-dist-*.tar.xz files from a nightly release like https://github.com/openxla/iree/releases/tag/candidate-20240131.787.

The script is basically:

cmake -B build-tests/ -DIREE_BUILD_COMPILER=OFF -DIREE_HOST_BIN_DIR={PACKAGE_DIR}/bin
cmake --build build-tests/
cmake --build build-tests/ --target iree-test-deps
ctest --test-dir build-tests

with all the filtering goo from https://github.com/openxla/iree/blob/main/build_tools/cmake/ctest_all.sh (now that I say that I realize I could also call that script... but might be worth keeping this self contained before it gets too entangled again)

The GitHub Action will need to

Clone the repo, with submodules
- I'd like to limit it to only runtime submodules, need at least lit.py installed for that
Download the release and unzip the iree-dist files (OR pull them from another action, like pkgci_build_packages.yml)
Run the script (under Docker?)

We should then be able to use that for "test_cpu", "test_gpu_nvidia_a100", "test_gpu_nvidia_t4" and "test_gpu_amd_???" jobs (all Linux, but Windows/macOS could also work)

Could then include Python packages in the tests too and fold the "test_tf_integrations_gpu" job and other Python tests in as well.

The remaining jobs will need something else - probably just a different pile of GitHub Actions yaml...

build_benchmark_tools
build_e2e_test_artifacts
cross_compile_and_test
build_and_test_android
test_benchmark_suites

ScottTodd · 2024-02-01T00:13:28Z

Nice, that test script is working with iree-dist from a release.

Here's the workflow file: https://github.com/ScottTodd/iree/blob/infra-test-pkg/.github/workflows/test_package.yml

A few sample runs on a standard GitHub Actions Linux runner (CPU tests only):

(100% tests passed, 0 tests failed out of 647 🥳 -- though that does filter a few preexisting and new test failures)

Next I'll give that a coat of paint (organize the steps, prefetch Docker, trim the build, make the input package source configurable, etc.).

ScottTodd · 2024-07-23T19:40:01Z

I have reasonable line of sight to removing ci.yml entirely:

Move "platforms" jobs to standalone ci_[os]_[arch]_[compiler]_[config].yml files, e.g. ci_linux_x64_clang.yml
Configure each of those new jobs to run on whatever combination of triggers makes sense given available runners: pull_request (based on specific paths), push (based on specific paths), schedule (e.g. nightly), workflow_dispatch
Remove 'python_release_packages' job from ci.yml #17955
Migrate GPU test jobs to pkgci, swapping the 'install' directory for compiler tools available in python packages
Figure out a story for cross compilation jobs (maybe same as GPU test jobs)
Find a place for the remaining jobs: build_test_all_bazel, sanitizers, small_runtime, tracing, build_test_runtime
- Could keep the Linux runtime jobs grouped, but should separate any that need special runners

ScottTodd · 2024-07-23T20:38:57Z

We can package some of the test utilities. No big deal.

Opinions on including each of these tools in the iree-compiler / iree-runtime python packages?

generate_embed_data
iree-flatcc-cli
FileCheck
not
iree-opt
iree-run-mlir

That gets us coverage for most tests using just packages and no "install" directory. Other tools that might be useful but I think we can avoid them are clang and llvm-link.

ScottTodd · 2024-07-23T21:14:18Z

While I ponder including more tools in the packages, I found a lightweight way to exclude all lit tests from a test run. May be able to trust the full compiler build job to cover lit tests, then require that package CI tests use other test facilities (pytest, native test, etc.)

ScottTodd · 2024-07-23T22:54:52Z

Well, generate_embed_data and iree-flatcc-cli are pretty much mandatory for building the runtime, tests, and samples. Right now we only build those if IREE_HOST_BIN_DIR is not set, expecting them to come from that folder if it is. On host platforms we can build those tools cheapy and easily, but when cross compiling they should really come from the host, not the target build.

So I'm thinking about including those in the iree-runtime package. Probably first renaming generate_embed_data to iree-c-embed-data, matching the CMake function name.

Progress on #16203. Once all jobs in `ci.yml` are migrated to pkgci or their own workflow files this will be the primary "build the compiler and run the compiler tests" workflow/job. For now I just forked the existing `build_all.yml` workflow, but future cleanup should change how Docker is used here and consider just inlining the `build_all.sh` and `ctest_all.sh` scripts. skip-ci: adding a new workflow (with its own triggering)

…18001) Progress on #16203. First, this renames `generate_embed_data` to `iree-c-embed-data`. The name here dates back to having a single implementation shared across multiple projects using Bazel (and Google's internal Blaze). There isn't much code in the tool, but we've been using it for long enough to give it a name matching our other tools IMO. Next, this includes `iree-c-embed-data` and `iree-flatcc-cli` in the `iree-runtime` Python package (https://pypi.org/project/iree-runtime/). Both of these host tools are required for building large parts of the runtime (embed data for builtins and bitcode, flatcc for schemas), so including them in packages will allow users (and pkgci workflows) to get them without needing to build from source.

ScottTodd · 2024-07-25T20:50:30Z

I'd like for jobs consuming/testing the packages produced by pkgci to be easy to disable or change the triggers for. Ideally this would be possible with small code changes to individual workflow files or with changes via a web interface:

GitHub doesn't make this easy. Buildkite or other CI providers might be a better fit for this usage, but I want to keep using GitHub Actions if possible.

I ran two experiments so far:

workflow_run (docs here): This triggers a workflow after another workflow completes. Looks great: we'd run pkgci_build_packages.yml then a pkgci_test_packages_nvidia.yml would watch for that and start on its own. Unfortunately, the triggered workflow does not run in the pull request context or appear in 'checks' on pull requests, so results wouldn't be easily visible and they can't be marked as 'required' before merging is allowed.
Reusable workflows (docs here): these include other workflows via uses: . I was hoping if I disabled the reusable workflow in the UI then it wouldn't trigger. Unfortunately, the uses: acts like an inlined string substitution and just pulls the callee workflow file in to the caller workflow, so it ignores if the callee is enabled.

Next I want to try https://github.com/lewagon/wait-on-check-action . Test workflows would have a "wait" job running on standard GitHub-hosted runners that would poll for the build_packages check completing. When the check completes they would start jobs using self-hosted runners that test the packages. Those test workflows could then be disabled independently from the core package workflow(s). Polling on a runner for ~20 minutes is less than ideal though, especially since GitHub sets a limit on the number of concurrent GitHub-hosted runners (20 on the free GitHub plan: docs here). Maybe we could run some very small self-hosted runners (like 10 runners per CPU core) whose only job was to poll.

ScottTodd · 2024-07-26T18:13:04Z

Polling on a runner for ~20 minutes is less than ideal though, especially since GitHub sets a limit on the number of concurrent GitHub-hosted runners (20 on the free GitHub plan: docs here).

For opt-in jobs on PRs polling might not be so bad:

Add label to PR

Trigger action runs on

  pull_request_target:
    types:
      - labeled

Trigger action (or another action it runs) polls until the build_package check finishes
- This will spin for 10-20 minutes if started immediately after PR creation
- This will run with no delay if build_package already completed (dev or reviewer: "unit tests passed, let's now also check benchmarks on system foo and system bar")
If the build_package check was not successful, skip/fail
Opt-in jobs use the artifacts from build_package to run their tests/benchmarks

Because polling would also occur on a limited subset of PRs, we wouldn't (hopefully) use too many concurrent runners for it. That would then let us get the benefits of decoupled workflows that can be independently triggered and enabled/disabled.

ScottTodd · 2024-07-26T21:58:31Z

End of week updates:

With Migrate GPU test jobs to pkgci. #18007 I've finished migrating most workflows to pkgci.
- build_and_test_android is the main holdout, but otherwise build_all can be deleted in favor of ci_linux_x64_clang.yml. Then we'll no longer be uploading the 3GB "install" archive on every CI run.
- The remaining jobs in that workflow can either stay there or move into new files: build_test_all_bazel, build_test_runtime, small_runtime, tracing.
I'll be updating the build status badge(s) in our root README.md file. Might be able to combine multiple workflows into a single status, or we can split by platform. Not sure if I want all builds listed or just a few:
I'll be adding a new page to our developer docs with a full set of workflow status badges, information about each workflow, debugging/maintenance tips, explanations about github-hosted vs self-hosted runners, and instructions for contributors to add/propose new runners and workflows:

Progress on #16203. Depends on #18000. These jobs used to use the 3.2GB install directory produced by `cmake --build full-build-dir --target install` in the `build_all` job. Now they use the 73MB Python packages produced by `python -m pip wheel runtime/` and `python -m pip wheel compiler/` in the `build_packages` job. Python packages are what we expect users to consume, so test jobs should use them too. * Note that the Python packages will be larger once we enable asserts and/or debug symbols in them. These tests may also fail with less useful error messages and callstacks as a result of this change until that is fixed. I tried to keep changes to the workflow jobs minimal for now. Once the migrations are further along we can cut out some of the remaining layers of scripts / Dockerfiles. As before, these jobs are all opt-in on presubmit (always running on LLVM integrate PRs or PRs affecting NVGPU/AMDGPU code). Diffs between previous jobs and new jobs to confirm how similar they are: Job name | Logs before | Logs after | Notes -- | -- | -- | -- `test_nvidia_t4` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939423899) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725841?pr=18007) | 433 tests -> 430 tests<br>skipping `samples/custom_dispatch/vulkan/shaders`<br>`IREE custom_dispatch/vulkan/shaders ignored -- glslc not found`<br>(no longer running under Docker) `test_amd_mi250` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939423747) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725347?pr=18007) | 138 tests before/after `test_amd_mi300` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939424223) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725525?pr=18007) | 141 tests before/after `test_amd_w7900` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939424084) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725679?pr=18007) | 289 tests before/after Each job is now included from its own standalone workflow file, allowing for testing of individual workflows using `workflow_dispatch` triggers. I have some other ideas for further decoupling these optional jobs from the core workflow(s). ci-extra: test_amd_mi250, test_amd_mi300, test_amd_w7900, test_nvidia_t4

…ree-org#17274) As we add more test jobs (specifically for AMDGPU on self-hosted mi250 and w7900 runners), relying on the `gcloud` CLI to interface with workflow artifacts will require extra setup (install and auth). As these files are public and the jobs only need to read and not write, plain `wget` should be enough. Progress on iree-org#17159 and iree-org#16203 ci-exactly: build_all, test_nvidia_gpu Signed-off-by: Lubo Litchev <[email protected]>

Progress on iree-org#16203. Once all jobs in `ci.yml` are migrated to pkgci or their own workflow files this will be the primary "build the compiler and run the compiler tests" workflow/job. For now I just forked the existing `build_all.yml` workflow, but future cleanup should change how Docker is used here and consider just inlining the `build_all.sh` and `ctest_all.sh` scripts. skip-ci: adding a new workflow (with its own triggering) Signed-off-by: Lubo Litchev <[email protected]>

…ree-org#18001) Progress on iree-org#16203. First, this renames `generate_embed_data` to `iree-c-embed-data`. The name here dates back to having a single implementation shared across multiple projects using Bazel (and Google's internal Blaze). There isn't much code in the tool, but we've been using it for long enough to give it a name matching our other tools IMO. Next, this includes `iree-c-embed-data` and `iree-flatcc-cli` in the `iree-runtime` Python package (https://pypi.org/project/iree-runtime/). Both of these host tools are required for building large parts of the runtime (embed data for builtins and bitcode, flatcc for schemas), so including them in packages will allow users (and pkgci workflows) to get them without needing to build from source. Signed-off-by: Lubo Litchev <[email protected]>

ScottTodd · 2024-07-31T15:54:45Z

Taking a look at the Android and other cross-compilation jobs now, since build_and_test_android is the last (enabled) job in ci.yml that is using build_all. Once that is migrated (or disabled/removed) we can switch to using ci_linux_x64_clang.yml as our primary job that builds the compiler and runs lit tests.

I can iterate on at least the cross_compile job in https://github.com/iree-org/iree/blob/main/.github/workflows/build_and_test_android.yml. The test job may be harder since it uses self-hosted Android runners, some of which are limited to only run on postsubmit.

Rough plan:

See if android.Dockerfile can be replaced with installing clang/ninja and then getting the Android NDK from something like https://github.com/nttld/setup-ndk
Update build_tools/cmake/build_android.sh as needed to be compatible with packages instead of the install dir

ScottTodd · 2024-08-01T23:27:50Z

Getting closer to migrating the Android job to pkgci, and then could also look at the other cross compilation workflows.

work stack since yesterday:

remove `build_all` job from `ci.yml`
└── move `build_and_test_android` job from `ci.yml` to `pkgci.yml`
    └── (DONE) make Android `cross_compile` job install deps instead of using Docker
    └── make Android `cross_compile` job compatible with python packages
        └── (DONE) actually use clang instead of gcc in various builds
            └── (DONE) update iree-template-runtime-cmake to use clang
                └── (DONE) update iree-template-runtime-cmake to latest release
        └── (DONE) use ccache in various runtime builds
    └── make Android `test` job compatible with GitHub artifacts

Progress on #16203. We only run tests from https://github.com/iree-org/iree/tree/main/experimental/regression_suite in a few jobs, but other jobs are sharing this `build_tools/pkgci/setup_venv.py` script, so this moves the extra install closer to where it is needed. Also drop the seemingly unused `IREERS_ARTIFACT_DIR` environment variable that got copy/pasted around.

Progress on #16203 and #17957. New page preview: https://scotttodd.github.io/iree/developers/general/github-actions/ ## Overview This documents the state of our GitHub Actions usage after recent refactoring work. We used to have a monolithic `ci.yml` that ran 10+ builds on every commit, many of which used large CPU runners to build the full project (including the compiler) from source. Now we have platform builds, some of which run every commit but many of which run on nightly schedules. Most interesting _test_ jobs are now a part of "pkgci", which builds release packages and then runs tests using those packages (and sometimes runtime source builds). Net results: * Infrastructure costs should be more manageable - far fewer workflow runs will require powerful CPU builders * Self-hosted runners with special hardware have dedicated places to be slotted in and are explicitly not on the critical path for pull requests ## Details * Add new README badges * pre-commit, OpenSSF Best Practices, releases (stable, nightly), PyPI packages, select CI workflows * Add github-actions.md webpage * Full workflow status tables * Descriptions of each type of workflow * Tips on (not) using Docker * Information about `pull_request`, `push`, `schedule` triggers * Information about GitHub-hosted runners and self-hosted runners * Maintenance/debugging tips * Small cleanup passes through related files skip-ci: modified script does not affect builds

Progress on #16203 and #17957. This migrates `.github/workflows/build_and_test_android.yml` to `.github/workflows/pkgci_test_android.yml`. **For now, this only builds for Android, it does not run tests or use real Android devices at all**. The previous workflow * Relied on the "install" directory from a CMake build * Ran on large self-hosted CPU build machines * Built within Docker (using [`build_tools/docker/dockerfiles/android.Dockerfile`](https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/android.Dockerfile)) * Used GCP/GCS for remote ccache storage * Used GCP/GCS for passing files between jobs * Ran tests on self-hosted lab machines (I think a raspberry pi connected to some physical Android devices) The new workflow * Relies on Python packages produced by pkgci_build_packages * Runs on standard GitHub-hosted runners * Installs dependencies that it needs on-demand (ninja, Android NDK), without using Docker * Uses caches provided by GitHub Actions for ccache storage * Could use Artifacts provided by GitHub Actions for passing files between jobs * Could run tests on self-hosted lab machines or Android emulators I made some attempts at passing files from the build job to a test job but ran into some GitHub Actions debugging that was tricky. Leaving the remaining migration work there to contributors at Google or other parties directly interested in Android CI infrastructure. ci-exactly: build_packages, test_android

…18107) Cleanup relating to #16203 and #17957. After #18070, no jobs in `ci.yml` are depending on the output of the `build_all` job. All jobs using the compiler now live in `pkgci.yml` and use packages instead of the "install dir". * The reusable [`build_all.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_all.yml) workflow is still in use from benchmarking workflows, but those are unmaintained and may be deleted soon. * After this is merged I'll set the `linux_x64_clang` check to "required", so PRs will need it to be passing before they can be merged. * Now that `ci.yml` is pruned to just a few remaining jobs (`build_test_all_bazel`, `build_test_runtime`, `small_runtime`, `tracing`), we can also limit when the workflow runs at all. I've added some `paths` patterns so this won't run on push events if only files under `docs/` are changed, for example.

ScottTodd · 2024-08-12T17:26:53Z

This is finished, for the CI jobs that are currently enabled.

Items to follow-up on:

The Android job is building (cross-compiling with the Android NDK) but it is not testing. That will need some workflow changes and either physical or emulated devices.
The cross_compile_and_test job matrix in ci.yml that covered [riscv_64 linux, riscv_32 linux, riscv_32 generic, emscripten wasm32] has been disabled for some time. Can port those to pkgci.yml by following how the Android job was refactored.
Many jobs could be faster with changes to how runtime HAL CTS tests are built/linked

ScottTodd added the infrastructure Relating to build systems, CI, or testing label Jan 24, 2024

ScottTodd self-assigned this Jan 24, 2024

ScottTodd mentioned this issue Jan 25, 2024

Try converting test_gpu to use install dir instead of build dir. #16216

Closed

This was referenced Jan 26, 2024

Remove all IREE_.*_DISABLE env checks from .mlir tests. #16225

Merged

Run check-iree-dialects tests in test_all job. #16226

Closed

ScottTodd mentioned this issue Jan 30, 2024

Allow using iree_lit_test without building the compiler. #16267

Merged

This was referenced Feb 2, 2024

Switch benchmark and crosscompile jobs to use install-dir. #16296

Merged

Skip downloading submodules in test_tf_integrations jobs. #16297

Merged

This was referenced Jul 24, 2024

Refactor pkgci scripting and ctest to enable more flexible package testing. #18000

Closed

Add iree-c-embed-data and iree-flatcc-cli to release packages. #18001

Merged

ScottTodd mentioned this issue Jul 25, 2024

Migrate GPU test jobs to pkgci. #18007

Merged

ScottTodd mentioned this issue Jul 29, 2024

Refresh docs for GitHub Actions usage. #18035

Merged

This was referenced Jul 31, 2024

Move build_and_test_android from ci.yml to pkgci.yml. #18070

Merged

Drop regression_suite install from setup_venv.py. #18072

Merged

ScottTodd mentioned this issue Aug 5, 2024

Replace build_all job from ci.yml with ci_linux_x64_clang.yml. #18107

Merged

ScottTodd closed this as completed Aug 12, 2024

This was referenced Aug 14, 2024

Disable cross compilation tests for now #17961

Merged

Migrate jobs off current GCP GHA runner cluster #18238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert more CI jobs to be package-based #16203

Convert more CI jobs to be package-based #16203

ScottTodd commented Jan 24, 2024 •

edited

Loading

ScottTodd commented Jan 25, 2024

stellaraccident commented Jan 25, 2024

ScottTodd commented Jan 25, 2024

ScottTodd commented Jan 25, 2024 •

edited

Loading

benvanik commented Jan 25, 2024

ScottTodd commented Jan 25, 2024

ScottTodd commented Jan 26, 2024

ScottTodd commented Jan 26, 2024

ScottTodd commented Jan 26, 2024

stellaraccident commented Jan 26, 2024

ScottTodd commented Jan 26, 2024

stellaraccident commented Jan 26, 2024

ScottTodd commented Jan 27, 2024

benvanik commented Jan 27, 2024

ScottTodd commented Jan 27, 2024

stellaraccident commented Jan 27, 2024

ScottTodd commented Jan 31, 2024

ScottTodd commented Jan 31, 2024

ScottTodd commented Jan 31, 2024

ScottTodd commented Feb 1, 2024 •

edited

Loading

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 25, 2024

ScottTodd commented Jul 26, 2024

ScottTodd commented Jul 26, 2024

ScottTodd commented Jul 31, 2024

ScottTodd commented Aug 1, 2024

ScottTodd commented Aug 12, 2024

Convert more CI jobs to be package-based #16203

Convert more CI jobs to be package-based #16203

Comments

ScottTodd commented Jan 24, 2024 • edited Loading

Current tasks

Background information

ScottTodd commented Jan 25, 2024

stellaraccident commented Jan 25, 2024

ScottTodd commented Jan 25, 2024

ScottTodd commented Jan 25, 2024 • edited Loading

benvanik commented Jan 25, 2024

ScottTodd commented Jan 25, 2024

ScottTodd commented Jan 26, 2024

ScottTodd commented Jan 26, 2024

ScottTodd commented Jan 26, 2024

stellaraccident commented Jan 26, 2024

ScottTodd commented Jan 26, 2024

stellaraccident commented Jan 26, 2024

ScottTodd commented Jan 27, 2024

benvanik commented Jan 27, 2024

ScottTodd commented Jan 27, 2024

stellaraccident commented Jan 27, 2024

ScottTodd commented Jan 31, 2024

ScottTodd commented Jan 31, 2024

ScottTodd commented Jan 31, 2024

ScottTodd commented Feb 1, 2024 • edited Loading

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 23, 2024

ScottTodd commented Jul 25, 2024

ScottTodd commented Jul 26, 2024

ScottTodd commented Jul 26, 2024

ScottTodd commented Jul 31, 2024

ScottTodd commented Aug 1, 2024

ScottTodd commented Aug 12, 2024

ScottTodd commented Jan 24, 2024 •

edited

Loading

ScottTodd commented Jan 25, 2024 •

edited

Loading

ScottTodd commented Feb 1, 2024 •

edited

Loading