Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert more CI jobs to be package-based #16203

Closed
20 of 22 tasks
ScottTodd opened this issue Jan 24, 2024 · 41 comments
Closed
20 of 22 tasks

Convert more CI jobs to be package-based #16203

ScottTodd opened this issue Jan 24, 2024 · 41 comments
Assignees
Labels
infrastructure Relating to build systems, CI, or testing

Comments

@ScottTodd
Copy link
Member

ScottTodd commented Jan 24, 2024

Current tasks

Background information

Test workflows (for integration and e2e tests as well as benchmarks) should be consuming packages, not source build directories. Build workflows could run unit tests.

For example, the test_all workflow currently downloads a nearly 10GB build archive here: https://github.com/openxla/iree/blob/79b6129e2333ae26e7e13b68c27566102dcece6e/.github/workflows/ci.yml#L290-L294

The PkgCI workflows show how this could be set up.

This may involve restructuring some parts of the project (like tests/e2e/) to be primarily based on packages and not the full CMake project.

In ci.yml, these are the jobs that currently depend on the build archive produced by build_all:

  • test_all
  • test_gpu
  • test_a100
  • test_tf_integrations
  • test_tf_integrations_gpu
  • build_benchmark_tools
  • build_e2e_test_artifacts
  • cross_compile_and_test
  • build_and_test_android
  • test_benchmark_suites

Related discussions:

@ScottTodd ScottTodd added the infrastructure Relating to build systems, CI, or testing label Jan 24, 2024
@ScottTodd ScottTodd self-assigned this Jan 24, 2024
@ScottTodd
Copy link
Member Author

We could separate "unit tests" from "integration/e2e tests" in the CMake project. Unit tests should be able to run right after the build step, while integration tests should use a release/dist package for compiler tools and a [cross-compiled] runtime build for test binaries.

I'm considering a nested CMake project for integration tests, replacing the iree-test-deps utility target, but that might not be needed.


Take these test_gpu logs as an example.

That is running this command:

        run: |
          ./build_tools/github_actions/docker_run.sh \
              --env IREE_NVIDIA_SM80_TESTS_DISABLE \
              --env IREE_MULTI_DEVICE_TESTS_DISABLE \
              --env IREE_CTEST_LABEL_REGEX \
              --env IREE_VULKAN_DISABLE=0 \
              --env IREE_VULKAN_F16_DISABLE=0 \
              --env IREE_CUDA_DISABLE=0 \
              --env IREE_NVIDIA_GPU_TESTS_DISABLE=0 \
              --env CTEST_PARALLEL_LEVEL=2 \
              --env NVIDIA_DRIVER_CAPABILITIES=all \
              --gpus all \
              gcr.io/iree-oss/nvidia@sha256:892fefbdf90c93b407303adadfa87f22c0f1e84b7e819e69643c78fc5927c2ba \
              bash -euo pipefail -c \
                "./build_tools/scripts/check_cuda.sh
                ./build_tools/scripts/check_vulkan.sh
                ./build_tools/cmake/ctest_all.sh ${BUILD_DIR}"

with all of those filters set, these are the only test source directories included:

iree/hal/drivers/cuda2/cts                          =  83.08 sec*proc (24 tests)
iree/hal/drivers/vulkan                             =   0.33 sec*proc (1 test)
iree/hal/drivers/vulkan/cts                         =  44.97 sec*proc (12 tests)
iree/modules/check/test                             =   5.95 sec*proc (2 tests)
iree/samples/custom_dispatch/vulkan/shaders         =   2.19 sec*proc (2 tests)
iree/samples/simple_embedding                       =   0.65 sec*proc (1 test)
iree/tests/e2e/linalg                               =   8.50 sec*proc (7 tests)
iree/tests/e2e/linalg_ext_ops                       =  32.42 sec*proc (11 tests)
iree/tests/e2e/matmul                               =   8.48 sec*proc (5 tests)
iree/tests/e2e/regression                           =  47.80 sec*proc (41 tests)
iree/tests/e2e/stablehlo_models/mnist_train_test    =  29.41 sec*proc (2 tests)
iree/tests/e2e/stablehlo_ops                        = 410.55 sec*proc (302 tests)
iree/tests/e2e/tensor_ops                           =   9.71 sec*proc (6 tests)
iree/tests/e2e/tosa_ops                             = 142.57 sec*proc (120 tests)
iree/tests/e2e/vulkan_specific                      =   5.20 sec*proc (5 tests)
iree/tests/transform_dialect/cuda                   =   0.20 sec*proc (1 test)
  • iree/hal/drivers (except cts) is pure runtime code
  • iree/hal/drivers/*/cts can use compiler tools
  • iree/tests/ is mostly comprised of "check" tests. A few tests use lit (using iree-compile, iree-opt, iree-run-mlir, FileCheck, iree-run-module, etc.)

@stellaraccident
Copy link
Collaborator

+1 - I've been meaning to do something like that.

The line between unit and integration tests can sometimes be blurred a bit but a unit test can never require special hardware. That belongs in something that can be independently executed with tools provided out of band

@ScottTodd
Copy link
Member Author

I'm deciding which of these job sequences to aim for:

  • build_dist_package --> compile_test_deps --> test_gpu
  • build_dist_package --> test_gpu

Portable targets like Android require running the compiler on a host machine, but other jobs like test_gpu that run on Linux/Windows can run the compiler if they want. The current tests/e2e/ folder after building iree-test-deps is ~34MB with ~1500 files and large CPU runners take around 30 seconds to generate all of those .vmfb files. All of those stats will likely grow over time. We don't really want to be spending CPU time on GPU machines, but keeping that flexibility for e2e Python tests could be useful.

@ScottTodd
Copy link
Member Author

ScottTodd commented Jan 25, 2024

Got some good data from my test PR #16216 (on the first try too, woohoo!)


Here's a sample run using just the "install" dir from a prior job: https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216.

Stage Time taken Notes
Checkout 1m16s Could use the smaller 'runtime' checkout?
Download install/ dir 5s 2.97GB file, could be trimmed - see notes below
Extract install/ dir 30s
Build runtime 1m
Build test deps 2m04s Currently generating all test .vmfb files, even those for CPU
Test all 6m20s
TOTAL 11m47s

The install dir has two copies of the 3.6GB libIREECompiler.so that don't appear to be symlinks:

  • install/lib/libIREECompiler.so
  • install/python_packages/iree_compiler/iree/compiler/_mlir_libs/libIREECompiler.so

If this was using an "iree-dist" package instead of the full "install/" directory, that would just be the bin/ and lib/ folders, instead of also including python_packages/.


Compared to a baseline run using the full "build" dir from a prior job: https://github.com/openxla/iree/actions/runs/7647953292/job/20840561210

Stage Time taken Notes
Checkout 46s
Download build/ dir 9s 5.94 GB (was 8.5GB)
Extract install/ dir 1m01s
Test all 6m37s
TOTAL 8m46s

It doesn't seem too unreasonable to keep the test artifact generation on the same job (on a GPU machine), at least with the current number of test cases. It would be nice to share workflow configurations between "desktop Linux GPU" and "Android GPU" though.

@benvanik
Copy link
Collaborator

oof at double libIREECompiler.so

may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match

@ScottTodd
Copy link
Member Author

oof at double libIREECompiler.so

Stella has been suggesting using a package distribution like iree-dist (without that problem), but I was just starting with the install/ directory from a regular source build.

may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match

That variance looks about right for Linux (just clicking through asan, tsan, build_all, test_all, etc. jobs on a run like https://github.com/openxla/iree/actions/runs/7647953292 and looking at the checkout step in each.

The smaller "runtime only" checkout is more like 5s (no submodules) + 6s (runtime submodules): https://github.com/openxla/iree/actions/runs/7647953292/job/20839868561.

Having these tests jobs use the main project makes drawing solid lines around components like "compiler" and "runtime" hard though. I don't really want to accidently exclude certain tests by forcing "integration test" jobs to use the big -DIREE_BUILD_COMPILER=OFF hammer. If tests were in a separate CMake project (nested in a subdir of the repo) then it would be easier to say to test authors/maintainers (oh... that's me) "these have to use installed artifacts, work with that".

@ScottTodd
Copy link
Member Author

I think I'll try converting tests/ (and later possibly samples/) into a standalone CMake project, possibly with the ability to still include it from the root project for developer source builds.

The test_gpu job would still build the runtime (for CUDA, Vulkan, etc. unit tests that run on a GPU), but it will also build the "tests" subproject using iree-dist (or install/). That may pull in FileCheck, llvm-lit.py, and other tools from LLVM that are needed to run the tests, but it will not build the compiler binaries from source.

@ScottTodd
Copy link
Member Author

Found a few things to fix first / as part of this.

Most of the tests in tests/e2e/stablehlo_models/ have been skipped, I think following #15837 with this change:

ctest_all.sh
+if (( IREE_METAL_DISABLE == 1 )); then
+  label_exclude_args+=("^driver=metal$")
+fi

The combination of these different filtering mechanisms is excluding tests since no CI configuration has both Vulkan and Metal:
https://github.com/openxla/iree/blob/f3b008c6db310f787ad76f151c21a30f72b14794/tests/e2e/stablehlo_models/CMakeLists.txt#L32-L35 https://github.com/openxla/iree/blob/f3b008c6db310f787ad76f151c21a30f72b14794/tests/e2e/stablehlo_models/edge_detection.mlir#L2-L4

I don't think we should use RUN: [[ $IREE_*_DISABLE == 1 ]] any longer. These tests should have separate targets for each HAL driver ('check' test suites do this automatically).

@ScottTodd
Copy link
Member Author

I was considering disallowing "lit" integration tests in tests/ altogether, but many are legitimate uses:
image

So we should have the 'lit' / FileCheck tooling still available on host platforms that run those tests I think.

@stellaraccident
Copy link
Collaborator

Yeah, we can fix that double compiler binary thing with the proper flow.

Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place.

@ScottTodd
Copy link
Member Author

Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place.

Here is the timing on the GPU machine:

Step Timing
Configure 20s
Build runtime 60s
Build test deps (all) 2m04s
Run ctest (GPU only) 6m20s

The "build test deps" step is running the compiler to generate .vmfb files for iree-check-module:

[1/[14](https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216#step:9:15)[21](https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216#step:9:22)] Generating check_vulkan-spirv_vulkan_conv2d.mlir_module.vmfb from conv2d.mlir
[2/1421] Generating check_winograd_vulkan-spirv_vulkan_conv2d.mlir_module.vmfb from conv2d.mlir
[3/1421] Generating check_large_linalg_matmul_cuda_f32_to_i4.mlir_module.vmfb from f32_to_i4.mlir
[4/1421] Generating check_large_linalg_matmul_cuda_conv2d.mlir_module.vmfb from conv2d.mlir
[5/1421] Generating check_vmvx_local-task_conv2d.mlir_module.vmfb from conv2d.mlir
[6/1421] Generating check_large_linalg_matmul_cuda_i4_to_f32.mlir_module.vmfb from i4_to_f32.mlir
[7/1421] Generating check_vulkan-spirv_vulkan_i4_to_f32.mlir_module.vmfb from i4_to_f32.mlir
[8/1421] Generating check_winograd_llvm-cpu_local-task_conv2d.mlir_module.vmfb from conv2d.mlir
[9/1421] Generating check_llvm-cpu_local-task_sort.mlir_module.vmfb from sort.mlir

If "build test deps" was taking 5+ minutes, I'd be more inclined to move it to a CPU machine and pass the test files to the GPU runner, as we have the Android / RISC-V cross-compilation test jobs currently configured. We might still end up with that sort of setup eventually, but I don't think it is needed for a "v0" of package-based CI.

@stellaraccident
Copy link
Collaborator

Yeah, and I'd rather make the compiler faster than build exotic infra unless if it becomes needed...

@ScottTodd
Copy link
Member Author

Made some pretty good progress on prerequisite tasks this week.


The latest thing I'm trying to enable is the use of iree_lit_test without needing to build the full compiler. That would let us run various tests under samples/ and tests/ that are using lit instead of check (some for good reasons, others just by convention) with a iree-dist package*.

I have a WIP commit here that gets close: ScottTodd@47bb19a

* another way to enable that is to allow test jobs to set IREE_BUILD_COMPILER and then be very careful about which targets those jobs choose to build before running tests (ctest for now, but also pytest in the future?). Some tests require building googletest binaries like hal/drivers/vulkan/cts/vulkan_driver_test.exe, so I've been leaning on just building the all target, but we could instead have more utility targets like iree-test-deps. Actually, something like that may be a more robust idea... @stellaraccident have any suggestions? Something like iree-run-tests from #12156 ?

@benvanik
Copy link
Collaborator

would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc)

@ScottTodd
Copy link
Member Author

Simple is what I'm aiming for... just figuring out how to get there still.

I want test jobs to run any tests that either use special hardware (GPUs) or use both the compiler and the runtime.

I'd like them to follow this pattern:

cmake -B ../iree-build-tests/ . -DIREE_HOST_BIN_DIR={PATH_TO_IREE_DIST} {CONFIGURE_OPTIONS}
cmake --build ../iree-build-tests/ {--target SOME_TARGET?}
ctest --test-dir ../iree-build-tests/
  • If {CONFIGURE_OPTIONS} can be empty or at least leave off -DIREE_BUILD_COMPILER=OFF then great
  • If the build step can build the default all, great. Otherwise, I can build iree-test-deps or some similar utility target

@stellaraccident
Copy link
Collaborator

would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc)

What Ben says. I've been down the path of mixing this stuff up and don't want to see anyone else fall in and then have to dig out of the pit.

We can package some of the test utilities. No big deal.

ScottTodd added a commit that referenced this issue Jan 30, 2024
Progress on #16203.

* Install more tools that are used for lit tests
* Import those installed tools
* Filter based on tools existing (imported or built) rather than based
on building the compiler
* Add tests subdirectory _after_ tools are imported/defined

Now I can run more tests from a build that imports compiler tools rather
than build them from source:
```
cmake -G Ninja -B ../iree-build-rt/ . \
  -DIREE_BUILD_COMPILER=OFF \
  -DIREE_HOST_BIN_DIR=D:\dev\projects\iree-build\install\bin \
  -DLLVM_EXTERNAL_LIT=D:\dev\projects\iree\third_party\llvm-project\llvm\utils\lit\lit.py
cmake --build ../iree-build-rt
cmake --build --target iree-test-deps
ctest --test-dir ../iree-build-rt
```

New failures that I see are:

* `iree/tests/e2e/regression/libm_linking.mlir.test`
* WindowsLinkerTool looks for `lld-link`, which we don't install. That
appears to be a copy of `lld`, which is different from IREE's `iree-lld`
(which complains with `Expected -flavor <gnu|link|darwin|wasm>` when I
try to use it in that test)
* `iree/samples/custom_module/dynamic/test/example.mlir.test`
* Not sure what's happening here - this passes in my regular build and
works when run standalone but segfaults when run under ctest in the
installed test suite
@ScottTodd
Copy link
Member Author

I think enough of the tests are segmented now (or will be once I land a few PRs).

Next I was planning on

  • switching a few existing jobs from using the "build" archive to using the "install" archive in-place. That will let us closely check for coverage gaps
  • switching from "install" to "iree-dist", try using github artifacts instead of GCS
  • taking a closer look at which build jobs are actually needed and how they are configured... I have a feeling that we don't need asan/tsan/tracing/debug/etc. as currently implemented (we still want that coverage... but the current jobs are unwieldy)

ScottTodd added a commit that referenced this issue Jan 31, 2024
Progress on #16203

This mechanism for filtering inside lit tests
* is a poor match for package-based CI
* has caused some tests to be skipped unintentionally (a test suite is
skipped if _any_ tag is excluded, so including Vulkan _and_ Metal tests
in the same suite results in the suite always being skipped)
* is awkward for developers - running a single lit file to test multiple
backends is an interesting idea, but it requires developers to set these
environment variables instead of filtering at configure time

The `IREE_.*_DISABLE` environment variables are still used as a way to
instruct the various `build_tools/.*/.*test.*.sh` scripts which test
labels to filter. The existing CI jobs use those environment variables
extensively.

Specific changes:

* Drop vulkan/metal and only use vmvx and llvm-cpu in some lit tests 
* Convert some lit tests into check tests
@ScottTodd
Copy link
Member Author

  • switching a few existing jobs from using the "build" archive to using the "install" archive in-place

Or I could just fork the jobs over the pkgci... that might be better. 🤔 (would want iree-dist over there, and maybe a few extra tools included like lit.py)

@ScottTodd
Copy link
Member Author

Okay, I have a script that "tests a package": ScottTodd@43b3922 . Going to try wiring that up to GitHub Actions, first pointed at the iree-dist-*.tar.xz files from a nightly release like https://github.com/openxla/iree/releases/tag/candidate-20240131.787.

The script is basically:

cmake -B build-tests/ -DIREE_BUILD_COMPILER=OFF -DIREE_HOST_BIN_DIR={PACKAGE_DIR}/bin
cmake --build build-tests/
cmake --build build-tests/ --target iree-test-deps
ctest --test-dir build-tests

with all the filtering goo from https://github.com/openxla/iree/blob/main/build_tools/cmake/ctest_all.sh (now that I say that I realize I could also call that script... but might be worth keeping this self contained before it gets too entangled again)

The GitHub Action will need to

  • Clone the repo, with submodules
    • I'd like to limit it to only runtime submodules, need at least lit.py installed for that
  • Download the release and unzip the iree-dist files (OR pull them from another action, like pkgci_build_packages.yml)
  • Run the script (under Docker?)

We should then be able to use that for "test_cpu", "test_gpu_nvidia_a100", "test_gpu_nvidia_t4" and "test_gpu_amd_???" jobs (all Linux, but Windows/macOS could also work)

Could then include Python packages in the tests too and fold the "test_tf_integrations_gpu" job and other Python tests in as well.

The remaining jobs will need something else - probably just a different pile of GitHub Actions yaml...

build_benchmark_tools
build_e2e_test_artifacts
cross_compile_and_test
build_and_test_android
test_benchmark_suites

@ScottTodd
Copy link
Member Author

ScottTodd commented Feb 1, 2024

Nice, that test script is working with iree-dist from a release.

Here's the workflow file: https://github.com/ScottTodd/iree/blob/infra-test-pkg/.github/workflows/test_package.yml

A few sample runs on a standard GitHub Actions Linux runner (CPU tests only):

(100% tests passed, 0 tests failed out of 647 🥳 -- though that does filter a few preexisting and new test failures)

Next I'll give that a coat of paint (organize the steps, prefetch Docker, trim the build, make the input package source configurable, etc.).

@ScottTodd
Copy link
Member Author

I have reasonable line of sight to removing ci.yml entirely:

@ScottTodd
Copy link
Member Author

We can package some of the test utilities. No big deal.

Opinions on including each of these tools in the iree-compiler / iree-runtime python packages?

  • generate_embed_data
  • iree-flatcc-cli
  • FileCheck
  • not
  • iree-opt
  • iree-run-mlir

That gets us coverage for most tests using just packages and no "install" directory. Other tools that might be useful but I think we can avoid them are clang and llvm-link.

@ScottTodd
Copy link
Member Author

While I ponder including more tools in the packages, I found a lightweight way to exclude all lit tests from a test run. May be able to trust the full compiler build job to cover lit tests, then require that package CI tests use other test facilities (pytest, native test, etc.)

@ScottTodd
Copy link
Member Author

Well, generate_embed_data and iree-flatcc-cli are pretty much mandatory for building the runtime, tests, and samples. Right now we only build those if IREE_HOST_BIN_DIR is not set, expecting them to come from that folder if it is. On host platforms we can build those tools cheapy and easily, but when cross compiling they should really come from the host, not the target build.

So I'm thinking about including those in the iree-runtime package. Probably first renaming generate_embed_data to iree-c-embed-data, matching the CMake function name.

ScottTodd added a commit that referenced this issue Jul 24, 2024
Progress on #16203. Once all jobs
in `ci.yml` are migrated to pkgci or their own workflow files this will
be the primary "build the compiler and run the compiler tests"
workflow/job.

For now I just forked the existing `build_all.yml` workflow, but future
cleanup should change how Docker is used here and consider just inlining
the `build_all.sh` and `ctest_all.sh` scripts.

skip-ci: adding a new workflow (with its own triggering)
ScottTodd added a commit that referenced this issue Jul 24, 2024
…18001)

Progress on #16203.

First, this renames `generate_embed_data` to `iree-c-embed-data`. The
name here dates back to having a single implementation shared across
multiple projects using Bazel (and Google's internal Blaze). There isn't
much code in the tool, but we've been using it for long enough to give
it a name matching our other tools IMO.

Next, this includes `iree-c-embed-data` and `iree-flatcc-cli` in the
`iree-runtime` Python package (https://pypi.org/project/iree-runtime/).
Both of these host tools are required for building large parts of the
runtime (embed data for builtins and bitcode, flatcc for schemas), so
including them in packages will allow users (and pkgci workflows) to get
them without needing to build from source.
@ScottTodd
Copy link
Member Author

I'd like for jobs consuming/testing the packages produced by pkgci to be easy to disable or change the triggers for. Ideally this would be possible with small code changes to individual workflow files or with changes via a web interface:
image

GitHub doesn't make this easy. Buildkite or other CI providers might be a better fit for this usage, but I want to keep using GitHub Actions if possible.

I ran two experiments so far:

  • workflow_run (docs here): This triggers a workflow after another workflow completes. Looks great: we'd run pkgci_build_packages.yml then a pkgci_test_packages_nvidia.yml would watch for that and start on its own. Unfortunately, the triggered workflow does not run in the pull request context or appear in 'checks' on pull requests, so results wouldn't be easily visible and they can't be marked as 'required' before merging is allowed.
  • Reusable workflows (docs here): these include other workflows via uses: . I was hoping if I disabled the reusable workflow in the UI then it wouldn't trigger. Unfortunately, the uses: acts like an inlined string substitution and just pulls the callee workflow file in to the caller workflow, so it ignores if the callee is enabled.

Next I want to try https://github.com/lewagon/wait-on-check-action . Test workflows would have a "wait" job running on standard GitHub-hosted runners that would poll for the build_packages check completing. When the check completes they would start jobs using self-hosted runners that test the packages. Those test workflows could then be disabled independently from the core package workflow(s). Polling on a runner for ~20 minutes is less than ideal though, especially since GitHub sets a limit on the number of concurrent GitHub-hosted runners (20 on the free GitHub plan: docs here). Maybe we could run some very small self-hosted runners (like 10 runners per CPU core) whose only job was to poll.

@ScottTodd
Copy link
Member Author

Polling on a runner for ~20 minutes is less than ideal though, especially since GitHub sets a limit on the number of concurrent GitHub-hosted runners (20 on the free GitHub plan: docs here).

For opt-in jobs on PRs polling might not be so bad:

  1. Add label to PR
  2. Trigger action runs on
      pull_request_target:
        types:
          - labeled
  3. Trigger action (or another action it runs) polls until the build_package check finishes
    • This will spin for 10-20 minutes if started immediately after PR creation
    • This will run with no delay if build_package already completed (dev or reviewer: "unit tests passed, let's now also check benchmarks on system foo and system bar")
  4. If the build_package check was not successful, skip/fail
  5. Opt-in jobs use the artifacts from build_package to run their tests/benchmarks

Because polling would also occur on a limited subset of PRs, we wouldn't (hopefully) use too many concurrent runners for it. That would then let us get the benefits of decoupled workflows that can be independently triggered and enabled/disabled.

@ScottTodd
Copy link
Member Author

End of week updates:

  • With Migrate GPU test jobs to pkgci. #18007 I've finished migrating most workflows to pkgci.
    • build_and_test_android is the main holdout, but otherwise build_all can be deleted in favor of ci_linux_x64_clang.yml. Then we'll no longer be uploading the 3GB "install" archive on every CI run.
    • The remaining jobs in that workflow can either stay there or move into new files: build_test_all_bazel, build_test_runtime, small_runtime, tracing.
  • I'll be updating the build status badge(s) in our root README.md file. Might be able to combine multiple workflows into a single status, or we can split by platform. Not sure if I want all builds listed or just a few:
    image
  • I'll be adding a new page to our developer docs with a full set of workflow status badges, information about each workflow, debugging/maintenance tips, explanations about github-hosted vs self-hosted runners, and instructions for contributors to add/propose new runners and workflows:
    image

ScottTodd added a commit that referenced this issue Jul 30, 2024
Progress on #16203. Depends on
#18000.

These jobs used to use the 3.2GB install directory produced by `cmake
--build full-build-dir --target install` in the `build_all` job. Now
they use the 73MB Python packages produced by `python -m pip wheel
runtime/` and `python -m pip wheel compiler/` in the `build_packages`
job. Python packages are what we expect users to consume, so test jobs
should use them too.
* Note that the Python packages will be larger once we enable asserts
and/or debug symbols in them. These tests may also fail with less useful
error messages and callstacks as a result of this change until that is
fixed.

I tried to keep changes to the workflow jobs minimal for now. Once the
migrations are further along we can cut out some of the remaining layers
of scripts / Dockerfiles. As before, these jobs are all opt-in on
presubmit (always running on LLVM integrate PRs or PRs affecting
NVGPU/AMDGPU code). Diffs between previous jobs and new jobs to confirm
how similar they are:

Job name | Logs before | Logs after | Notes
-- | -- | -- | --
`test_nvidia_t4` | [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939423899)
| [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725841?pr=18007)
| 433 tests -> 430 tests<br>skipping
`samples/custom_dispatch/vulkan/shaders`<br>`IREE
custom_dispatch/vulkan/shaders ignored -- glslc not found`<br>(no longer
running under Docker)
`test_amd_mi250` | [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939423747)
| [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725347?pr=18007)
| 138 tests before/after
`test_amd_mi300` | [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939424223)
| [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725525?pr=18007)
| 141 tests before/after
`test_amd_w7900` | [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939424084)
| [workflow
logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725679?pr=18007)
| 289 tests before/after

Each job is now included from its own standalone workflow file, allowing
for testing of individual workflows using `workflow_dispatch` triggers.
I have some other ideas for further decoupling these optional jobs from
the core workflow(s).

ci-extra: test_amd_mi250, test_amd_mi300, test_amd_w7900, test_nvidia_t4
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
…ree-org#17274)

As we add more test jobs (specifically for AMDGPU on self-hosted mi250
and w7900 runners), relying on the `gcloud` CLI to interface with
workflow artifacts will require extra setup (install and auth). As these
files are public and the jobs only need to read and not write, plain
`wget` should be enough.

Progress on iree-org#17159 and
iree-org#16203

ci-exactly: build_all, test_nvidia_gpu
Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
Progress on iree-org#16203. Once all jobs
in `ci.yml` are migrated to pkgci or their own workflow files this will
be the primary "build the compiler and run the compiler tests"
workflow/job.

For now I just forked the existing `build_all.yml` workflow, but future
cleanup should change how Docker is used here and consider just inlining
the `build_all.sh` and `ctest_all.sh` scripts.

skip-ci: adding a new workflow (with its own triggering)
Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this issue Jul 30, 2024
…ree-org#18001)

Progress on iree-org#16203.

First, this renames `generate_embed_data` to `iree-c-embed-data`. The
name here dates back to having a single implementation shared across
multiple projects using Bazel (and Google's internal Blaze). There isn't
much code in the tool, but we've been using it for long enough to give
it a name matching our other tools IMO.

Next, this includes `iree-c-embed-data` and `iree-flatcc-cli` in the
`iree-runtime` Python package (https://pypi.org/project/iree-runtime/).
Both of these host tools are required for building large parts of the
runtime (embed data for builtins and bitcode, flatcc for schemas), so
including them in packages will allow users (and pkgci workflows) to get
them without needing to build from source.

Signed-off-by: Lubo Litchev <[email protected]>
@ScottTodd
Copy link
Member Author

Taking a look at the Android and other cross-compilation jobs now, since build_and_test_android is the last (enabled) job in ci.yml that is using build_all. Once that is migrated (or disabled/removed) we can switch to using ci_linux_x64_clang.yml as our primary job that builds the compiler and runs lit tests.

I can iterate on at least the cross_compile job in https://github.com/iree-org/iree/blob/main/.github/workflows/build_and_test_android.yml. The test job may be harder since it uses self-hosted Android runners, some of which are limited to only run on postsubmit.

Rough plan:

  • See if android.Dockerfile can be replaced with installing clang/ninja and then getting the Android NDK from something like https://github.com/nttld/setup-ndk
  • Update build_tools/cmake/build_android.sh as needed to be compatible with packages instead of the install dir

@ScottTodd
Copy link
Member Author

Getting closer to migrating the Android job to pkgci, and then could also look at the other cross compilation workflows.

work stack since yesterday:

remove `build_all` job from `ci.yml`
└── move `build_and_test_android` job from `ci.yml` to `pkgci.yml`
    └── (DONE) make Android `cross_compile` job install deps instead of using Docker
    └── make Android `cross_compile` job compatible with python packages
        └── (DONE) actually use clang instead of gcc in various builds
            └── (DONE) update iree-template-runtime-cmake to use clang
                └── (DONE) update iree-template-runtime-cmake to latest release
        └── (DONE) use ccache in various runtime builds
    └── make Android `test` job compatible with GitHub artifacts

ScottTodd added a commit that referenced this issue Aug 1, 2024
Progress on #16203.

We only run tests from
https://github.com/iree-org/iree/tree/main/experimental/regression_suite
in a few jobs, but other jobs are sharing this
`build_tools/pkgci/setup_venv.py` script, so this moves the extra
install closer to where it is needed.

Also drop the seemingly unused `IREERS_ARTIFACT_DIR` environment
variable that got copy/pasted around.
ScottTodd added a commit that referenced this issue Aug 5, 2024
Progress on #16203 and
#17957.

New page preview:
https://scotttodd.github.io/iree/developers/general/github-actions/

## Overview

This documents the state of our GitHub Actions usage after recent
refactoring work. We used to have a monolithic `ci.yml` that ran 10+
builds on every commit, many of which used large CPU runners to build
the full project (including the compiler) from source. Now we have
platform builds, some of which run every commit but many of which run on
nightly schedules. Most interesting _test_ jobs are now a part of
"pkgci", which builds release packages and then runs tests using those
packages (and sometimes runtime source builds).

Net results:

* Infrastructure costs should be more manageable - far fewer workflow
runs will require powerful CPU builders
* Self-hosted runners with special hardware have dedicated places to be
slotted in and are explicitly not on the critical path for pull requests

## Details

* Add new README badges
* pre-commit, OpenSSF Best Practices, releases (stable, nightly), PyPI
packages, select CI workflows
* Add github-actions.md webpage
  * Full workflow status tables
  * Descriptions of each type of workflow
  * Tips on (not) using Docker
  * Information about `pull_request`, `push`, `schedule` triggers
  * Information about GitHub-hosted runners and self-hosted runners
  * Maintenance/debugging tips
* Small cleanup passes through related files

skip-ci: modified script does not affect builds
ScottTodd added a commit that referenced this issue Aug 5, 2024
Progress on #16203 and
#17957.

This migrates `.github/workflows/build_and_test_android.yml` to
`.github/workflows/pkgci_test_android.yml`.

**For now, this only builds for Android, it does not run tests or use
real Android devices at all**.

The previous workflow
* Relied on the "install" directory from a CMake build 
* Ran on large self-hosted CPU build machines
* Built within Docker (using
[`build_tools/docker/dockerfiles/android.Dockerfile`](https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/android.Dockerfile))
* Used GCP/GCS for remote ccache storage
* Used GCP/GCS for passing files between jobs
* Ran tests on self-hosted lab machines (I think a raspberry pi
connected to some physical Android devices)

The new workflow
* Relies on Python packages produced by pkgci_build_packages
* Runs on standard GitHub-hosted runners
* Installs dependencies that it needs on-demand (ninja, Android NDK),
without using Docker
* Uses caches provided by GitHub Actions for ccache storage
* Could use Artifacts provided by GitHub Actions for passing files
between jobs
* Could run tests on self-hosted lab machines or Android emulators

I made some attempts at passing files from the build job to a test job
but ran into some GitHub Actions debugging that was tricky. Leaving the
remaining migration work there to contributors at Google or other
parties directly interested in Android CI infrastructure.

ci-exactly: build_packages, test_android
ScottTodd added a commit that referenced this issue Aug 6, 2024
…18107)

Cleanup relating to #16203 and
#17957.

After #18070, no jobs in `ci.yml`
are depending on the output of the `build_all` job. All jobs using the
compiler now live in `pkgci.yml` and use packages instead of the
"install dir".

* The reusable
[`build_all.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_all.yml)
workflow is still in use from benchmarking workflows, but those are
unmaintained and may be deleted soon.
* After this is merged I'll set the `linux_x64_clang` check to
"required", so PRs will need it to be passing before they can be merged.
* Now that `ci.yml` is pruned to just a few remaining jobs
(`build_test_all_bazel`, `build_test_runtime`, `small_runtime`,
`tracing`), we can also limit when the workflow runs at all. I've added
some `paths` patterns so this won't run on push events if only files
under `docs/` are changed, for example.
@ScottTodd
Copy link
Member Author

This is finished, for the CI jobs that are currently enabled.

Items to follow-up on:

  • The Android job is building (cross-compiling with the Android NDK) but it is not testing. That will need some workflow changes and either physical or emulated devices.
  • The cross_compile_and_test job matrix in ci.yml that covered [riscv_64 linux, riscv_32 linux, riscv_32 generic, emscripten wasm32] has been disabled for some time. Can port those to pkgci.yml by following how the Android job was refactored.
  • Many jobs could be faster with changes to how runtime HAL CTS tests are built/linked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

No branches or pull requests

3 participants