-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert more CI jobs to be package-based #16203
Comments
We could separate "unit tests" from "integration/e2e tests" in the CMake project. Unit tests should be able to run right after the build step, while integration tests should use a release/dist package for compiler tools and a [cross-compiled] runtime build for test binaries. I'm considering a nested CMake project for integration tests, replacing the Take these That is running this command: run: |
./build_tools/github_actions/docker_run.sh \
--env IREE_NVIDIA_SM80_TESTS_DISABLE \
--env IREE_MULTI_DEVICE_TESTS_DISABLE \
--env IREE_CTEST_LABEL_REGEX \
--env IREE_VULKAN_DISABLE=0 \
--env IREE_VULKAN_F16_DISABLE=0 \
--env IREE_CUDA_DISABLE=0 \
--env IREE_NVIDIA_GPU_TESTS_DISABLE=0 \
--env CTEST_PARALLEL_LEVEL=2 \
--env NVIDIA_DRIVER_CAPABILITIES=all \
--gpus all \
gcr.io/iree-oss/nvidia@sha256:892fefbdf90c93b407303adadfa87f22c0f1e84b7e819e69643c78fc5927c2ba \
bash -euo pipefail -c \
"./build_tools/scripts/check_cuda.sh
./build_tools/scripts/check_vulkan.sh
./build_tools/cmake/ctest_all.sh ${BUILD_DIR}" with all of those filters set, these are the only test source directories included:
|
+1 - I've been meaning to do something like that. The line between unit and integration tests can sometimes be blurred a bit but a unit test can never require special hardware. That belongs in something that can be independently executed with tools provided out of band |
I'm deciding which of these job sequences to aim for:
Portable targets like Android require running the compiler on a host machine, but other jobs like |
Got some good data from my test PR #16216 (on the first try too, woohoo!) Here's a sample run using just the "install" dir from a prior job: https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216.
The install dir has two copies of the 3.6GB
If this was using an "iree-dist" package instead of the full "install/" directory, that would just be the Compared to a baseline run using the full "build" dir from a prior job: https://github.com/openxla/iree/actions/runs/7647953292/job/20840561210
It doesn't seem too unreasonable to keep the test artifact generation on the same job (on a GPU machine), at least with the current number of test cases. It would be nice to share workflow configurations between "desktop Linux GPU" and "Android GPU" though. |
oof at double libIREECompiler.so may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match |
Stella has been suggesting using a package distribution like
That variance looks about right for Linux (just clicking through The smaller "runtime only" checkout is more like 5s (no submodules) + 6s (runtime submodules): https://github.com/openxla/iree/actions/runs/7647953292/job/20839868561. Having these tests jobs use the main project makes drawing solid lines around components like "compiler" and "runtime" hard though. I don't really want to accidently exclude certain tests by forcing "integration test" jobs to use the big |
I think I'll try converting The |
Found a few things to fix first / as part of this. Most of the tests in ctest_all.sh
+if (( IREE_METAL_DISABLE == 1 )); then
+ label_exclude_args+=("^driver=metal$")
+fi The combination of these different filtering mechanisms is excluding tests since no CI configuration has both Vulkan and Metal: I don't think we should use |
Yeah, we can fix that double compiler binary thing with the proper flow. Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place. |
Here is the timing on the GPU machine:
The "build test deps" step is running the compiler to generate .vmfb files for
If "build test deps" was taking 5+ minutes, I'd be more inclined to move it to a CPU machine and pass the test files to the GPU runner, as we have the Android / RISC-V cross-compilation test jobs currently configured. We might still end up with that sort of setup eventually, but I don't think it is needed for a "v0" of package-based CI. |
Yeah, and I'd rather make the compiler faster than build exotic infra unless if it becomes needed... |
Made some pretty good progress on prerequisite tasks this week. The latest thing I'm trying to enable is the use of I have a WIP commit here that gets close: ScottTodd@47bb19a * another way to enable that is to allow test jobs to set |
would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc) |
Simple is what I'm aiming for... just figuring out how to get there still. I want test jobs to run any tests that either use special hardware (GPUs) or use both the compiler and the runtime. I'd like them to follow this pattern:
|
What Ben says. I've been down the path of mixing this stuff up and don't want to see anyone else fall in and then have to dig out of the pit. We can package some of the test utilities. No big deal. |
Progress on #16203. * Install more tools that are used for lit tests * Import those installed tools * Filter based on tools existing (imported or built) rather than based on building the compiler * Add tests subdirectory _after_ tools are imported/defined Now I can run more tests from a build that imports compiler tools rather than build them from source: ``` cmake -G Ninja -B ../iree-build-rt/ . \ -DIREE_BUILD_COMPILER=OFF \ -DIREE_HOST_BIN_DIR=D:\dev\projects\iree-build\install\bin \ -DLLVM_EXTERNAL_LIT=D:\dev\projects\iree\third_party\llvm-project\llvm\utils\lit\lit.py cmake --build ../iree-build-rt cmake --build --target iree-test-deps ctest --test-dir ../iree-build-rt ``` New failures that I see are: * `iree/tests/e2e/regression/libm_linking.mlir.test` * WindowsLinkerTool looks for `lld-link`, which we don't install. That appears to be a copy of `lld`, which is different from IREE's `iree-lld` (which complains with `Expected -flavor <gnu|link|darwin|wasm>` when I try to use it in that test) * `iree/samples/custom_module/dynamic/test/example.mlir.test` * Not sure what's happening here - this passes in my regular build and works when run standalone but segfaults when run under ctest in the installed test suite
I think enough of the tests are segmented now (or will be once I land a few PRs). Next I was planning on
|
Progress on #16203 This mechanism for filtering inside lit tests * is a poor match for package-based CI * has caused some tests to be skipped unintentionally (a test suite is skipped if _any_ tag is excluded, so including Vulkan _and_ Metal tests in the same suite results in the suite always being skipped) * is awkward for developers - running a single lit file to test multiple backends is an interesting idea, but it requires developers to set these environment variables instead of filtering at configure time The `IREE_.*_DISABLE` environment variables are still used as a way to instruct the various `build_tools/.*/.*test.*.sh` scripts which test labels to filter. The existing CI jobs use those environment variables extensively. Specific changes: * Drop vulkan/metal and only use vmvx and llvm-cpu in some lit tests * Convert some lit tests into check tests
Or I could just fork the jobs over the pkgci... that might be better. 🤔 (would want iree-dist over there, and maybe a few extra tools included like lit.py) |
Okay, I have a script that "tests a package": ScottTodd@43b3922 . Going to try wiring that up to GitHub Actions, first pointed at the The script is basically:
with all the filtering goo from https://github.com/openxla/iree/blob/main/build_tools/cmake/ctest_all.sh (now that I say that I realize I could also call that script... but might be worth keeping this self contained before it gets too entangled again) The GitHub Action will need to
We should then be able to use that for "test_cpu", "test_gpu_nvidia_a100", "test_gpu_nvidia_t4" and "test_gpu_amd_???" jobs (all Linux, but Windows/macOS could also work) Could then include Python packages in the tests too and fold the "test_tf_integrations_gpu" job and other Python tests in as well. The remaining jobs will need something else - probably just a different pile of GitHub Actions yaml...
|
Nice, that test script is working with iree-dist from a release. Here's the workflow file: https://github.com/ScottTodd/iree/blob/infra-test-pkg/.github/workflows/test_package.yml A few sample runs on a standard GitHub Actions Linux runner (CPU tests only):
( Next I'll give that a coat of paint (organize the steps, prefetch Docker, trim the build, make the input package source configurable, etc.). |
I have reasonable line of sight to removing
|
Opinions on including each of these tools in the
That gets us coverage for most tests using just packages and no "install" directory. Other tools that might be useful but I think we can avoid them are |
While I ponder including more tools in the packages, I found a lightweight way to exclude all lit tests from a test run. May be able to trust the full compiler build job to cover lit tests, then require that package CI tests use other test facilities (pytest, native test, etc.) |
Well, So I'm thinking about including those in the |
Progress on #16203. Once all jobs in `ci.yml` are migrated to pkgci or their own workflow files this will be the primary "build the compiler and run the compiler tests" workflow/job. For now I just forked the existing `build_all.yml` workflow, but future cleanup should change how Docker is used here and consider just inlining the `build_all.sh` and `ctest_all.sh` scripts. skip-ci: adding a new workflow (with its own triggering)
…18001) Progress on #16203. First, this renames `generate_embed_data` to `iree-c-embed-data`. The name here dates back to having a single implementation shared across multiple projects using Bazel (and Google's internal Blaze). There isn't much code in the tool, but we've been using it for long enough to give it a name matching our other tools IMO. Next, this includes `iree-c-embed-data` and `iree-flatcc-cli` in the `iree-runtime` Python package (https://pypi.org/project/iree-runtime/). Both of these host tools are required for building large parts of the runtime (embed data for builtins and bitcode, flatcc for schemas), so including them in packages will allow users (and pkgci workflows) to get them without needing to build from source.
I'd like for jobs consuming/testing the packages produced by pkgci to be easy to disable or change the triggers for. Ideally this would be possible with small code changes to individual workflow files or with changes via a web interface: GitHub doesn't make this easy. Buildkite or other CI providers might be a better fit for this usage, but I want to keep using GitHub Actions if possible. I ran two experiments so far:
Next I want to try https://github.com/lewagon/wait-on-check-action . Test workflows would have a "wait" job running on standard GitHub-hosted runners that would poll for the build_packages check completing. When the check completes they would start jobs using self-hosted runners that test the packages. Those test workflows could then be disabled independently from the core package workflow(s). Polling on a runner for ~20 minutes is less than ideal though, especially since GitHub sets a limit on the number of concurrent GitHub-hosted runners (20 on the free GitHub plan: docs here). Maybe we could run some very small self-hosted runners (like 10 runners per CPU core) whose only job was to poll. |
For opt-in jobs on PRs polling might not be so bad:
Because polling would also occur on a limited subset of PRs, we wouldn't (hopefully) use too many concurrent runners for it. That would then let us get the benefits of decoupled workflows that can be independently triggered and enabled/disabled. |
End of week updates:
|
Progress on #16203. Depends on #18000. These jobs used to use the 3.2GB install directory produced by `cmake --build full-build-dir --target install` in the `build_all` job. Now they use the 73MB Python packages produced by `python -m pip wheel runtime/` and `python -m pip wheel compiler/` in the `build_packages` job. Python packages are what we expect users to consume, so test jobs should use them too. * Note that the Python packages will be larger once we enable asserts and/or debug symbols in them. These tests may also fail with less useful error messages and callstacks as a result of this change until that is fixed. I tried to keep changes to the workflow jobs minimal for now. Once the migrations are further along we can cut out some of the remaining layers of scripts / Dockerfiles. As before, these jobs are all opt-in on presubmit (always running on LLVM integrate PRs or PRs affecting NVGPU/AMDGPU code). Diffs between previous jobs and new jobs to confirm how similar they are: Job name | Logs before | Logs after | Notes -- | -- | -- | -- `test_nvidia_t4` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939423899) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725841?pr=18007) | 433 tests -> 430 tests<br>skipping `samples/custom_dispatch/vulkan/shaders`<br>`IREE custom_dispatch/vulkan/shaders ignored -- glslc not found`<br>(no longer running under Docker) `test_amd_mi250` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939423747) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725347?pr=18007) | 138 tests before/after `test_amd_mi300` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939424223) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725525?pr=18007) | 141 tests before/after `test_amd_w7900` | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102664951/job/27939424084) | [workflow logs](https://github.com/iree-org/iree/actions/runs/10102841435/job/27939725679?pr=18007) | 289 tests before/after Each job is now included from its own standalone workflow file, allowing for testing of individual workflows using `workflow_dispatch` triggers. I have some other ideas for further decoupling these optional jobs from the core workflow(s). ci-extra: test_amd_mi250, test_amd_mi300, test_amd_w7900, test_nvidia_t4
…ree-org#17274) As we add more test jobs (specifically for AMDGPU on self-hosted mi250 and w7900 runners), relying on the `gcloud` CLI to interface with workflow artifacts will require extra setup (install and auth). As these files are public and the jobs only need to read and not write, plain `wget` should be enough. Progress on iree-org#17159 and iree-org#16203 ci-exactly: build_all, test_nvidia_gpu Signed-off-by: Lubo Litchev <[email protected]>
Progress on iree-org#16203. Once all jobs in `ci.yml` are migrated to pkgci or their own workflow files this will be the primary "build the compiler and run the compiler tests" workflow/job. For now I just forked the existing `build_all.yml` workflow, but future cleanup should change how Docker is used here and consider just inlining the `build_all.sh` and `ctest_all.sh` scripts. skip-ci: adding a new workflow (with its own triggering) Signed-off-by: Lubo Litchev <[email protected]>
…ree-org#18001) Progress on iree-org#16203. First, this renames `generate_embed_data` to `iree-c-embed-data`. The name here dates back to having a single implementation shared across multiple projects using Bazel (and Google's internal Blaze). There isn't much code in the tool, but we've been using it for long enough to give it a name matching our other tools IMO. Next, this includes `iree-c-embed-data` and `iree-flatcc-cli` in the `iree-runtime` Python package (https://pypi.org/project/iree-runtime/). Both of these host tools are required for building large parts of the runtime (embed data for builtins and bitcode, flatcc for schemas), so including them in packages will allow users (and pkgci workflows) to get them without needing to build from source. Signed-off-by: Lubo Litchev <[email protected]>
Taking a look at the Android and other cross-compilation jobs now, since I can iterate on at least the Rough plan:
|
Getting closer to migrating the Android job to pkgci, and then could also look at the other cross compilation workflows. work stack since yesterday:
|
Progress on #16203. We only run tests from https://github.com/iree-org/iree/tree/main/experimental/regression_suite in a few jobs, but other jobs are sharing this `build_tools/pkgci/setup_venv.py` script, so this moves the extra install closer to where it is needed. Also drop the seemingly unused `IREERS_ARTIFACT_DIR` environment variable that got copy/pasted around.
Progress on #16203 and #17957. New page preview: https://scotttodd.github.io/iree/developers/general/github-actions/ ## Overview This documents the state of our GitHub Actions usage after recent refactoring work. We used to have a monolithic `ci.yml` that ran 10+ builds on every commit, many of which used large CPU runners to build the full project (including the compiler) from source. Now we have platform builds, some of which run every commit but many of which run on nightly schedules. Most interesting _test_ jobs are now a part of "pkgci", which builds release packages and then runs tests using those packages (and sometimes runtime source builds). Net results: * Infrastructure costs should be more manageable - far fewer workflow runs will require powerful CPU builders * Self-hosted runners with special hardware have dedicated places to be slotted in and are explicitly not on the critical path for pull requests ## Details * Add new README badges * pre-commit, OpenSSF Best Practices, releases (stable, nightly), PyPI packages, select CI workflows * Add github-actions.md webpage * Full workflow status tables * Descriptions of each type of workflow * Tips on (not) using Docker * Information about `pull_request`, `push`, `schedule` triggers * Information about GitHub-hosted runners and self-hosted runners * Maintenance/debugging tips * Small cleanup passes through related files skip-ci: modified script does not affect builds
Progress on #16203 and #17957. This migrates `.github/workflows/build_and_test_android.yml` to `.github/workflows/pkgci_test_android.yml`. **For now, this only builds for Android, it does not run tests or use real Android devices at all**. The previous workflow * Relied on the "install" directory from a CMake build * Ran on large self-hosted CPU build machines * Built within Docker (using [`build_tools/docker/dockerfiles/android.Dockerfile`](https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/android.Dockerfile)) * Used GCP/GCS for remote ccache storage * Used GCP/GCS for passing files between jobs * Ran tests on self-hosted lab machines (I think a raspberry pi connected to some physical Android devices) The new workflow * Relies on Python packages produced by pkgci_build_packages * Runs on standard GitHub-hosted runners * Installs dependencies that it needs on-demand (ninja, Android NDK), without using Docker * Uses caches provided by GitHub Actions for ccache storage * Could use Artifacts provided by GitHub Actions for passing files between jobs * Could run tests on self-hosted lab machines or Android emulators I made some attempts at passing files from the build job to a test job but ran into some GitHub Actions debugging that was tricky. Leaving the remaining migration work there to contributors at Google or other parties directly interested in Android CI infrastructure. ci-exactly: build_packages, test_android
…18107) Cleanup relating to #16203 and #17957. After #18070, no jobs in `ci.yml` are depending on the output of the `build_all` job. All jobs using the compiler now live in `pkgci.yml` and use packages instead of the "install dir". * The reusable [`build_all.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/build_all.yml) workflow is still in use from benchmarking workflows, but those are unmaintained and may be deleted soon. * After this is merged I'll set the `linux_x64_clang` check to "required", so PRs will need it to be passing before they can be merged. * Now that `ci.yml` is pruned to just a few remaining jobs (`build_test_all_bazel`, `build_test_runtime`, `small_runtime`, `tracing`), we can also limit when the workflow runs at all. I've added some `paths` patterns so this won't run on push events if only files under `docs/` are changed, for example.
This is finished, for the CI jobs that are currently enabled. Items to follow-up on:
|
Current tasks
lit
, can be run using packages instead of source builds (-DIREE_BUILD_COMPILER
)install-dir
inbuild_all
stepbuild-dir
toinstall-dir
test_tf_integrations
job to pkgciBuild iree-dist in pkgci.yml at the head of the pipelinetest_gpu
job to pkgciBackground information
Test workflows (for integration and e2e tests as well as benchmarks) should be consuming packages, not source build directories. Build workflows could run unit tests.
For example, the
test_all
workflow currently downloads a nearly 10GB build archive here: https://github.com/openxla/iree/blob/79b6129e2333ae26e7e13b68c27566102dcece6e/.github/workflows/ci.yml#L290-L294The PkgCI workflows show how this could be set up.
This may involve restructuring some parts of the project (like
tests/e2e/
) to be primarily based on packages and not the full CMake project.In
ci.yml
, these are the jobs that currently depend on the build archive produced bybuild_all
:test_all
test_gpu
test_a100
test_tf_integrations
test_tf_integrations_gpu
build_benchmark_tools
build_e2e_test_artifacts
cross_compile_and_test
build_and_test_android
test_benchmark_suites
Related discussions:
The text was updated successfully, but these errors were encountered: