Remove mlir-hlo (replace with stablehlo). #2460

stellaraccident · 2023-09-12T23:20:27Z

We just have to do this: I ran into an issue today where I needed to make a one line patch to stablehlo to work around a compiler issue, and it is completely unapparent how to do so given that the mlir-hlo repo is a read-only export and is at the tail end of a multi-week integration chain from the open-source stablehlo repo.

We've discussed this often enough and gotten +1 from everyone that they are ok with taking the e2e testing hit if it becomes necessary: It is necessary as the current situation is unmanageable.

Looking at it, I expect it wouldn't actually be very difficult to build a little runner binary out of the stablehlo interpreter and subprocess call that in order to get the testing coverage back. I leave that as an exercise to the users of this part of the stack and recommend following the breadcrumbs from the deleted python/torch_mlir_e2e_test/stablehlo_backends/linalg_on_tensors.py file and the main.py changes.

Note that I am pointing us at a stablehlo fork for the moment until it is apparent that we don't need to carry any local patches to it. We can update this in a few days if everything is clear.

powderluv · 2023-09-12T23:28:10Z

Thank you. Hopefully this will be the end of the index.lock issues from the MHLO repo. Hopefully we don't see it from the LLVM side though.

stellaraccident · 2023-09-12T23:30:51Z

Thank you. Hopefully this will be the end of the index.lock issues from the MHLO repo. Hopefully we don't see it from the LLVM side though.

I don't know what that is coming from. If we upgrade to the style of runners that are used in IREE's PkgCI, it shouldn't be possible, fwiw.

ashay

LGTM, thanks for seeing this through, since it had been on the cards for a long time.

I think we could also nix the STABLEHLO_PASS_SET set in e2e_testing/xfail_sets.py, but we can do that in a separate commit. Same with the examples/torchscript_stablehlo_backend_resnet.py file. This patch might be helpful to find other references to MHLO that we can now remove.

e2e_testing/main.py

stellaraccident · 2023-09-13T02:09:25Z

Ok - that is the same Windows compilation issue that drove this switch. Landing this so that I can patch it.

Aligns with #2460 and fixes bazel build. GHA workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/6178894329

…ync (#11) ## Why When bumping LLVM up, it is crucial to be able to test all downstream repos depending on it to ensure they work **in tandem** (and not just in isolation). In the past, LLVM upgrades were simpler because torch-mlir took a hard dependency on mhlo/stablehlo and, in doing so, ensured that the llvm "green commit" (sha1) that torch-mlir and stablehlo were built+tested against was pre-identified. During this time mlir-tcp was developed on a branch of torch-mlir. This meant when upgrades were needed downstream, we’d simply point to torch-mlir@HEAD (sha4) and pick the llvm-project (sha1) and mhlo/stablehlo (sha3) hashes it’d refer to, since these are already tested to work together. This became our set of green commits (llvm@sha1, stablehlo@sha3, torch-mlir@sha4) for downstream integrations (e.g cruise monorepo). <img width="500" alt="image" src="https://github.com/cruise-automation/mlir-tcp/assets/19234106/42078522-466c-449f-8d7e-496facc1447c"> At present the situation is complicated because torch-mlir no longer takes a hard dependency on stablehlo (stablehlo e2e tests [disabled](llvm/torch-mlir#2460)). Here's details from a recent upgrade scenario that motivated this RFC. We picked torch-mlir@HEAD which was right after the llvm bump in llvm/torch-mlir#2511 pointing to llvm/llvm-project@b44b349, but soon realized (when we started building torch-mlir) that the llvm bazel build upstream was broken: ``` ERROR: /root/.cache/bazel/_bazel_root/b89349c08f7224396763d14fe35cba11/external/llvm-project/mlir/BUILD.bazel:5837:18: TdGenerate external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/NVVMOpsInterface.h.inc failed: (Exit 1): mlir-tblgen failed: error executing command ... external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td:20:9: error: Could not find include file 'mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td' include "mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td" ^ ``` The bazel fixes followed in a subsequent commit at llvm/llvm-project@28b27c1. Hence llvm had to be re-bumped in torch-mlir (llvm/torch-mlir#2517). However, after a bit more work we hit these failing stablehlo tests, which surfaced the fact that stablehlo pointed to by torch-mlir could no longer be used, and we had to separately identify the sha3 of stablehlo that would build cleanly against sha1 of llvm. ``` @stablehlo//stablehlo/conversions/tosa/tests:binary.mlir.test FAILED in 0.7s @stablehlo//stablehlo/tests:print_stablehlo.mlir.test FAILED in 4.7s ``` This meant the burden of identifying the llvm green commit (that works across the board) is shifted further downstream from torch-mlir. Incidentally we are in a great position to leverage mlir-tcp to identify the set of green commits, given it already directly depends on each of these repos. <img width="500" alt="image" src="https://github.com/cruise-automation/mlir-tcp/assets/19234106/cadd38c4-71ec-45b0-8888-85ac0bfd4e99"> ## What This PR is an attempt to leverage the mlir-tcp repo as our "proxy" for such downstream integrations, and _I think_ contains everything needed to be able to do that. ## How Specifically, we should now be able to run these from the comfort of `mlir-tcp`: ```shell bazel test --config=clang_linux @llvm-project//mlir/... bazel test --config=clang_linux @stablehlo//... bazel test --config=clang_linux @torch-mlir//... ``` We provide `local_repos.bzl` that allows easier local testing of patches that later need to be upstreamed, and while they're being upstreamed we could land them as patches to our `http_archive` targets. Note: I include a `stablehlo.patch` that allows testing stablehlo from `mlir-tcp`. This is temporary and can be removed once openxla/stablehlo#1810 lands. This PR also enables each of the 3p test suites as GHA workflows (non-merge gating for now, we can change this). These workflows are automatically skipped unless a change is made to `deps.bzl` (which usually means bumping 3p deps), as it would be unnecessary to run them for every PR and `main` commit post-merge. Here's a snapshot from this PR's workflows, having bumped stablehlo commit. <img width="747" alt="image" src="https://github.com/cruise-automation/mlir-tcp/assets/19234106/e535ed39-33f7-4941-958c-3a5d0c0adef6">

stellaraccident added 3 commits September 12, 2023 15:22

Drop mlir-hlo and add stablehlo synced to an equivalent commit.

b7922f5

Replace code deps to stablehlo.

c8a4691

Disable stablehlo e2e testing.

c699d46

stellaraccident mentioned this pull request Sep 12, 2023

LLVM green commit tracking for coordination experiment #1178

Open

stellaraccident requested review from ashay and powderluv September 12, 2023 23:22

powderluv approved these changes Sep 12, 2023

View reviewed changes

Remove stablehlo from CI tests.

976de7e

powderluv approved these changes Sep 12, 2023

View reviewed changes

ashay approved these changes Sep 13, 2023

View reviewed changes

e2e_testing/main.py Show resolved Hide resolved

stellaraccident merged commit 078d1e1 into main Sep 13, 2023
3 of 5 checks passed

stellaraccident deleted the killmhlo branch September 13, 2023 02:10

sjain-stanford added a commit to sjain-stanford/torch-mlir that referenced this pull request Sep 13, 2023

fixes mlir-hlo -> stablehlo to align with llvm#2460

45e13d5

sjain-stanford mentioned this pull request Sep 13, 2023

[Bazel] Replace mlir-hlo with stablehlo #2463

Merged

sjain-stanford added a commit that referenced this pull request Sep 14, 2023

[Bazel] Replace mlir-hlo with stablehlo (#2463)

3d974ed

Aligns with #2460 and fixes bazel build. GHA workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/6178894329

qingyunqu mentioned this pull request Sep 15, 2023

Problems for migration from mhlo to stablehlo openxla/stablehlo#1773

Open

sjain-stanford mentioned this pull request Oct 19, 2023

[RFC + PR] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commit Sync cruise-automation/mlir-tcp#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove mlir-hlo (replace with stablehlo). #2460

Remove mlir-hlo (replace with stablehlo). #2460

stellaraccident commented Sep 12, 2023 •

edited

Loading

powderluv commented Sep 12, 2023

stellaraccident commented Sep 12, 2023

ashay left a comment

stellaraccident commented Sep 13, 2023

Remove mlir-hlo (replace with stablehlo). #2460

Remove mlir-hlo (replace with stablehlo). #2460

Conversation

stellaraccident commented Sep 12, 2023 • edited Loading

powderluv commented Sep 12, 2023

stellaraccident commented Sep 12, 2023

ashay left a comment

Choose a reason for hiding this comment

stellaraccident commented Sep 13, 2023

stellaraccident commented Sep 12, 2023 •

edited

Loading