[MHLO] Init end-to-end unit tests #1223

tanyokwok · 2022-08-14T15:26:42Z

See RFC #999

Co-authored-by: Bairen Yi [email protected]
Co-authored-by: Jiawei Wu [email protected]
Co-authored-by: Tianyou Guo [email protected]
Co-authored-by: Xu Yan [email protected]
Co-authored-by: Ziheng Jiang [email protected]

tanyokwok · 2022-08-15T03:53:32Z

As @silvasean mentioned #1025 (comment) before. This PR adds the MHLO end-to-end unit tests to CI. It lowers MHLO to Linalg and run it on Linalg-On-Tensors backend. Please review for me @silvasean @ZihengJiang @Vremold

Vremold

LGTM!

tanyokwok · 2022-08-15T06:53:28Z

@silvasean I can't reproduce the CI failure locally with the following environments:

Collecting environment information...
PyTorch version: 1.13.0.dev20220814+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu Kinetic Kudu (development branch) (x86_64)
GCC version: (Ubuntu 11.3.0-5ubuntu1) 11.3.0
Clang version: 14.0.6-2
CMake version: version 3.24.0
Libc version: glibc-2.35

Python version: 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-108-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.2
[pip3] torch==1.13.0.dev20220814+cpu
[pip3] torchvision==0.14.0.dev20220814+cpu
[conda] Could not collect

My testing script is:

 cmake -GNinja -Bbuild \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
    -DCMAKE_LINKER=lld \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_PROJECTS=mlir \
    -DLLVM_EXTERNAL_PROJECTS="torch-mlir;torch-mlir-dialects" \
    -DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$PWD" \
    -DLLVM_EXTERNAL_TORCH_MLIR_DIALECTS_SOURCE_DIR="${PWD}/externals/llvm-external-projects/torch-mlir-dialects" \
    -DLLVM_TARGETS_TO_BUILD=host \
    -DMLIR_ENABLE_BINDINGS_PYTHON=ON \
    -DTORCH_MLIR_ENABLE_LTC=ON \
    -DTORCH_MLIR_USE_INSTALLED_PYTORCH="ON" \
    -DPython3_EXECUTABLE="$(which python)" \
    externals/llvm-project/llvm

cmake --build build

bash build_tools/write_env_file.sh
bash tools/torchscript_e2e_test.sh -c mhlo --verbose 2>&1 | tee test.log

sjain-stanford · 2022-08-15T15:25:15Z

@fortianyou . I'm about to send a PR that dockerizes CI. This should help with local reproducers. Once that's out, could you try to rebase on that and then run these tests locally? It should hopefully eliminate any environmental issues and enable a robust reproducer.

sjain-stanford · 2022-08-15T15:46:15Z

@fortianyou Here it is: #1225. Please LMK once you rebase if you are able to repro locally.

tanyokwok · 2022-08-15T16:04:54Z

Once that's out, could you try to rebase on that and then run these tests locally?

@sjain-stanford Thanks! I would love to do that.

ramiro050 · 2022-08-15T16:36:14Z

I can't reproduce the CI failure locally with the following environments

The CI is failing because one of the e2e tests is failing on an assertion. For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong. (See the assertion error here: https://github.com/llvm/torch-mlir/runs/7840097389?check_suite_focus=true#step:12:9)

If you run the tests sequentially, you should also see the assertion error locally, and it should crash the entire program, allowing you to debug further.

When I run locally python -m e2e_testing.torchscript.main -v -c mhlo -s, I get the error:

python: /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280: llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]: Assertion `idx < size()' failed.
fish: Job 1, 'python -m e2e_testing.torchscri…' terminated by signal SIGABRT (Abort)

Here is the relevant backtrace

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007ffff7c35546 in __GI_abort () at abort.c:79
#2  0x00007ffff7c3542f in __assert_fail_base (fmt=0x7ffff7dabdf8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fff3e0f9710 "idx < size()", 
    file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, function=<optimized out>) at assert.c:92
#3  0x00007ffff7c44222 in __GI___assert_fail (assertion=0x7fff3e0f9710 "idx < size()", 
    file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, 
    function=0x7fff3dd5c227 "llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]")
    at assert.c:101
#4  0x00007fff41419c59 in llvm::SmallVectorTemplateCommon<long, void>::operator[] (this=0x7fffffff91f0, idx=18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280
#5  0x00007fff4712709d in mlir::mhlo::ConcatenateOp::inferReturnTypes (location=..., operands=..., attributes=..., regions=..., inferredReturnTypes=...)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/mlir-hlo/lib/Dialect/mhlo/IR/hlo_ops.cc:3778
#6  0x00007fff4717e53a in mlir::mhlo::ConcatenateOp::build (odsBuilder=..., odsState=..., val=..., dimension=18446744073709551614)
    at tools/torch-mlir/mlir-hlo/include/mlir-hlo/Dialect/mhlo/IR/hlo_ops.cc.inc:6846
#7  0x00007fff46ebd4e3 in mlir::OpBuilder::create<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa378, location=..., 
    args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/Builders.h:455
#8  0x00007fff46ebd3ef in mlir::RewriterBase::replaceOpWithNewOp<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa370, op=0x55555a1d6e30, 
    args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/PatternMatch.h:452
#9  0x00007fff46eae374 in (anonymous namespace)::ConvertAtenOp<mlir::torch::Torch::AtenCatOp>::matchAndRewrite (this=0x55555a1bb6d0, op=..., adaptor=..., rewriter=...)
    at /usr/local/google/home/ramiroleal/torch-mlir/lib/Conversion/TorchToMhlo/Basic.cpp:999

Note: There are other tests also causing assertion errors. If you run the tests in parallel, you should see the assertion error messages print out before the results are printed out.

Let me know if you're able to reproduce things.

silvasean

Awesome!!!

python/torch_mlir_e2e_test/mhlo_backends/abc.py

python/torch_mlir_e2e_test/torchscript/configs/mhlo_backend.py

sjain-stanford · 2022-08-15T20:49:39Z

For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong.

I've seen better failures when disabling multiprocessing (with the -s flag you use above)

sjain-stanford · 2022-08-15T20:50:36Z

(apologies for the accidental closing with a comment :P )

ZihengJiang

Great work! Thanks @fortianyou

Signed-off-by: Kevin O'Brien <[email protected]>

tanyokwok force-pushed the tanyo/e2e_test branch from 84d938d to 23e3836 Compare August 15, 2022 03:50

tanyokwok requested review from silvasean, Vremold and ZihengJiang August 15, 2022 03:50

Vremold approved these changes Aug 15, 2022

View reviewed changes

tanyokwok force-pushed the tanyo/e2e_test branch from 23e3836 to a509961 Compare August 15, 2022 06:31

silvasean approved these changes Aug 15, 2022

View reviewed changes

python/torch_mlir_e2e_test/mhlo_backends/abc.py Show resolved Hide resolved

python/torch_mlir_e2e_test/torchscript/configs/mhlo_backend.py Outdated Show resolved Hide resolved

sjain-stanford closed this Aug 15, 2022

sjain-stanford reopened this Aug 15, 2022

ZihengJiang approved these changes Aug 15, 2022

View reviewed changes

Vremold mentioned this pull request Aug 23, 2022

[MHLO] bert-tiny and resnet18 e2e example from torchscript to mhlo #1266

Merged

tanyokwok force-pushed the tanyo/e2e_test branch from a509961 to 355720c Compare August 23, 2022 07:20

tanyokwok mentioned this pull request Aug 23, 2022

Add VerifyMhloBackendContract pass for MHLO. #1267

Closed

[MHLO] Init end to end unit tests

09ee945

tanyokwok force-pushed the tanyo/e2e_test branch from 355720c to 09ee945 Compare August 23, 2022 08:26

tanyokwok merged commit 2374098 into main Aug 23, 2022

tanyokwok deleted the tanyo/e2e_test branch August 23, 2022 08:47

silvasean mentioned this pull request Sep 16, 2022

[Tcp] Add boilerplate for TCP dialect #1375

Merged

tanyokwok mentioned this pull request Sep 21, 2022

features/bladedisc rebase 20220830 pai-disc/torch-mlir#20

Closed

qedawkins pushed a commit to nod-ai/torch-mlir that referenced this pull request Oct 3, 2022

make NNPA target "live" (llvm#1223)

f86355b

Signed-off-by: Kevin O'Brien <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MHLO] Init end-to-end unit tests #1223

[MHLO] Init end-to-end unit tests #1223

tanyokwok commented Aug 14, 2022

tanyokwok commented Aug 15, 2022

Vremold left a comment

tanyokwok commented Aug 15, 2022 •

edited

Loading

sjain-stanford commented Aug 15, 2022

sjain-stanford commented Aug 15, 2022

tanyokwok commented Aug 15, 2022

ramiro050 commented Aug 15, 2022

silvasean left a comment

sjain-stanford commented Aug 15, 2022

sjain-stanford commented Aug 15, 2022

ZihengJiang left a comment •

edited

Loading

[MHLO] Init end-to-end unit tests #1223

[MHLO] Init end-to-end unit tests #1223

Conversation

tanyokwok commented Aug 14, 2022

tanyokwok commented Aug 15, 2022

Vremold left a comment

Choose a reason for hiding this comment

tanyokwok commented Aug 15, 2022 • edited Loading

sjain-stanford commented Aug 15, 2022

sjain-stanford commented Aug 15, 2022

tanyokwok commented Aug 15, 2022

ramiro050 commented Aug 15, 2022

silvasean left a comment

Choose a reason for hiding this comment

sjain-stanford commented Aug 15, 2022

sjain-stanford commented Aug 15, 2022

ZihengJiang left a comment • edited Loading

Choose a reason for hiding this comment

tanyokwok commented Aug 15, 2022 •

edited

Loading

ZihengJiang left a comment •

edited

Loading