Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MHLO] Init end-to-end unit tests #1223

Merged
merged 1 commit into from
Aug 23, 2022
Merged

[MHLO] Init end-to-end unit tests #1223

merged 1 commit into from
Aug 23, 2022

Conversation

tanyokwok
Copy link
Collaborator

See RFC #999

Co-authored-by: Bairen Yi [email protected]
Co-authored-by: Jiawei Wu [email protected]
Co-authored-by: Tianyou Guo [email protected]
Co-authored-by: Xu Yan [email protected]
Co-authored-by: Ziheng Jiang [email protected]

@tanyokwok
Copy link
Collaborator Author

As @silvasean mentioned #1025 (comment) before. This PR adds the MHLO end-to-end unit tests to CI. It lowers MHLO to Linalg and run it on Linalg-On-Tensors backend. Please review for me @silvasean @ZihengJiang @Vremold

Copy link
Collaborator

@Vremold Vremold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tanyokwok
Copy link
Collaborator Author

tanyokwok commented Aug 15, 2022

@silvasean I can't reproduce the CI failure locally with the following environments:

Collecting environment information...
PyTorch version: 1.13.0.dev20220814+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu Kinetic Kudu (development branch) (x86_64)
GCC version: (Ubuntu 11.3.0-5ubuntu1) 11.3.0
Clang version: 14.0.6-2
CMake version: version 3.24.0
Libc version: glibc-2.35

Python version: 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-108-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.2
[pip3] torch==1.13.0.dev20220814+cpu
[pip3] torchvision==0.14.0.dev20220814+cpu
[conda] Could not collect

My testing script is:

 cmake -GNinja -Bbuild \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
    -DCMAKE_LINKER=lld \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_PROJECTS=mlir \
    -DLLVM_EXTERNAL_PROJECTS="torch-mlir;torch-mlir-dialects" \
    -DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$PWD" \
    -DLLVM_EXTERNAL_TORCH_MLIR_DIALECTS_SOURCE_DIR="${PWD}/externals/llvm-external-projects/torch-mlir-dialects" \
    -DLLVM_TARGETS_TO_BUILD=host \
    -DMLIR_ENABLE_BINDINGS_PYTHON=ON \
    -DTORCH_MLIR_ENABLE_LTC=ON \
    -DTORCH_MLIR_USE_INSTALLED_PYTORCH="ON" \
    -DPython3_EXECUTABLE="$(which python)" \
    externals/llvm-project/llvm

cmake --build build

bash build_tools/write_env_file.sh
bash tools/torchscript_e2e_test.sh -c mhlo --verbose 2>&1 | tee test.log

@sjain-stanford
Copy link
Member

@fortianyou . I'm about to send a PR that dockerizes CI. This should help with local reproducers. Once that's out, could you try to rebase on that and then run these tests locally? It should hopefully eliminate any environmental issues and enable a robust reproducer.

@sjain-stanford
Copy link
Member

@fortianyou Here it is: #1225. Please LMK once you rebase if you are able to repro locally.

@tanyokwok
Copy link
Collaborator Author

Once that's out, could you try to rebase on that and then run these tests locally?

@sjain-stanford Thanks! I would love to do that.

@ramiro050
Copy link
Collaborator

I can't reproduce the CI failure locally with the following environments

The CI is failing because one of the e2e tests is failing on an assertion. For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong. (See the assertion error here: https://github.com/llvm/torch-mlir/runs/7840097389?check_suite_focus=true#step:12:9)

If you run the tests sequentially, you should also see the assertion error locally, and it should crash the entire program, allowing you to debug further.

When I run locally python -m e2e_testing.torchscript.main -v -c mhlo -s, I get the error:

python: /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280: llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]: Assertion `idx < size()' failed.
fish: Job 1, 'python -m e2e_testing.torchscri…' terminated by signal SIGABRT (Abort)

Here is the relevant backtrace

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007ffff7c35546 in __GI_abort () at abort.c:79
#2  0x00007ffff7c3542f in __assert_fail_base (fmt=0x7ffff7dabdf8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fff3e0f9710 "idx < size()", 
    file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, function=<optimized out>) at assert.c:92
#3  0x00007ffff7c44222 in __GI___assert_fail (assertion=0x7fff3e0f9710 "idx < size()", 
    file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, 
    function=0x7fff3dd5c227 "llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]")
    at assert.c:101
#4  0x00007fff41419c59 in llvm::SmallVectorTemplateCommon<long, void>::operator[] (this=0x7fffffff91f0, idx=18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280
#5  0x00007fff4712709d in mlir::mhlo::ConcatenateOp::inferReturnTypes (location=..., operands=..., attributes=..., regions=..., inferredReturnTypes=...)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/mlir-hlo/lib/Dialect/mhlo/IR/hlo_ops.cc:3778
#6  0x00007fff4717e53a in mlir::mhlo::ConcatenateOp::build (odsBuilder=..., odsState=..., val=..., dimension=18446744073709551614)
    at tools/torch-mlir/mlir-hlo/include/mlir-hlo/Dialect/mhlo/IR/hlo_ops.cc.inc:6846
#7  0x00007fff46ebd4e3 in mlir::OpBuilder::create<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa378, location=..., 
    args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/Builders.h:455
#8  0x00007fff46ebd3ef in mlir::RewriterBase::replaceOpWithNewOp<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa370, op=0x55555a1d6e30, 
    args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/PatternMatch.h:452
#9  0x00007fff46eae374 in (anonymous namespace)::ConvertAtenOp<mlir::torch::Torch::AtenCatOp>::matchAndRewrite (this=0x55555a1bb6d0, op=..., adaptor=..., rewriter=...)
    at /usr/local/google/home/ramiroleal/torch-mlir/lib/Conversion/TorchToMhlo/Basic.cpp:999

Note: There are other tests also causing assertion errors. If you run the tests in parallel, you should see the assertion error messages print out before the results are printed out.

Let me know if you're able to reproduce things.

Copy link
Contributor

@silvasean silvasean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!!

@sjain-stanford
Copy link
Member

For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong.

I've seen better failures when disabling multiprocessing (with the -s flag you use above)

@sjain-stanford
Copy link
Member

(apologies for the accidental closing with a comment :P )

Copy link
Collaborator

@ZihengJiang ZihengJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks @fortianyou

@tanyokwok tanyokwok merged commit 2374098 into main Aug 23, 2022
@tanyokwok tanyokwok deleted the tanyo/e2e_test branch August 23, 2022 08:47
qedawkins pushed a commit to nod-ai/torch-mlir that referenced this pull request Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants