-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve CI testing failure for Lazy Tensor Core #1088
Resolve CI testing failure for Lazy Tensor Core #1088
Conversation
We're still seeing
which is causing a whole bunch of tests to fail. We're unable to reproduce this locally. @silvasean @powderluv How should we proceed? What can be done to try and root cause this issue? |
Did this ever work? If so, can we bisect to find where it failed? |
We've never had it run successfully thru CI before |
Does passing |
erm now it dies without running the test?
|
Oh, that is good and expected. That means that the multiprocessing was swallowing the crash by restarting the worker process. I think from here you can run the testing process under gdb/lldb if it is in the VM image, and have it break on This appears to be a memory corruption issue, so likely running it locally with AddressSanitizer would flag it, even if locally there are no symptoms on a normal build. btw, have you verified that locally you are doing the same Release+Asserts build as CI? This is the command line:
|
When I try to run the same cmake command as CI, I get the following error:
And when I run |
For those following along, we got a stack trace: https://github.com/llvm/torch-mlir/runs/7452705521?check_suite_focus=true
|
Awesome. that's a good first step. It sounds like ASan would catch this bug and give a very clear diagnosis. ASan is probably pretty hard to set up for the full Python/etc. e2e test, but a small .cpp file with just a main() and a few calls into libtorch + LTC might be doable. Let me know how your debugging goes and I can get more hands-on (vs armchair debugging this) if you folks need. |
And by Asan I mean a local ASan -- I suspect that the issue exists in the local build but is somehow not causing any symptoms locally (e.g. your local libc/malloc doesn't do as strict of checking) - asan will see it though. |
Thanks for continuing debugging this. Eventually we can add an ASAN builder (at least for the C++ parts) |
ca6d89e
to
29922a2
Compare
I made a new PR for experimenting with CI so everyone doesn't get spammed with emails: #1095 Once a solution is found, I'll bring it back over here |
Is your -DLLVM_EXTERNAL_TORCH_MLIR_DIALECTS_SOURCE_DIR set correctly? Where is that error coming from? |
Yes, it was set to
Honestly not sure where that error is coming from. Its not something that's encountered when running |
6698bb0
to
005748b
Compare
29922a2
to
23e9bb8
Compare
Here's some updates on the situation. It looks like we forgot to add
The behaviour is the same when running e2e tests, but in this case I'm running a small example model to help with isolating the source. |
23e9bb8
to
c5e8448
Compare
You probably need to call torchMlirRegisterAllDialects: torch-mlir/python/torch_mlir/dialects/torch/importer/jit_ir/csrc/module_builder.cpp Line 118 in e23fbc8
|
Hmm the strange thing is that it is called: https://github.com/llvm/torch-mlir/blob/torch_mlir_ltc_backend/python/torch_mlir/csrc/base_lazy_backend/mlir_lowering_context.cpp#L279 |
I think we might also need the equivalent of torchMlirRegisterRequiredDialects: https://github.com/llvm/torch-mlir/pull/1084/files @ashay I recall that patch got reverted -- is it okay for henry to just copy that function for now? |
Oh now that I think of it, |
c5e8448
to
bfe533f
Compare
From my testing on another branch, this PR should enable e2e CI tests to run successfully; however, we still run into some issues during Build out-of-tree |
we were talking today about the possibility of removing the out of tree build at the developer hour -- @powderluv, this sounds like another data point. |
@silvasean assuming the other 2 tests pass, are there any other changes you want on this PR, or would we be good to merge into |
LGTM |
And btw, thanks for debugging this. I've been on the other end of these "debug iteration goes through GitHub Actions" things and it is incredibly painful. I can't imagine debugging a memory corruption issue like you did here! |
Sorry for being late to the discussion. Henry, feel free to copy that function. Now that macOS builds run in CI, I should be able to fix any redundancy in a subsequent patch. |
I'm going to merge this into the LTC branch now. The failures during source and macos build seem to be outside the scope of this PR, so they'll be addressed separately. |
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
* Xfail unsupported ops * Register FuncDialect * Include dynamic_ir in build * Code reformat * Enable LTC tests for macOS and Source Build
Signed-off-by: Gong Su <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]>
* Add check-onnx-backend to Mac CI. (llvm#1069) * Add check-onnx-backend to Mac CI. Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Additional Docker help and split README for easier reading (llvm#1084) * initial docker documentation Signed-off-by: Alexandre Eichenberger <[email protected]> * split README with no redundant place for info Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * respond to suggestions Signed-off-by: Alexandre Eichenberger <[email protected]> * specify that onnx-mlir.py script generates only code suitable to be exec in Linux and/or Docker env Signed-off-by: Alexandre Eichenberger <[email protected]> * fix checkdocs Signed-off-by: Alexandre Eichenberger <[email protected]> * responded to review suggestion on onnx-mlir --help Signed-off-by: Alexandre Eichenberger <[email protected]> * use ONNX-MLIR everywhere Signed-off-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * add verify for concat Signed-off-by: Ethan Wang <[email protected]> * check all inputs Signed-off-by: Ethan Wang <[email protected]> * Support filtering out lit tests based on targets (llvm#1087) Currently we ignore what targets llvm was built for in the lit tests, but recent changes to onnx-mlir explicitly initialize the available targets. This makes the corresponding change to the lit configuration, so that we can filter out the lit tests based on the available targets. Signed-off-by: Stella Stamenova <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Switch URLs to use main instead of master (llvm#1094) Signed-off-by: Charles Volzka <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Fix MacOS build badge (llvm#1092) Signed-off-by: Gong Su <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * onnx-mlir.py warning about binary output (.so and .jar) (llvm#1090) not directly usable if host is not Linux Signed-off-by: Gong Su <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Make the doc example obey ONNX_MLIR_BUILD_TESTS (llvm#1083) * Make the doc example obey ONNX_MLIR_BUILD_TESTS Currently, ONNX_MLIR_BUILD_TESTS controls EXCLUDE_FROM_ALL, however, the targets added through add_executable will always build. We follow the llvm pattern and explicitly set EXCLUDE_FROM_ALL in the add_onnx_mlir_executable function if it is set for the directory, so that add_executable targets don't always build. Signed-off-by: Stella Stamenova <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Explicitly install into lib on all systems (llvm#1088) Signed-off-by: Gong Su <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * add check (llvm#1098) Signed-off-by: Tong Chen <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * fix typos and add ssh-client to dockerfile (llvm#1096) * fix typos and add ssh-client to dockerfile Signed-off-by: Ethan Wang <[email protected]> * sync doc and script Signed-off-by: Ethan Wang <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Emit print statement only when the verbose option is in effect. (llvm#1097) Signed-off-by: Ettore Tiotto <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * format & refine code by request Signed-off-by: Ethan Wang <[email protected]> * Support older versions 6, 11, 12 for Clip Op (llvm#1100) Signed-off-by: Tung D. Le <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * using front to get first input Signed-off-by: Ethan Wang <[email protected]> * add 3 lit test for concat verifier Signed-off-by: Ethan Wang <[email protected]> * add newline Signed-off-by: Ethan Wang <[email protected]> * Add check-onnx-backend to Mac CI. (llvm#1069) * Add check-onnx-backend to Mac CI. Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Additional Docker help and split README for easier reading (llvm#1084) * initial docker documentation Signed-off-by: Alexandre Eichenberger <[email protected]> * split README with no redundant place for info Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * update Signed-off-by: Alexandre Eichenberger <[email protected]> * respond to suggestions Signed-off-by: Alexandre Eichenberger <[email protected]> * specify that onnx-mlir.py script generates only code suitable to be exec in Linux and/or Docker env Signed-off-by: Alexandre Eichenberger <[email protected]> * fix checkdocs Signed-off-by: Alexandre Eichenberger <[email protected]> * responded to review suggestion on onnx-mlir --help Signed-off-by: Alexandre Eichenberger <[email protected]> * use ONNX-MLIR everywhere Signed-off-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Switch URLs to use main instead of master (llvm#1094) Signed-off-by: Charles Volzka <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Fix MacOS build badge (llvm#1092) Signed-off-by: Gong Su <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * fix typos and add ssh-client to dockerfile (llvm#1096) * fix typos and add ssh-client to dockerfile Signed-off-by: Ethan Wang <[email protected]> * sync doc and script Signed-off-by: Ethan Wang <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Update document (llvm#1077) * create Signed-off-by: Tong Chen <[email protected]> * delete HowTOAddAnOperation.md Signed-off-by: Tong Chen <[email protected]> * modify testing Signed-off-by: Tong Chen <[email protected]> * create Signed-off-by: Tong Chen <[email protected]> * delete HowTOAddAnOperation.md Signed-off-by: Tong Chen <[email protected]> * modify testing Signed-off-by: Tong Chen <[email protected]> * fix Signed-off-by: Tong Chen <[email protected]> * create Signed-off-by: Tong Chen <[email protected]> * add comment Signed-off-by: Tong Chen <[email protected]> * delete HowTOAddAnOperation.md Signed-off-by: Tong Chen <[email protected]> * modify testing Signed-off-by: Tong Chen <[email protected]> * fix Signed-off-by: Tong Chen <[email protected]> * create Signed-off-by: Tong Chen <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Update LLVM level (llvm#1095) * Update LLVM level to 700997a Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Pass a type converter to all ONNX operations. (llvm#1102) Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Nuke KrnlDummyCastOp now that we use MLIR's UnrealizedConversionCastOp (llvm#1103) * Nuke KrnlDummyCastOp now that we use MLIR's UnrealizedConversionCastOp Signed-off-by: Ettore Tiotto <[email protected]> * Remove a dependency in src/Dialect/Krnl/CMakeList.txt. Regenerate docs via 'ninja onnx-mlir-docs'. Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Add --emitObj option to onnx-mlir (llvm#1104) Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * fix warnings (llvm#1093) Signed-off-by: Ian Bearman <[email protected]> Co-authored-by: Stella Stamenova <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Add -march option to onnx-mlir (llvm#1107) Signed-off-by: Ettore Tiotto <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Fix Doc spelling and broken links, removed warnings about using main (llvm#1106) * removed warning about main vs master in CONTRIBUTING, fixed links and spelling mistakes Signed-off-by: Alexandre Eichenberger <[email protected]> Signed-off-by: Ethan Wang <[email protected]> * Update BuildONNX.md Signed-off-by: Ethan Wang <[email protected]> Co-authored-by: Ettore Tiotto <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]> Co-authored-by: Stella Stamenova <[email protected]> Co-authored-by: Charles Volzka <[email protected]> Co-authored-by: gongsu832 <[email protected]> Co-authored-by: chentong319 <[email protected]> Co-authored-by: Tung D. Le <[email protected]> Co-authored-by: Ian Bearman <[email protected]>
dynamic_ir.cpp
to source list, which resolvedfree()
error in CIFuncDialect
, which resolved MLIR error in CIcc: @ke1337 @antoniojkim