Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with bus error on Android (arm-v7a) when creating InferenceSession with SSD model #4103

Closed
muxyrmuc opened this issue Jun 1, 2020 · 8 comments

Comments

@muxyrmuc
Copy link

muxyrmuc commented Jun 1, 2020

Describe the bug
Application linked with release build of libonnxruntime.so crashes due to SIGBUS. Logcat output:

F/libc    (30024): Fatal signal 7 (SIGBUS), code 1, fault addr 0xb7acf8a1 in tid 30024 (app)
I/DEBUG   (  194): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
I/DEBUG   (  194): Build fingerprint: 'lge/palman/palman:5.1/LMY47O.L017/1474611459:user/release-keys'
I/DEBUG   (  194): Revision: '11'
I/DEBUG   (  194): ABI: 'arm'
I/DEBUG   (  194): pid: 30024, tid: 30024, name: app  >>> ./app <<<
I/DEBUG   (  194): signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xb7acf8a1
W/NativeCrashListener(  630): Couldn't find ProcessRecord for pid 30024
I/DEBUG   (  194):     r0 b7b53938  r1 b7b53938  r2 b7b53940  r3 bed7e058
E/DEBUG   (  194): AM write failure (32 / Broken pipe)
I/DEBUG   (  194):     r4 b7acf8a9  r5 b7acf8a1  r6 00000000  r7 bed7e038
I/DEBUG   (  194):     r8 00000000  r9 00000008  sl b64c4dd4  fp bed7e050
I/DEBUG   (  194):     ip b64c064c  sp bed7e000  lr b649513b  pc b6536a78  cpsr 20010030
I/DEBUG   (  194): 
I/DEBUG   (  194): backtrace:
I/DEBUG   (  194):     #00 pc 00069a78  /data/local/tmp/libonnxruntime.so
I/DEBUG   (  194):     #01 pc 0002b137  /system/lib/libc.so (dlmalloc_real+3486)
I/DEBUG   (  194):     #02 pc 0020f4f4  [heap]
I/DEBUG   (  194): 
I/DEBUG   (  194): Tombstone written to: /data/tombstones/tombstone_00
I/BootReceiver(  630): Copying /data/tombstones/tombstone_00 to DropBox (SYSTEM_TOMBSTONE)

Urgency
None.

System information

  • OS Platform and Distribution: Android 5.1 on arm-v7a (NDK versions 21 and 19 were tested).
  • ONNX Runtime installed from source: yes.
  • ONNX Runtime version: 1.3.0 (1.2.0 is affected too).
  • Python version: none
  • Visual Studio version (if applicable): not applicable.
  • GCC/Compiler version (if compiling from source): clang from NDK version 21.
  • CUDA/cuDNN version: not applicable.
  • GPU model and memory: not applicable.

To Reproduce

  • Describe steps/code to reproduce the behavior.
    Code to reproduce:
const char *model_path = "ssd-10.onnx";
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "somestring");
Ort::SessionOptions session_options;
session_options.SetIntraOpNumThreads(1);
session_options.SetGraphOptimizationLevel(ORT_ENABLE_BASIC);
Ort::Session session(env, model_path, session_options);

This is how I build libonnxruntime.so:

./build.sh --android --android_sdk_path /home/username/Android/Sdk/ --android_ndk_path /home/username/Android/Sdk/ndk/21.0.6113669/ --android_abi armeabi-v7a --android_api 21 --build_shared_lib --config Release

Expected behavior
InferenceSession is created without errors.

Screenshots
Not applicable.

Additional context
Debug build works fine.
Other models with NMS layers are affected too.

@skottmckay
Copy link
Contributor

Could you try building with --config RelWithDebInfo? Hopefully that will give a bit more info on where it's breaking.

@muxyrmuc
Copy link
Author

muxyrmuc commented Jun 2, 2020

Looks like building with RelWithDebInfo doesn't give additional info:

F/libc    (32474): Fatal signal 7 (SIGBUS), code 1, fault addr 0xb89cd4c1 in tid 32474 (app)
W/debuggerd(  194): type=1400 audit(0.0:1177): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
W/debuggerd(  194): type=1400 audit(0.0:1178): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
W/debuggerd(  194): type=1400 audit(0.0:1179): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
W/debuggerd(  194): type=1400 audit(0.0:1180): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
W/debuggerd(  194): type=1400 audit(0.0:1181): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
W/debuggerd(  194): type=1400 audit(0.0:1182): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
W/debuggerd(  194): type=1400 audit(0.0:1183): avc: denied { search } for name="tmp" dev="mmcblk0p29" ino=627090 scontext=u:r:debuggerd:s0 tcontext=u:object_r:shell_data_file:s0 tclass=dir
I/DEBUG   (  194): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
I/DEBUG   (  194): Build fingerprint: 'lge/palman/palman:5.1/LMY47O.L017/1474611459:user/release-keys'
I/DEBUG   (  194): Revision: '11'
I/DEBUG   (  194): ABI: 'arm'
I/DEBUG   (  194): pid: 32474, tid: 32474, name: app  >>> ./app <<<
I/DEBUG   (  194): signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xb89cd4c1
W/NativeCrashListener(  630): Couldn't find ProcessRecord for pid 32474
I/DEBUG   (  194):     r0 b8a51558  r1 00000000  r2 b8a51558  r3 b89cd4c1
E/DEBUG   (  194): AM write failure (32 / Broken pipe)
I/DEBUG   (  194):     r4 00000000  r5 b89cd4c9  r6 b89cd4c1  r7 beb3a010
I/DEBUG   (  194):     r8 00000001  r9 00000000  sl b8a51558  fp 00000001
I/DEBUG   (  194):     ip 00000000  sp beb39fd8  lr 00000000  pc b64820a0  cpsr 80010030
I/DEBUG   (  194): 
I/DEBUG   (  194): backtrace:
I/DEBUG   (  194):     #00 pc 000830a0  /data/local/tmp/libonnxruntime.so
I/DEBUG   (  194):     #01 pc 00000000  <unknown>
I/DEBUG   (  194): 
I/DEBUG   (  194): Tombstone written to: /data/tombstones/tombstone_02
I/BootReceiver(  630): Copying /data/tombstones/tombstone_02 to DropBox (SYSTEM_TOMBSTONE)

addr2line for NDK v19 build gives something like this:

/home/username/Applications/AndroidNDKr19c/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/memory:1812

Output for NDK v21 is similar:

/home/username/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/memory:1825

UPD: ndk-stack gave me something:

********** Crash dump: **********
Build fingerprint: 'lge/palman/palman:5.1/LMY47O.L017/1474611459:user/release-keys'
#00 0x00079f78 /data/local/tmp/libonnxruntime.so
void std::__ndk1::allocator<long long>::construct<long long, long long const&>(long long*, long long const&)
/home/muxyrmuc/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/memory:1825:35
void std::__ndk1::allocator_traits<std::__ndk1::allocator<long long> >::__construct<long long, long long const&>(std::__ndk1::integral_constant<bool, true>, std::__ndk1::allocator<long long>&, long long*, long long const&)
/home/muxyrmuc/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/memory:1717:0
void std::__ndk1::allocator_traits<std::__ndk1::allocator<long long> >::construct<long long, long long const&>(std::__ndk1::allocator<long long>&, long long*, long long const&)
/home/muxyrmuc/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/memory:1560:0
std::__ndk1::enable_if<__is_forward_iterator<long long const*>::value, void>::type std::__ndk1::__split_buffer<long long, std::__ndk1::allocator<long long>&>::__construct_at_end<long long const*>(long long const*, long long const*)
/home/muxyrmuc/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/__split_buffer:275:0
std::__ndk1::enable_if<(__is_forward_iterator<long long const*>::value) && (is_constructible<long long, std::__ndk1::iterator_traits<long long const*>::reference>::value), std::__ndk1::__wrap_iter<long long*> >::type std::__ndk1::vector<long long, std::__ndk1::allocator<long long> >::insert<long long const*>(std::__ndk1::__wrap_iter<long long const*>, long long const*, long long const*)
/home/muxyrmuc/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/c++/v1/vector:2005:0
#01 0xffffffff <unknown>
Crash dump is completed

@skottmckay
Copy link
Contributor

It looks like something is going wrong inserting into a vector where a new allocation is required to grow the vector, but it's not clear what code is trying to do the insert as the stack has no info on that.

Can this be reproduced in an emulator or is an ARM device needed? Some info on running that here: https://github.com/microsoft/onnxruntime/blob/master/docs/Android_testing.md. Possibly need to specify the RAM size to match the device it fails on using the '-memory' parameter. You'd have to rebuild ORT targeting the host device though.

Or alternatively can you run the program in GDB and see if a more complete stack can be obtained? https://source.android.com/devices/tech/debug/gdb

@muxyrmuc
Copy link
Author

muxyrmuc commented Jun 4, 2020

I think that this issue can't be reproduced in an emulator because the x86 platform is not affected by memory alignment issues, so I skipped this step for now.
GDB didn't give me any results:

Program received signal SIGBUS, Bus error.
0xb6b51f38 in ?? ()
(gdb) bt
#0  0xb6b51f38 in ?? ()
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I tried to track the issue manually and found the function which causes the problem:
File:

/home/username/onnxruntime/cmake/external/onnx/onnx/defs/generator/defs.cc (located at https://github.com/onnx/onnx/blob/master/onnx/defs/generator/defs.cc)

Function:

TypeAndShapeInferenceFunction() for ConstantOfShape (called from InferenceContextImpl::RunInferencing() in onnxruntime/core/graph/graph.cc)

Lines:

std::vector<int64_t> targetShape;
if (targetShapeInitializer->has_raw_data()) {
  const std::string& bytes = targetShapeInitializer->raw_data();
  targetShape.insert(
    targetShape.end(),
    reinterpret_cast<const int64_t*>(bytes.c_str()),
    reinterpret_cast<const int64_t*>(bytes.c_str() + bytes.size()));
  }

We are receiving SIGBUS in targetShape.insert() because bytes.c_str() pointer must be aligned to int64_t size (I tried to use memalign() to check my idea and succeeded).

Also I found similar issue in TypeAndShapeInferenceFunction() for Slice operator (checked on different model file) which uses ParseData() from https://github.com/onnx/onnx/blob/master/onnx/defs/tensor_proto_util.cc:

res.insert(                                                            \
        res.end(),                                                         \
        reinterpret_cast<const type*>(bytes),                              \
        reinterpret_cast<const type*>(bytes + raw_data.size()));           \

These lines also cause SIGBUS.

Hope it helps (should I create issue in ONNX repo for these cases?).

@skottmckay
Copy link
Contributor

skottmckay commented Jun 4, 2020

Hope it helps (should I create issue in ONNX repo for these cases?).

Thanks for the detailed info! If you could create an issue on the ONNX repo that would be awesome as the problem needs to be fixed there.

Out of interest what did your change to use memalign involve?

Also curious as to why it's only a Release build that fails.

@muxyrmuc
Copy link
Author

muxyrmuc commented Jun 4, 2020

I've just opened the issue in ONNX repo: onnx/onnx#2813

Out of interest what did your change to use memalign involve?

I created char * storage aligned to sizeof(int64_t) and copied data from bytes.c_str() using memcpy()
Of course, it's not a proper fix but it will be enough to avoid alignment issues.

@stale
Copy link

stale bot commented Aug 3, 2020

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@stale stale bot added the wontfix label Aug 3, 2020
@stale
Copy link

stale bot commented Aug 15, 2020

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

@stale stale bot closed this as completed Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants