Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A weird, yet critical issue with x86 and OpenGL ES #54

Closed
DoDoENT opened this issue Mar 29, 2016 · 9 comments
Closed

A weird, yet critical issue with x86 and OpenGL ES #54

DoDoENT opened this issue Mar 29, 2016 · 9 comments

Comments

@DoDoENT
Copy link

DoDoENT commented Mar 29, 2016

In our project we use OpenGL ES initialized with pbuffer surface to perform some offscreen rendering (OpenGLES shaders are used to accelerate some image processing).

With NDK r10e everything works OK, however, when same source code is built with NDK r11b, app crashes when OpenGL is initalised.

The device in question is Prestigio Multiphone (device model: PAP5430), target is set to android-21, GNU STL static is used and GCC 4.9 for building (both in NDR r10e and NDK r11b).

When built with NDK r10, app works correctly, when built with NDK r11 we get following crash:

stack corruption detected: aborted

This is how to reproduce it:

  • obtain default display with eglGetDisplay(EGL_DEFAULT_DISPLAY)
  • initialize EGL with eglInitialize(display, &versionMajor, &versionMinor)
  • try creating OpenGLES 3 context - this will fail as this device does not support it
  • try creating OpenGLES 2 context - this succeeds
  • create pbuffer surface with eglCreatePbufferSurface
  • make context current with eglMakeCurrent
  • after this, attempt to return from function causes stack corruption
  • the very same code works on same device when compiled with same compiler (GCC 4.9), same STL (gnustl_static) and same NDK target (android-21), but with NDK r10d (and r10e)

The GPU on device is PowerVR SGX 540.

I believe this is a NDK issue, not a device issue.

@DoDoENT
Copy link
Author

DoDoENT commented Mar 29, 2016

I've also tried on x86 emulator. When using x86 Marshmallow image, I get a different crash (see below), and Intel 4.0.4 image is not recognised by virtual devices manager, even after installing according to instructions (probably another issue?).

On x86 Marshmallow image I get this crash as soon as I attempt to compile GL shader:

DEBUG  F  #00 pc 00081ba3  /system/lib/libc.so (pthread_mutex_lock+11)
F  #01 pc 00010752  /system/lib/egl/libEGL_emulation.so (GLSharedGroup::addShaderData(unsigned int)+32)
F  #02 pc 0000961f  /system/lib/libGLESv2_enc.so (GL2Encoder::s_glCreateShader(void*, unsigned int)+203)
F  #03 pc 00005e94  /system/lib/egl/libGLESv2_emulation.so (glCreateShader+52)
F #04 (my code that calls glCreateShader)

However, I believe this crash is not related to this issue...

@DoDoENT
Copy link
Author

DoDoENT commented Mar 29, 2016

I've just tried running problematic binary on Pegatron Hudl 2 and it works there. (this tablets runs Android 5.1)

So it is possible that this issue is either related to ICS x86 image (I cannot test that because official Intel image is not recognized by recent virtual devices manager) or to Prestigio Multiphone duo.

Any ideas what happened?

@DoDoENT
Copy link
Author

DoDoENT commented Mar 29, 2016

OK, after some investigation I've discovered that error I am getting is due to -fstack-protector-all flag, i.e. when compiled with -fno-stack-protector, there is no crash.

This now indicates one of several possible things:

  1. in NDK r10, -fstack-protector-all flag was ignored completely and now in r11 it works and has discovered a bug in our code
  2. same as 1., but only for x86 platform
  3. x86 versions of libGLESv2.so and libEGL.so corrupt caller's stack in r11, and work correctly in r10

If 1. is true, then why our binaries compiled with r11 do not crash on other devices (SGS6, Nexus5X, Hudl 2, ...). Maybe it's specific to Android 4.0.4? EDIT: obviously it's not, as it works on HTC One V on Android 4.0.3 (ARMv7).
If 2. is true, when why same x86 binary works on Hudl 2 (it's a x86_64 device, but it runs x86 binaries as well)?
If 3. is true, then this is an r11 issue. But why then x86 binary works on Hudl 2 device?

It is also possible that device in question has some bug in their OpenGL ES driver, but why then our binary works when compiled with r10?

I am totally confused...

@DanAlbert
Copy link
Member

The device in question is Prestigio Multiphone (device model: PAP5430), target is set to android-21

Uh, is that device actually running Lollipop? The target is a minimum target, and there are a lot of problems specifically with L and newer running on pre-L devices.

It is also possible that device in question has some bug in their OpenGL ES driver, but why then our binary works when compiled with r10?

That seems pretty likely. According to gsmarena, that device shipped with 4.0, which was before we had any CTS tests to verify stack protector behavior.

Where was -fstack-protector-all coming from? I don't see it provided by our build system (though clang will use -fstack-protector-strong by default as of r11).

The emulator crash in pthread_mutex_lock is pretty interesting too, but I think you're right about it not being related.

@DoDoENT
Copy link
Author

DoDoENT commented Mar 30, 2016

Uh, is that device actually running Lollipop? The target is a minimum target, and there are a lot of problems specifically with L and newer running on pre-L devices.

No, this device is running ICS. I am aware that NDK target is a minimum target and up until 16 months ago I used android-8 as a target, but with release of Lollipop and android-21 NDK target I wanted to take advantage of OpenGLES 3 on devices that support it. In order to do that, I needed appropriate OpenGL ES headers which are not available when targeting less than android-21. In order to ensure our binary works correctly, we rigorously tested it down to Android 2.3, implementing all the required workarounds to make our binary work across all Android versions our SDK supports (API level 10 and higher)

That seems pretty likely. According to gsmarena, that device shipped with 4.0, which was before we had any CTS tests to verify stack protector behavior.

Then why does binary work when compiled with NDK r10? We use -fstack-protector-all in our Android.mk for more than a year - it was also used with r10 NDK releases to build our app.
A possible explanation would be that GCC in r10 ignored that flag completely. Can you verify that?

Where was -fstack-protector-all coming from? I don't see it provided by our build system (though clang will use -fstack-protector-strong by default as of r11).

It comes from our Android.mk.

The emulator crash in pthread_mutex_lock is pretty interesting too, but I think you're right about it not being related.

I'll open another issue for that later. My guess is that it might be related to rare crash we have at some very specific devices when initialising OpenGL ES with pbuffer for offscreen-only rendering.

@DanAlbert
Copy link
Member

If your testing is solid then you're probably fine. Just know that it is an untested and unsupported configuration that is known broken in many cases.

Then why does binary work when compiled with NDK r10?

That's a good question. GCC was updated (still 4.9 but it's a newer 4.9). Maybe thanks to some new optimization you're hitting a bad case now? Possibly something that was being inlined before isn't now and that's affected stack guard generation?

A possible explanation would be that GCC in r10 ignored that flag completely. Can you verify that?

objdump says it works in r10. It also shows that "all" doesn't actually mean "all". It's also possible that the definition of "all" has changed to include the function that is failing the check.

@DoDoENT
Copy link
Author

DoDoENT commented Mar 30, 2016

Since this appears to be device specific issue - other Intel-based devices work fine - I've contacted our Intel contact to check if they know more about possible GPU driver bugs in this device. They responded that they have deprecated platform this device uses (Medfield platform) years ago and they recommended dropping support for this device from our SDK. I've also contacted our client which required us to support this device to check if it's OK to drop support for it. Anywhere, even if they still require us to support it, I will make sure our x86 binary will be built with -fno-stack-protector (ugly, but it works).

So, I think we can close this issue because there is obviously nothing to do here. To conclude:

  • the device in question has been deprecated by its manufacturer
  • the device most probably has bug in their OpenGLES library causing caller's stack corruption
  • in binaries compiled with r10 this bug was not observed due to currently unknown reason
  • in r11, something has changed so that this bug is now observed

@DoDoENT DoDoENT closed this as completed Mar 30, 2016
@DanAlbert
Copy link
Member

I will make sure our x86 binary will be built with -fno-stack-protector (ugly, but it works).

FYI anything built with clang will use -fstack-protector-strong by default.

@jevinskie
Copy link

It looks like bionic added the stack protector TLS slot in this commit which first appeared in a release branch with jb-mr2-release or 4.3. As far as I can tell, Clang, LLVM, and the NDK makefiles contain no logic to gate the use of the TLS cookies to >= 4.3 platforms. I haven't looked at GCC. My company must support down to 4.1 so we patched LLVM to unconditionally use the global cookie instead of the TLS cookie.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants