Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm64] membarrier causing silent crashes/freezes #12605

Closed
filipnavara opened this issue Apr 29, 2019 · 14 comments
Closed

[arm64] membarrier causing silent crashes/freezes #12605

filipnavara opened this issue Apr 29, 2019 · 14 comments
Labels
arch-arm64 area-PAL-coreclr os-unsupported OS which is not officially supported
Milestone

Comments

@filipnavara
Copy link
Member

I was trying to run CoreCLR on Galaxy S10 phone in the Linux-on-Dex environment (essentially Ubuntu 16.04 Docker container). Almost every non-trivial operation results in silent crash, including running dotnet --version.

Running under strace seems to suggest that the problem is the membarrier syscall introduced with PRs dotnet/coreclr#20949 and dotnet/coreclr#23778. The last line I can see in the log is the following:

membarrier(MEMBARRIER_CMD_QUERY, 0

After that the process freezes and it's listed as <defunct> in ps.

I wrote a test application to verify the assumption about membarrier being the culprit. Calling it through syscall in the same way CoreCLR does results in SIGSYS signal:

membarrier(MEMBARRIER_CMD_QUERY, 0 <unfinished ...>
+++ killed by SIGSYS +++
Bad system call

The underlying kernel version reported by uname -a is Linux localhost 4.14.85-15820661 dotnet/coreclr#1 SMP PREEMPT Tue Apr 16 17:32:20 KST 2019 aarch64 aarch64 aarch64 GNU/Linux.

/cc @VSadov @janvorli @tmds

@filipnavara filipnavara changed the title [arm64] membarrier causing silent crashes [arm64] membarrier causing silent crashes/freezes Apr 29, 2019
@janvorli
Copy link
Member

I wonder if the syscall could have a different number in the kernel compiled for Android.
Without that syscall, we have no way to implement the FlushProcessWriteBuffers which the runtime depends on heavily on arm64. Unlike on other architectures, the mprotect trick doesn't work for arm64 since arm64 doesn't need to invoke flush of TLB on all CPU cores using IPI.
I will try to find out whether the syscall has a different number of Android.

@filipnavara
Copy link
Member Author

The syscall number matches the system headers distributed with the Ubuntu container (which may not be correct) and it is decoded properly by strace.

@janvorli
Copy link
Member

There must be something strange going on. If you look at the implementation of the syscall in Linux kernel https://elixir.bootlin.com/linux/v4.14.85/source/kernel/sched/membarrier.c#L152, the only case when it could fail with bad system call would be when the flags passed in were non-zero.
I've also checked the syscall numbers - the Android kernel seems to be the same in this respect as the vanilla Linux.

@filipnavara
Copy link
Member Author

I think it could be the container that the Linux is running in but there's not too much info available about it. If you have some hint what to look for I can check it, or I can try to share access to it.

@filipnavara
Copy link
Member Author

filipnavara commented Apr 29, 2019

Further inspection suggests that it's caused by Seccomp that is turned on in the container (SECCOMP_MODE_FILTER in /proc/self/status). It could be something that is default Android policy. I will try to get more detailed logs, but it's non-trivial.

@filipnavara
Copy link
Member Author

Here's a snippet from the log (from my test app to avoid too much output):

04-29 20:59:53.269  5178  5178 E audit   : type=1326 audit(1556564393.265:488): auid=4294967295 uid=1638401000 gid=1638401000 ses=4294967295 subj=u:r:lxd_cont_app:s0 pid=15663 comm="test" exe="/home/dextop/Downloads/test" sig=31 arch=c00000b7 syscall=283 compat=0 ip=0x70d5b57b44 code=0x0
04-29 20:59:53.269  5178  5178 E audit   : type=1701 audit(1556564393.265:489): auid=4294967295 uid=1638401000 gid=1638401000 ses=4294967295 subj=u:r:lxd_cont_app:s0 pid=15663 comm="test" exe="/home/dextop/Downloads/test" sig=31 res=1

@filipnavara
Copy link
Member Author

filipnavara commented Apr 29, 2019

There's something really weird going on. AOSP has whitelist for membarrier and it seems to be used by ART, so I don't really understand why it fails.

Update: Turns out the answer is obvious. It was actually added only in Android Q and this device is still on Android P. This is what the ART source code says

#if defined(__BIONIC__)
  // Avoid calling membarrier on older Android versions where membarrier may be barred by secomp
  // causing the current process to be killed. The probing here could be considered expensive so
  // endeavour not to repeat too often.
  static int api_level = android_get_device_api_level();
  if (api_level < __ANDROID_API_Q__) {
    errno = ENOSYS;
    return -1;
  }
#endif  // __BIONIC__

@filipnavara
Copy link
Member Author

filipnavara commented Apr 30, 2019

Turns out that I hit the same problem with membarrier on some other indirect dependencies (lttng-ust -> liburcu) as well. I reported the problem to Samsung but I don't expect them to act swiftly on it.

@filipnavara
Copy link
Member Author

On unrelated note, I rebuilt CoreCLR without the membarrier call and everything else seems to work.

@janvorli
Copy link
Member

janvorli commented May 2, 2019

everything else seems to work.

Issues due to the FlushProcessWriteBuffers would manifest mostly during heavy multi-threaded stress, so reproing in regular apps may take a lot of runs of an application to repro.

@filipnavara
Copy link
Member Author

I understand that but it begs the question why it was not used in .NET Core 2.2 or 2.1 and/or why it was not backported if it's such an issue. liburcu seems to have some alternative implementation without membarrier that should work on ARM systems but I didn't investigate it in detail yet.

@janvorli
Copy link
Member

janvorli commented May 2, 2019

why it was not used in .NET Core 2.2 or 2.1 and/or why it was not backported if it's such an issue

ARM64 Linux is not supported in 2.1 and 2.2.

@filipnavara
Copy link
Member Author

Ah, that explains a lot. Maybe it would make sense to mark the fallback path with some warning/assert on ARM64 if it's known not to work correctly. I'm fine with running somewhat broken builds locally until there is a better alternative.

@filipnavara
Copy link
Member Author

Samsung just announced that they are killing Linux on DeX so this is a dead end.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-PAL-coreclr os-unsupported OS which is not officially supported
Projects
None yet
Development

No branches or pull requests

3 participants