-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native crash, sgen gc aborting #6546
Comments
I think i found a workaround for the place i was stuck for using xabuild, (this is file from visual studio, imported through the <Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup>
<NuGetTargets Condition="'$(NuGetTargets)'==''">$(MSBuildExtensionsPath)\Microsoft\NuGet\$(VisualStudioVersion)\Microsoft.NuGet.targets</NuGetTargets>
</PropertyGroup>
<Import Condition="Exists('$(NuGetTargets)') and '$(SkipImportNuGetBuildTargets)' != 'true'" Project="$(NuGetTargets)" />
</Project> if i set NuGetTargets to the right value before any import , the packagereference now works correctly |
I managed to get the application to crash in the debugger but as i did not yet compile my own mono, so I dont have the source or enough detailed debugging information stored in the shared objects. I cannot print any values or go up or down the stack yet 😢 . We still have maybe a little more information than the previous stack dump though...
(limited to frame 12 and 14) |
Using the "new" bridge implementation seems to be a valid workaround for now. workaround
I am still worried that some of the code is doing some memory corruption and changing the implementation is just preventing crashes because of arbitrary implementation luck. But if all features of my app seems to work correctly, and it is not crashing anymore, then it it still better than nothing 😅. I managed to get more information.
It seems the three lower bits are consistently set to zero during the few times i was able to reproduce. Maybe the values can give ideas to people familiar with gc implementation what could be causing this.
EDIT: I am rather eager to undestand what is going on to be sure this bug will not show up again uninvited, so I read a little bit of code and documentation about sgen and I saw that if the 3 lower bits on vtable address are actually flags for cemented/pinned/forwarded state in the gc (and in my screenshot it is the case ...)
so i am wondering if there could be a rare race condition happening with tarjan implementation that could re-tag object after this line of code, and then scan works with a tagged object where it does not expect it to be ? |
As i suspected if i just clear out the 3 lower bits of the vtable address in the debugger I get valid objects everytime (but never the same type of object). |
I have made substantial progress in identifying and reproducing the issue, that i reported here: the bug affects xamarin-android in its default configuration but it is specific for mono sgen and the bug is also a rather corner-case, most probably unlikely race-condition so I don't know if any action is required to be taken here or not ? (except continue to integrate upstream mono bugfixes when they are released). |
@tmijieux thanks for a very thorough job investigating the issue! However, Xamarin.Android is but a "client" of the Mono runtime, so I'll pass the buck to @lambdageek who will hopefully be able to address and fix the issue, thanks again! |
There's not much for me to do, other than collect all the backports and shepherd them in: @tmijieux did I great job investigating and fixing the underlying issue.
|
…od_union_preclean (#21384) fixes #21369 Related to dotnet/android#6546 job_major_mod_union_preclean can race with the tarjan bridge implementation that changes the vtable pointer by settings the three lower bits. this results in invalid loading of the vtable (shifted by 7 bytes) which in turn give a wrong desc to the scan functions This change is released under the MIT license.
…od_union_preclean (#63293) Fixes mono/mono#21369 Related to dotnet/android#6546 job_major_mod_union_preclean can race with the tarjan bridge implementation that changes the vtable pointer by settings the three lower bits. this results in invalid loading of the vtable (shifted by 7 bytes) which in turn give a wrong desc to the scan functions This change is released under the MIT license. Co-authored-by: tmijieux <[email protected]>
…od_union_preclean (#21391) fixes #21369 Related to dotnet/android#6546 job_major_mod_union_preclean can race with the tarjan bridge implementation that changes the vtable pointer by settings the three lower bits. this results in invalid loading of the vtable (shifted by 7 bytes) which in turn give a wrong desc to the scan functions This change is released under the MIT license. Co-authored-by: Thomas Mijieux <[email protected]>
@jonpryor @grendello The runtime fixes are in mono 2020-02 and dotnet release/6.0-maui. For !NET6 bump to mono/mono@a5d1934 Thanks again @tmijieux ! |
Fixes: dotnet#6546 Context: mono/mono#21391 Changes: * transform sgen_get_descriptor to parallel safe version in job_major_mod_union_preclean (#21391)
Fixes: #6546 Context: mono/mono#21384 Changes: mono/mono@b8d7525...a5d1934 * mono/mono@a5d1934898b: transform sgen_get_descriptor to parallel safe version in job_major_mod_union_preclean (#21391)
Fixes: #6546 Context: mono/mono#21384 Changes: mono/mono@b8d7525...a5d1934 * mono/mono@a5d1934898b: transform sgen_get_descriptor to parallel safe version in job_major_mod_union_preclean (#21391)
Glad to see this fixed so quickly 🙂 any idea when we'll see a release with this fix included (currently a pretty large crasher for us)? |
Anybody knows a Xamarin.Android version that works with .NET 5? @FelixZY i am building the app with v. 6.12.0.164 but it does not fix: |
Android application type
Classic Xamarin.Android (MonoAndroid11.0, MonoAndroid12.0, etc.)
Affected platform version
VS2022 17.0.1, 17.0.2 VS2019(latest, probably https://docs.microsoft.com/fr-fr/visualstudio/releases/2019/release-notes#16.11.5 but i uninstalled it since)
Description
I am currently under investigation of a native crash very similar to #3892
the relevant part of the log seems to be this message
Assertion: should not be reached at /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mono/sgen/sgen-scan-object.h:91
(see here this is actually a header file that contains preprocessor templated code and is included in multiple position in the code)that seems to indicate some portion of theorically unreachable code was reached in the code that is scanning for references in the sgen garbage collector, specifically it looked for an field in a native struct to determine what type of object it is currently looking at, but the switch did not match any case which seems to indicate what it is looking at is not what it expected to be (this could be maybe memory corruption?)
My intuition for now is that is has something to do with some c# code we introduced (tough i am not 100% sure about that and my rational self rather believe that it is unlikely since it looks like a native crash).
It started crashing in a production release (we never reproduced during development until it happened in production).
In the new release, we did update some nugets but even after reverting the nuget versions, the crash was still there. But when we rebuilt (with the same visual-studio/xamarin-android version that was producing the crashing builds) an old version that we knew was not triggering this crash initially, then the new build of that old version was still not crashing.
At first we did thought the crash was more likely to happen under low memory condition (and it is likely because it happens when gc is running) and we looked for memory leaks. We found some but even after fixing most of them the crashes are still there.
At the times when we were looking for leaks, we tried to do a "git bisect" to find where the problem was coming from but we were more focused on the leaks when we did this, so I think I should probably retry to do a "git bisect" focusing on trying to trigger the crash (but this is a little problematic because we did not found a way yet to reproduce this issue systematically on any of our test devices)
Things i've looked at that I thought could have been related but does not really look similar:
https://docs.microsoft.com/en-us/xamarin/android/release-notes/11/11.1#corrected-garbage-collection-behavior-for-android-bindings-and-bindings-projects
Our app is a xamarin.forms app
there is the list of nugets we use
xml package references
So we have a few bindings libraries (com.onesignal, openid.appauth,...) and the main native library here is SkiaSharp.
What i am currently stuck at:
I succeed at binding gdb on my app like describe here (https://github.com/xamarin/xamarin-android/blob/main/Documentation/workflow/DevelopmentTips.md#attaching-gdb-using-visual-studio-on-windows)
and i think i also got my app to trigger my crash once, but there was virtually no information when printing the backtrace (just interrogation mark)
My current goal is to build xamarin-android to get to natively debug my app and try to get more information out of it (with debug symbols in mono sgen and stuff, i would like to have Address and undefined behavior sanitizer on mono and skiasharp if possible )
I successfully built xamarin-android and xabuild (I checkout the d17-0 branch because i wanted to get the same version i have in my current visual studio 2022, is that a good idea or not?)
I had to change to platformtarget of xabuild.csproj to x64 because the msbuild in vs2022 seems to be 64bits.
if i run xabuild on the samples/HelloWorld project i can build and deploy an app but if I add a packagereference to xamarin.forms
diff
Somehow the referenced assemblies from the nuget does not get added to the csc command line and the project fails to build (
C:\src\xamarin-android\samples\HelloWorld\MainActivity.cs(4,15): error CS0234: The type or namespace name 'Forms' does not exist in the namespace 'Xamari n' (are you missing an assembly reference?) [C:\src\xamarin-android\samples\HelloWorld\HelloWorld.csproj]
) just like what would happen if the packagereference was not there, The same issue happens with my own project so i am currently unable to build my project with xabuild.(maybe I have something in my env that is hindering correct behavior or there is an issue with xabuild itself? if someone have an idea about this issue that would be helpful, i attached a binlog for the modified HelloWorld)
msbuild.binlog.zip
for items mentionned in the referenced issue:
I tried disabling the concurrent garbage collector, but the app suffered a slowdown, and it did not seems to fix the crashes.
The crash seems to happens even on debug builds, not only on appstore releases, but maybe less often...
What I did not try yet:
Steps to Reproduce
It is still very hard even for our team to reproduce (especially on emulator where it seems to happens very rarely)
Did you find any workaround?
Not yet, reverting my app to an old version did the trick for now but we cannot go forward until we find what is causing this.
Relevant log output
development environment information
The text was updated successfully, but these errors were encountered: