-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in AdjustContextForVirtualStub #49070
Comments
Tagging subscribers to this area: @tommcdon Issue DetailsI have an application that consistently crashes only on ARM64, with the following stacktrace:
This is an application that throws a bunch of exceptions, on top of lengthy callstacks (this one is 153 frames long). It's important to note that the crash only occurs when our profiler is attached and we've ReJITted some methods. So it's entirely possible we're corrupting something. Still, the consistent location of the error and the fact that it happens only on ARM64 is fishy. Using .NET 5.0.3. I'll keep digging on my side to figure out if the issue is in the runtime or in our profiler.
|
I am not familiar with this code path, but it may just be a bug on the runtime. There is no check for runtime/src/coreclr/vm/arm64/stubs.cpp Line 1197 in f7de865
But runtime/src/coreclr/vm/exceptionhandling.cpp Line 4579 in f7de865
I don't know off the top of my head why you would only see it when your profiler is attached, but it's suspicious that we check for null on other architectures: runtime/src/coreclr/vm/amd64/excepamd64.cpp Lines 614 to 617 in f7de865
I will have to do some research to convince myself that this is the right fix. @kevingosse, if you are feeling motivated it would be helpful to know if adding a check for null solves your problem. |
I'll give it a try tomorrow. That would be my first time building the CLR on ARM64, but I suppose it shouldn't be much harder than x64. |
It's largely the same except setting up the cross building environment. It's documented here: https://github.com/dotnet/runtime/blob/main/docs/workflow/building/coreclr/cross-building.md, we have docker images with the environment preconfigured that makes it a lot easier. |
I have more information on the issue. There are actually 2 segfaults. The first is at:
Then I resume execution from this point, it triggers another segfault, which is the one I reported:
Adding the null-check as you suggested fixes that second segfault. The first one still happens (I don't know if that's expected) but it does not crash the process anymore. |
Do you have a coredump for the first segfault? Also, what does the native stack look like at the first segfault? |
I can try capturing one tomorrow. There's no native stack (top frame is |
Frames (the coreclr data structure, they are confusingly named the same as stack frames) are what we use to track native code that is used by the runtime but needs to act like managed code. Each thread has a list of Frames that the StackWalker can use to determine if the code it is walking is one of our FCalls/Helpers/etc that plays by the same rules as managed code. The HelperMethodFrame means that it is a jit helper. You should be able to look at the assembly of the helper by disassembling at the IP (0000ffff7d1fbee8) and you can inspect the HelperMethodFrame by looking at the address of the frame (0000ffff457f0f38). E.g. in lldb Just because it segfaults doesn't mean it's a bug though. You'd have to determine what the helper is and what it's supposed to be doing. NullReferenceExceptions in managed code are achieved by letting the native jitted code run and then if a segfault happens we look at the address of the segfault, and if it's in managed code we translate it to a managed NullReferenceException. The code that does that is what you added a NULL check to. So long story short, this could very well be normal operation. |
Unfortunately I'm not sure how to capture a coredump for this segfault (since it doesn't crash the process). Saving a coredump from LLDB doesn't seem to be supported on this OS. I can't use gcore because the debugger is attached. I can capture one with GDB but the format is not supported by LLDB, and GDB seems to struggle to reconstruct the callstack. In any case, I did more digging with LLDB. I see no evidence in the helper frame of what method is being called ( The disassembly of the stub is:
It fails on the very first instruction (the value of x0 is 0x0). I'm going to assume this is this code: https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/arm/stubs.cpp#L957 (I can't find a version for ARM64, does it mean it uses the version for ARM?) in which case the segfault is expected (according to the comment) just like you described. |
Though it's in the middle of throwing a JSON serialization exception, it seems weird there would be a null reference there. |
@kevingosse re: the first segfault, do you happen to have the call and data that triggers this? Might possibly be a bug in the json serializer (Utf8Json fork in 7.x) in the Elasticsearch .NET client that would be good for us to fix 🙂 In the interests of not wanting to derail this issue, an issue can be opened on https://github.com/elastic/elasticsearch-net/issues/new/choose |
That's in the version 6 of the client, so still based on JSON.NET. In the end, the segfault happen in |
From the runtime side I don't think that this is a bug. I don't look at codegen all that often so I could always be wrong, but I am pretty sure x0 is the object that is being dispatched, so if x0 is null that means someone is trying to call a virtual method on a null object. Looking at ExceptionDispatchInfo.Throw it's a fairly simple method: Lines 52 to 57 in bd7630d
It seems likely that _exception is null here. How _exception is null is not obvious, since the ExceptionDispatchInfo would be returned by whatever Task is running, either through task.GetCancellationExceptionDispatchInfo() or task.GetExceptionDispatchInfos(). Are you able to capture it under a managed debugger so you can inspect the managed state? |
I retrieved the ExceptionDispatchInfo instance with
I then checked the disassembly of
Given that the IP for the ExceptionDispatchInfo.Throw() frame is at 0000ffff80e39a84, I assume we haven't returned from the call
Here it's a simple jump to 0xfffff71c0378. Which seems to be
Unfortunately there are too many branches in this method, so I got lost pretty quickly. But that made me think: how can we make it all the way to a virtual dispatch stub if we still haven't returned from |
We're starting to get far enough in the details that I don't have an immediate answer. @jkotas is there anyone who is familiar with exception handling that could take a look? If not I can dig further but it will take me a bit to dig in and get an answer. |
The original problem tracked by this issue (SegFault in AdjustContextForVirtualStub) is fixed. We should open a new issue to discuss the other AV. Everything in the description suggests that interface method was called on NULL pointer, and the exception was caught and handled gracefully. The only outlier is the output from |
I have an application that consistently crashes only on Linux/ARM64, with the following stacktrace:
This is an application that throws a bunch of exceptions, on top of lengthy callstacks (this one is 153 frames long).
It's important to note that the crash only occurs when our profiler is attached and we've ReJITted some methods. So it's entirely possible we're corrupting something. Still, the consistent location of the error and the fact that it happens only on ARM64 is fishy.
Using .NET 5.0.3.
I'll keep digging on my side to figure out if the issue is in the runtime or in our profiler.
coredump.zip
The text was updated successfully, but these errors were encountered: