-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paltest_pal_sxs_test1 failing on NetBSD #5158
Comments
Related to https://github.com/dotnet/coreclr/issues/2090. That issue has up-for-grabs label, so we can probably take a stab at it. |
Thanks. For now I will push a patch to disable this test on NetBSD as well and try to work with @mikem8361 towards investigating it. |
This test has been also disabled on FreeBSD as hardware exceptions always seem to abort on NetBSD as well. Related issues: dotnet/coreclr#2090 dotnet/coreclr#3287
This functionality is critical for CoreFX tests. Part of dlltest1.cpp.i
On NetBSD this tests fails because there is thrown and not caught exception (it's C++ exception with "throw"). @janvorli @jkotas have you got pointers what may be wrong/missing? The same issue is likely on FreeBSD. |
|
DllTest1 fails the same way -- it doesn't matter that there are two DllTests. Calling just one or the other results in the same termination. |
@krytarowski I would try to run it under a debugger with breakpoints set to |
GDB/LLDB doesn't want to attach to |
Doesn't work either. I will try to investigate it. |
That's strange, it works fine on Linux, I've been using it a lot for debugging in the past. Maybe the function names have different number of underscores or something? |
|
OK, I know what was going on. I had to first load library and make it sound to GDB:
|
It looks like I know that GNU libsigsegv works on NetBSD. I'm not sure how about GCJ. |
The GDB always catches the sigsegv first, you should just do "c" after that. |
I see. You are right.
|
How/what to extract useful information from this position? |
Can you also dump stack when you hit each breakpoint? |
|
I will be back to it on the evening. |
@krytarowski My guess is that NetBSD is unable to propagate exception and through the signal trampoline. And probably even unwind stack through it, which would explain the nonsense frames 10 and 11. |
If I am right, then we will need to handle hardware errors differently, in a way similar to what we do on OSX. That means by modifying the context passed in by the signal, redirecting it to an exception handling function and then returning from the signal. See HijackFaultingThread in the src/pal/src/exception/machexception.cpp |
Thanks! I will have a look at it. |
@krytarowski, @janvorli, it looks the NetBSD’s C++ runtime isn’t allowing the throw of a PAL_SEHException to be catch by the try/catch (which are wrapped in the PAL_TRY/PAL_EXCEPT/etc. macros) in the dlltest1/dlltest2 test code. In the dlltest1.cpp.i file I don’t actually see the “try”, “catch”. The CatchHardwareExceptionHolder count was non-zero (the reason the SEHProcessException code is throwing the exception) which means the PAL_TRY/etc. macros must have done that much. The h/w exception holder “enable” is in the HardwareExceptionHolder macro inside of PAL_EXCEPT macro. I hope this helps. |
@mikem8361 - I think you may have missed the the try / catches in there, since they are all in one line. I have reformatted the code here so that it is visible well: extern "C"
int DllTest1()
{
Trace("Starting pal_sxs test1 DllTest1\n");
{
void* __param = 0;
auto tryBlock = [](void* unused)
{
{
volatile int* p = (volatile int *)0x11;
bTry = 1;
*p = 1;
Fail("ERROR: code was executed after the access violation.\n");
}
};
const bool isFinally = false;
auto finallyBlock = []() {};
EXCEPTION_DISPOSITION disposition = -1;
auto exceptionFilter = [&disposition, &__param](PAL_SEHException& ex)
{
disposition = 1;
do
{
if (!(disposition != -1))
{
PAL_fprintf ((PAL_get_stderr(0)), "ASSERT FAILED\n" "\tExpression: %s\n" "\tLocation: line %d in %s\n" "\tFunction: %s\n" "\tProcess: %d\n", "disposition != EXCEPTION_CONTINUE_EXECUTION", 40, "/tmp/pkgsrc-tmp/wip/coreclr-git/work/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/dlltest1.cpp", __FUNCTION__, GetCurrentProcessId()); DebugBreak();
}
}while (0);
return disposition;
};
try
{
CatchHardwareExceptionHolder __catchHardwareException;
auto __exceptionHolder = NativeExceptionHolderFactory::CreateHolder(&exceptionFilter);
__exceptionHolder.Push();
tryBlock(__param);
}
catch (PAL_SEHException& ex)
{
if (disposition == -1)
{
exceptionFilter(ex);
}
if (disposition == 0)
{
throw;
}
{
if (!bTry)
{
Fail("ERROR: PAL_EXCEPT was hit without PAL_TRY being hit.\n");
}
if (ex.ExceptionRecord.ExceptionInformation[1] != 0x11)
{
Fail("ERROR: PAL_EXCEPT ExceptionInformation[1] != 0x11\n");
}
bExcept = 1;
}
};
if (isFinally)
{
try
{
tryBlock(__param);
}
catch (...)
{
finallyBlock();
throw;
}
finallyBlock();
}
};
if (!bTry)
{
Trace("ERROR: the code in the PAL_TRY block was not executed.\n");
}
if (!bExcept)
{
Trace("ERROR: the code in the PAL_EXCEPT block was not executed.\n");
}
if(!bTry || !bExcept)
{
Fail("DllTest1 FAILED\n");
}
Trace("DLLTest1 PASSED\n");
return PASS;
} |
@krytarowski, @janvorli. Thanks. I see them now. But I don’t understand why on NetBSD they don’t catch the PAL_SEHException that is thrown. Your suggestion of looking at the mach exception hijacking may not help. On OSX that code sets up the fautling thread from the exception thread to end up in the same SEHProcessException code that throws the software exception. |
@mikem8361 As I've said to @krytarowski, the problem is most likely that the exception unwinding cannot unwind stack across the signal handler trampoline. As you can see from the stack dumps, the frames displayed by the GDB after the sigsegv_handler frame don't make sense, so the GDB itself is not able to cross the signal trampoline either. |
I was debugging it with help from Christos Zoulas (christos / netbsd.org). It seems that the trick to switch context and throw from EH is to go for the following patch:
However it still requires stack restoration.
@janvorli have you got pointers how to continue? Some LLVM libunwind usage? Thanks! |
Another question:
In the above function we are saving pointer to a stack object ( |
Could be that logic rewritten with I was told that mixing EH and program context leads to dangerous situations like calling non-signal-safe functions may leave program in unpredictable state (like (Actually I'm not volunteering this redesign myself, since CoreFX support remaining bugs on NetBSD are much more important) |
@krytarowski To implement it like we do on the OSX, it would be more involved than this. We would need to create a fake frame on the stack that would contain the context and exception record and allow the unwinder to keep the stack walkable from the exception handler to the actual code with the exception. Basically, it is a stack frame that has its return address on stack set to the SEHProcessException, below that is the RBP of the context where the exception happened, then the context and finally an address in the middle of a fake function that is never called, but provides unwind info for the stack walker - like the PAL_DispatchExceptionWrapper. Then you set the RIP in the ucontext like you do, set the RSP in it to the address on the stack where the address in the middle of a fake function was stored. And also RSI to point to the context and RDI to the exception record in that helper frame (the AMD64 calling convention passes the EXCEPTION_POINTERS in registers, that's what we form in the RDI/RSI this way). I was originally thinking that we would just modify the context passed in by the signal handler and return from the handler, but that would have the problem you've mentioned - any function called between the return from the signal handler and the context restoration by the system would potentially overwrite our fake stack frame with the context. Moreover, we would not be able to put the fake frame right below the faulting frame. However it seems there is still a potential problem - the red zone in case the function that has caused the exception was a leaf one. We would probably need to write the fake function in assembly with manual CFI annotations to allow us to skip the red zone. Or, we can disable the red zone using the -mno-red-zone compiler option for all of our code. Since hardware exception out of our code leads to fail fast, we don't have to care about polluting red zone of the platform library functions, You are right that one has to be extremely careful what to call from the signal handler. But again, we try to handle only exceptions in our managed code (and some of our native code), exceptions at other places cause abort, so we just need to make sure we don't call any non-signal safe function before the point where we check where the RIP of the faulting instruction is located. As for the idea of using kqueue and sigwait - I am not sure how we would do that. Did you have some specific ideas? |
Thank you for your feedback! I need to process it and try to produce a working solution, |
Surprisingly I think the easiest solution is to write a kernel module to unwind the stack for the process.. I'm researching the kernel internals for |
I'm not sure if a kernel module would be the best solution here. I'm not sure how that is viewed in the NetBSD world, but as far as I've understood on Linux, installing a kernel module is considered tainting the kernel. |
In the Linux world tainting the kernel is inserting a module with not a GPL (or few other GPL-compatible alternatives) license. The shortcoming is that there is need to be a superuser to insert it. It's not trivial for my current knowledge on unwinding on AMD64, it takes time. Thank you for your support! |
I'm still stuck with it. Is upstream considering to redesign the code and go for an |
Actually, some time ago I got an idea that I believe could work. It is similar to the thing we do for unwinding native frames during exception handling. There is a "StartUnwindingNativeFrames" function that gets a context and the PAL_SEHException to throw. What it does is that it sets the current context right below the frame from which the exception should get unwound by the C++ exception handling and calls a helper C++ function to throw the passed in PAL_SEHException. |
I will happily test it. Sadly too little time (in my spare time) to redesign the code myself. |
@krytarowski I can check the core of the idea on my Ubuntu and if it works, then it should not be complicated to finalize it. |
@krytarowski I have made a quick test for the idea. It would be nice if you could give it a try on BSD. Here is my branch with the experimental change: |
I tested your patch on NetBSD and it works!
Please push it for master. I'm 1month behind now on porting NetBSD for .Net Core 1.0.... it's still possible to happen! |
Great! It will still need a little cleanup though, so I guess I'll have a PR ready tomorrow. |
Fixed by @janvorli |
Just for the record, PR dotnet/coreclr#5140 solved this issue. I am observing similar issue but with LTO build of CoreCLR on ubuntu. Going to test after that commit to see if it passes. |
Currently
paltest_pal_sxs_test1
is disabled on FreeBSD. I would like to know why?Should I disable it on NetBSD as well? Is it testing crucial functionality?
The text was updated successfully, but these errors were encountered: