-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paltest_pal_sxs_test1 failed on SmartOS x64 #35362
Comments
With line 67 commented out, the backtraces look like this: https://gist.github.com/am11/2464eff2db3a3ceadd3d56b976ee4a15. |
The ExceptionInformation[1] comes from the sigsegv_handler, it passes siginfo->si_addr to common_signal_handler which then stores it into the ExceptionInformation[1]. |
I put the breakpoints in few places in sigsegv_handler and common_signal_handler, they get called as expected. At position src/coreclr/src/pal/src/exception/signal.cpp:516, the failure address (captured from siginfo->si_addr) was 96070, which is the same (wrong) one assigned to (gdb) c
Continuing.
Thread 2 hit Breakpoint 6, sigsegv_handler (code=11, siginfo=0xfffffc7fef0c3ef0, context=0xfffffc7fef0c3b90) at /home/am11/runtime/src/coreclr/src/pal/src/exception/signal.cpp:516
516 if ((failureAddress - (sp - GetVirtualPageSize())) < 2 * GetVirtualPageSize())
(gdb) print failureAddress
$8 = 96070
(gdb) bt
#0 sigsegv_handler (code=11, siginfo=0xfffffc7fef0c3ef0, context=0xfffffc7fef0c3b90) at /home/am11/runtime/src/coreclr/src/pal/src/exception/signal.cpp:516
#1 0xfffffc7fef245df6 in __sighndlr () from /lib/64/libc.so.1
#2 0xfffffc7fef238c3b in call_user_handler () from /lib/64/libc.so.1
#3 0xfffffc7fef238f6e in sigacthandler () from /lib/64/libc.so.1
#4 0xffffffffffffffff in ?? ()
#5 0x000000000000000b in ?? ()
#6 0xfffffc7fef0c3ef0 in ?? ()
#7 0x000000000000000f in ?? ()
#8 0x0000000000000000 in ?? () Could it be the indication of data getting corrupted somewhere in libc? |
The strange thing is also that the dlltest2 passes and then dlltest1 fails. They are exactly the same except of the address of failure. So maybe there is something that the first sigsegv screws in the kernel? |
I switched the call: --- a/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/exceptionsxs.cpp
+++ b/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/exceptionsxs.cpp
@@ -84,12 +84,12 @@ int main(int argc, char *argv[])
printf("PAL_SXS test1 SIGSEGV handler %p\n", oldAction.sa_sigaction);
- if (0 != InitializeDllTest1())
+ if (0 != InitializeDllTest2())
{
return FAIL;
}
- if (0 != InitializeDllTest2())
+ if (0 != InitializeDllTest1())
{
return FAIL;
} rebuilt and ran the test under debugger without any breakpoints, it failed at same place: (gdb) r
Starting program: /home/am11/runtime/artifacts/obj/coreclr/SunOS.x64.Debug/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/paltest_pal_sxs_test1
[Thread debugging using libthread_db enabled]
PAL_SXS test1 SIGSEGV handler 0
Starting pal_sxs test1 DllTest2
[New Thread 1 (LWP 1)]
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1 (LWP 1)]
0xfffffc7fe7238653 in FailingFunction (p=0x22) at /home/am11/runtime/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/dlltest2.cpp:34
34 *p = 1; // Causes an access violation exception
(gdb) c
Continuing.
ERROR: PAL_EXCEPT ExceptionInformation[1] != 0x22
[Inferior 1 (process 270 ) exited with code 01]
(gdb) bt
No stack. |
I didn't mean to switch the initialization, I meant switching the calls to DllTest1 and DllTest2 functions. |
Ah, right. the current order is DllTest2, DllTest1, DllTest2. I changed it to DllTest1, DllTest2, DllTest2 and got these results: (gdb) r
Starting program: /home/am11/runtime/artifacts/obj/coreclr/SunOS.x64.Debug/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/paltest_pal_sxs_test1
[Thread debugging using libthread_db enabled]
PAL_SXS test1 SIGSEGV handler 0
Starting pal_sxs test1 DllTest1
[New Thread 1 (LWP 1)]
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1 (LWP 1)]
0xfffffc7fe7d78653 in FailingFunction (p=0x11) at /home/am11/runtime/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/dlltest1.cpp:34
34 *p = 1; // Causes an access violation exception
(gdb) c
Continuing.
ERROR: PAL_EXCEPT ExceptionInformation[1] != 0x11
[Inferior 1 (process 4173 ) exited with code 01] |
DllTest2 is the first one (in master), that is why it was failing. runtime/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/exceptionsxs.cpp Lines 97 to 100 in af36c6d
Now I have put DllTest1 before 2, so DllTest1 is failing. Reordering has no effect. |
Oh, I have somehow misread your debugging log, I have thought that the 1st one was succeeding and the 2nd one was failing. |
I tried to isolate the issue, but could not reproduce it with this program: http://sprunge.us/6rXQXO
correctly prints: |
Another strange thing is that line 65 condition only fails under the debugger, so it is not related to the actual test failure. I commented out two out of three tests: // Test catching exceptions in other PAL instances
DllTest2();
- DllTest1();
- DllTest2();
+// DllTest1();
+// DllTest2(); and got a crash on second sigsegv (in chaining): PAL_SXS test1 SIGSEGV handler 0
Starting pal_sxs test1 DllTest2
DLLTest2 PASSED
Starting PAL_SXS test1 signal chaining
Segmentation Fault (core dumped) (exit code 139) the truss output looks like this: https://gist.github.com/am11/682e8f2a6b22a551e78004506774047f. For comparison, here is the output of truss on FreeBSD x64: https://gist.github.com/am11/419ca311e2219c7f033155e2d2a44c6d |
Added few print messages: --- a/src/coreclr/src/pal/src/exception/signal.cpp
+++ b/src/coreclr/src/pal/src/exception/signal.cpp
@@ -251,6 +251,7 @@ void SEHCleanupSignals()
restore_signal(SIGTRAP, &g_previous_sigtrap);
restore_signal(SIGFPE, &g_previous_sigfpe);
restore_signal(SIGBUS, &g_previous_sigbus);
+system("echo detaching pal internal sigsegv_handler");
restore_signal(SIGSEGV, &g_previous_sigsegv);
restore_signal(SIGINT, &g_previous_sigint);
restore_signal(SIGQUIT, &g_previous_sigquit);
@@ -505,6 +506,7 @@ Parameters :
--*/
static void sigsegv_handler(int code, siginfo_t *siginfo, void *context)
{
+system("echo pal internal sigsegv_handler");
if (PALIsInitialized())
{
// First check if we have a stack overflow
--- a/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/exceptionsxs.cpp
+++ b/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/exceptionsxs.cpp
@@ -96,8 +96,8 @@ int main(int argc, char *argv[])
// Test catching exceptions in other PAL instances
DllTest2();
- DllTest1();
- DllTest2();
+// DllTest1();
+ // DllTest2();
if (bHandler)
{ on FreeBSD, we get:
on SmartOS, we get:
and it continues, until process is terminated. Looks like the second sigsegv is initiating a never ending loop. |
Does the runtime make use of in-process profiling through |
Also I would be very careful using system(3C) in a signal handler, even just for debugging. It is listed with an MT-Level attribute of Unsafe, which as per my reading of attributes(5) means it is not safe for use when multiple threads or signals are at play. If it were safe in a signal handler, I would expect it to be marked Async-Signal-Safe. |
Afaict, setitimer is not used in coreclr code; the only use of setitimer is in one of the libunwind's test code, which is passing in upstream repo on SmartOS 2020. Also, SIGPROF is not used in coreclr (mono's POSIX library uses it in one of its APIs, but that is separate).
Thanks, i have removed it. It was just for a quick sneak peek in the absence of printf/std::cout in that context (as PAL have some of the printer functions redefined, and tests are selectively linked with components). The overview of this test failure is as follow:
ps - aside from |
To investigate this further, I see that your error message tells you that the expected value was not detected -- but it'd be good to print the value that was detected. In addition, our It might help to ask the runtime linker which shared object a particular PC value comes from, too, to see which frame was interrupted; e.g., ucontext_t *uc = /* value from handler arguments */;
while (uc != NULL) {
Dl_info_t dli;
void *pc = (void *)(uintptr_t)uc->uc_mcontext.gregs[REG_PC];
/* report pc */
if (dladdr(pc, &dli) != 0) {
/* report dli.dli_fname */
}
uc = uc->uc_link;
} |
I agree, we should emit the faulting address at
Thank you. I will try to walk up the chain to find more details. Looks like Linux/BSD also define
Could you give some pointers on how -- which linker flag to set in order to obtain this shared object to PC/IP mappings? We are using Solaris native ld (as opposed to |
@jclulow, if you want to give it a try, you can use my branch with libunwind upgrade commit: sudo pkgin -y install git mozilla-rootcerts cmake icu py37-expat gcc7 gmake gdb-7
git clone https://github.com/am11/runtime --branch feature/solaris/pal-test-fixes --single-branch --depth 1
# skip everything and only build coreclr with PAL tests
runtime/src/coreclr/build-runtime.sh -skipgenerateversion -nopgooptimize \
-cmakeargs -DCLR_CMAKE_BUILD_TESTS=1 -gcc with seven physical cores, the build took 11m33.128s with clean tree; subsequent runs (without git-clean) are faster. The test executable |
Failed again:
Failed tests:
Error message:
|
@JulieLeeMSFT, this is a different OS (illumos), while outerloop failure is in linux-musl-arm. Opened #81113. |
@gwr, @AustinWise, the PAL tests instructions are here: https://github.com/dotnet/runtime/blob/aed5c225ded0fecf98988ac3736d0f6399a82df3/docs/workflow/testing/coreclr/testing.md#pal-tests-macos-and-linux-only (it is meant to be |
The following condition is failing on SmartOS x64:
runtime/src/coreclr/src/pal/tests/palsuite/exception_handling/pal_sxs/test1/dlltest2.cpp
Lines 65 to 67 in 0fbd88e
Based on the discussion in #5158, I have captured the backtraces and printed the value of
ex.GetExceptionRecord()->ExceptionInformation[1]
, which is different in every run, but never the expected0x22
: https://gist.github.com/am11/0bef90cabb1185d41a93c456e9083b4d.cc @janvorli, if i try to comment this FAIL line it throws SIGABRT after few lines.
The text was updated successfully, but these errors were encountered: