Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK11 Linux ppc64le java/util/Hashtable/SerializationDeadlock.java *** stack smashing detected *** #10630

Closed
JasonFengJ9 opened this issue Sep 17, 2020 · 14 comments · Fixed by eclipse-omr/omr#5574

Comments

@JasonFengJ9
Copy link
Member

Failure link

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/148/consoleFull

Optional info

Failure output (captured from console output)

01:11:38  --------------------------------------------------
01:11:38  TEST: java/util/Hashtable/SerializationDeadlock.java
01:11:38  TEST JDK: /home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/openjdkbinary/j2sdk-image
01:11:38  

01:11:38  stderr:
01:11:38  *** stack smashing detected ***: /home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/openjdkbinary/j2sdk-image/bin/java terminated
01:11:38  ======= Backtrace: =========
01:11:38  /lib64/libc.so.6(__fortify_fail+0x54)[0x3fff7944d1f4]
01:11:38  /lib64/libc.so.6(__stack_chk_fail+0x20)[0x3fff7944d190]
01:11:38  /home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so(+0xb26a24)[0x3fff6d336a24]

...

01:11:38  ======= Memory map: ========
01:11:38  00010000-00020000 rw-p 00000000 00:00 0 
01:11:38  116c60000-116c70000 r-xp 00000000 fd:00 1181615                          /home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/openjdkbinary/j2sdk-image/bin/java
01:11:38  116c70000-116c80000 r--p 00000000 fd:00 1181615                          /home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/openjdkbinary/j2sdk-image/bin/java
01:11:38  116c80000-116c90000 rw-p 00010000 fd:00 1181615                          /home/jenkins/workspace/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/openjdkbinary/j2sdk-image/bin/java
01:11:38  1003b8c0000-1003b8f0000 rw-p 00000000 00:00 0     

...
@pshipton
Copy link
Member

Changes from the previous build:
1d44c6a...a3acbbf
eclipse-omr/omr@c72851c...af16e89
adoptium/aqa-tests@403f848...005b311

@andrewcraik @r30shah could this be caused by #10366 ?

@r30shah
Copy link
Contributor

r30shah commented Sep 17, 2020

@pshipton Let me take a look.

@r30shah
Copy link
Contributor

r30shah commented Sep 17, 2020

@pshipton Looking into core-dump, I think this failure is not related to the #10366 as On the stack I am seeing,

(gdb) info symbol 0x3fff6d25cc48
TR::RegDepCopyRemoval::makeFreshCopy(short) + 1320 in section .text of /team/rahil/P/openjdk-tests/TKG/test_output_16003173712911/jdk_util_j9_0/work/JDK/jdk/lib/default/libj9jit29.so

So it is more likely related to eclipse-omr/omr#5219 and hitting ASSERT at https://github.com/eclipse/omr/blob/25e93bed006afa4333c1a1c5f6a9a24342da3ecc/compiler/optimizer/RegDepCopyRemoval.cpp#L441

I am trying to reproduce this locally/internally as core-dump/jitdump in the test does not provide more information.

@r30shah
Copy link
Contributor

r30shah commented Sep 18, 2020

Would it be possible to launch a grinder of 100 runs on the Eclipse CI ? I ran the test internally on PPCLE 64 machine for 100 runs but can not reproduce the failure, if we can get it fairly reproducible, then I would have to get access to the external machine to continue investigation.

@pshipton
Copy link
Member

pshipton commented Sep 18, 2020

I launched 2 grinders, one on the same machine where it originally failed.
https://ci.eclipse.org/openj9/job/Grinder/1054 - 100x cent7-ppcle-5 - passed
https://ci.eclipse.org/openj9/job/Grinder/1055 - 100x cent7-ppcle-1 - passed

Started more
https://ci.eclipse.org/openj9/job/Grinder/1056 - 100x cent7-ppcle-5 - passed
https://ci.eclipse.org/openj9/job/Grinder/1057 - 100x cent7-ppcle-6 - passed
https://ci.eclipse.org/openj9/job/Grinder/1058 - 20x jdk_util_j9_0 ub16-ppcle-3 - passed

@r30shah
Copy link
Contributor

r30shah commented Sep 21, 2020

None of the grinders from #10630 (comment) had failures (0/420).
For context, hitting this ASSERT means, we have encountered a node under GlRegDeps node which is not *regLoad and if it is the PassThrough node then we have not seen the regStore of the child node in extended basic block. Unfortunately failure in https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/148/consoleFull does not print the assert message which may have provided more information (It prints type of the node if it is not what we expect) to get more information about the failure.

@pshipton
Copy link
Member

Can we update the assert to provide more information in case the problem occurs again?

@r30shah
Copy link
Contributor

r30shah commented Sep 22, 2020

Can we update the assert to provide more information in case the problem occurs again?

Let me see, currently it provides information about the node that was encountered but not expected (Information contains node index and opcode).

I think this kind of failure would benefit most from good jitdump. I see in the stack trace in https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.openjdk_ppc64le_linux_xl_Nightly/148/consoleFull , it was in middle of processing the jitdump and fails. Looked through the jitdump from the failure, tree looked OK ( I can not find any incorrect GlRegDeps).
@fjeremic Your changes for jitdmp improvements are committed right? Do you think they could have helped here?

@fjeremic
Copy link
Contributor

@r30shah yes this failure generated a useful jitdump showing a clear error:

TREE VERIFICATION ERROR -- node [    0x3ffea5945ff0] ref count is 11 and should be 10
BLOCK VERIFICATION ERROR -- node [    0x3ffea5945ff0] accessed outside of its (extended) basic block: 1 time(s)
n59320n   BBStart <block_1729> (freq 16) ()                                                   [    0x3ffea5945dc0] bci=[48,10,234] rc=0 vc=2 vn=- li=- udi=- nc=1 flg=0x20
n59328n     GlRegDeps ()                                                                      [    0x3ffea5946040] bci=[48,10,234] rc=1 vc=2 vn=- li=1 udi=- nc=6 flg=0x20
n59327n       aRegLoad gr30   <temp slot 5>[#4231  Auto] [flags 0x20004007 0x0 ]              [    0x3ffea5945ff0] bci=[48,7,234] rc=11 vc=2 vn=- li=10 udi=- nc=0
n59326n       iRegLoad gr19   unshared<parm 1 Z>[#385  Parm] [flags 0x40000103 0x0 ] (cannotOverflow SeenRealReference )  [    0x3ffea5945fa0] bci=[-1,188,2237] rc=7 vc=2 vn=- ln59325n       iRegLoad gr20   enumHandle<auto slot 3>[#418  Auto] [flags 0x3 0x0 ] (cannotOverflow SeenRealReference )  [    0x3ffea5945f50] bci=[61,4,4093] rc=8 vc=2 vn=- li=8
n59324n       aRegLoad gr22   this<'this' parm Ljava/io/ObjectInputStream;>[#384  Parm] [flags 0x40000107 0x0 ] (X!=0 X>=0 SeenRealReference sharedMemory )  [    0x3ffea5945f00]n59323n       aRegLoad gr23   cl<auto slot 7>[#433  Auto] [flags 0x4007 0x0 ]                 [    0x3ffea5945eb0] bci=[48,7,234] rc=2 vc=2 vn=- li=2 udi=- nc=0
n59322n       aRegLoad gr28   name<auto slot 5>[#429  Auto] [flags 0x4007 0x0 ] (SeenRealReference sharedMemory )  [    0x3ffea5945e60] bci=[48,7,234] rc=2 vc=2 vn=- li=2 udi=- n39853n   checkcast [#86]                                                                     [    0x3ffea60998d0] bci=[48,10,234] rc=0 vc=2 vn=- li=- udi=- nc=2
n59327n     ==>aRegLoad
n39852n     loadaddr  java/lang/Enum[#4441  Static] [flags 0x18307 0x0 ]                      [    0x3ffea6099880] bci=[48,10,234] rc=1 vc=2 vn=- li=1 udi=- nc=0
n39859n   ifacmpeq --> block_1554 BBStart at n54225n ()                                       [    0x3ffea6099ab0] bci=[48,15,235] rc=0 vc=2 vn=- li=- udi=- nc=3 flg=0x20
n59327n     ==>aRegLoad
n39856n     aconst NULL (X==0 sharedMemory )                                                  [    0x3ffea60999c0] bci=[48,15,235] rc=1 vc=2 vn=- li=1 udi=- nc=0 flg=0x2
n54228n     GlRegDeps ()                                                                      [    0x3ffea74b2620] bci=[48,15,235] rc=1 vc=2 vn=- li=1 udi=- nc=2 flg=0x20
n59322n       ==>aRegLoad
n59323n       ==>aRegLoad
n39844n   BBEnd </block_1729>                                                                 [    0x3ffea6099600] bci=[48,15,235] rc=0 vc=2 vn=- li=- udi=- nc=0

You can find the jitdump in the Artifacts safed on the build:

Filips-MacBook-Pro:openjdk-tests fjeremic$ find . -name jitdump*
./TKG/test_output_16003173712911/jdk_util_j9_0/work/java/util/HashSet/Serialization/jitdump.20200917.020905.9297.0004.dmp

@fjeremic
Copy link
Contributor

By the way @r30shah we should use TR_ASSERT_FATAL_WITH_NODE for that assert you linked which gives a lot more context.

@r30shah
Copy link
Contributor

r30shah commented Sep 23, 2020

@fjeremic Yes, checked it again with the context of #10671 where I did see the ASSERT message as well which is missing from this failure. But I can verify that the node in question pointed by the TREE VERIFICATION ERROR -- node [ 0x3ffea5945ff0] ref count is 11 and should be 10 is the one that is used under one of the GlRegDeps where is is under PassThrough (in this case both passthrough and regload under it uses same register which would not have the corresponding regStore).
I will change the ASSERT to use TR_ASSERT_FATAL_WITH_NODE

r30shah added a commit to r30shah/omr that referenced this issue Sep 23, 2020
Fix the assert added in RegDepCopyRemoval to consider missing case
where child of GlRegDeps is PassThrough with the regLoad child and
both uses same register.

Fixes: eclipse-openj9/openj9#10630

Signed-off-by: Rahil Shah <[email protected]>
@r30shah
Copy link
Contributor

r30shah commented Sep 24, 2020

@pshipton Wanted to verify if we can deliver this in 0.23 branch, I will open PR against it.

r30shah added a commit to r30shah/openj9-omr that referenced this issue Sep 24, 2020
Fix the assert added in RegDepCopyRemoval to consider missing case
where child of GlRegDeps is PassThrough with the regLoad child and
both uses same register.

Fixes: eclipse-openj9/openj9#10630

Signed-off-by: Rahil Shah <[email protected]>
@pshipton
Copy link
Member

pshipton commented Sep 24, 2020

@r30shah Pls cherry pick the change against https://github.com/eclipse/openj9-omr/tree/v0.23.0-release and let me know the PR #.

@r30shah
Copy link
Contributor

r30shah commented Sep 24, 2020

Thanks @pshipton , here is the PR eclipse-openj9/openj9-omr#79

qasimy123 pushed a commit to qasimy123/omr that referenced this issue Feb 24, 2022
Fix the assert added in RegDepCopyRemoval to consider missing case
where child of GlRegDeps is PassThrough with the regLoad child and
both uses same register.

Fixes: eclipse-openj9/openj9#10630

Signed-off-by: Rahil Shah <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants