[BUG] occasional crashes in CI #1797

revans2 · 2021-02-23T23:02:12Z

Describe the bug
I have been seeing some crashes in CI that look like they are caused by some kind of memory corruption, but I am not sure. They are intermittent

16:51:11  �[32mAll tests passed.�[0m
16:51:12  #
16:51:12  # A fatal error has been detected by the Java Runtime Environment:
16:51:12  #
16:51:12  #  SIGSEGV (0xb) at pc=0x00007f2b75cad562, pid=30764, tid=0x00007f253dddb700
16:51:12  #
16:51:12  # JRE version: OpenJDK Runtime Environment (8.0_282-b08) (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
16:51:12  # Java VM: OpenJDK 64-Bit Server VM (25.282-b08 mixed mode linux-amd64 compressed oops)
16:51:12  # Problematic frame:
16:51:12  # C  [libc.so.6+0x84562]  cfree+0x22
16:51:12  #
16:51:12  # Core dump written. Default location: /.../core or core.30764
16:51:12  #
16:51:12  # An error report file with more information is saved as:
16:51:12  # /.../hs_err_pid30764.log
16:51:12  Compiled method (nm)  557628 20278     n 0       sun.misc.Unsafe::freeMemory (native)
16:51:12   total in heap  [0x00007f2b60a1a690,0x00007f2b60a1a9c0] = 816
16:51:12   relocation     [0x00007f2b60a1a7b8,0x00007f2b60a1a800] = 72
16:51:12   main code      [0x00007f2b60a1a800,0x00007f2b60a1a9c0] = 448
16:51:12  Compiled method (nm)  557628 20278     n 0       sun.misc.Unsafe::freeMemory (native)
16:51:12   total in heap  [0x00007f2b60a1a690,0x00007f2b60a1a9c0] = 816
16:51:12   relocation     [0x00007f2b60a1a7b8,0x00007f2b60a1a800] = 72
16:51:12   main code      [0x00007f2b60a1a800,0x00007f2b60a1a9c0] = 448
16:51:12  #
16:51:12  # If you would like to submit a bug report, please visit:
16:51:12  #   http://bugreport.java.com/bugreport/crash.jsp
16:51:12  #
16:51:12  [INFO] ----------------------------------------------

The text was updated successfully, but these errors were encountered:

pxLi · 2021-03-01T01:17:13Z

Copied from #1823

jvm crash in nightly build during UT intermittently

OutOfMemory and StackOverflow Exception counts:
OutOfMemoryError java_heap_errors=1

hs_err_pid2506.log

I will try keep monitoring this to tell what was the root cause for the crash

revans2 · 2021-03-01T13:46:05Z

I think I found the issue. This is related to the concurrent modification exception that we occasionally see when the tests are shutting down. The stack trace for the bad free in hs_err_pid2506.log is when we are freeing a host buffer that was "leaked". I cannot tell the size of the host buffer, but from the stack trace it is not a pinned buffer, so it is not the pinned memory pool, which we expect to leak. But it is being cleaned by the MemoryCleaner on shutdown to verify any leaks that we might have. This is something that no one would turn on in production, so I think we are OK with shipping 0.4 without any fix in place. When I look at the cleaner code there is not locking. It was written with the assumption that GC would remove any need for locking because it would show up in the queue when there are no references left to the object. When we force the cleanup to happen at shutdown there are now race conditions. I think we need to put in some locking in the actual cleanup code for each buffer. It should be a simple change and cheap because there should be no lock contention in the common case.

revans2 · 2021-03-01T15:11:35Z

I spoke with @abellina and he had a patch he was working on related to synchronization in the Memory cleaner for UCX work. He is going to extend that patch to also cover what I suspect the cause of this issue is too.

revans2 · 2021-03-01T15:13:27Z

Because the bug is in the java CUDF code and not in this code I am going to target this to the 0.5 release. Also the only way that this can be triggered, if I am correct in my analysis, is when we turn on the debug leak detection, which no one should ever do in production.

@abellina

Add synchronization in `cleanImpl` and `close` in various places where race conditions could exist, and also within the `MemoryCleaner` to address some concurrent modification issues we've seen in tests while shutting down (i.e. invoking the cleaner) (i.e. NVIDIA/spark-rapids#1797) Authors: - Alessandro Bellina (@abellina) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) URL: #7474

abellina · 2021-03-12T18:34:59Z

My fix targeted the cleaner synchronization. I'd say close this one and re-open if more CI crashes show up?

abellina · 2021-03-12T22:38:24Z

We are closing this one for now as we haven't seen this (after rapidsai/cudf#7474). If you happen to see a similar bug show up in CI please reopen.

@abellina

Add synchronization in `cleanImpl` and `close` in various places where race conditions could exist, and also within the `MemoryCleaner` to address some concurrent modification issues we've seen in tests while shutting down (i.e. invoking the cleaner) (i.e. NVIDIA/spark-rapids#1797) Authors: - Alessandro Bellina (@abellina) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) URL: rapidsai#7474

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 23, 2021

pxLi mentioned this issue Feb 26, 2021

[BUG] fix jvm crash during UT #1823

Closed

revans2 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Mar 1, 2021

revans2 assigned abellina Mar 1, 2021

abellina mentioned this issue Mar 1, 2021

Java cleaner synchronization [skip ci] rapidsai/cudf#7474

Merged

abellina added this to the Mar 1 - Mar 12 milestone Mar 3, 2021

abellina closed this as completed Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] occasional crashes in CI #1797

[BUG] occasional crashes in CI #1797

revans2 commented Feb 23, 2021

pxLi commented Mar 1, 2021 •

edited

Loading

revans2 commented Mar 1, 2021

revans2 commented Mar 1, 2021

revans2 commented Mar 1, 2021

abellina commented Mar 12, 2021

abellina commented Mar 12, 2021 •

edited

Loading

[BUG] occasional crashes in CI #1797

[BUG] occasional crashes in CI #1797

Comments

revans2 commented Feb 23, 2021

pxLi commented Mar 1, 2021 • edited Loading

revans2 commented Mar 1, 2021

revans2 commented Mar 1, 2021

revans2 commented Mar 1, 2021

abellina commented Mar 12, 2021

abellina commented Mar 12, 2021 • edited Loading

pxLi commented Mar 1, 2021 •

edited

Loading

abellina commented Mar 12, 2021 •

edited

Loading