-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] occasional crashes in CI #1797
Comments
Copied from #1823 jvm crash in nightly build during UT intermittently
I will try keep monitoring this to tell what was the root cause for the crash |
I think I found the issue. This is related to the concurrent modification exception that we occasionally see when the tests are shutting down. The stack trace for the bad free in hs_err_pid2506.log is when we are freeing a host buffer that was "leaked". I cannot tell the size of the host buffer, but from the stack trace it is not a pinned buffer, so it is not the pinned memory pool, which we expect to leak. But it is being cleaned by the MemoryCleaner on shutdown to verify any leaks that we might have. This is something that no one would turn on in production, so I think we are OK with shipping 0.4 without any fix in place. When I look at the cleaner code there is not locking. It was written with the assumption that GC would remove any need for locking because it would show up in the queue when there are no references left to the object. When we force the cleanup to happen at shutdown there are now race conditions. I think we need to put in some locking in the actual cleanup code for each buffer. It should be a simple change and cheap because there should be no lock contention in the common case. |
I spoke with @abellina and he had a patch he was working on related to synchronization in the Memory cleaner for UCX work. He is going to extend that patch to also cover what I suspect the cause of this issue is too. |
Because the bug is in the java CUDF code and not in this code I am going to target this to the 0.5 release. Also the only way that this can be triggered, if I am correct in my analysis, is when we turn on the debug leak detection, which no one should ever do in production. |
Add synchronization in `cleanImpl` and `close` in various places where race conditions could exist, and also within the `MemoryCleaner` to address some concurrent modification issues we've seen in tests while shutting down (i.e. invoking the cleaner) (i.e. NVIDIA/spark-rapids#1797) Authors: - Alessandro Bellina (@abellina) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) URL: #7474
My fix targeted the cleaner synchronization. I'd say close this one and re-open if more CI crashes show up? |
We are closing this one for now as we haven't seen this (after rapidsai/cudf#7474). If you happen to see a similar bug show up in CI please reopen. |
Add synchronization in `cleanImpl` and `close` in various places where race conditions could exist, and also within the `MemoryCleaner` to address some concurrent modification issues we've seen in tests while shutting down (i.e. invoking the cleaner) (i.e. NVIDIA/spark-rapids#1797) Authors: - Alessandro Bellina (@abellina) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) URL: rapidsai#7474
Describe the bug
I have been seeing some crashes in CI that look like they are caused by some kind of memory corruption, but I am not sure. They are intermittent
The text was updated successfully, but these errors were encountered: