Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport arena MR fix for simultaneous access by PTDS and other streams #1396

Merged
merged 1 commit into from
Dec 4, 2023

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Dec 1, 2023

Description

This PR backports #1395 from 24.02 to 23.12. It contains an arena MR fix for simultaneous access by PTDS and other streams.

Backport requested by @sameerz @GregoryKimball.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…rapidsai#1395)

Replaces rapidsai#1394, this is targeted for 24.02.

fixes rapidsai#1393

In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:

```
terminate called after throwing an instance of 'rmm::logic_error'
  what():  RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found
```

The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled.  CUDF recently added its own stream pool that is used in addition to when per default streams are used.  
It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back  into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and  there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena.

I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know.

Authors:
  - Thomas Graves (https://github.com/tgravescs)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)
  - Rong Ou (https://github.com/rongou)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#1395
@github-actions github-actions bot added the cpp Pertains to C++ code label Dec 1, 2023
@bdice bdice added non-breaking Non-breaking change bug Something isn't working labels Dec 1, 2023
@raydouglass raydouglass changed the base branch from branch-24.02 to branch-23.12 December 1, 2023 22:18
@bdice bdice marked this pull request as ready for review December 1, 2023 22:21
@bdice bdice requested a review from a team as a code owner December 1, 2023 22:21
@bdice bdice requested review from rongou and cwharris December 1, 2023 22:21
@tgravescs
Copy link
Contributor

+1

@raydouglass raydouglass merged commit 0054957 into rapidsai:branch-23.12 Dec 4, 2023
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants