Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] raft::allocate bypasses RMM #308

Closed
harrism opened this issue Aug 3, 2021 · 15 comments
Closed

[BUG] raft::allocate bypasses RMM #308

harrism opened this issue Aug 3, 2021 · 15 comments
Assignees
Labels
bug Something isn't working inactive-30d inactive-90d

Comments

@harrism
Copy link
Member

harrism commented Aug 3, 2021

Describe the bug

/** cuda malloc */
template <typename Type>
void allocate(Type*& ptr, size_t len, bool setZero = false) {
  CUDA_CHECK(cudaMalloc((void**)&ptr, sizeof(Type) * len));
  if (setZero) CUDA_CHECK(cudaMemset(ptr, 0, sizeof(Type) * len));
}

Surprised to see that this quite commonly called function (in RAFT and cuML) calls directly into cudaMalloc, making memory pool use impossible.

@harrism harrism added the bug Something isn't working label Aug 3, 2021
@dantegd
Copy link
Member

dantegd commented Aug 3, 2021

Not sure where you found that allocate that indeed it shouldn’t be around unless it’s for a test perhaps? But in general the allocate that is used by raft (and cuML) is here

void* ptr = rmm::mr::get_current_device_resource()->allocate(n, stream);
(which is being replaced to use rmm directly, but it works with the pool allocator currently AFAIK) so if something is using that allocate you mention it is a big and should be removed.

Cc @viclafargue since you’re doing the raft rmm refactor, it’d be good to double check nothing in the main code is using this cudamalloc

@viclafargue
Copy link
Contributor

Yes, with the RAFT refactor, raft::allocate should now use RMM :

template <typename Type>
void allocate(Type*& ptr, size_t len, cudaStream_t stream,
bool setZero = false) {
size_t size = len * sizeof(Type);
ptr = (Type*)rmm::mr::get_current_device_resource()->allocate(size, stream);
if (setZero) CUDA_CHECK(cudaMemset((void*)ptr, 0, size));
std::lock_guard<std::mutex> _(mutex_);
allocations[ptr] = size;
}

The new design of raft::allocate, raft::deallocate, and raft::deallocate_all is still in progress though.

@harrism
Copy link
Member Author

harrism commented Aug 4, 2021

@dantegd it's in <raft/cudart_utils.h>

And it's called all over the place (both RAFT and cuML), not just in tests. Just search the codebase for raft::allocate( and you will see.

@harrism
Copy link
Member Author

harrism commented Aug 4, 2021

Moreover, very little code should need to call memory_resource::allocate: only thrust allocators. Everything else should use containers: device_buffer for untyped or byte data, device_uvector and device_vector for typed data. No raw pointers. Searching the codebase for the string allocate( should turn up very few results.

As an example, if you search for this string in libcudf it comes up 10 times, 8 of which are inside allocators. And I think all 10 will be gone from libcudf in the near future.

@dantegd
Copy link
Member

dantegd commented Aug 9, 2021

@harrism I fully agree, though I commented that it is in tests mostly since I did the searchfor raft::allocate( and found it in = a lot of tests, but couldn't find it in any non tests in RAFT, and in 2 non tests files in cuML:

cpp/src/ml_mg_utils.cuh
cpp/src_prims/metrics/scores.cuh 

Just was curious if I missed some place else.

Regardless of that, the PR of @viclafargue should solve the issue of bypassing, and yes the whole codebase is moving to use containers directly and has been for a while, can't find an issue on a quick search though, @divyegala would you happen to know if we have an issue for use of containers?

@divyegala
Copy link
Member

@dantegd @harrism I don't believe we have an issue in cuML for directly using containers. I am not against using raft::allocate as long as it's only in tests

@harrism
Copy link
Member Author

harrism commented Aug 10, 2021

OK, I see now it's mostly in tests. I do see it in src_prims/metrics/scores.cuh and src/ml_mg_utils.cuh.

That said, why is it OK for test code to be lower quality than core library code? If the tests allocate most memory using a raw byte allocator, that means they are using raw pointers. No Raw Pointers is an important goal to achieve, so I would add a corollary: No Raw Allocation.

Second, bypassing RMM in tests is a bad idea, as it means you can't easily switch the underlying memory resource used in the tests to help isolate behavior happening in real apps where RMM MRs are used.

@viclafargue
Copy link
Contributor

@harrism I could remove it from src, src_prims and bench. I agree that it's probably better to avoid using raw allocation in tests as well. However, we would like to merge the changes in RMM, RAFT, cuML, cuGraph and cuHornet as soon as possible to avoid possible future conflicts with PRs people are currently working on. It could be interesting to have a follow-up PR to remove any call to raft::allocate in testing as well.

Second, bypassing RMM in tests is a bad idea, as it means you can't easily switch the underlying memory resource used in the tests to help isolate behavior happening in real apps where RMM MRs are used.

raft::allocate used in tests should not bypass RMM anymore. It now calls rmm::mr::get_current_device_resource() that should return a pointer to an RMM allocator that follows RMM configuration. Please correct me if I'm missing something though. However, again, it might indeed be better to remove all calls to raft::allocate in the end, maybe in a follow-up PR.

@dantegd
Copy link
Member

dantegd commented Sep 7, 2021

@viclafargue this now should have been dealt with #286, correct? Should we update the issue or open a new one about removing raft::allocate entirely?

@viclafargue
Copy link
Contributor

raft::allocate should indeed now use RMM thanks to #286. I just opened two new issues (#323 and rapidsai/cuml#4197) to remove any calls to raft::allocate in RAFT and cuML.

@harrism
Copy link
Member Author

harrism commented Oct 6, 2021

I'm reopening this because there are still calls to rmm::mr::get_current_device_resource()->allocate(size, stream); in RAFT. These will cause a problem in the near future as we transition to replacing the RMM::device_memory_resource interface with proposed cuda::memory_resource and cuda::stream_ordered_memory_resource interfaces in libcu++. Those interfaces will somewhat change the API. For example, there will be separate allocate() and allocate_async functions, the latter accepting a stream. Also, there will be an optional alignment parameter.

This is a great example of why RMM clients should use rmm::device_buffer for raw byte allocation rather than direct allocation using mr::allocate(). "No raw pointers"

@harrism harrism reopened this Oct 6, 2021
@harrism
Copy link
Member Author

harrism commented Oct 6, 2021

Sorry, just saw there are new issues opened. Do they cover all uses of MR->allocate()?

@viclafargue
Copy link
Contributor

Well, replacing raft::allocate calls (see #323), would allow to remove the function in cpp/include/raft/cudart_utils.h.
Then, if my observations are right, the MR->allocate() pattern would only remain in these two files : cpp/include/raft/mr/buffer_base.hpp and cpp/include/raft/mr/device/allocator.hpp.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapids-bot bot pushed a commit that referenced this issue Mar 18, 2022
Answers #308.
Requires the appropriate changes in `cuML` and `cuGraph` before merging.

Authors:
  - Victor Lafargue (https://github.com/viclafargue)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #400
@cjnolet cjnolet closed this as completed Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants