-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Story] Enabling prefetching of unified memory #16251
Comments
I'd like us to consider an alternate libcudf implementation that is more work but may be better in terms of control and maintenance going forward. I believe we could build a set of utilities that accept pointers or a variety of container types that perform the prefetch and then insert the prefetch/utility calls before each kernel launch. This provides the best control to the algorithm author when and what is prefetched with no surprises or side-effects. I'd like to keep logic like this out of the containers ( |
I concur with your assessment long term, but as detailed in the issue I don't think it is feasible on the timeline we are seeking. Inserting changes before every kernel launch, even fairly trivial changes, seems like a task that will take at least one full release since the initial work will require achieving consensus on what those changes should be. Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point. |
I only disagree with modifying I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term. |
The problem I see with that approach is that while we might be able to see good results on a particular set of benchmarks that way, we will not be able to enable a managed memory resource as default without substantially slowing down a wide range of APIs (anything that doesn't have prefetching enabled). We should at minimum test running the cudf microbenchmarks with a managed memory resource. I suspect that the results will not support using a managed memory resource by default in cudf.pandas without the more blanket approach for prefetching, unless we choose to wait for the longer term solution where we roll out your proposed changes to more APIs. |
Copying from Slack:
|
This implements a `PrefetchResourceAdaptor`, following the proposal from rapidsai/cudf#16251. > Implement a new PrefetchMemoryResource that performs a prefetch when data is allocated. This is important because injecting prefetches in cuIO is more challenging than in the rest of libcudf, so prefetching on allocate is a short-term fix that ensures buffers are prefetched before being written to in cuIO. Authors: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Rong Ou (https://github.com/rongou) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1608
Based on some discussions of the current state of things, we've determined that we are happy to leave the managed memory support largely as is. We have no plans to support multi-stream usage, and the amount of work it would take to insert prefetching using new accessors everywhere is substantial and has little benefit at the moment. Improving the I/O side of things is also a significant chunk of work, and at present it seems like we are getting enough benefit out of the prefetching allocator in cudf.pandas to be satisfied. Until we anticipate benefits in those directions, the current state is a tolerable steady state. If some new data suggests that we should consider revising those decisions (e.g. if UVM experiments with cudf-polars indicate that we do need better I/O performance there), we can reopen this issue. |
Problem statement
cudf.pandas has substantially increased the number of users attempting to use cuDF on workloads that trigger out of memory (OOM) errors. This is particularly problematic because cuDF typically OOMs on datasets that are far smaller than the available GPU memory due to the overhead of various algorithms, resulting in users being unable to process datasets that they might reasonably expect to. Addressing these OOM errors is one of the highest priorities for cuDF to enable users with smaller GPUs, such as consumer cards with lower memory. Unified Memory is one possible solution to this problem since algorithms are no longer bound by the memory available on device. RMM exposes a managed memory resource so users can easily switch over to using unified memory. However, naive usage of unified memory introduces severe performance bottlenecks due to page faulting, so simply switching over to unified memory is not an option for cuDF or libcudf. Before we can use unified memory in production, we need to implement mitigating strategies to avoid faults using either hinting or prefetching to trigger a migration. Here we propose using systematic prefetching for this purpose.
Goals:
Non-Goals:
Short-term Proposal (mix of 24.08 and 24.10)
We need to make an expedient set of changes to enable prefetching when using managed memory. While cudf.pandas is the primary target in the immediate term, we cannot realistically achieve this purely in the Python layer and will need some libcudf work. With that in mind, I propose the following changes:
PrefetchMemoryResource
that performs a prefetch when data is allocated. This is important because injecting prefetches in cuIO is more challenging than in the rest of libcudf, so prefetching on allocate is a short-term fix that ensures buffers are prefetched before being written to in cuIO.column_view
/mutable_column_view::head
.rmm::device_uvector
to createcudf::device_uvector
and add a prefetch call tocudf::device_uvector::data
. All internal uses ofrmm::device_uvector
should be replaced withcudf::device_uvector
, but functions returningrmm:device_uvector
need not be changed.Items 2-4 are implemented in #16020 (3 is partially implemented; a global find-and-replace is still needed). Item 1 has been prototyped as a callback memory resource in Python for testing but needs to be converted to a proper C++ implementation.
This plan involves a number of compromises, but it offers a number of significant advantages that I think make it worthwhile to proceed in the short term.
The drawbacks of this plan:
device_uvector
, and we may eventually find that we needcudf::device_buffer
too), and we are perhaps adding functionality that really should be added to rmm instead.Long-term plans
Here we lay out various potential long-term solutions to address the concerns above.
Adding new pointer accessors for prefetching
Instead of modifying the behavior of existing data accessors and gating the behavior behind a configuration, we could instead introduce new data accessors. For instance, we could add
column_view::data_prefetch(rmm::cuda_stream_view)
.Pros:
Cons:
Using a macro of some sort to trigger prefetching
Instead of adding new accessors, we could add a macro that could be inserted into algorithms to indicate a set of columns that need to be prefetched. This approach has essentially the same pros and cons as the above, so it's really a question of which implementation we prefer if we choose to go either of these routes.
Adding prefetching to rmm data structures
Pros:
Cons:
Updating cuIO to properly handle prefetching
Updating cuIO data structures to properly handle prefetching is a long-term requirement.
Pros:
Cons:
The text was updated successfully, but these errors were encountered: