-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Interface support to reserve an address space without actually allocating it #6
Comments
Weights alteration
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
@teju85 @seunghwak is this something that will be solved by using RMM directly? |
Yes, I think we should be able to use RMM's managed_memory_resource for this purpose. But I'd like to hear from @seunghwak if this was what he had in mind or something else. |
My understanding of this issue is about reserving an address space (https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/) instead of using managed memory. This address space reservation feature is mainly to avoid reallocation. If we resize a vector, 1) it first allocates the memory block having the new size, 2) copies the old data to the new block, and 3) deallocates the old block. This address reservation feature allows us to replace 1), 2), 3) by just allocating a (new size - old size) block and mapping this to the reserved address space. Managed memory is to allow using host memory as a lower level buffer (larger in size but slower in speed) for a device memory. So, these two are two different things, and AFAIK, there was some brief discussion about supporting the address space reservation feature in RMM, but AFAIK, it has not been implemented yet in RMM. |
I now remember this discussion. This needs the support of address-reserve API of cuda, which I don't see in RMM. @harrism and/or @jrhemstad is this being planned for in RMM? |
Use of the CUDA VMM APIs would be an implementation detail of a memory resource implementation. There aren't any plans to expose explicit RMM APIs for CUDA VMM. |
@jrhemstad if there's a use-case like the one Seunghwa discussed above, is there a chance of getting this feature added onto RMM roadmap? |
OK, so the specific use case is for resizing buffers, and the reasoning is to avoid copies. The trouble with this is that RMM implements the So let me ask: is reallocation definitely a bottleneck? |
Fair enough @harrism . So far I haven't seen this being the bottleneck, maybe @seunghwak had a use-case in mind when he suggested this feature? |
So, the biggest benefit of this is memory footprint. We can clearly live without this, but having this will allow more memory footprint optimization for us (and we can handle bigger graphs within the gpu memory limit). To explain this in more detail, without this feature, to resize, we need memory size of "old_size + new_size" while with address space reservation, we need only max(old_size, new_size). Say we're doing filtering of multiple blocks, we do something like.
So, with the address reservation, we need to just reserve the address space of Avoiding copy is a second benefit (but less important as this is pretty fast). |
Address reservation can be used (as Jake pointed out) as an implementation detail of the vector to reduce memory overhead of resizing. This does not require an external interface for address reservation. |
This issue has been labeled |
This issue has been labeled |
Demangle the error stack trace provided by GCC. Example output: ```bash RAFT failure at file=/workspace/raft/cpp/bench/ann/src/raft/raft_ann_bench_utils.h line=127: Ooops! Obtained 16 stack frames #1 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::logic_error::logic_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5e [0x7fb20acce45e] #2 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::configured_raft_resources::stream_wait(CUstream_st*) const +0x2e3 [0x7fb20acd0ac3] #3 in /workspace/raft/cpp/build/libraft_ivf_pq_ann_bench.so: raft::bench::ann::RaftIvfPQ<float, long>::search(float const*, int, int, unsigned long*, float*, CUstream_st*) const +0x63e [0x7fb20acd44fe] #4 in ./cpp/build/ANN_BENCH: void raft::bench::ann::bench_search<float>(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective) +0xf76 [0x55853859f586] #5 in ./cpp/build/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, raft::bench::ann::Configuration::Index, unsigned long, std::shared_ptr<raft::bench::ann::Dataset<float> const>, raft::bench::ann::Objective), raft::bench::ann::Configuration::Index&, unsigned long&, std::shared_ptr<raft::bench::ann::Dataset<float> const>&, raft::bench::ann::Objective&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x84 [0x558538548f14] #6 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const +0x168 [0x5585385d6498] #7 in ./cpp/build/ANN_BENCH(+0x149108) [0x5585385b7108] #8 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoNIterations() +0x34f [0x5585385b8c7f] #9 in ./cpp/build/ANN_BENCH: benchmark::internal::BenchmarkRunner::DoOneRepetition() +0x119 [0x5585385b99b9] #10 in ./cpp/build/ANN_BENCH(+0x13afdd) [0x5585385a8fdd] #11 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) +0x58e [0x5585385aa8fe] #12 in ./cpp/build/ANN_BENCH: benchmark::RunSpecifiedBenchmarks() +0x6a [0x5585385aaada] #13 in ./cpp/build/ANN_BENCH: raft::bench::ann::run_main(int, char**) +0x11ed [0x5585385980cd] #14 in /lib/x86_64-linux-gnu/libc.so.6(+0x28150) [0x7fb213e28150] #15 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x89 [0x7fb213e28209] #16 in ./cpp/build/ANN_BENCH(+0xbfcef) [0x55853852dcef] ``` Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #2188
Describe the solution you'd like
@seunghwak has a very good point here about having a support for reserving memory (aka over-subscription) using CUDA's virtual memory management APIs. I believe this is a good improvement to our existing
Allocator
,device_buffer
andhost_buffer
interfaces. Thus, filing this issue so that this feature item is not lost.Additional context
Ref: https://devblogs.nvidia.com/introducing-low-level-gpu-virtual-memory-management/
The text was updated successfully, but these errors were encountered: