-
Notifications
You must be signed in to change notification settings - Fork 84
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Presburger The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Presburger Please associate the related issue to the body of your Pull Request. (eg. “issue: #”) |
@wphicks hello, can you help me review my code, which adapt for raft ivf. |
src/index/ivf_raft/ivf_raft.cuh
Outdated
namespace memory_pool { | ||
static rmm::mr::device_memory_resource* | ||
resources() { | ||
auto device_id = rmm::detail::current_device(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose we not invoke anything in the detail
namespace (for RMM or for RAFT). That namespace is subject to change without warning and could cause integration nightmares going forward. We have public functions and helpers we can expose for things like this in RAFT.
Any particular reason there's a preference here to invoke internal APIs for RMM instead of using the public APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing @cjnolet, this is a fairly unusual use of RMM, and I would recommend avoiding it for the same reasons Corey mentioned. Can you outline what you're trying to achieve with this shift or what error you were seeing with the previous memory pool invocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling all this! I think we might want to tweak a few things. In particular, the way that RMM is being used now is a little unusual and may lead to some issues for us. I also think that we would probably be better served by just generating a stream pool on the device_resources
handle rather than relying on the current workarounds to use default stream per thread.
src/index/ivf_raft/ivf_raft.cuh
Outdated
namespace memory_pool { | ||
static rmm::mr::device_memory_resource* | ||
resources() { | ||
auto device_id = rmm::detail::current_device(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing @cjnolet, this is a fairly unusual use of RMM, and I would recommend avoiding it for the same reasons Corey mentioned. Can you outline what you're trying to achieve with this shift or what error you were seeing with the previous memory pool invocation?
src/index/ivf_raft/ivf_raft.cuh
Outdated
auto scoped_device = detail::device_setter{*ivf_raft_cfg.gpu_ids.begin()}; | ||
res_ = std::make_unique<raft::device_resources>(); | ||
auto res_ = std::make_unique<raft::device_resources>(rmm::cuda_stream_per_thread, nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we recreate the device_resources
with every call? I'm assuming this is to ensure that we are getting the per-thread default stream each time.
I think we might be better served with a stream pool, which can be added directly to the device_resources
. If we need to ensure that Train
and Add
complete before we perform a search, we can synchronize on the whole pool after those calls. Then, we can just pull streams from that pool to perform overlapping searches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add to @wphicks, device_resources
manages all of the resources that you need to interact w/ the GPU (and with other GPU-accelerated libraries like cublas) and they can be very expensive to create. As an example, reusing the same device_resources
instead of creating a new one, specially during search, led to well over a 6x speedup in our recent 4-bit RAFT IVFPQ vs SCANN benchmarks. This matters a lot.
result = gpu_index_->size(); | ||
} | ||
return result; | ||
return counts_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Count
not supposed to return the total number of rows in the index? It looks like counts_
is getting overridden with each Add
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello, when we plug knowhere into milvus, milvus will check the count of the original dataset, I knowhere raft ivf will do some data padding, so, when we get the size from the raft index, milvus will crash down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give us a little more detail on this? index.size()
should return the actual size of the index without padding. Are you seeing a specific mismatch between the reported size and the number of added vectors? If you have reproducer code, that would be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies! I just realized that what I said is true in 23.04 but false in 23.02 (which we are currently using. I'd recommend we upgrade for this and the other 23.04 features we've touched on elsewhere. I'm trying to get this PR to build right now, but once I've done so, I'll try to contribute code suggestions for how best to make use of 23.04.
|
||
std::ostream os(&buf); | ||
|
||
os.write((char*)(&this->dim_), sizeof(this->dim_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole serialization setup seems fine, but I wonder if we wouldn't make our lives easier by upgrading to RAFT 23.04 and just invoking the serialization API directly from that. It would likely dramatically simplify this logic. That can also happen as a follow up, if this is working for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another benefit to centralize this is so that, for example, you could load an index into Milvus that's been previously trained w/ RAFT.
391d9af
to
9e453a3
Compare
9e453a3
to
7d9349a
Compare
src/index/ivf_raft/ivf_raft.cuh
Outdated
namespace raft_res_pool { | ||
|
||
struct context { | ||
static constexpr std::size_t default_size{16}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intending to create 16 rat::device_resources
instances up front in the same thread? All the device_resources
instances are going to be using the same cudaStream_t
underneath so it's not going to be asynchronous or otherwise parallelizable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we create a group of stream, std::vectorrmm::cuda_stream stream_;
src/index/ivf_raft/ivf_raft.cuh
Outdated
auto scoped_device = detail::device_setter{*ivf_raft_cfg.gpu_ids.begin()}; | ||
res_ = std::make_unique<raft::device_resources>(); | ||
auto res_ = raft_res_pool::resource::instance().get_raft_res(devs_[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice a stream-pool being created above. Is the goal to parallelize multiple queries and saturate the GPU by using different cuda streams or different threads? What guarantee does Knowhere
make wrt the number of threads/gpus which might be invoking search()
concurrently to compute the QPS?
If Knowhere is using different threads (and GPUs) to perform queries then we absolutely should be able to submit the queries on different streams to saturate each GPU. Though- I think we will want to be careful to make sure each GPU has its own copy of the IVF indexes in that case.
We are going to need to set the stream from the stream pool explicitly on a copy of the device_resources
instance for each stream in the stream pool, though. As it works right now, the device_resources
can contain its own stream pool, which allows algorithms to use it internally when they support it. It's not automatic, though. If you are using default_stream_per_thread
then I would create a new device_resources
instance for each thread (maybe change context()
to thread_local
in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
knowhere have a thread pool, now search concurrent size is the cpu core number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Presburger okay so it sounds like multiple threads might be using the raft::device_resources
resources but are all of the the instances being created by the same thread?
Is it possible for us to populate the |
This is a good proposal, we can init resource when build and load, and assume resource is available when we do search. |
7d9349a
to
833967b
Compare
833967b
to
04aea4c
Compare
Signed-off-by: Yusheng.Ma <[email protected]>
04aea4c
to
19490f3
Compare
|
||
context& | ||
get_context() { | ||
thread_local context ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much better!
context() | ||
: resources_( | ||
[]() { | ||
return new rmm::cuda_stream(); // Avoid program exit datart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these arguments are actually needed. By default, the device_resouces
instance will use the per_thread_default_cuda_stream
and the workspace_resources
(the last argument to the constructor here) will automatically use rmm::mr::get_current_device_resource()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the cuda runtime error on exit- I've not encountered that exact error before, but I have encountered weird errors in the order of deleter invocations when using statics. I think one way to avoid that error might be to use the thread_local instead of a single static.
Nevermind, Will figured out the cause below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! In terms of the current error on exit, , it is happening because a static variable holds access to GPU memory, and the cuda context is destroyed before the static variable calls its destructor.
In the existing code, these lines took care of that. In the new version, we just need to make sure that we call .release() on our pool allocators before we exit.
when I use python3 client, try the raft ivf index, old version got this error.
CUDA Error detected. cudaErrorCudartUnloading driver shutting down
seem some static variable destructor call's order cause this eror.
issue: #742