adapt for raft ivf #734

Presburger · 2023-03-13T13:40:31Z

Implemented serialization and deserialization of raft ivf index.
The thread works for thread pool, I hope one thread id mapping one gpu stream.
use the default device resource map, which can avoid some cuda error.

when I use python3 client, try the raft ivf index, old version got this error.
CUDA Error detected. cudaErrorCudartUnloading driver shutting down
seem some static variable destructor call's order cause this eror.
issue: #742

sre-ci-robot · 2023-03-13T13:40:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Presburger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Presburger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mergify · 2023-03-13T13:41:07Z

@Presburger Please associate the related issue to the body of your Pull Request. (eg. “issue: #”)

Presburger · 2023-03-13T13:43:13Z

@wphicks hello, can you help me review my code, which adapt for raft ivf.

cjnolet · 2023-03-13T17:55:25Z

src/index/ivf_raft/ivf_raft.cuh

+namespace memory_pool {
+static rmm::mr::device_memory_resource*
+resources() {
+    auto device_id = rmm::detail::current_device();


I propose we not invoke anything in the detail namespace (for RMM or for RAFT). That namespace is subject to change without warning and could cause integration nightmares going forward. We have public functions and helpers we can expose for things like this in RAFT.

Any particular reason there's a preference here to invoke internal APIs for RMM instead of using the public APIs?

Echoing @cjnolet, this is a fairly unusual use of RMM, and I would recommend avoiding it for the same reasons Corey mentioned. Can you outline what you're trying to achieve with this shift or what error you were seeing with the previous memory pool invocation?

wphicks

Thanks for tackling all this! I think we might want to tweak a few things. In particular, the way that RMM is being used now is a little unusual and may lead to some issues for us. I also think that we would probably be better served by just generating a stream pool on the device_resources handle rather than relying on the current workarounds to use default stream per thread.

wphicks · 2023-03-13T20:02:45Z

src/index/ivf_raft/ivf_raft.cuh

+namespace memory_pool {
+static rmm::mr::device_memory_resource*
+resources() {
+    auto device_id = rmm::detail::current_device();


Echoing @cjnolet, this is a fairly unusual use of RMM, and I would recommend avoiding it for the same reasons Corey mentioned. Can you outline what you're trying to achieve with this shift or what error you were seeing with the previous memory pool invocation?

wphicks · 2023-03-13T20:07:46Z

src/index/ivf_raft/ivf_raft.cuh

                auto scoped_device = detail::device_setter{*ivf_raft_cfg.gpu_ids.begin()};
-                res_ = std::make_unique<raft::device_resources>();
+                auto res_ = std::make_unique<raft::device_resources>(rmm::cuda_stream_per_thread, nullptr,


Any reason we recreate the device_resources with every call? I'm assuming this is to ensure that we are getting the per-thread default stream each time.

I think we might be better served with a stream pool, which can be added directly to the device_resources. If we need to ensure that Train and Add complete before we perform a search, we can synchronize on the whole pool after those calls. Then, we can just pull streams from that pool to perform overlapping searches.

Just to add to @wphicks, device_resources manages all of the resources that you need to interact w/ the GPU (and with other GPU-accelerated libraries like cublas) and they can be very expensive to create. As an example, reusing the same device_resources instead of creating a new one, specially during search, led to well over a 6x speedup in our recent 4-bit RAFT IVFPQ vs SCANN benchmarks. This matters a lot.

wphicks · 2023-03-13T20:17:14Z

src/index/ivf_raft/ivf_raft.cuh

-            result = gpu_index_->size();
-        }
-        return result;
+        return counts_;


Is Count not supposed to return the total number of rows in the index? It looks like counts_ is getting overridden with each Add call.

hello, when we plug knowhere into milvus, milvus will check the count of the original dataset, I knowhere raft ivf will do some data padding, so, when we get the size from the raft index, milvus will crash down.

Could you give us a little more detail on this? index.size() should return the actual size of the index without padding. Are you seeing a specific mismatch between the reported size and the number of added vectors? If you have reproducer code, that would be even better.

My apologies! I just realized that what I said is true in 23.04 but false in 23.02 (which we are currently using. I'd recommend we upgrade for this and the other 23.04 features we've touched on elsewhere. I'm trying to get this PR to build right now, but once I've done so, I'll try to contribute code suggestions for how best to make use of 23.04.

wphicks · 2023-03-13T20:20:53Z

src/index/ivf_raft/ivf_raft.cuh

+
+        std::ostream os(&buf);
+
+        os.write((char*)(&this->dim_), sizeof(this->dim_));


This whole serialization setup seems fine, but I wonder if we wouldn't make our lives easier by upgrading to RAFT 23.04 and just invoking the serialization API directly from that. It would likely dramatically simplify this logic. That can also happen as a follow up, if this is working for now.

Another benefit to centralize this is so that, for example, you could load an index into Milvus that's been previously trained w/ RAFT.

cjnolet · 2023-03-14T22:45:22Z

src/index/ivf_raft/ivf_raft.cuh

+namespace raft_res_pool {
+
+struct context {
+    static constexpr std::size_t default_size{16};


Is this intending to create 16 rat::device_resources instances up front in the same thread? All the device_resources instances are going to be using the same cudaStream_t underneath so it's not going to be asynchronous or otherwise parallelizable.

no, we create a group of stream, std::vectorrmm::cuda_stream stream_;

cjnolet · 2023-03-14T22:45:28Z

src/index/ivf_raft/ivf_raft.cuh

                auto scoped_device = detail::device_setter{*ivf_raft_cfg.gpu_ids.begin()};
-                res_ = std::make_unique<raft::device_resources>();
+                auto res_ = raft_res_pool::resource::instance().get_raft_res(devs_[0]);


I notice a stream-pool being created above. Is the goal to parallelize multiple queries and saturate the GPU by using different cuda streams or different threads? What guarantee does Knowhere make wrt the number of threads/gpus which might be invoking search() concurrently to compute the QPS?

If Knowhere is using different threads (and GPUs) to perform queries then we absolutely should be able to submit the queries on different streams to saturate each GPU. Though- I think we will want to be careful to make sure each GPU has its own copy of the IVF indexes in that case.

We are going to need to set the stream from the stream pool explicitly on a copy of the device_resources instance for each stream in the stream pool, though. As it works right now, the device_resources can contain its own stream pool, which allows algorithms to use it internally when they support it. It's not automatic, though. If you are using default_stream_per_thread then I would create a new device_resources instance for each thread (maybe change context() to thread_local in that case?

knowhere have a thread pool, now search concurrent size is the cpu core number.

@Presburger okay so it sounds like multiple threads might be using the raft::device_resources resources but are all of the the instances being created by the same thread?

wphicks · 2023-03-14T23:36:34Z

Is it possible for us to populate the device_resources only on Build and Load calls and then assume that we have that available for each device at search time (rather than acquiring a lock and checking for existence on every call)? Given that we're benchmarking single-query calls, we should also assess the performance impact of using std::map to store the device_resources objects. If it is a significant part of the query time for small batches, shifting to a sorted vector would likely help us quite a bit.

Presburger · 2023-03-15T02:51:23Z

Is it possible for us to populate the device_resources only on Build and Load calls and then assume that we have that available for each device at search time (rather than acquiring a lock and checking for existence on every call)? Given that we're benchmarking single-query calls, we should also assess the performance impact of using std::map to store the device_resources objects. If it is a significant part of the query time for small batches, shifting to a sorted vector would likely help us quite a bit.

This is a good proposal, we can init resource when build and load, and assume resource is available when we do search.
thanks

Signed-off-by: Yusheng.Ma <[email protected]>

cjnolet · 2023-03-15T18:41:20Z

src/index/ivf_raft/ivf_raft.cuh

+
+context&
+get_context() {
+    thread_local context ctx;


This looks much better!

cjnolet · 2023-03-15T18:42:43Z

src/index/ivf_raft/ivf_raft.cuh

+    context()
+        : resources_(
+              []() {
+                  return new rmm::cuda_stream();  // Avoid program exit datart


None of these arguments are actually needed. By default, the device_resouces instance will use the per_thread_default_cuda_stream and the workspace_resources (the last argument to the constructor here) will automatically use rmm::mr::get_current_device_resource()

As for the cuda runtime error on exit- I've not encountered that exact error before, but I have encountered weird errors in the order of deleter invocations when using statics. I think one way to avoid that error might be to use the thread_local instead of a single static.

Nevermind, Will figured out the cause below.

wphicks

This looks great! In terms of the current error on exit, , it is happening because a static variable holds access to GPU memory, and the cuda context is destroyed before the static variable calls its destructor.

In the existing code, these lines took care of that. In the new version, we just need to make sure that we call .release() on our pool allocators before we exit.

sre-ci-robot requested review from cqy123456 and liliu-z March 13, 2023 13:40

sre-ci-robot added approved size/L labels Mar 13, 2023

mergify bot added dco-passed DCO check passed. do-not-merge/missing-related-issue labels Mar 13, 2023

mergify bot added the ci-passed label Mar 13, 2023

cjnolet reviewed Mar 13, 2023

View reviewed changes

wphicks reviewed Mar 13, 2023

View reviewed changes

mergify bot added ci-passed and removed ci-passed labels Mar 14, 2023

Presburger force-pushed the nvidia_raft_adapt branch from 391d9af to 9e453a3 Compare March 14, 2023 13:15

mergify bot removed the ci-passed label Mar 14, 2023

Presburger force-pushed the nvidia_raft_adapt branch from 9e453a3 to 7d9349a Compare March 14, 2023 13:16

mergify bot added the ci-passed label Mar 14, 2023

cjnolet reviewed Mar 14, 2023

View reviewed changes

Presburger force-pushed the nvidia_raft_adapt branch from 7d9349a to 833967b Compare March 15, 2023 13:37

mergify bot removed the ci-passed label Mar 15, 2023

Presburger force-pushed the nvidia_raft_adapt branch from 833967b to 04aea4c Compare March 15, 2023 13:39

adapt for raft ivf

19490f3

Signed-off-by: Yusheng.Ma <[email protected]>

Presburger force-pushed the nvidia_raft_adapt branch from 04aea4c to 19490f3 Compare March 15, 2023 13:42

mergify bot added the ci-passed label Mar 15, 2023

cjnolet reviewed Mar 15, 2023

View reviewed changes

wphicks reviewed Mar 15, 2023

View reviewed changes

Presburger added lgtm and removed do-not-merge/missing-related-issue labels Mar 16, 2023

mergify bot removed the ci-passed label Mar 16, 2023

sre-ci-robot merged commit a926d86 into milvus-io:main Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt for raft ivf #734

adapt for raft ivf #734

Presburger commented Mar 13, 2023 •

edited

Loading

sre-ci-robot commented Mar 13, 2023

mergify bot commented Mar 13, 2023

Presburger commented Mar 13, 2023

cjnolet Mar 13, 2023 •

edited

Loading

wphicks Mar 13, 2023

wphicks left a comment

wphicks Mar 13, 2023

wphicks Mar 13, 2023

cjnolet Mar 13, 2023

wphicks Mar 13, 2023

Presburger Mar 14, 2023

wphicks Mar 14, 2023

wphicks Mar 14, 2023

wphicks Mar 13, 2023

cjnolet Mar 13, 2023 •

edited

Loading

cjnolet Mar 14, 2023

Presburger Mar 15, 2023

cjnolet Mar 14, 2023

Presburger Mar 15, 2023

cjnolet Mar 15, 2023 •

edited

Loading

wphicks commented Mar 14, 2023

Presburger commented Mar 15, 2023

cjnolet Mar 15, 2023

cjnolet Mar 15, 2023

cjnolet Mar 15, 2023 •

edited

Loading

wphicks left a comment


		std::ostream os(&buf);

		os.write((char*)(&this->dim_), sizeof(this->dim_));

adapt for raft ivf #734

adapt for raft ivf #734

Conversation

Presburger commented Mar 13, 2023 • edited Loading

sre-ci-robot commented Mar 13, 2023

mergify bot commented Mar 13, 2023

Presburger commented Mar 13, 2023

cjnolet Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wphicks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

wphicks commented Mar 14, 2023

Presburger commented Mar 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

wphicks left a comment

Choose a reason for hiding this comment

Presburger commented Mar 13, 2023 •

edited

Loading

cjnolet Mar 13, 2023 •

edited

Loading

cjnolet Mar 13, 2023 •

edited

Loading

cjnolet Mar 15, 2023 •

edited

Loading

cjnolet Mar 15, 2023 •

edited

Loading