Skip to content
This repository has been archived by the owner on Aug 16, 2023. It is now read-only.

adapt for raft ivf #734

Merged
merged 1 commit into from
Mar 16, 2023
Merged

Conversation

Presburger
Copy link
Member

@Presburger Presburger commented Mar 13, 2023

  1. Implemented serialization and deserialization of raft ivf index.
  2. The thread works for thread pool, I hope one thread id mapping one gpu stream.
  3. use the default device resource map, which can avoid some cuda error.

when I use python3 client, try the raft ivf index, old version got this error.
CUDA Error detected. cudaErrorCudartUnloading driver shutting down
seem some static variable destructor call's order cause this eror.
issue: #742

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Presburger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify
Copy link

mergify bot commented Mar 13, 2023

@Presburger Please associate the related issue to the body of your Pull Request. (eg. “issue: #”)

@Presburger
Copy link
Member Author

@wphicks hello, can you help me review my code, which adapt for raft ivf.

@mergify mergify bot added the ci-passed label Mar 13, 2023
namespace memory_pool {
static rmm::mr::device_memory_resource*
resources() {
auto device_id = rmm::detail::current_device();
Copy link

@cjnolet cjnolet Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we not invoke anything in the detail namespace (for RMM or for RAFT). That namespace is subject to change without warning and could cause integration nightmares going forward. We have public functions and helpers we can expose for things like this in RAFT.

Any particular reason there's a preference here to invoke internal APIs for RMM instead of using the public APIs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing @cjnolet, this is a fairly unusual use of RMM, and I would recommend avoiding it for the same reasons Corey mentioned. Can you outline what you're trying to achieve with this shift or what error you were seeing with the previous memory pool invocation?

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling all this! I think we might want to tweak a few things. In particular, the way that RMM is being used now is a little unusual and may lead to some issues for us. I also think that we would probably be better served by just generating a stream pool on the device_resources handle rather than relying on the current workarounds to use default stream per thread.

namespace memory_pool {
static rmm::mr::device_memory_resource*
resources() {
auto device_id = rmm::detail::current_device();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing @cjnolet, this is a fairly unusual use of RMM, and I would recommend avoiding it for the same reasons Corey mentioned. Can you outline what you're trying to achieve with this shift or what error you were seeing with the previous memory pool invocation?

auto scoped_device = detail::device_setter{*ivf_raft_cfg.gpu_ids.begin()};
res_ = std::make_unique<raft::device_resources>();
auto res_ = std::make_unique<raft::device_resources>(rmm::cuda_stream_per_thread, nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we recreate the device_resources with every call? I'm assuming this is to ensure that we are getting the per-thread default stream each time.

I think we might be better served with a stream pool, which can be added directly to the device_resources. If we need to ensure that Train and Add complete before we perform a search, we can synchronize on the whole pool after those calls. Then, we can just pull streams from that pool to perform overlapping searches.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add to @wphicks, device_resources manages all of the resources that you need to interact w/ the GPU (and with other GPU-accelerated libraries like cublas) and they can be very expensive to create. As an example, reusing the same device_resources instead of creating a new one, specially during search, led to well over a 6x speedup in our recent 4-bit RAFT IVFPQ vs SCANN benchmarks. This matters a lot.

result = gpu_index_->size();
}
return result;
return counts_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Count not supposed to return the total number of rows in the index? It looks like counts_ is getting overridden with each Add call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello, when we plug knowhere into milvus, milvus will check the count of the original dataset, I knowhere raft ivf will do some data padding, so, when we get the size from the raft index, milvus will crash down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give us a little more detail on this? index.size() should return the actual size of the index without padding. Are you seeing a specific mismatch between the reported size and the number of added vectors? If you have reproducer code, that would be even better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies! I just realized that what I said is true in 23.04 but false in 23.02 (which we are currently using. I'd recommend we upgrade for this and the other 23.04 features we've touched on elsewhere. I'm trying to get this PR to build right now, but once I've done so, I'll try to contribute code suggestions for how best to make use of 23.04.


std::ostream os(&buf);

os.write((char*)(&this->dim_), sizeof(this->dim_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole serialization setup seems fine, but I wonder if we wouldn't make our lives easier by upgrading to RAFT 23.04 and just invoking the serialization API directly from that. It would likely dramatically simplify this logic. That can also happen as a follow up, if this is working for now.

Copy link

@cjnolet cjnolet Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another benefit to centralize this is so that, for example, you could load an index into Milvus that's been previously trained w/ RAFT.

@mergify mergify bot added ci-passed and removed ci-passed labels Mar 14, 2023
@mergify mergify bot removed the ci-passed label Mar 14, 2023
@mergify mergify bot added the ci-passed label Mar 14, 2023
namespace raft_res_pool {

struct context {
static constexpr std::size_t default_size{16};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intending to create 16 rat::device_resources instances up front in the same thread? All the device_resources instances are going to be using the same cudaStream_t underneath so it's not going to be asynchronous or otherwise parallelizable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we create a group of stream, std::vectorrmm::cuda_stream stream_;

auto scoped_device = detail::device_setter{*ivf_raft_cfg.gpu_ids.begin()};
res_ = std::make_unique<raft::device_resources>();
auto res_ = raft_res_pool::resource::instance().get_raft_res(devs_[0]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice a stream-pool being created above. Is the goal to parallelize multiple queries and saturate the GPU by using different cuda streams or different threads? What guarantee does Knowhere make wrt the number of threads/gpus which might be invoking search() concurrently to compute the QPS?

If Knowhere is using different threads (and GPUs) to perform queries then we absolutely should be able to submit the queries on different streams to saturate each GPU. Though- I think we will want to be careful to make sure each GPU has its own copy of the IVF indexes in that case.

We are going to need to set the stream from the stream pool explicitly on a copy of the device_resources instance for each stream in the stream pool, though. As it works right now, the device_resources can contain its own stream pool, which allows algorithms to use it internally when they support it. It's not automatic, though. If you are using default_stream_per_thread then I would create a new device_resources instance for each thread (maybe change context() to thread_local in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knowhere have a thread pool, now search concurrent size is the cpu core number.

Copy link

@cjnolet cjnolet Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Presburger okay so it sounds like multiple threads might be using the raft::device_resources resources but are all of the the instances being created by the same thread?

@wphicks
Copy link
Contributor

wphicks commented Mar 14, 2023

Is it possible for us to populate the device_resources only on Build and Load calls and then assume that we have that available for each device at search time (rather than acquiring a lock and checking for existence on every call)? Given that we're benchmarking single-query calls, we should also assess the performance impact of using std::map to store the device_resources objects. If it is a significant part of the query time for small batches, shifting to a sorted vector would likely help us quite a bit.

@Presburger
Copy link
Member Author

Is it possible for us to populate the device_resources only on Build and Load calls and then assume that we have that available for each device at search time (rather than acquiring a lock and checking for existence on every call)? Given that we're benchmarking single-query calls, we should also assess the performance impact of using std::map to store the device_resources objects. If it is a significant part of the query time for small batches, shifting to a sorted vector would likely help us quite a bit.

This is a good proposal, we can init resource when build and load, and assume resource is available when we do search.
thanks

Signed-off-by: Yusheng.Ma <[email protected]>

context&
get_context() {
thread_local context ctx;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better!

context()
: resources_(
[]() {
return new rmm::cuda_stream(); // Avoid program exit datart
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these arguments are actually needed. By default, the device_resouces instance will use the per_thread_default_cuda_stream and the workspace_resources (the last argument to the constructor here) will automatically use rmm::mr::get_current_device_resource()

Copy link

@cjnolet cjnolet Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the cuda runtime error on exit- I've not encountered that exact error before, but I have encountered weird errors in the order of deleter invocations when using statics. I think one way to avoid that error might be to use the thread_local instead of a single static.

Nevermind, Will figured out the cause below.

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! In terms of the current error on exit, , it is happening because a static variable holds access to GPU memory, and the cuda context is destroyed before the static variable calls its destructor.

In the existing code, these lines took care of that. In the new version, we just need to make sure that we call .release() on our pool allocators before we exit.

@mergify mergify bot removed the ci-passed label Mar 16, 2023
@sre-ci-robot sre-ci-robot merged commit a926d86 into milvus-io:main Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants