-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] : Boosting OpenSearch Vector Engine Performance using GPUs #2293
Comments
@navneet1v Thanks for proposal - its very exciting! Question 1: A very important aspect of a separate index build service will be defining compatibility/contracts between the index build service and opensearch. Is there a plan for this? Will the index build service follow a release cycle of a plugin? That being said, why not come up with a very specialized node type for index build as opposed to a separate architecture? The node does not even have to be a part of the cluster - it could operate independently - something like cross cluster search. Question 2: In future, I could see GPUs/specialized hardware being used in very high throughput search use cases. For instances, maybe we cache frequently visited vectors in HNSW graph on GPU cache or create some kind of high cardinality partition (like IVF) and search all cluster centroids in parallel on GPU. I assume we could still do this regardless of decisions we make around remote vector index service correct? Question 3: In regards to
I understand that Nvidia Cagra is a good first place to start, but in general, whatever we build should be generic enough so that a different org/company can plug in their own hardware and build a similar component. That being said, will the index build service be generic? Should we come up with a list of tenets to follow for remote index build service? |
Yes there will contracts around this. As we start to explore the details of these components in details the contract will start to formulate. Like what APIs to be exposed by an index build service, in what format the vectors will be written to object store etc.
Till this point, I don't see a strong reason to follow the release cycle. I am thinking similar to clients of Opensearch for the release of the image which will be exposed to users.
Not sure what you mean by come up with a separate node. Since we are exposing a docker image which can be deployed and maintained by distribution owners themselves, it is very well independent. It is no upto the providers how they want to host this image and use it. |
Yes that is correct. I see these GPU based index build nodes/images being used as stateless workers for now. They gets spun up, do their work and then peacefully goes back to the pool. Opensearch providers can make them stateful by building a control plane aspects around these workers as per their needs. But this design doesn't touch base on those things since they are specific to the usecase which a Opensearch provider want to do. |
See the parts of k-NN plugin side will support any remote endpoint that can do vector index build and implements the APIs use by k-NN plugin like createIndex, getIndexBuildJobDetails etc(pending RFC creation). The details of these APIs will be shared in upcoming RFCs. So in that sense if tomorrow any specialized hardware comes, all one needs to do is flip the workers to use that hardware. I hope this clarifies. Ref: #2294 for plugin design |
Introduction
As the popularity and adoption of generative AI continues to grow, the demand for high-performance vector search databases has skyrocketed. Customers are now dealing with vector datasets that have scaled from hundreds of millions to tens of billions of vectors, and they require these vector databases to handle continuous updates and queries at massive scale. Building vector indices at such massive scales takes significant time and it is imperative that we significantly improve the performance and scalability of the OpenSearch Vector Engine to handle such workloads.
This document outlines a strategic plan to leverage GPU acceleration to dramatically boost the indexing performance of the OpenSearch vector engine. By harnessing the massive parallelism of GPUs, we aim to achieve faster index build times compared to CPU-based solutions, all while maintaining a compelling price-performance ratio. This will enable OpenSearch to meet the growing demands of our customers.
The key elements of this RFC include:
By implementing this GPU-powered approach, we will significantly enhance the indexing capabilities of the OpenSearch vector engine and better position it to meet the evolving needs of our customers in the generative AI era.
Opportunity
Index Build Acceleration
In 2024, the Vector Engine team made significant investments to improve the performance of the OpenSearch vector engine. This includes adding AVX512 SIMD support, implementing segment replication, transitioning to the more efficient KNNVectorsFormat, and employing iterative graph builds during merges to reduce memory footprint, greedy graph builds. These optimizations have greatly improved the overall performance and stability of the vector engine.
However, we believe we have now reached the limits of what can be achieved with CPU-based hardware alone. The next logical step is to leverage GPUs and other purpose-built hardware to further accelerate vector indexing. Our experiments with Nvidia's GPU-accelerated vector indexing algorithm, CAGRA, have shown very promising results - achieving 10-30x faster index build times compared to CPU-based solutions, all while maintaining a compelling price-performance ratio. A snapshot of the results are added in Appendix C.
The bar graph illustrates the substantial percentage improvements in index build time achieved by leveraging GPU-based infrastructure compared to CPU-based solutions for different datasets.
These performance gains are critical as our customers continue to scale their vector workloads into the billions of vectors. Faster indexing will allow them to more efficiently build and manage their vector indices.
Feasibility Study
To ensure full compatibility of GPU-created indices with CPU machines, multiple experiments were conducted. One crucial aspect that was validated is the ability to search a Faiss Index built on a GPU machine(with CAGRA algorithm) using a CPU machine. The performance benchmarks, as mentioned in the previous section, were set up in a manner where the CAGRA index was built on a GPU with Faiss, converted to a CPU-compatible index using Faiss, and then searched using Faiss. This proves that we can build the Faiss Index on a GPU machine and search it on a CPU machine. The CAGRA graph, before serialization, will be converted to an HNSW-based graph (IndexHNSWCagra), which can be searched on a CPU. This conversion is supported internally by Faiss and is maintained and contributed by the Nvidia team. Benchmarking code and code to convert a Cagra index to HNSW index is present in repo named VectorSearchForge
Proposed Architecture
Below is the high level simplified flow for improved vector index build. The overall flow goes like this:
Note: This is an oversimplified architecture and provides a 10K ft. view of the overall system.
Components Definition
Pros:
Cons:
Alternatives Considered
Alternative1: Hosting IndexBuild Service on GPU based Data Node Instances in OpenSearch Cluster
The idea here is rather than having a separate service/component outside of OpenSearch cluster, have a GPU based node in the cluster and can be used to build the index. On exploring further on this idea we can have GPU based instances in 2 forms:
Alternative2: Using Node to Node communication to transfer the vectors from data node to IndexBuildService
The idea here is rather than using an intermediate storage in between we directly transfer the vectors from data nodes to GPU machine to index build service where the actual index build will happen. This will completely remove building the support for different remote stores.
Pros:
Cons
Vector Index Build Component
Below is a sneak peek into the Vector Index build node and it components. This node/nodes will be responsible for building the Vector Index. The Vector Engine will calling/interacting with these nodes to build the index. Opensearch Project will be providing a docker image containing different components that can be deployed into on-prem/cloud to create/run/manage the Vector Index Build Service. Below is a very high level diagram and flow for how create Index api flow will happen.
Components Definition
Below is high level responsibilities and purpose of the components added in the above diagram. This is still a high level roles and responsibilities. During the detailed low level design we will further explore and cement it.
Pros:
Cons:
Alternative Considered
The below alternatives are mainly from distribution of the above software standpoint. The discussion on the internal components will happen as part of a separate RFC.
Alternative 1: Provide the whole software as an zip/tar rather than a container
The idea here is rather than containerizing the service, we can actually give direct .zip/tars containing the software just like Opensearch tar.gz.
Pros:
Cons:
Next steps
Overall the high level design is divided among various RFCs/issues. The division will look like this:
Security
Since there are multiple moving pieces in the design, details on the security aspect/certification will be covered with each component level design/RFC.
FAQs
Why we are focusing on improving the indexing time and not necessarily the search time?
The acceleration provided by GPUs specially comes when you have high number of vectors either for indexing or search. Since for indexing always we buffer the vectors in segments and then do a graph creation indexing can take huge advantage of GPUs.
But if we look at the Search use-case we just get 1 query vector at a time. Plus there is no capability in Opensearch currently where we can batch/buffer the vectors for doing the query at segment/shard level. So at-least initially we want to introduce GPU acceleration with indexing. We also believe that when the use case of batch vector search query will come in that case we can start leveraging the GPUs for search.
Why we choose GPUs rather than a purpose built vector search hardware?
One of the biggest reason around this was Nvidia’s hardware is popular and already available in most of the clouds. Another thing is the k-NN algorithm of Nvidia is Opensource and compatible with CPUs and Faiss.
How a GPU based Faiss Index(built using CAGRA) can be searched on CPU machine?
The CAGRA graph cannot be searched on a CPU machine, this is where when the index will be built on the GPU machine, before doing the serialization, the IndexBuild Service convert the Cagra Index to a IndexHNSWCagra, which can be searched on CPU.
Does CAGRA supports fp16, Byte quantization along with Binary Index for disk based vector search?
We had a discussion with Nvidia team and got the confirmation that Cagra support fp16 and byte quantization. The Binary Index is also present with Cagra. But of Tuesday, November 5 the support is only present in the raft library but not exposed via Faiss. Nvidia team will be working on enabling the support in Faiss for these quantizations.
How much time will be taken to upload the Vectors to remote store?
As per writing of this RFC, actual upload times have not been measured, but if we look at the raw numbers in terms of build time improvement (refer in Appendix C) for 30GB raw vector (openai 1536D) vector, we see a difference of close to 90mins. So even if we spend like 5-10mins in uploads and downloads(~250 Megabit/sec) we would still be saving huge amount of time. Also, when index is built on the GPU, only the HNSW structure will be uploaded to S3, this will further reduce the total upload time of the index(ref this code on how we will do this).
Appendix
Appendix A: Reference Documents
Appendix B
Current Architecture
The OpenSearch k-NN plugin supports three different types of engines to perform Approximate Nearest Neighbor (ANN) search: Lucene(Java Implementation), Faiss(C++ implementation) and Nmslib(C++ implementation). These engines serve as an abstraction layer over the downstream libraries used to implement the nearest neighbor search functionality.
Specifically, the plugin supports the following algorithms within each engine:
At a high level, OpenSearch index data is stored in shards, which are essentially Lucene indices. Each shard is further divided into immutable segments, created during the indexing process.
For indices with k-NN fields, the k-NN plugin leverages the same underlying Lucene architecture. During segment creation, in addition to the standard Lucene data structures (e.g., FST, BKDs, DocValues), the plugin also generates the necessary vector-specific data structures for each vector field. These vector-related files are then written to disk as part of the segment files and managed by Lucene.
When performing ANN search, the plugin loads the relevant vector data structures from disk into memory (but not the JVM heap) and uses the respective library (Lucene, Faiss, or Nmslib) to execute the search operation.
Appendix C
Comparison between g5.2xlarge and r6g.2xlarge
By maintaining similar recall and search time below table provides the index build time improvement numbers
Legends:
CPU Build Time: Total time in sec required to build + serialize the graph on CPU
GPU Build Time: Total time in sec required to build the graph on GPU + converting to CPU compatible graph + serialization of graph
Observations
Appendix D
Prices for GPU and CPU machines.
EC2 machine prices
The text was updated successfully, but these errors were encountered: