Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] : Boosting OpenSearch Vector Engine Performance using GPUs #2293

Open
navneet1v opened this issue Nov 28, 2024 · 4 comments
Open

[RFC] : Boosting OpenSearch Vector Engine Performance using GPUs #2293

navneet1v opened this issue Nov 28, 2024 · 4 comments
Assignees
Labels
enhancement indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time.

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Nov 28, 2024

Introduction

As the popularity and adoption of generative AI continues to grow, the demand for high-performance vector search databases has skyrocketed. Customers are now dealing with vector datasets that have scaled from hundreds of millions to tens of billions of vectors, and they require these vector databases to handle continuous updates and queries at massive scale. Building vector indices at such massive scales takes significant time and it is imperative that we significantly improve the performance and scalability of the OpenSearch Vector Engine to handle such workloads.
This document outlines a strategic plan to leverage GPU acceleration to dramatically boost the indexing performance of the OpenSearch vector engine. By harnessing the massive parallelism of GPUs, we aim to achieve faster index build times compared to CPU-based solutions, all while maintaining a compelling price-performance ratio. This will enable OpenSearch to meet the growing demands of our customers.
The key elements of this RFC include:

  • Offloading vector index building to a GPU-powered remote fleet that can leverage the latest advancements in GPU-accelerated vector indexing algorithms
  • Enabling seamless integration of GPU-built indices with the CPU-based OpenSearch vector engine for search and query serving

By implementing this GPU-powered approach, we will significantly enhance the indexing capabilities of the OpenSearch vector engine and better position it to meet the evolving needs of our customers in the generative AI era.

Opportunity

Index Build Acceleration

In 2024, the Vector Engine team made significant investments to improve the performance of the OpenSearch vector engine. This includes adding AVX512 SIMD support, implementing segment replication, transitioning to the more efficient KNNVectorsFormat, and employing iterative graph builds during merges to reduce memory footprint, greedy graph builds. These optimizations have greatly improved the overall performance and stability of the vector engine.

However, we believe we have now reached the limits of what can be achieved with CPU-based hardware alone. The next logical step is to leverage GPUs and other purpose-built hardware to further accelerate vector indexing. Our experiments with Nvidia's GPU-accelerated vector indexing algorithm, CAGRA, have shown very promising results - achieving 10-30x faster index build times compared to CPU-based solutions, all while maintaining a compelling price-performance ratio. A snapshot of the results are added in Appendix C.

The bar graph illustrates the substantial percentage improvements in index build time achieved by leveraging GPU-based infrastructure compared to CPU-based solutions for different datasets.

These performance gains are critical as our customers continue to scale their vector workloads into the billions of vectors. Faster indexing will allow them to more efficiently build and manage their vector indices.

Feasibility Study

To ensure full compatibility of GPU-created indices with CPU machines, multiple experiments were conducted. One crucial aspect that was validated is the ability to search a Faiss Index built on a GPU machine(with CAGRA algorithm) using a CPU machine. The performance benchmarks, as mentioned in the previous section, were set up in a manner where the CAGRA index was built on a GPU with Faiss, converted to a CPU-compatible index using Faiss, and then searched using Faiss. This proves that we can build the Faiss Index on a GPU machine and search it on a CPU machine. The CAGRA graph, before serialization, will be converted to an HNSW-based graph (IndexHNSWCagra), which can be searched on a CPU. This conversion is supported internally by Faiss and is maintained and contributed by the Nvidia team. Benchmarking code and code to convert a Cagra index to HNSW index is present in repo named VectorSearchForge

Proposed Architecture

Below is the high level simplified flow for improved vector index build. The overall flow goes like this:

  1. Vectors are ingested in OpenSearch Vector Engine via standard bulk apis and segments will start to get created and stored on the disk.
  2. During the flush/merges if Vector Engine Recognize that rather than building the index on current data node, handing it over the index build service is more beneficial, vector engine will stream/uploads the vectors data to an object store/intermediate store.
  3. Once the data is uploaded Vector Engine will call the createIndex api of IndexBuildService(new component).
  4. IndexBuildService based on the details of createIndexRequest will choose the right worker/instance for building the index and will trigger the index build and return a token back to Vector Engine. (More details related to this step is provided in later sections)
  5. Once the index is created it will be uploaded to remote store (Refer below sections on how we can reduce the size of the index) by IndexBuildService.
  6. Vector Engine will be notified back with details on where the index is stored and few other details.(Refer the below sections about more details on this interaction)
  7. Vector Engine will then download the index from the remote store and store it with other segment files in Lucene Directory.

High Level Diagram

Note: This is an oversimplified architecture and provides a 10K ft. view of the overall system.

Components Definition

  1. Vector Index Build Component: This is new component that is being proposed as part of this design. It will be responsible for building a vector index for a segment and then notifying Vector Engine that index is ready to download. This index build component can have both GPU and CPU machines or any customer hardware to build the vector index.
  2. Opensearch Vector Engine(k-NN plugin): This nothing but k-NN plugin which is responsible for providing vector related capabilities in Opensearch.
  3. Object Store/Intermediate Vector Store: This an intermediate storage component which will temporarily store the vectors and index for different components. The store will ensure that it has right deletion polices to remove these vectors. This is not the same store where we store the segments for Remote store feature in Opensearch. This will be a separate store.

Pros:

  1. Decoupled Architecture: With the decoupled architecture the index build component can evolve separately and new algorithms/machines can be added without impacting the overall system. There will also be no coupling between OS version and Vector Index Build Component version.
  2. Better GPU node usages: The GPU nodes will only be building the vector index and will not be indexing any text data.
  3. Better Error Handling: Since we are uploading the data to an intermediate store even if a GPU node crashes in between we don’t need to again transfer the data from data nodes to GPU nodes.
  4. Multi-Tenancy: Since the fleet is outside of the cluster, multiple clusters can use the same Vector Index build service.

Cons:

  1. May be Higher Upload Times: The overall benefits might get reduced(actual number still needs to be found) as vectors first needs to be uploaded to object store. But this still debatable as with object stores we can parallelize the uploads. Refer FAQ section for details on the upload times

Alternatives Considered

Alternative1: Hosting IndexBuild Service on GPU based Data Node Instances in OpenSearch Cluster

The idea here is rather than having a separate service/component outside of OpenSearch cluster, have a GPU based node in the cluster and can be used to build the index. On exploring further on this idea we can have GPU based instances in 2 forms:

  1. GPU based data node doing only Indexing: With IndexReaders and Writers separation feature/project in OpenSearch, we can very well have create GPU machines as Indexing nodes.
    1. Pros:
      1. No intermediate Store: One of the biggest advantage of such architecture we don’t need to transfer vectors back and forth between OpenSearch cluster and Index Build Service.
    2. Cons:
      1. GPU machine indexing non vector data: Since this node will also be indexing the non vector fields(in OpenSearch we can have a vector field and a text field in same index), GPU machines doing indexing will not be economical. (we know similar price GPU machine has both low memory and vCPUs as compared to CPU machines with same price. Refer Appendix D.
      2. Price Performance: Since OpenSearch will be running on machine, which will take CPU resources and also RAM to run the OpenSearch cluster, which will further make this approach not economical.
  2. A separate node role which builds the k-NN index: In this approach we will still have the GPU nodes running OpenSearch process on it, but the node will not have any shards/index writer capability in it. The node will have a separate role lets say vector-index-builder, where data nodes will transfer their vectors(similar to original recommended solution). The node will need to build and run a very simple OpenSearch process
    1. Pros:
      1. Since the GPU node is also part of the cluster no separate security controls/encryption is needed to be built.
    2. Cons:
      1. Low Price Performance: Since the node is in the cluster, and once ingestion is completed the node is no longer needed. But it will still remain in the cluster and cannot be shared across the different clusters. We saw similar thing with ML node, and at end we switched to using remote models to do the inference as that is more scaleable
      2. Maintenance Overhead: This will also require building a trimmed version of OpenSearch which might not be useful, since for every feature we need to see if we want to add it in the trimmed version or not.
Alternative2: Using Node to Node communication to transfer the vectors from data node to IndexBuildService

The idea here is rather than using an intermediate storage in between we directly transfer the vectors from data nodes to GPU machine to index build service where the actual index build will happen. This will completely remove building the support for different remote stores.
Pros:

  1. As mentioned above since there is no remote store users can save cost of the remote stores and from code standpoint it becomes simple to manage as Vector Engine doesn’t need to support different remote stores.

Cons

  1. Data loss: If a GPU machine fails, all data must be resent from the data node, wasting time and resources. Since the data is getting transferred to GPU machines if that machine is no longer active the whole data needs to be transferred again from data node to GPU machine.
  2. Tight Coupling: Vector engine would become overloaded and too tightly coupled by handling all the decision-making for graph building. Decisions like which node to use for graph building, which tenant needs to be prioritized

Vector Index Build Component

Below is a sneak peek into the Vector Index build node and it components. This node/nodes will be responsible for building the Vector Index. The Vector Engine will calling/interacting with these nodes to build the index. Opensearch Project will be providing a docker image containing different components that can be deployed into on-prem/cloud to create/run/manage the Vector Index Build Service. Below is a very high level diagram and flow for how create Index api flow will happen.

Components Definition

Below is high level responsibilities and purpose of the components added in the above diagram. This is still a high level roles and responsibilities. During the detailed low level design we will further explore and cement it.

  1. API Handlers: A thin layer REST/gRPC to handle the incoming requests. There will be no auth/authz happening here. Users who are hosting this service can add a side car container to handle the auth/authz.
  2. Index Builder: This component will be responsible for triggering the creation of the index. The component will first load the vectors in memory and then pass that to Faiss for Index build. Once the index is build and converted to CPU based index it will then trigger the upload.
  3. Vector Index Data Accessor: The accessor component will be a thin abstraction between the Builder and the remote client. The component will handle trigger the downloads(if required in parallel) using the remote store client.
  4. Remote Store Client: A wrapper over different remote store clients that can be used to download the vectors data.

Pros:

  1. Reproducible Setup: The main benefit of having a image is, the setup becomes super easy and can be deployed on different machines in the cloud/on-prem.
  2. Efficient Resource Management: With containerized setups a user can run multiple containers on a machine with more resources which can help provide isolation if needed between different index build processes.
  3. Ease of Development and upgrades: Since the whole env is containerized, the development and upgrades becomes easy as there are already multiple tools like Kubernetes etc that can easily do the upgrades.
  4. Side Car Architecture: With containerized environment, user can plugin in any side car container for security, metrics of their choice without making any changes in the main container.

Cons:

  1. The learning curve for accessing the GPUs via container could be a problem. We might have to use GPU vendor(in this case Nvidia) specific images as base images. But with documentation and proper developer guide this can easily be overcome.

Alternative Considered

The below alternatives are mainly from distribution of the above software standpoint. The discussion on the internal components will happen as part of a separate RFC.

Alternative 1: Provide the whole software as an zip/tar rather than a container

The idea here is rather than containerizing the service, we can actually give direct .zip/tars containing the software just like Opensearch tar.gz.
Pros:

  1. There will be no learning curve for containers setup with GPUs.

Cons:

  1. All the pros mentioned in the recommended approach will be a con for this approach.

Next steps

Overall the high level design is divided among various RFCs/issues. The division will look like this:

  1. Remote Vector Index Build Feature in OpenSearch Vector Engine: This will be a generic capability in OpenSearch Vector engine which will allow Vector Engine to take the advantage of a remote fleet to build the vector index. The capability and interfaces will be used as building blocks for integrating GPUs based Vector Index builds. REF: [RFC] Remote Vector Index Build Feature with OpenSearch Vector Engine #2294
  2. Remote Store and Vector Index Build Component Integration Interfaces: This RFC will add the details on how Vector Engine will be integrating with Remote store and also with Vector Index Build Component. The RFC will talk about the Remote Index Build Client as a first class citizen to integrate with vector index build component.
  3. Vector Index Build Component: In this RFC we talked on very high level about the create index api for vector index build service. But this is not the only API that could be exposed. In the upcoming RFC more details about other APIs will be added, hosting framework for the APIs, multiple internal components that are just added in above diagram but haven’t been explained in much details.

Security

Since there are multiple moving pieces in the design, details on the security aspect/certification will be covered with each component level design/RFC.

FAQs

Why we are focusing on improving the indexing time and not necessarily the search time?

The acceleration provided by GPUs specially comes when you have high number of vectors either for indexing or search. Since for indexing always we buffer the vectors in segments and then do a graph creation indexing can take huge advantage of GPUs.
But if we look at the Search use-case we just get 1 query vector at a time. Plus there is no capability in Opensearch currently where we can batch/buffer the vectors for doing the query at segment/shard level. So at-least initially we want to introduce GPU acceleration with indexing. We also believe that when the use case of batch vector search query will come in that case we can start leveraging the GPUs for search.

Why we choose GPUs rather than a purpose built vector search hardware?

One of the biggest reason around this was Nvidia’s hardware is popular and already available in most of the clouds. Another thing is the k-NN algorithm of Nvidia is Opensource and compatible with CPUs and Faiss.

How a GPU based Faiss Index(built using CAGRA) can be searched on CPU machine?

The CAGRA graph cannot be searched on a CPU machine, this is where when the index will be built on the GPU machine, before doing the serialization, the IndexBuild Service convert the Cagra Index to a IndexHNSWCagra, which can be searched on CPU.

Does CAGRA supports fp16, Byte quantization along with Binary Index for disk based vector search?

We had a discussion with Nvidia team and got the confirmation that Cagra support fp16 and byte quantization. The Binary Index is also present with Cagra. But of Tuesday, November 5 the support is only present in the raft library but not exposed via Faiss. Nvidia team will be working on enabling the support in Faiss for these quantizations.

How much time will be taken to upload the Vectors to remote store?

As per writing of this RFC, actual upload times have not been measured, but if we look at the raw numbers in terms of build time improvement (refer in Appendix C) for 30GB raw vector (openai 1536D) vector, we see a difference of close to 90mins. So even if we spend like 5-10mins in uploads and downloads(~250 Megabit/sec) we would still be saving huge amount of time. Also, when index is built on the GPU, only the HNSW structure will be uploaded to S3, this will further reduce the total upload time of the index(ref this code on how we will do this).

Appendix

Appendix A: Reference Documents

  1. Overall Initiative to improve the index build time: [META] [Build-Time] Improving Build time for Vector Indices #1599
  2. Serialize Faiss Index build with HSNW structure only: add skip_storage flag to HNSW facebookresearch/faiss#3487
  3. Carga Algorithm: https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/#cagra
  4. Library Vending out the Cagra algorithm via Faiss: https://github.com/rapidsai/cuvs
  5. Benchmarking code: https://github.com/navneet1v/VectorSearchForge
  6. Meta issue: [META]: Boosting Opensearch Vector Engine Performance using GPUs #2295

Appendix B

Current Architecture

The OpenSearch k-NN plugin supports three different types of engines to perform Approximate Nearest Neighbor (ANN) search: Lucene(Java Implementation), Faiss(C++ implementation) and Nmslib(C++ implementation). These engines serve as an abstraction layer over the downstream libraries used to implement the nearest neighbor search functionality.
Specifically, the plugin supports the following algorithms within each engine:

  1. Lucene Engine: HNSW algorithm
  2. Nmslib Engine (Native): HNSW algorithm
  3. Faiss Engine (Native): HNSW and IVF algorithms

At a high level, OpenSearch index data is stored in shards, which are essentially Lucene indices. Each shard is further divided into immutable segments, created during the indexing process.
For indices with k-NN fields, the k-NN plugin leverages the same underlying Lucene architecture. During segment creation, in addition to the standard Lucene data structures (e.g., FST, BKDs, DocValues), the plugin also generates the necessary vector-specific data structures for each vector field. These vector-related files are then written to disk as part of the segment files and managed by Lucene.
When performing ANN search, the plugin loads the relevant vector data structures from disk into memory (but not the JVM heap) and uses the respective library (Lucene, Faiss, or Nmslib) to execute the search operation.

Appendix C

Comparison between g5.2xlarge and r6g.2xlarge

By maintaining similar recall and search time below table provides the index build time improvement numbers

Dataset Name Dimension Number of Vectors CPU Build Time (sec) GPU Build Time (sec) %age improvement s with GPU in Build Time (CPU/GPU)
sift-128 128 1M 490.42 16.61 30
ms-marco-384 384 1M 1098.01 35.68 31
cohere-768-ip 768 1M 3251.76 67.01 49
gist 960 1M 3146.3 68.64 46
open-ai 1536 5M 5647.2 181.17 32
bigann-10M 128 10M 9489.49 226.81 42

Legends:
CPU Build Time: Total time in sec required to build + serialize the graph on CPU
GPU Build Time: Total time in sec required to build the graph on GPU + converting to CPU compatible graph + serialization of graph

Observations

  1. For similar recalls and almost same overall search latency are able to get 30x-50x faster build times with GPUs, which are just 3x extra price. Ref price here: Appendix D
  2. The conversion time from GPU based index to CPU based index is similar to what we observe with pure CPU based indices.
  3. Increasing the compression rate reduces the build time but the return diminishes after we cross 16x compression rate. This is observed across all the different datasets.

Appendix D

Prices for GPU and CPU machines.

EC2 machine prices

Instance Size GPU GPU Memory (GiB) vCPUs Memory (GiB) Price
g5.2xlarge 1 24 8 32 1.212
r6g.2xlarge 0 0 8 64 0.4032
r6g.4xlarge 0 0 16 128 0.8064
r6g.8xlarge 0 0 32 256 1.6128

@navneet1v navneet1v self-assigned this Nov 28, 2024
@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Nov 28, 2024
@navneet1v navneet1v added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. indexing labels Nov 28, 2024
@jmazanec15
Copy link
Member

@navneet1v Thanks for proposal - its very exciting!

Question 1: A very important aspect of a separate index build service will be defining compatibility/contracts between the index build service and opensearch. Is there a plan for this? Will the index build service follow a release cycle of a plugin? That being said, why not come up with a very specialized node type for index build as opposed to a separate architecture? The node does not even have to be a part of the cluster - it could operate independently - something like cross cluster search.

Question 2: In future, I could see GPUs/specialized hardware being used in very high throughput search use cases. For instances, maybe we cache frequently visited vectors in HNSW graph on GPU cache or create some kind of high cardinality partition (like IVF) and search all cluster centroids in parallel on GPU. I assume we could still do this regardless of decisions we make around remote vector index service correct?

Question 3: In regards to

Why we choose GPUs rather than a purpose built vector search hardware?

I understand that Nvidia Cagra is a good first place to start, but in general, whatever we build should be generic enough so that a different org/company can plug in their own hardware and build a similar component. That being said, will the index build service be generic? Should we come up with a list of tenets to follow for remote index build service?

@navneet1v
Copy link
Collaborator Author

Question 1: A very important aspect of a separate index build service will be defining compatibility/contracts between the index build service and opensearch. Is there a plan for this?

Yes there will contracts around this. As we start to explore the details of these components in details the contract will start to formulate. Like what APIs to be exposed by an index build service, in what format the vectors will be written to object store etc.

Will the index build service follow a release cycle of a plugin?

Till this point, I don't see a strong reason to follow the release cycle. I am thinking similar to clients of Opensearch for the release of the image which will be exposed to users.

That being said, why not come up with a very specialized node type for index build as opposed to a separate architecture? The node does not even have to be a part of the cluster - it could operate independently - something like cross cluster search.

Not sure what you mean by come up with a separate node. Since we are exposing a docker image which can be deployed and maintained by distribution owners themselves, it is very well independent. It is no upto the providers how they want to host this image and use it.

@navneet1v
Copy link
Collaborator Author

navneet1v commented Dec 27, 2024

Question 2: In future, I could see GPUs/specialized hardware being used in very high throughput search use cases. For instances, maybe we cache frequently visited vectors in HNSW graph on GPU cache or create some kind of high cardinality partition (like IVF) and search all cluster centroids in parallel on GPU. I assume we could still do this regardless of decisions we make around remote vector index service correct?

Yes that is correct. I see these GPU based index build nodes/images being used as stateless workers for now. They gets spun up, do their work and then peacefully goes back to the pool. Opensearch providers can make them stateful by building a control plane aspects around these workers as per their needs. But this design doesn't touch base on those things since they are specific to the usecase which a Opensearch provider want to do.

@navneet1v
Copy link
Collaborator Author

Question 3: In regards to

Why we choose GPUs rather than a purpose built vector search hardware?

I understand that Nvidia Cagra is a good first place to start, but in general, whatever we build should be generic enough so that a different org/company can plug in their own hardware and build a similar component. That being said, will the index build service be generic? Should we come up with a list of tenets to follow for remote index build service?

See the parts of k-NN plugin side will support any remote endpoint that can do vector index build and implements the APIs use by k-NN plugin like createIndex, getIndexBuildJobDetails etc(pending RFC creation). The details of these APIs will be shared in upcoming RFCs. So in that sense if tomorrow any specialized hardware comes, all one needs to do is flip the workers to use that hardware. I hope this clarifies. Ref: #2294 for plugin design

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time.
Projects
Status: Backlog (Hot)
Development

No branches or pull requests

2 participants