-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Multi-Vector support for HNSW search #13525
base: main
Are you sure you want to change the base?
Conversation
pending: - merge fields - scorer changes - default scorer etc. - reader changes
No. "default run" is knn search where each embedding is a separate document with no relationship between them. I'm still wiring things up to see benchmark results for this PR. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Hey @vigyasharma there is a lot of good work here. I am going to shift my focus and see about how I can help here more fully. What are the next steps? I am guessing handling all the merging from main, I can take care of that sometime next week. Just wondering where I can help. |
Thanks @benwtrent. I've been working on getting a multi-vector benchmark running to wire this end to end. Found some pesky bugs and oversights. I'm planning to split this feature into multiple smaller PRs. This PR was mainly to get inputs on the approach. It's too big to test and review. I'll share a plan of the split PRs soon. re: the multi-vector benchmark for passage search use-case, I've been stuck on a bug where after I run into an Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: read past EOF: MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv") [slice=multi-vector-data]
at knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:1115)
at knn.KnnGraphTester.computeNN(KnnGraphTester.java:967)
at knn.KnnGraphTester.getNN(KnnGraphTester.java:812)
at knn.KnnGraphTester.run(KnnGraphTester.java:438)
at knn.KnnGraphTester.runWithCleanUp(KnnGraphTester.java:177)
at knn.KnnGraphTester.main(KnnGraphTester.java:172)
Caused by: java.io.EOFException: read past EOF: MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv") [slice=multi-vector-data]
at org.apache.lucene.store.MemorySegmentIndexInput.readByte(MemorySegmentIndexInput.java:146)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:95)
at org.apache.lucene.store.MemorySegmentIndexInput.readInt(MemorySegmentIndexInput.java:261)
at org.apache.lucene.store.DataInput.readFloats(DataInput.java:202)
at org.apache.lucene.store.MemorySegmentIndexInput.readFloats(MemorySegmentIndexInput.java:231)
at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:111)
at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:130)
at org.apache.lucene.codecs.hnsw.DefaultFlatMultiVectorScorer$FloatMultiVectorScorer.score(DefaultFlatMultiVectorScorer.java:185)
at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues$DenseOffHeapMultiVectorValues$1.score(OffHeapFloatMultiVectorValues.java:248)
at org.apache.lucene.search.AbstractKnnVectorQuery.exactSearch(AbstractKnnVectorQuery.java:220)
at knn.KnnFloatVectorBenchmarkQuery.exactSearch(KnnFloatVectorBenchmarkQuery.java:33)
at knn.KnnFloatVectorBenchmarkQuery.runExactSearch(KnnFloatVectorBenchmarkQuery.java:50)
at knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:1111)
... 5 more |
re: single v/s multi-vectors, I think it makes sense to not force users to chose multi-valued fields upfront. There's value in being able to go from single to multi-values when the need arises (and treating single-vectors like a storage optimization). However, I do think that we should not support changing the aggregation function once it has been set. Allowing different aggregate functions per segment will make merging and general debugging overly complicated. As such, I'm thinking of keeping the |
The solution really depends on the semantics. In its current form, the way multi-vectors are incorporated in this PR doesn’t quite extend the single-vector case. With max similarity, we assume that each similarity score results from a full comparison, which works well when the operations are limited (such as in re-ranking scenarios). However, for ColBERT, where the average number of vectors per document is large (in the hundreds or thousands), using HNSW with max similarity layered on top may not be the optimal approach. This is likely why other vector libraries don’t expose this setup. If our aim is to introduce max similarity in Lucene, we might need to explore a more effective strategy. Although nested vectors could be promising, they’re currently constrained by the 2B vector limit, which isn’t ideal for ColBERT, given that each input token is represented as a dense vector. The primary limitation with HNSW and the knn codec today seems to be this 2B cap on vectors. Given these factors, we may want to reconsider HNSW for this purpose. A scalable solution would likely involve running multiple queries (one per query vector) rather than relying on an aggregation strategy. Maybe the first goal should be to incorporate max sim for re-ranking use cases first using a flat format? |
Hi @jimczi , The main change in this PR is support for multi-vectors in flat readers and writers, along with a similarity spec for multiple vector values. It is possible that HNSW is not the ideal data structure to expose multi-vector ANN. We don't really change much in hnsw impl, except using multi-vector similarity for comparisons (graph build and search). Users can use the Notably, this change maps all vector values for a document to a single ordinal. This gets us past the 2B vector limit (which I like), but also reads all vector values for the document whenever fetched. I can't think of a case where we'd only like partial values, but if we do, perhaps we can handle it in the similarity/aggregate functions. |
This could be setup using 1) a single-vector field for hnsw matching, and 2) a separate field with multi-vector values to directly access the flat format for the subset of matched readers. Basically a |
As mentioned earlier, here is my rough plan for splitting this change into smaller PRs. Some of these steps could be merged if the impl. warrants it:
|
The more I think about it, the less I feel like the knn codec is the best choice for this feature (assuming that this issue is focused on late interaction models).
Using the knn codec to handle multi-vectors seems limiting, especially since it treats multi-vectors as a single unit for scoring. This works well for late interaction models, where we’re dealing with a collection of embeddings, but it’s restrictive if we want to index each vector separately. It could be helpful to explore other options instead of relying on the knn codec alone. Along those lines, I created a quick draft of a What do you think of this approach? It feels like we could skip the full knn framework if our main goal is just to score a bag of embeddings. This would keep things simpler and allow us to focus specifically on max similarity scoring without the added weight of the full knn codec. My main worry is that adding multi-vectors to the knn codec as a late interaction model might add complexity later. It’s really two different approaches, and it seems valuable to keep the option for indexing each vector separately. We could expose this flexibility through the aggregation function, but that might complicate things across all codecs, as they’d need to handle both aggregate and independent cases efficiently. |
One use-case for multi-vectors is indexing product aspects as separate embeddings for e-commerce search. At Amazon Product Search (where I work), we'd like to experiment with separate embeddings to represent product attributes, user product opinions, and product images. Such e-commerce use-cases would have a limited set of embeddings, but leverage similarity computations across all of them. I see your point about scaling challenges with very high cardinality multi-vectors like token level ColBERT embeddings. Keeping them in a However, I do think there is space for both solutions. It's not obvious to me how the knn codec gets polluted for future complexity. We would still support single vectors as is. My mental model is: if you want to use multi-vectors in nearest neighbor search (hnsw or newer algos later), index them in the knn field. Otherwise, index them separately as doc-values used only for re-ranking top results. |
This seems like just more than one knn field, or the nested field support. But, I understand the desire to add a multi-vector support to the flat codecs. I am honestly torn around whats the best path forward for the majority of users in Lucene. |
I tried to find some blogs and benchmarks on other library implementations. Astra Db, Vespa, faiss and nmslib, all seem to support multi-vectors in some form. From what I can tell, Astra Db and Vespa have ColBERT style multi-vector support in ANN [1] [2]. Benchmarks indicate ColBERT outperforms other techniques in quality, but full ColBERT on ANN has higher latency [3]. For large scale applications, users seem to overquery on ANN with single vector representations, and rerank them with ColBERT token vectors [4]. However, there's also ongoing work/research on reducing the no. of embeddings in ColBERT, like PLAID which replaces a bunch of vectors with their centroids [5]. ...
I hear you! And I don't want to add complexity only because we have some body of work in this PR. Thanks for raising the concern Jim, it led me to some interesting reading. ... My current thinking is that this is a rapidly evolving field, and it's early to lean one way or another. Adding this support unlocks experimentation. We might add different, scalable, ANN algos going forward, and our flat storage format should work with most of them. Meanwhile, there's research on different ways to run late interaction with multiple but fewer vectors. This change will help users experiment with what works at their scale, for their cost/performance/quality requirements. I'm happy to change my perspective, and would like to hear more opinions. One reason to not add this would be if it makes the single vector setup hard to evolve. I'd like to understand if (and how) this is happening, and think on how we can address those concerns. 1: https://docs.datastax.com/en/ragstack/examples/colbert.html |
Amen! This ends up being so domain specific. Multi-embeddings become key when you deal with domain voids in the LLMs used to create the embeddings. That's most big corpuses. So at least being able to experiment would get you far more feedback. I would be ok with writing some tests if that helps. |
I believe we should carefully consider the approach to adding multi-vector support through an aggregate function. From the outset, we assume that multi-vectors should be scored together, which is an important principle. Moreover, the default aggregate function proposed in the PR relies on brute force, which is not practical for any indexing setup. My concern is that this proposal doesn’t truly add support for independent multi-vectors. Instead, it introduces a block of vectors that must be scored together, which feels like a workaround rather than a comprehensive solution. This approach doesn’t address the key challenges of implementing true multi-vector support in the codec. The root issue is that the current KNN codec assumes the number of vectors is bounded by a single integer, a limitation that needs to be addressed first. Removing this constraint is a complex task but essential for properly supporting multi-vectors. Once that foundation is in place, adding support for setups like ColBERT should become relatively straightforward. Finally, while the max-sim function proposed in this PR may work as a ranking function, it isn’t suitable for indexing any documents. A true solution should allow for independent multi-vectors to be queried and scored flexibly without these constraints. |
That's a valid concern. I've been thinking about a more comprehensive multi-vector solution. Sharing some raw thoughts below, would love to get feedback. We support a default aggregation value of Once this is in place, we can add support for "dependent" multi-vector values like ColBERT. They'll take an aggregation function. Each graph node will represent all vectors for a document and use aggregated similarity (like in this PR). This will let us experiment with full ANN on ColBERT style multi-vectors. |
...contd. from above – thoughts on supporting independent multi-vectors specified via The Our codec today has single unique sequentially increasing vector ordinal per doc, which we can store and fetch with the DirectMonotonicWriter. For multi-vectors, we need to handle multiple nodeIds mapping to a single document. I'm thinking of using "ordinals" and "sub-ordinals" to identify each vector value. 'Ordinal' is incremented when docId changes. 'Sub-ordinals' start at 0 for each new doc and are incremented for subsequent vector values in the doc. A nodeId in the graph, is a "long" with ordinals and sub-ordinals packed into MSB and LSB bits separately. For flat storage, we can continue to use the technique in this PR; i.e. have one DirectMonotonicWriter object for docIds indexed by "ordinals", and another that stores start offsets for each docId, again indexed by ordinals. The sub-ordinal bits help us seek to exact vector values from this metadata. int ordToDoc(long nodeId) {
// get int ordinal from most-significant 32 bits
// get docId for the ordinal from DirectMonotonicWriter
}
float[] vectorValue(int nodeId) {
// get int ordinal from most-significant 32 bits
// get "startOffset" for ordinal
// get subOrdinal from least-signifant 32 bits
// read vector value from startOffset + (subOrdinal * dimension * byteSize)
}
float[] getAllVectorValues(int nodeId) {
// get int ordinal from most-significant 32 bits
// get "startOffset" for ordinal
// get "endOffset" from offset value for ordinal + 1
// return values from [startOffset, endOffset)
} With this setup, we won't need parent-block join queries for multiple vector values. And we can use I'm skeptical if this'll give a visible performance boost. It should at least be similar to the block-join setup we have today, but hopefully more convenient to use. And it sets us up for "dependent" multi-vector values like ColBERT. We'll need to code this up to iron out any wrinkles. I can work on a draft PR if the idea makes sense. Note that this still doesn't allow >2B vector values. While the "long" nodeId can support it, our ANN impl. returns arrays containing all nodeIds is various places. I don't think java can support >2B array length. But we can address this limitation separately, perhaps with a different ANN algo for such high cardinality graphs. |
Your proposal to implement (sidenote: if you are doing max/average, you can do that during index time though, right?) I'm currently conducting A/B tests on three methods to retrieve and rank documents with multiple vectors:
The third approach is particularly promising for domain-specific applications, where standard aggregation methods may not suffice. For instance, embedding tags could be linked to user access controls, unlocking certain vectors at query time, or to specific n-grams, activating them based on query content. Incorporating a mechanism to override the default aggregation method would facilitate experimentation with these strategies. |
Thank you for sharing these use-cases @krickert !
Honestly, I think the existing parent-block join can achieve most use-cases for independent multi-vectors (the passage vector use case). But the approach above might make it easier to use? We also need it for dependent multi-vectors like ColBERT, though it's a separate question on whether ANN is even viable for ColBERT (v/s only for reranking). I'd like to know what issues or limitations do people face with the existing parent-child support for multiple vector values, so we can address them here. |
Not sure. But it is frustrating for me: we only calculate K chunks and not N documents. I want to return N documents all the time, and keep running K until N is reached. Since it runs K on the chunks, I'd rather it return all thee chunks that it can until it reaches N amount of documents. Then we can return the chunks that match which can be used by highlighting.
Indexing the child docs requires making more docs. We just care about the resulting embedding, so why not treat it like a tensor instead of an entire document? It's frustrating to always make a child doc for multiple vectors when I can just do a keyword-value style instead. Also, there's def some limitations with how you can use it with scoring and the query ends up looking like a mess. If we can simplify the query syntax that would help a lot. If you can get a unit test going for your PR, I'd be glad to expand on it and play with it a bit. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Adds support for multi-valued vectors to Lucene.
In addition to max-similarity aggregations like parent-block joins, this change supports ColBERT style distance functions that compute interaction across all query and document vector values. Documents can have a variable number of vector values, but to support distance function computations, we require all values to have the same dimension.
Addresses #12313 .
.
Approach
We define a new "Tensor" field that comprises multiple vector values, and a new
TensorSimilarityFunction
to compute distance across multiple vectors (uses SumMax() currently). Node ordinal is assigned to the tensor value, giving us one ordinal per document. All vector values of a tensor field are processed together during writing, reading and scoring. They are passed around as a packedfloat[]
orbyte[]
array with all vector values concatenated. Consumers (like theTensorSimilarityFunction
) slice this array by dimension to get individual vector values.Tensors are stored using a new FlatVectorStorage that supports writing/reading variable length values per field (allowing us to have a different number of vectors per tensor). We reuse the existing HNSW readers and writers. Each graph node is a tensor and maps to a single document. I also added a new codec tensor format, to allow both tensors and vectors to coexist. I'm not yet sure how to integrate with the quantization changes (separate later date change) and didn't want to force everything into a single format. Tensors continue to work with
KnnVectorWriter/Reader
and extend theFlatVectorWriter/Reader
classes.Finally, I named the field and format "Tensors" though technically these are only rank-2 tensors. The thought was that we might extend this field and format if we ever went for higher rank tensors support. I'm open to renaming based on community feedback.
.
Major Changes
The PR has a lot of files which is not practical to review. Here are the files with key changes. If we align on the approach, I'm happy to reraise separate split PRs with different changes.
Lucene99FlatTensorsWriter
for writing in the new flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsWriter.javaLucene99FlatTensorsReader
for reading the flat tensor format - lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatTensorsReader.javaHnswTensorFormat
that uses FlatTensorFormat to initialize the flat storage readers/writers underlying HNSW reader/writer..
Open Questions
vectorEncoding
andvectorDimension
attributes in FieldInfo instead of a separate tensor encoding and dimension