-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate COSINE VectorSimilarity function #13308
Conversation
As per my understanding the segment present at above directory already has |
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java
Outdated
Show resolved
Hide resolved
Correct. Practically, this means keeping the enum constant but marking it as deprecated, and modifying IndexingChain to fail using COSINE on 10.x Lucene indexes. It would be nice to also add a new file format under lucene/codecs that does cosine similarity, like we just added for hamming distance. This could help with the migration for some users. |
Thanks @jpountz for confirming and suggestions. |
@@ -302,7 +301,6 @@ private static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) th | |||
List.of( | |||
VectorSimilarityFunction.EUCLIDEAN, | |||
VectorSimilarityFunction.DOT_PRODUCT, | |||
VectorSimilarityFunction.COSINE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is exactly the problem. There must be a null entry, so the ordinals in index format stay alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ The reason for the list here is exactly to support and allow such evolution and separation from the enum ordinals. A null here trivially need to a change of List implementation type. And tests need to be carefully updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, null does not work with default lists :-) Also downstream the code needs to check for a null value.
In any case we can later also go back to a Map<Integer,VectorSimilarity> or to an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @uschindler @ChrisHegarty for the pointers, but IIUC, we cannot add a null entry to VectorSimilarityFunction map for COSINE in these files because we still want to use COSINE similarity function for Lucene versions < 10.0, right? Instead we need to create a separate file for new versions. Please correct me if I am wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a larger review.
I think it would be prudent to first attempt to deprecate and handle internal usage, etc. After that is merged and backported, another PR can be made to fully remove it from main only on writes.
At least, that would make reviewing & testing this simpler :)
@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends BackwardsCompatibilityTestB | |||
private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = LUCENE_9_0_0.major; | |||
private static final String KNN_VECTOR_FIELD = "knn_field"; | |||
private static final FieldType KNN_VECTOR_FIELD_TYPE = | |||
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE); | |||
KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure this will work as simply as this. We will index NEW fields into an index created with the OLD code (see where we call addDoc
), meaning you will be trying to index a dot_product
knn field into an index with a cosine
field.
You will have to handle the index versions from before a certain version (thus on cosine) to ones created after (and thus dot-product).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f}; | ||
private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same sort of comment as above on versioning and that we index fields into an old index.
@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) { | |||
} | |||
}, | |||
|
|||
/** | |||
* Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to deprecate this first, get all the internal paths updated to not use it.
Then the deprecation can be merged and backported. Then work on removing it from main.
Mostly because the deprecation path and internal code usage will all have to occur anyways :)
@@ -30,23 +28,6 @@ | |||
public class ScalarQuantizedRandomVectorScorer | |||
extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> { | |||
|
|||
public static float quantizeQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in Lucene 10, we will need to support quantizing a query sent to a Lucene index that uses Cosine.
return (1 + cosine(v1, v2)) / 2; | ||
} | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will get tricky. If you remove this ordinal (I think it would be wise to deprecate first, work through the warnings, etc. get everything passing, etc.), field info & knn vector format readers will need to somehow distinguish between "Max inner product" (whose ordinal decreases to cosine's) and "cosine but in an old index version".
Maybe on write, we adjust the ordinal dynamically, knowing what they are (thus correcting MIP's ordinal on write, allowing for readers to distinguish).
d500c89
to
589d1e2
Compare
Thanks @benwtrent for all the feedback. |
That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE. |
Sorry I think I am missing something, I think we cannot remove COSINE from those places, since we require them in LUCENE version < 10.0. They are mostly being used in FieldInfosFormat, VectorReader, ScalarQuantizedVectorWriter classes. I think the main task is to first get rid of dependency on enum for VectorSimilarityFunction and instead have some interface to get those functions and support. |
@Pulkitg64 could you deprecate also |
lucene/CHANGES.txt
Outdated
* GITHUB#13281: Mark COSINE VectorSimilarityFunction as deprecated. (Pulkit Gupta) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be under 9.11 as we will backport the deprecation as a way to warn users that it is going away in 10.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Make sense. Will change in next revision
Thanks for mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though you it would be good get another reviewers approval too.
Hi @benwtrent, could we merge this change, if everything looks good to you? |
…ntries; don't even consider "forbidden" entries anymore (apache#13429) Hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore
…3414) Add a count field to LabelAndValue
Pull request apache#13406 inadvertly broke Lucene's handling of tragic exceptions by stopping after the first `DocValuesProducer` whose `close()` calls throws an exception, instead of keeping calling `close()` on further producers in the list. This moves back to the previous behavior. Closes apache#13434
…e vector values is empty (apache#13444) This commit ensures that SimpleText[Float|Byte]VectorValues::scorer returns null when the vector values is empty, as per the scorer javadoc. Other KnnVectorsReader implementations have specialised empty implementations that do similar, e.g. OffHeapFloatVectorValues.EmptyOffHeapVectorValues. The VectorScorer interface in new in Lucene 9.11, see apache#13181 An existing test randomly hits this, but a new test has been added that exercises this code path consistently. It's also useful to verify other KnnVectorsReader implementations.
… format (apache#13445) When int4 scalar quantization was merged, it added a new way to dynamically calculate quantiles. However, when that was merged, I inadvertently changed the default behavior, where a null confidenceInterval would actually calculate the dynamic quantiles instead of doing the previous auto-setting to 1 - 1/(dim + 1). This commit formalizes the dynamic quantile calculate through setting the confidenceInterval to 0, and preserves the previous behavior for null confidenceIntervals so that users upgrading will not see different quantiles than they would expect.
…ern. (apache#13450) This applies to files where performing readahead could help: - Doc values data (`.dvd`) - Norms data (`.nvd`) - Docs and freqs in postings lists (`.doc`) - Points data (`.kdd`) Other files (KNN vectors, stored fields, term vectors) keep using a `RANDOM` advice.
This follows a similar approach as postings and only prefetches the first page of data. I verified that it works well for collectors such as `TopFieldCollector`, as `IndexSearcher` first pulls a `LeafCollector`, then a `BulkScorer` and only then starts feeding the `BulkScorer` into the `LeafCollector`. So the background I/O for the `LeafCollector` which will prefetch the first page of doc values and the background I/O for the `BulkScorer` will run in parallel.
@Pulkitg64 sorry for radio silence, I was out of office and swamped with other things. If you could move the change log to 9.12, I can merge and backport. |
…che#13322) * implement Weight#count for vector values * add change log * apply review comment * apply review comment * changelog * remove null check
This commit updates the Gradle wrapper to 8.8, which has support for Java 22.
…ks when gradle version is changed (apache#13456)
If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer. Closes apache#13410
…e#13460) Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR apache#12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR apache#12156
We consume a lot of memory for the `indexIn` slices. If `indexIn` is of type `MemorySegmentIndexInput` the overhead of keeping loads of slices around just for cloning is far higher than the extra 12b per reader this adds (the slice description alone often costs a lot). In a number of Elasticsearch example uses with high segment counts I investigated, this change would save up to O(GB) of heap.
also make links to CONTRIBUTING.md more prominent, and demote link to dev-docs
…e link to dev-docs
I somehow messed up with my branch while pulling changes from Sorry for the mess! |
Description
Tries to solve #13281 by deprecating
COSINE
VectorSimilarity function for Vector Search.