Deprecate COSINE VectorSimilarity function #13308

Pulkitg64 · 2024-04-15T20:34:11Z

Description

Tries to solve #13281 by deprecating COSINE VectorSimilarity function for Vector Search.

Pulkitg64 · 2024-04-15T20:46:17Z

TestInt8HnswBackwardsCompatibility.java and TestBasicBackwardsCompatibility test cases failing for Lucene versions older than LUCENE_10_0_0. For older Lucene version, we try to read Index directory from resources folder. For example for Lucene-9.10.0, the directory path is located at lucene/backward-codecs/src/test/org/apache/lucene/backward_index/index.9.10.0-cfs.zip.

As per my understanding the segment present at above directory already has KNNVectorField with MAXIMUM_INNER_PRODUCT as vector similarity function and in above test we try to change similarity function to DOT_PRODUCT due to which there is conflict happening when adding more docs to it and hence these test cases failing.

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

jpountz · 2024-04-17T20:47:32Z

Oh okay, that means we need to remove support for COSINE just from indexing side not from searching side?

Correct. Practically, this means keeping the enum constant but marking it as deprecated, and modifying IndexingChain to fail using COSINE on 10.x Lucene indexes.

It would be nice to also add a new file format under lucene/codecs that does cosine similarity, like we just added for hamming distance. This could help with the migration for some users.

Pulkitg64 · 2024-04-18T07:03:33Z

Thanks @jpountz for confirming and suggestions.

uschindler · 2024-04-19T11:47:27Z

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java

@@ -302,7 +301,6 @@ private static VectorSimilarityFunction getDistFunc(IndexInput input, byte b) th
      List.of(
          VectorSimilarityFunction.EUCLIDEAN,
          VectorSimilarityFunction.DOT_PRODUCT,
-          VectorSimilarityFunction.COSINE,


Here is exactly the problem. There must be a null entry, so the ordinals in index format stay alive.

++ The reason for the list here is exactly to support and allow such evolution and separation from the enum ordinals. A null here trivially need to a change of List implementation type. And tests need to be carefully updated.

of course, null does not work with default lists :-) Also downstream the code needs to check for a null value.

In any case we can later also go back to a Map<Integer,VectorSimilarity> or to an array.

Thanks @uschindler @ChrisHegarty for the pointers, but IIUC, we cannot add a null entry to VectorSimilarityFunction map for COSINE in these files because we still want to use COSINE similarity function for Lucene versions < 10.0, right? Instead we need to create a separate file for new versions. Please correct me if I am wrong here.

benwtrent

Left a larger review.

I think it would be prudent to first attempt to deprecate and handle internal usage, etc. After that is merged and backported, another PR can be made to fully remove it from main only on writes.

At least, that would make reviewing & testing this simpler :)

benwtrent · 2024-04-22T21:37:44Z

...ckward-codecs/src/test/org/apache/lucene/backward_index/TestBasicBackwardsCompatibility.java

@@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends BackwardsCompatibilityTestB
  private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = LUCENE_9_0_0.major;
  private static final String KNN_VECTOR_FIELD = "knn_field";
  private static final FieldType KNN_VECTOR_FIELD_TYPE =
-      KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE);
+      KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT);


I am not 100% sure this will work as simply as this. We will index NEW fields into an index created with the OLD code (see where we call addDoc), meaning you will be trying to index a dot_product knn field into an index with a cosine field.

You will have to handle the index versions from before a certain version (thus on cosine) to ones created after (and thus dot-product).

Make sense.

benwtrent · 2024-04-22T21:38:13Z

...ard-codecs/src/test/org/apache/lucene/backward_index/TestInt8HnswBackwardsCompatibility.java

  private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f};
+  private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length];


Same sort of comment as above on versioning and that we index fields into an old index.

benwtrent · 2024-04-22T22:14:14Z

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

@@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) {
    }
  },

-  /**
-   * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all


I think it would be good to deprecate this first, get all the internal paths updated to not use it.

Then the deprecation can be merged and backported. Then work on removing it from main.

Mostly because the deprecation path and internal code usage will all have to occur anyways :)

benwtrent · 2024-04-22T22:15:27Z

lucene/core/src/java/org/apache/lucene/util/quantization/ScalarQuantizedRandomVectorScorer.java

@@ -30,23 +28,6 @@
 public class ScalarQuantizedRandomVectorScorer
    extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> {

-  public static float quantizeQuery(


Even in Lucene 10, we will need to support quantizing a query sent to a Lucene index that uses Cosine.

benwtrent · 2024-04-22T22:19:19Z

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

-      return (1 + cosine(v1, v2)) / 2;
-    }
-  },
-


I think this will get tricky. If you remove this ordinal (I think it would be wise to deprecate first, work through the warnings, etc. get everything passing, etc.), field info & knn vector format readers will need to somehow distinguish between "Max inner product" (whose ordinal decreases to cosine's) and "cosine but in an old index version".

Maybe on write, we adjust the ordinal dynamically, knowing what they are (thus correcting MIP's ordinal on write, allowing for readers to distinguish).

Pulkitg64 · 2024-05-06T11:20:22Z

Thanks @benwtrent for all the feedback.
I have raised another revision just to mark the COSINE function as deprecated. I didn't find any internal usages of this function.
In a followup PR, I will remove COSINE from writes for LUCENE version >=10.0.0

benwtrent · 2024-05-08T12:20:11Z

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

Pulkitg64 · 2024-05-08T13:44:29Z

I didn't find any internal usages of this function.

That doesn't make sense to me. IntelliJ tells me there are over 30 usages of VectorSimilarityFunction.COSINE.

Sorry I think I am missing something, I think we cannot remove COSINE from those places, since we require them in LUCENE version < 10.0. They are mostly being used in FieldInfosFormat, VectorReader, ScalarQuantizedVectorWriter classes.

I think the main task is to first get rid of dependency on enum for VectorSimilarityFunction and instead have some interface to get those functions and support.

benwtrent · 2024-05-08T13:51:25Z

@Pulkitg64 could you deprecate also VectorUtil.cosine? That is technically a public API.

benwtrent · 2024-05-09T13:30:22Z

lucene/CHANGES.txt

+* GITHUB#13281: Mark COSINE VectorSimilarityFunction as deprecated. (Pulkit Gupta)
+


This should be under 9.11 as we will backport the deprecation as a way to warn users that it is going away in 10.0

Yes, Make sense. Will change in next revision
Thanks for mentioning.

ChrisHegarty

LGTM, though you it would be good get another reviewers approval too.

Pulkitg64 · 2024-05-24T19:44:11Z

Hi @benwtrent, could we merge this change, if everything looks good to you?

…ntries; don't even consider "forbidden" entries anymore (apache#13429) Hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore

…3421)

…3414) Add a count field to LabelAndValue

Pull request apache#13406 inadvertly broke Lucene's handling of tragic exceptions by stopping after the first `DocValuesProducer` whose `close()` calls throws an exception, instead of keeping calling `close()` on further producers in the list. This moves back to the previous behavior. Closes apache#13434

…e vector values is empty (apache#13444) This commit ensures that SimpleText[Float|Byte]VectorValues::scorer returns null when the vector values is empty, as per the scorer javadoc. Other KnnVectorsReader implementations have specialised empty implementations that do similar, e.g. OffHeapFloatVectorValues.EmptyOffHeapVectorValues. The VectorScorer interface in new in Lucene 9.11, see apache#13181 An existing test randomly hits this, but a new test has been added that exercises this code path consistently. It's also useful to verify other KnnVectorsReader implementations.

… format (apache#13445) When int4 scalar quantization was merged, it added a new way to dynamically calculate quantiles. However, when that was merged, I inadvertently changed the default behavior, where a null confidenceInterval would actually calculate the dynamic quantiles instead of doing the previous auto-setting to 1 - 1/(dim + 1). This commit formalizes the dynamic quantile calculate through setting the confidenceInterval to 0, and preserves the previous behavior for null confidenceIntervals so that users upgrading will not see different quantiles than they would expect.

…ern. (apache#13450) This applies to files where performing readahead could help: - Doc values data (`.dvd`) - Norms data (`.nvd`) - Docs and freqs in postings lists (`.doc`) - Points data (`.kdd`) Other files (KNN vectors, stored fields, term vectors) keep using a `RANDOM` advice.

This follows a similar approach as postings and only prefetches the first page of data. I verified that it works well for collectors such as `TopFieldCollector`, as `IndexSearcher` first pulls a `LeafCollector`, then a `BulkScorer` and only then starts feeding the `BulkScorer` into the `LeafCollector`. So the background I/O for the `LeafCollector` which will prefetch the first page of doc values and the background I/O for the `BulkScorer` will run in parallel.

benwtrent · 2024-06-05T13:26:44Z

@Pulkitg64 sorry for radio silence, I was out of office and swamped with other things. If you could move the change log to 9.12, I can merge and backport.

…che#13322) * implement Weight#count for vector values * add change log * apply review comment * apply review comment * changelog * remove null check

This commit updates the Gradle wrapper to 8.8, which has support for Java 22.

… URL API apache#13458

…ks when gradle version is changed (apache#13456)

…atrix.

If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer. Closes apache#13410

…e#13460) Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.

MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR apache#12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR apache#12156

We consume a lot of memory for the `indexIn` slices. If `indexIn` is of type `MemorySegmentIndexInput` the overhead of keeping loads of slices around just for cloning is far higher than the extra 12b per reader this adds (the slice description alone often costs a lot). In a number of Elasticsearch example uses with high segment counts I investigated, this change would save up to O(GB) of heap.

also make links to CONTRIBUTING.md more prominent, and demote link to dev-docs

…#13465)

…e link to dev-docs

…precate

Pulkitg64 · 2024-06-09T18:15:55Z

I somehow messed up with my branch while pulling changes from main branch. @benwtrent I have created a new PR for marking the cosine VectorSImilarity function as deprecated.

#13473

Sorry for the mess!

benwtrent requested changes Apr 16, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java Outdated Show resolved Hide resolved

uschindler reviewed Apr 19, 2024

View reviewed changes

benwtrent reviewed Apr 22, 2024

View reviewed changes

Mark COSINE Vector Similarity Function as Deprecated

589d1e2

Pulkitg64 force-pushed the cosine-deprecate branch from d500c89 to 589d1e2 Compare May 6, 2024 10:07

tidy fixes

3f23885

Pulkitg64 requested a review from benwtrent May 7, 2024 17:26

Added entry to CHANGES.txt

0c0c35b

Mark VectorUtil cosine function as deprecated

821f775

benwtrent reviewed May 9, 2024

View reviewed changes

Move change from 10.0 to 9.11.0

6ca2e43

ChrisHegarty approved these changes May 16, 2024

View reviewed changes

Pulkitg64 requested a review from benwtrent May 16, 2024 16:14

Pulkitg64 and others added 9 commits May 28, 2024 19:35

Merge branch 'main' into cosine-deprecate

a644bea

hunspell: speed up "compress"; minimize the number of the generated e…

54d3ff6

…ntries; don't even consider "forbidden" entries anymore (apache#13429) Hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore

Fixes failing test case for TestOrdinalMap.testRamBytesUsed (apache#1…

ea0646d

…3421)

Allow users to retrieve counts from taxo association facets (apache#1…

b3dc915

…3414) Add a count field to LabelAndValue

Add next minor version 9.12.0

9a3dbd5

Fix test failure on TestPoint#testEqualsAndHashCode (apache#13433)

a6f920d

jpountz added 2 commits June 5, 2024 13:41

bugmakerrrrrr and others added 25 commits June 5, 2024 15:02

Implement Weight#count for vector values in the FieldExistsQuery (apa…

fe50e86

…che#13322) * implement Weight#count for vector values * add change log * apply review comment * apply review comment * changelog * remove null check

Update Gradle wrapper to 8.8 - supports Java 22 (apache#13453)

9a4caa9

This commit updates the Gradle wrapper to 8.8, which has support for Java 22.

Update WrapperDownloader to accept java 22 and correct deprecated new…

868897e

… URL API apache#13458

Add a github workflow that checks common (and less common) gradle tas…

14782a2

…ks when gradle version is changed (apache#13456)

Java 22 has been released, so drop -ea from smoketester gh workflow m…

b85c99d

…atrix.

Add test for ghost fields to BaseKnnVectorQueryTestCase. (apache#13455)

d5aa88b

Removed Scorer#getWeight (apache#13440)

d0d2aa2

If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer. Closes apache#13410

DOAP changes for release 9.11.0

61a6abd

Merge related HashMaps in FieldInfos#FieldNumbers into one map (apach…

58ab5b7

…e#13460) Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.

Sync CHANGES for 9.11.0

51d8d72

Add back-compat indices for 9.11.0

39a7eab

Move entry in CHANGES.txt

9f8e886

Document how to make tests run faster in IntelliJ (apache#13466)

a5b4b8c

also make links to CONTRIBUTING.md more prominent, and demote link to dev-docs

Add int8_hnsw backcompat index creawtion to dev tools scripts (apache…

2d62faa

…#13465)

on README.md, make links to CONTRIBUTING.md more prominent, and demot…

262341b

…e link to dev-docs

fix fumble-finger

71a9aed

clarify that IntelliJ UI varies across platforms

0699117

Mark COSINE Vector Similarity Function as Deprecated

00a8704

tidy fixes

be3527a

Move changes from 9.11.0 to 9.12.0

0e91af0

Mark VectorUtil cosine function as deprecated

1163749

Move change from 10.0 to 9.11.0

ec6a037

Merge remote-tracking branch 'origin/cosine-deprecate' into cosine-de…

2a52ebf

…precate

Pulkitg64 closed this Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate COSINE VectorSimilarity function #13308

Deprecate COSINE VectorSimilarity function #13308

Pulkitg64 commented Apr 15, 2024

Pulkitg64 commented Apr 15, 2024 •

edited

Loading

jpountz commented Apr 17, 2024

Pulkitg64 commented Apr 18, 2024

uschindler Apr 19, 2024

ChrisHegarty Apr 19, 2024

uschindler Apr 19, 2024

Pulkitg64 Apr 21, 2024 •

edited

Loading

benwtrent left a comment

benwtrent Apr 22, 2024

Pulkitg64 Apr 26, 2024

benwtrent Apr 22, 2024

benwtrent Apr 22, 2024

benwtrent Apr 22, 2024

benwtrent Apr 22, 2024

Pulkitg64 commented May 6, 2024

benwtrent commented May 8, 2024

Pulkitg64 commented May 8, 2024

benwtrent commented May 8, 2024

benwtrent May 9, 2024

Pulkitg64 May 9, 2024

Pulkitg64 May 16, 2024

ChrisHegarty left a comment

Pulkitg64 commented May 24, 2024

benwtrent commented Jun 5, 2024

Pulkitg64 commented Jun 9, 2024

		private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f};
		private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length];

		* GITHUB#13281: Mark COSINE VectorSimilarityFunction as deprecated. (Pulkit Gupta)

Deprecate COSINE VectorSimilarity function #13308

Deprecate COSINE VectorSimilarity function #13308

Conversation

Pulkitg64 commented Apr 15, 2024

Description

Pulkitg64 commented Apr 15, 2024 • edited Loading

jpountz commented Apr 17, 2024

Pulkitg64 commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pulkitg64 Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pulkitg64 commented May 6, 2024

benwtrent commented May 8, 2024

Pulkitg64 commented May 8, 2024

benwtrent commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

Pulkitg64 commented May 24, 2024

benwtrent commented Jun 5, 2024

Pulkitg64 commented Jun 9, 2024

Pulkitg64 commented Apr 15, 2024 •

edited

Loading

Pulkitg64 Apr 21, 2024 •

edited

Loading