Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the Lucene Distance Calculation Function in Script Scoring for doing exact search #1699

Merged
merged 7 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.14...2.x)
### Features
* Use the Lucene Distance Calculation Function in Script Scoring for doing exact search [#1699](https://github.com/opensearch-project/k-NN/pull/1699)
### Enhancements
* Add KnnCircuitBreakerException and modify exception message [#1688](https://github.com/opensearch-project/k-NN/pull/1688)
### Bug Fixes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import java.util.Objects;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.lucene.util.VectorUtil;
import org.opensearch.knn.index.KNNVectorScriptDocValues;
import org.opensearch.knn.index.SpaceType;
import org.opensearch.knn.index.VectorDataType;
Expand Down Expand Up @@ -48,13 +49,7 @@ private static void requireEqualDimension(final float[] queryVector, final float
* @return L2 score
*/
public static float l2Squared(float[] queryVector, float[] inputVector) {
requireEqualDimension(queryVector, inputVector);
float squaredDistance = 0;
for (int i = 0; i < inputVector.length; i++) {
float diff = queryVector[i] - inputVector[i];
squaredDistance += diff * diff;
}
return squaredDistance;
return VectorUtil.squareDistance(queryVector, inputVector);
}

private static float[] toFloat(List<Number> inputVector, VectorDataType vectorDataType) {
Expand Down Expand Up @@ -101,19 +96,16 @@ public static float l2Squared(List<Number> queryVector, KNNVectorScriptDocValues
* @return cosine score
*/
public static float cosinesimilOptimized(float[] queryVector, float[] inputVector, float normQueryVector) {
requireEqualDimension(queryVector, inputVector);
float dotProduct = 0.0f;
float normInputVector = 0.0f;
for (int i = 0; i < queryVector.length; i++) {
dotProduct += queryVector[i] * inputVector[i];
normInputVector += inputVector[i] * inputVector[i];
}
float normalizedProduct = normQueryVector * normInputVector;
if (normalizedProduct == 0) {
logger.debug("Invalid vectors for cosine. Returning minimum score to put this result to end");
return 0.0f;
}
return (float) (dotProduct / (Math.sqrt(normalizedProduct)));
return (float) (VectorUtil.dotProduct(queryVector, inputVector) / (Math.sqrt(normalizedProduct)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to use dotProduct / normalize for calculate cosine. this would do one more iteration as original L108 doing dotProduct.

PS, i checked Lucene#DefaultVectorUtilSupport#cosine(float, float) would do cosine normalize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used that below, I'll see if I can get it to work with the normVector present in this method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see there is return (float) (sum / Math.sqrt((double) norm1 * (double) norm2)); in Lucene#DefaultVectorUtilSupport#cosine(float, float) so we can use it directly in public static float cosinesimilOptimized and without using dotProduct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you seeing the DefaultVectorUtilSupport class? I've only been able to find VectorUtil so far and that class doesn't have a cosine method that takes floats, only float[]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMD would be better as per out older experiments of SIMD. Also, given that lucene lacks that implementation I fine to remove this optimize cosine code for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see multiple versions of cosinesimil and cosineSimilarity. Lets just move towards 1 where we use Lucene functions to do the distance calculations and remove all others.

Some are using optimized and some doesn't. Lets just clean things up and move towards 1 implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll incorporate that with this PR then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO for those that are very serious about performance, they will normalize their data during preprocessing and use inner product directly. So, I think its okay to not change cosine functionality for now and just focus on dot product and l2 for this optimization.

}

/**
Expand Down Expand Up @@ -148,20 +140,28 @@ public static float cosineSimilarity(List<Number> queryVector, KNNVectorScriptDo
*/
public static float cosinesimil(float[] queryVector, float[] inputVector) {
requireEqualDimension(queryVector, inputVector);
float dotProduct = 0.0f;
float normQueryVector = 0.0f;
float normInputVector = 0.0f;
for (int i = 0; i < queryVector.length; i++) {
dotProduct += queryVector[i] * inputVector[i];
normQueryVector += queryVector[i] * queryVector[i];
normInputVector += inputVector[i] * inputVector[i];
int numZeroInInput = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method would still return true for a zero vector right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe cosine will be infinite if one vector is finite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we validate that it's not zero vector in the above method, we should be able to remove the other check because of the assert finite

int numZeroInQuery = 0;
float cosine = 0.0f;
for (int i = 0; i < inputVector.length; i++) {
if (inputVector[i] == 0) {
numZeroInInput++;
}

if (queryVector[i] == 0) {
numZeroInQuery++;
}
}
float normalizedProduct = normQueryVector * normInputVector;
if (normalizedProduct == 0) {
if (numZeroInInput == inputVector.length || numZeroInQuery == queryVector.length) {
return cosine;
}
try {
cosine = VectorUtil.cosine(queryVector, inputVector);
} catch (IllegalArgumentException e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did lucene doesn't have cosine functions directly present which we can leverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the lucene cosine function on line 159. The rest just returns 0 if either the input or query vectors are all 0's.

logger.debug("Invalid vectors for cosine. Returning minimum score to put this result to end");
return 0.0f;
}
return (float) (dotProduct / (Math.sqrt(normalizedProduct)));
return cosine;
}

/**
Expand Down Expand Up @@ -217,7 +217,6 @@ public static float calculateHammingBit(Long queryLong, Long inputLong) {
* @return L1 score
*/
public static float l1Norm(float[] queryVector, float[] inputVector) {
requireEqualDimension(queryVector, inputVector);
float distance = 0;
for (int i = 0; i < inputVector.length; i++) {
float diff = queryVector[i] - inputVector[i];
Expand Down Expand Up @@ -255,7 +254,6 @@ public static float l1Norm(List<Number> queryVector, KNNVectorScriptDocValues do
* @return L-inf score
*/
public static float lInfNorm(float[] queryVector, float[] inputVector) {
requireEqualDimension(queryVector, inputVector);
float distance = 0;
for (int i = 0; i < inputVector.length; i++) {
float diff = queryVector[i] - inputVector[i];
Expand Down Expand Up @@ -293,12 +291,7 @@ public static float lInfNorm(List<Number> queryVector, KNNVectorScriptDocValues
* @return dot product score
*/
public static float innerProduct(float[] queryVector, float[] inputVector) {
requireEqualDimension(queryVector, inputVector);
float distance = 0;
for (int i = 0; i < inputVector.length; i++) {
distance += queryVector[i] * inputVector[i];
}
return distance;
return VectorUtil.dotProduct(queryVector, inputVector);
}

/**
Expand Down
Loading