-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the Lucene Distance Calculation Function in Script Scoring for doing exact search #1699
Changes from 4 commits
1752bfd
d21a24f
f423779
e1ec3b9
540782c
f872d83
f5b76cf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ | |
import java.util.Objects; | ||
import org.apache.logging.log4j.LogManager; | ||
import org.apache.logging.log4j.Logger; | ||
import org.apache.lucene.util.VectorUtil; | ||
import org.opensearch.knn.index.KNNVectorScriptDocValues; | ||
import org.opensearch.knn.index.SpaceType; | ||
import org.opensearch.knn.index.VectorDataType; | ||
|
@@ -48,13 +49,7 @@ private static void requireEqualDimension(final float[] queryVector, final float | |
* @return L2 score | ||
*/ | ||
public static float l2Squared(float[] queryVector, float[] inputVector) { | ||
requireEqualDimension(queryVector, inputVector); | ||
float squaredDistance = 0; | ||
for (int i = 0; i < inputVector.length; i++) { | ||
float diff = queryVector[i] - inputVector[i]; | ||
squaredDistance += diff * diff; | ||
} | ||
return squaredDistance; | ||
return VectorUtil.squareDistance(queryVector, inputVector); | ||
} | ||
|
||
private static float[] toFloat(List<Number> inputVector, VectorDataType vectorDataType) { | ||
|
@@ -101,19 +96,16 @@ public static float l2Squared(List<Number> queryVector, KNNVectorScriptDocValues | |
* @return cosine score | ||
*/ | ||
public static float cosinesimilOptimized(float[] queryVector, float[] inputVector, float normQueryVector) { | ||
requireEqualDimension(queryVector, inputVector); | ||
float dotProduct = 0.0f; | ||
float normInputVector = 0.0f; | ||
for (int i = 0; i < queryVector.length; i++) { | ||
dotProduct += queryVector[i] * inputVector[i]; | ||
normInputVector += inputVector[i] * inputVector[i]; | ||
} | ||
float normalizedProduct = normQueryVector * normInputVector; | ||
if (normalizedProduct == 0) { | ||
logger.debug("Invalid vectors for cosine. Returning minimum score to put this result to end"); | ||
return 0.0f; | ||
} | ||
return (float) (dotProduct / (Math.sqrt(normalizedProduct))); | ||
return (float) (VectorUtil.dotProduct(queryVector, inputVector) / (Math.sqrt(normalizedProduct))); | ||
} | ||
|
||
/** | ||
|
@@ -148,20 +140,28 @@ public static float cosineSimilarity(List<Number> queryVector, KNNVectorScriptDo | |
*/ | ||
public static float cosinesimil(float[] queryVector, float[] inputVector) { | ||
requireEqualDimension(queryVector, inputVector); | ||
float dotProduct = 0.0f; | ||
float normQueryVector = 0.0f; | ||
float normInputVector = 0.0f; | ||
for (int i = 0; i < queryVector.length; i++) { | ||
dotProduct += queryVector[i] * inputVector[i]; | ||
normQueryVector += queryVector[i] * queryVector[i]; | ||
normInputVector += inputVector[i] * inputVector[i]; | ||
int numZeroInInput = 0; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are unnecessary: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L79. Can we just do this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That method would still return true for a zero vector right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe cosine will be infinite if one vector is finite There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as we validate that it's not zero vector in the above method, we should be able to remove the other check because of the assert finite |
||
int numZeroInQuery = 0; | ||
float cosine = 0.0f; | ||
for (int i = 0; i < inputVector.length; i++) { | ||
if (inputVector[i] == 0) { | ||
numZeroInInput++; | ||
} | ||
|
||
if (queryVector[i] == 0) { | ||
numZeroInQuery++; | ||
} | ||
} | ||
float normalizedProduct = normQueryVector * normInputVector; | ||
if (normalizedProduct == 0) { | ||
if (numZeroInInput == inputVector.length || numZeroInQuery == queryVector.length) { | ||
return cosine; | ||
} | ||
try { | ||
cosine = VectorUtil.cosine(queryVector, inputVector); | ||
} catch (IllegalArgumentException e) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. did lucene doesn't have cosine functions directly present which we can leverage? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We use the lucene cosine function on line 159. The rest just returns 0 if either the input or query vectors are all 0's. |
||
logger.debug("Invalid vectors for cosine. Returning minimum score to put this result to end"); | ||
return 0.0f; | ||
} | ||
return (float) (dotProduct / (Math.sqrt(normalizedProduct))); | ||
return cosine; | ||
} | ||
|
||
/** | ||
|
@@ -217,7 +217,6 @@ public static float calculateHammingBit(Long queryLong, Long inputLong) { | |
* @return L1 score | ||
*/ | ||
public static float l1Norm(float[] queryVector, float[] inputVector) { | ||
requireEqualDimension(queryVector, inputVector); | ||
float distance = 0; | ||
for (int i = 0; i < inputVector.length; i++) { | ||
float diff = queryVector[i] - inputVector[i]; | ||
|
@@ -255,7 +254,6 @@ public static float l1Norm(List<Number> queryVector, KNNVectorScriptDocValues do | |
* @return L-inf score | ||
*/ | ||
public static float lInfNorm(float[] queryVector, float[] inputVector) { | ||
requireEqualDimension(queryVector, inputVector); | ||
float distance = 0; | ||
for (int i = 0; i < inputVector.length; i++) { | ||
float diff = queryVector[i] - inputVector[i]; | ||
|
@@ -293,12 +291,7 @@ public static float lInfNorm(List<Number> queryVector, KNNVectorScriptDocValues | |
* @return dot product score | ||
*/ | ||
public static float innerProduct(float[] queryVector, float[] inputVector) { | ||
requireEqualDimension(queryVector, inputVector); | ||
float distance = 0; | ||
for (int i = 0; i < inputVector.length; i++) { | ||
distance += queryVector[i] * inputVector[i]; | ||
} | ||
return distance; | ||
return VectorUtil.dotProduct(queryVector, inputVector); | ||
} | ||
|
||
/** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to use dotProduct / normalize for calculate cosine. this would do one more iteration as original L108 doing dotProduct.
PS, i checked
Lucene#DefaultVectorUtilSupport#cosine(float, float)
would do cosine normalizeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used that below, I'll see if I can get it to work with the normVector present in this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see there is
return (float) (sum / Math.sqrt((double) norm1 * (double) norm2));
inLucene#DefaultVectorUtilSupport#cosine(float, float)
so we can use it directly inpublic static float cosinesimilOptimized
and without using dotProductThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you seeing the
DefaultVectorUtilSupport
class? I've only been able to findVectorUtil
so far and that class doesn't have a cosine method that takes floats, only float[]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java#L65
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's also used in KNNScoringSpace as well: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/plugin/script/KNNScoringSpace.java#L106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIMD would be better as per out older experiments of SIMD. Also, given that lucene lacks that implementation I fine to remove this optimize cosine code for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see multiple versions of
cosinesimil
andcosineSimilarity
. Lets just move towards 1 where we use Lucene functions to do the distance calculations and remove all others.Some are using optimized and some doesn't. Lets just clean things up and move towards 1 implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll incorporate that with this PR then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO for those that are very serious about performance, they will normalize their data during preprocessing and use inner product directly. So, I think its okay to not change cosine functionality for now and just focus on dot product and l2 for this optimization.