Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Support k-NN similarity functions in painless scripting #281

Merged

Conversation

VijayanB
Copy link
Member

*Issue #213

Implement IndexFieldDataBuilder to allow painless scripting to indenitfy KNNScriptDocValues
Add dependency to PainlessExtension to add custom scoring methods to be exposed for usage by users inside painless scripting.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Painless scripting uses IndexFieldDataBuilder to load field data values. Hence, override
fielddataBuilder to return instance of KNNVectorIndexFieldData.Builder.
Subsequently, this builder is responsible for returning ScriptDocValues which
will be used as argument in our Whitlisting knn custome methods.
Added new methods which will be whitelisted for users to
call in source field for painless scripting.
Add Extension class and return list of whitelisted method.
Update gradle to add dependencies to painless.
@VijayanB
Copy link
Member Author

Following methods are whitelisted:
l2Squared
cosineSimilarity
cosineSimilarityOptimized

@vamshin @jmazanec15 we can discuss here if we want to rename the method name for better reach

@codecov
Copy link

codecov bot commented Dec 12, 2020

Codecov Report

Merging #281 (44b2ace) into master (8c802d9) will increase coverage by 1.11%.
The diff coverage is 91.13%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #281      +/-   ##
============================================
+ Coverage     79.27%   80.38%   +1.11%     
- Complexity      359      388      +29     
============================================
  Files            58       62       +4     
  Lines          1404     1458      +54     
  Branches        126      127       +1     
============================================
+ Hits           1113     1172      +59     
+ Misses          243      239       -4     
+ Partials         48       47       -1     
Impacted Files Coverage Δ Complexity Δ
...sticsearch/knn/index/KNNVectorDVLeafFieldData.java 72.72% <72.72%> (ø) 4.00 <4.00> (?)
...sticsearch/knn/index/KNNVectorScriptDocValues.java 81.81% <81.81%> (ø) 8.00 <8.00> (?)
...relasticsearch/knn/index/KNNVectorFieldMapper.java 79.25% <100.00%> (+0.31%) 16.00 <0.00> (ø)
...asticsearch/knn/index/KNNVectorIndexFieldData.java 100.00% <100.00%> (ø) 7.00 <7.00> (?)
...lasticsearch/knn/plugin/script/KNNScoreScript.java 100.00% <100.00%> (+25.58%) 1.00 <0.00> (ø)
...lasticsearch/knn/plugin/script/KNNScoringUtil.java 98.03% <100.00%> (+1.26%) 18.00 <15.00> (+7.00)
...earch/knn/plugin/script/KNNWhitelistExtension.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c802d9...44b2ace. Read the comment docs.

test annotation is forbidden
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Still need to dig into the test cases more. One note, once #222 is resolved, we will need to add a section on this.

@@ -50,7 +90,7 @@ public static float l2Squared(float[] queryVector, float[] inputVector) {
* @return cosine score
*/
public static float cosinesimilOptimized(float[] queryVector, float[] inputVector, float normQueryVector) {
float dotProduct = 0.0f;
float dotProduct = 0.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this indented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACk. Will apply reformat code in next commit.

* @param normQueryVector normalized query vector value.
* @return cosine score
*/
public static float cosineSimilarityOptimized(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this interface is vague: cosineSimilarityOptimized does not give a user any information about how this function differs from cosineSimilarity. If cosineSimilarity can be optimized, why even provide cosineSimilarity?

The Cosine similarity is A . B / (||A|| x ||B||). The difference between the two functions is that cosineSimilarityOptimized has the user pass in both the query vector and the magnitude of the query vector. Assuming the query vector is A, ||A|| does not change throughout the query. So time is saved by computing ||A|| separately and then passing it into the function. Also, I think normQueryVector may not be the appropriate term. Normalization refers to the process of making a vector have a magnitude of one.

My proposal is to switch it to the following interface:

cosineSimilarity(List <Number> queryVector, KNNVectorScriptDocValues docValues, Number queryVectorMagnitude)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

1. Made KNNScriptDocValues thread safe
2. reformat code
3. rename cosinesimiloptimized.
@VijayanB VijayanB requested a review from jmazanec15 December 15, 2020 02:43
Add painless scripting score to verify whether following methods
are available for users.
1. l2Squared
2. cosineSimilarity(queryVector,doc[field])
3. cosineSimilarity(queryVector,doc[field],normalizedVector)
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think test cases look pretty good. I added a few additional comments in implementation.


public synchronized float[] getValue() throws IOException {
if (!docExists) {
throw new IllegalArgumentException("no value found for the corresponding doc ID");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why throw IllegalArgumentException? It seems like it maybe should be an IllegalStateException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

float[] knnDocVector;
try {
knnDocVector = docValues.getValue();
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we catch the specific exception here?

Copy link
Member Author

@VijayanB VijayanB Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getValue can throw RuntimeException & IOException, i see three option here

  1. Let docValues throw exception as it is and catch all individual exception here and return Float.Min like now,
  2. Update getValue to throw only RuntimeException, and catch only RuntimeException here
  3. Just update catch (Exception e) to catch ( RuntimeException | IOException e )

i don't see any difference in all three since at the end outcome is same, but if you have strong opinion on one vs other, i can change it . if you see any other option i can make change it as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, Option 1 us good with me

1. updated exception type.
2. Added test
@VijayanB VijayanB requested a review from jmazanec15 December 16, 2020 21:58
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me!

float[] knnDocVector;
try {
knnDocVector = docValues.getValue();
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, Option 1 us good with me


@Override
public void setNextDocId(int docId) throws IOException {
synchronized (this) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need synchronized here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added thread safe to prevent any race condition based on feedback, since docExists setting is happening in one place and reading is happening on another place. If there is not threat with race condition then definitely we can remove this. At the end, it comes to do we need to make this instance thread safe or not? What do you think?

Copy link
Member

@vamshin vamshin Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Do you have pointers to any other base classes extending ScriptDocValues<> doing synchronization? My understanding is each search request should have its own instance and documents are iterated sequentially so synchronization should not be required. If you see any base class doing this we can definitely consider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed synchronization.


@Override
public int size() {
synchronized (this) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need synchronized here? Avoid synchronization in the places not needed as it could hamper performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added so that docExists is synchronized. Since it is just one instruction, i think i can remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. I have not seen other places doing this. We can remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


private Map<String, Float[]> getL2TestData() {
Map<String, Float[]> data = new HashMap<>();
data.put("1", new Float[]{6.0f, 6.0f});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we have a document without the vector and then assert that particular doc comes at the end of the result? Same for both CosineTestData as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test data.

1. Requires dimension must be consistent between query vector and input vector
2. Remove thread safe for KNNScriptDocValues
3. Throw exception if document doesn't contains vector field.
To avoid exception, user's can use size method to validate.
Include error message with additional details on how to fix the
problem.
Add field name as part of error message
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Use scriptdoc to extract values from segment.
Use KNNScriptDocValues to retrieve float[] from the
document instead of explicity retrieval.
Copy link
Member

@vamshin vamshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding painless support. This will help users to have more customization on the scoring.

@VijayanB VijayanB merged commit c41f375 into opendistro-for-elasticsearch:master Dec 29, 2020
@VijayanB VijayanB added the Features New functionality added label Feb 1, 2021
@VijayanB VijayanB changed the title Whitelist scoring methods Use k-NN similarity functions in painless scripting Feb 1, 2021
@VijayanB VijayanB changed the title Use k-NN similarity functions in painless scripting Support k-NN similarity functions in painless scripting Feb 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Features New functionality added
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants