Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437

Closed
coreation opened this issue Mar 16, 2021 · 22 comments
Labels
>bug :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Core/Infra Meta label for core/infra team Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@coreation
Copy link

coreation commented Mar 16, 2021

Elasticsearch version (bin/elasticsearch --version): 7.11

Plugins installed: none outside the default ones coming with 7.11

JVM version (java -version): Java8

OS version (uname -a if on a Unix-like system): OS X

Description of the problem including expected versus actual behavior:

I'm currently experiencing something weird in the following script:

def scores =[]; def xt = params.filterVectors; for (int i=0; i< xt.size();i++) { scores.add(cosineSimilarity(xt[i], doc['media.full_body_dense_vector'].asList()));} throw new Exception(scores.toString());

The result that is added to the array, is always the cosineSimilarity outcome of the first iteration, however if you log "xt[i]", it is looping through the passed parameters of the script. It seems however that if a (vector) function is passed in a for loop, it's always the same outcome as in the first iteration. In other words it doesn't seem like you can dynamically re-use the function inside a for loop. (?)

**I have no issue with providing more detail If someone could tell me that this is a bug, or just undocumented behaviour. By undocumented I mean that there's documentation on using for loops, and documentation on a variety of vector functions (much appreciated btw!!), but there's nothing that states you can't use them inside a loop.

Steps to reproduce:

  1. mapping
"mappings": {
  "properties": {
    "title": {
      "type": "text"
    },
    "title_vector": {
      "type": "dense_vector",
      "dims": 512
    }
    "tags": {
      "type": "keyword"
    }
  }
}
  1. script / query
    {
    	"query": {
    		"function_score": {
    			"boost_mode": "replace",
    			"query": {
    				"bool": {
    					"filter": [{
    						"exists": {
    							"field": "title_vector"
    						}
    					}]
    				}
    			},
    			"functions": [{
    				"script_score": {
    					"script": {
    						"source": "def scores =[]; def xt = params.filterVectors; for (int i=0; i< xt.size();i++) { scores.add(cosineSimilarity(xt[i], doc['title_vector']));} throw new Exception(scores.toString());",
    						"params": {
    							"filterVectors": [
    								[1.0, 2.0, 3.0],
    								[0.1, 0.4, 0.5]
    							]
    						},
    						"lang": "painless"
    					}
    				}
    			}]
    		}
    	},
    	"size": 500,
    	"from": 0,
    	"track_scores": true
    }

Provide logs (if relevant):

No logs are produced, just the exception which prints 2 times the same calculation outcome.

Link to relevant StackOverflow ticket

https://stackoverflow.com/questions/66639279/elasticsearch-painless-using-vector-functions-in-for-loops-bug/66662131#66662131

@coreation coreation added >bug needs:triage Requires assignment of a team area label labels Mar 16, 2021
@danielmitterdorfer danielmitterdorfer added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed needs:triage Requires assignment of a team area label labels Mar 18, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Mar 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@jdconrad
Copy link
Contributor

@mayya-sharipova Would you please take a look at this? I don't know what the expected behavior for cosineSimilarity is well enough to determine if this is indeed a bug or not.

It looks like all the parameters are intentionally "locked" at the first invocation as they are all part of the constructor of a Painless class binding. I see in the history there was at one point a mutable parameter passed into the method of the class binding that was removed in #48604.

@jdconrad
Copy link
Contributor

@mayya-sharipova @rjernst looked through the code with me and we both think it should work as-is. I will take some time to try to reproduce and see if we can figure out what isn't getting updated as we think it should be.

@coreation
Copy link
Author

@jdconrad maybe @mayya-sharipova 's reply helps, she provided a reproducable flow on this comment.

@jdconrad
Copy link
Contributor

I have found the issue here. We have certain methods that we whitelist that fall under the category of class binding. A class binding in Painless is a method that we internally create a new instance of a class for and pass specific arguments that we consider read-only to a constructor for caching and other mutable arguments to a method that is invoked each time. We do this for performance reasons as this set of methods would otherwise be prohibitive to use repetitively if a large number of documents match a search.

cosineSimilarity falls under the class binding category and both arguments are passed to the constructor and cached with some work done on them. When the method is invoked a second time the originally passed in arguments are used as opposed to mutable ones.

I see three possible options that I would like to discuss as alternatives to what exists now:

  1. We add/improve documentation to describe the situation here and leave it as-is, but that means that these methods cannot be used in a loop.
  2. We modify the method to process the vectors each time, but this means we lose the original purpose of a class binding to have good performance.
  3. We provide an alternative method that's the "slow" path but does accept new data each time as well as the original "fast" path method.

Given our general goal to make Painless as user-friendly as possible, 3 is probably the best choice, but I'd like to hear thoughts from other members of the team on this. @rjernst @stu-elastic @mayya-sharipova @jtibshirani

@mayya-sharipova mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Mar 18, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Mar 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jzzfs
Copy link

jzzfs commented Mar 19, 2021

@jdconrad my 2 cents if I may: I don't think it's only limited to cosineSimilarity & other vector functions but essentially to all static Painless methods that exhibit this class binding behavior. I debugged l2norm in this SO issue, for instance:

"script": {
  "source": """
      def field = doc['media.full_body_dense_vector'];
      
      def hardcodedVectors = [ [1,2,3], [0.1,0.4,0.5] ];
      
      def noLoopDistances = [
        l2norm(hardcodedVectors[0], field),
        l2norm(hardcodedVectors[1], field)
      ];
      
      def hardcodedDistances = [];
      for (vector in hardcodedVectors) {
        double euDistance = l2norm(vector, field);
        hardcodedDistances.add(euDistance);
      }
      
      def parametrizedDistances = [];
      for (vector in params.filterVectors) {
        double euDistance = l2norm(vector, field);
        parametrizedDistances.add(euDistance);
      }
      
      def comparisonMap = [
        "no-loop": noLoopDistances,
        "hardcoded": hardcodedDistances,
        "parametrized": parametrizedDistances
      ];
      
      Debug.explain(comparisonMap);
     """,
  "params": {
    "filterVectors": [ [1,2,3], [0.1,0.4,0.5] ]
  },
  "lang": "painless"
}

yielding

{
  "no-loop":[             
    8.558621384311845,    // <-- the only time two different l2norm calls behave correctly
    11.071133967619906
  ],
  "parametrized":[
    8.558621384311845,
    8.558621384311845
  ],
  "hardcoded":[
    8.558621384311845,
    8.558621384311845
  ]
}

I also tried to call l2norm in a separate function def but then the compiler complains about the this context.

When I try to pass the top-level l2norm to a separate function as an argument, the compiler doesn't recognize the l2norm keyword (unless it was of course called with arguments that it expects).

@rjernst
Copy link
Member

rjernst commented Mar 22, 2021

@jzzfs Thanks for the additional info! You are correct all of the score script "static" methods have this problem. Note though that it is not all static methods in painless; there are many more than that.

@jdconrad @mayya-sharipova @jtibshirani I think the problem here is the argument order. We want field to be passed in as a fixed argument, so that we can cache the doc lookup/doc value accessor inside the ctor of the class binding. But we want the query vectors to be passed in at runtime, since they are unique per document. Class bindings already support this, but fixing it means completely breaking existing uses of these methods. The call sites would need to change to eg cosineSimilarity(field, queryVector) so that the ctor of CosineSimilarity can take the field name, but the runtime method would take the query vectory. The only way to not change the signature would be to come up with some way to map the arguments, which i think would be difficult given how the bindings/definitions are defined now.

@jdconrad
Copy link
Contributor

@mayya-sharipova @jtibshirani I was hoping one of you would be able to chime in to give your thoughts on the following:

  1. Are these methods caching as expected? I see a ton of work done on the passed in queryVector as part of the constructor/caching process, so I'm concerned that removing this may give very slow behavior.
  2. Is it better to just document this behavior or would is be better to have fast/slow versions? I understand the latter may be confusing, though.

@jzzfs The "this" context is indeed a bug. We don't appropriately capture the outer script class with an inner lambda or user-defined function.

@coreation
Copy link
Author

Thanks a bunch for the ongoing discussion @jdconrad and everyone else involved :) Just out of curiosity, your proposal would be to leave it as is, document this behaviour and fix the "this" bug so that a work-around could be to define a custom function within the Painless script so that it could be made possible to use functions in loops?

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Mar 25, 2021

@rjernst @jdconrad Thank you for your feedback and suggestions.

I see a ton of work done on the passed in queryVector as part of the constructor/caching process, so I'm concerned that removing this may give very slow behavior.

For cosineSimilarity function (which is the most popular vector function), we indeed would like to pre-calculate query vector magnitude only once for all documents and do caching.

So, I would suggest I can just document this behaviour without introducing slow functions or doing any other updates.


@coreation It seems that exposing dense vector iterator will solve your issue, we will try to expose vector iterator.

@coreation
Copy link
Author

Thanks for your reply @mayya-sharipova , does that mean you'd allow the dense vector to be accessible as an iterator, so we could write our own cosineSimilarity by iterating over the values of the (soon to be exposed) dense_vector values?

@jtibshirani
Copy link
Contributor

jtibshirani commented Mar 29, 2021

So, I would suggest I can just document this behaviour without introducing slow functions or doing any other updates.

Would there be an inexpensive way to catch this case and throw an error? That way we'd avoid silently returning an incorrect result based on cached parameters (which is extra confusing and difficult to debug).

@rjernst
Copy link
Member

rjernst commented Mar 30, 2021

@jtibshirani Yes, I think we could throw an error. If I understand correctly the intention, since all the parameters are fixed, the only thing that is changing is the underlying document per execution. So, we can can cache the last docid value used inside DenseVectorFunction.getEncodedVector(), and error if the value does not change. That is, the function was called twice for the same document. Would those semantics match with what you were thinking?

@jtibshirani
Copy link
Contributor

I think that would work. It would also disallow cases where you make the same call twice but the query vector parameter hasn't changed, but that is easy to avoid/ work around. I guess this restriction is a bit arbitrary, but if we plan to "leave as is" then a clear error seems preferable than none.

@jdconrad
Copy link
Contributor

@rjernst While that will work in this case, that doesn't necessarily mean the value isn't changing between iterations as the vectors could be pulled from say a field in each new doc that are different. If you actually wanted to cover all cases you'd have to check the incoming parameters each time the method is called, and I imagine that would be pretty slow.

@rjernst
Copy link
Member

rjernst commented Mar 30, 2021

While I understand the concern there, I think limiting to one call per document will achieve the desired result in practice.

The expectation is the query params are passed in through script params. While in theory they could be changed across documents, in practice this is not the expectation, and is unsupported.

The other case I could think of would be a user generating a query values array per document, but still only calling once. Could we perhaps use reference checking to ensure the same object is passed in at each call site? It would seem like a good change for class bindings in general. It won't catch all cases (like modifying the underlying map values as described above), but it would catch unsupported cases so the user at least gets an error rather than confusing behavior.

@jdconrad
Copy link
Contributor

Reference checking plus id seems like a good solution w/o hurting performance. We could still miss, but it would require users to actively change the data in the direct references passed in.

@coreation
Copy link
Author

coreation commented Apr 15, 2021

@mayya-sharipova Thanks so much for linking the PR! Do I understand this correct then that it is not possible to use the cosineSimilarity function in a loop, but by providing access to the underlying value of the dense vector that it opens up the possibility for clients to script this themselves?

Edit: nvm, the documentation edits in the PR says yes :)

@konstantinmiller
Copy link

In my use case, I submit several vectors as query parameters and would like to iterate over them inside the painless script, and compute their cosine similarity to the document.

Now, do I understand correctly that the decision is not to support that! That would be very said!

is there any documentation on a possible workaround?

@coreation
Copy link
Author

@konstantinmiller i think indeed the choice is not to support it via the built-in cosineSimilarity function, but if you look at the PR that was merged, you'll see that in the documentation they do provide documentation on how you can work around it. The example given is even with the cosineSimilarity computation as an example.

@coreation
Copy link
Author

Closing this issue as it is resolved by #71313

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Core/Infra Meta label for core/infra team Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

10 participants