Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437

coreation · 2021-03-16T08:08:39Z

Elasticsearch version (bin/elasticsearch --version): 7.11

Plugins installed: none outside the default ones coming with 7.11

JVM version (java -version): Java8

OS version (uname -a if on a Unix-like system): OS X

Description of the problem including expected versus actual behavior:

I'm currently experiencing something weird in the following script:

def scores =[]; def xt = params.filterVectors; for (int i=0; i< xt.size();i++) { scores.add(cosineSimilarity(xt[i], doc['media.full_body_dense_vector'].asList()));} throw new Exception(scores.toString());

The result that is added to the array, is always the cosineSimilarity outcome of the first iteration, however if you log "xt[i]", it is looping through the passed parameters of the script. It seems however that if a (vector) function is passed in a for loop, it's always the same outcome as in the first iteration. In other words it doesn't seem like you can dynamically re-use the function inside a for loop. (?)

**I have no issue with providing more detail If someone could tell me that this is a bug, or just undocumented behaviour. By undocumented I mean that there's documentation on using for loops, and documentation on a variety of vector functions (much appreciated btw!!), but there's nothing that states you can't use them inside a loop.

Steps to reproduce:

mapping

"mappings": {
  "properties": {
    "title": {
      "type": "text"
    },
    "title_vector": {
      "type": "dense_vector",
      "dims": 512
    }
    "tags": {
      "type": "keyword"
    }
  }
}

script / query

    {
    	"query": {
    		"function_score": {
    			"boost_mode": "replace",
    			"query": {
    				"bool": {
    					"filter": [{
    						"exists": {
    							"field": "title_vector"
    						}
    					}]
    				}
    			},
    			"functions": [{
    				"script_score": {
    					"script": {
    						"source": "def scores =[]; def xt = params.filterVectors; for (int i=0; i< xt.size();i++) { scores.add(cosineSimilarity(xt[i], doc['title_vector']));} throw new Exception(scores.toString());",
    						"params": {
    							"filterVectors": [
    								[1.0, 2.0, 3.0],
    								[0.1, 0.4, 0.5]
    							]
    						},
    						"lang": "painless"
    					}
    				}
    			}]
    		}
    	},
    	"size": 500,
    	"from": 0,
    	"track_scores": true
    }

Provide logs (if relevant):

No logs are produced, just the exception which prints 2 times the same calculation outcome.

Link to relevant StackOverflow ticket

https://stackoverflow.com/questions/66639279/elasticsearch-painless-using-vector-functions-in-for-loops-bug/66662131#66662131

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-18T08:32:22Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

jdconrad · 2021-03-18T16:45:13Z

@mayya-sharipova Would you please take a look at this? I don't know what the expected behavior for cosineSimilarity is well enough to determine if this is indeed a bug or not.

It looks like all the parameters are intentionally "locked" at the first invocation as they are all part of the constructor of a Painless class binding. I see in the history there was at one point a mutable parameter passed into the method of the class binding that was removed in #48604.

jdconrad · 2021-03-18T18:25:52Z

@mayya-sharipova @rjernst looked through the code with me and we both think it should work as-is. I will take some time to try to reproduce and see if we can figure out what isn't getting updated as we think it should be.

coreation · 2021-03-18T18:35:49Z

@jdconrad maybe @mayya-sharipova 's reply helps, she provided a reproducable flow on this comment.

jdconrad · 2021-03-18T21:18:13Z

I have found the issue here. We have certain methods that we whitelist that fall under the category of class binding. A class binding in Painless is a method that we internally create a new instance of a class for and pass specific arguments that we consider read-only to a constructor for caching and other mutable arguments to a method that is invoked each time. We do this for performance reasons as this set of methods would otherwise be prohibitive to use repetitively if a large number of documents match a search.

cosineSimilarity falls under the class binding category and both arguments are passed to the constructor and cached with some work done on them. When the method is invoked a second time the originally passed in arguments are used as opposed to mutable ones.

I see three possible options that I would like to discuss as alternatives to what exists now:

We add/improve documentation to describe the situation here and leave it as-is, but that means that these methods cannot be used in a loop.
We modify the method to process the vectors each time, but this means we lose the original purpose of a class binding to have good performance.
We provide an alternative method that's the "slow" path but does accept new data each time as well as the original "fast" path method.

Given our general goal to make Painless as user-friendly as possible, 3 is probably the best choice, but I'd like to hear thoughts from other members of the team on this. @rjernst @stu-elastic @mayya-sharipova @jtibshirani

elasticmachine · 2021-03-18T21:35:49Z

Pinging @elastic/es-search (Team:Search)

jzzfs · 2021-03-19T09:25:28Z

@jdconrad my 2 cents if I may: I don't think it's only limited to cosineSimilarity & other vector functions but essentially to all static Painless methods that exhibit this class binding behavior. I debugged l2norm in this SO issue, for instance:

"script": {
  "source": """
      def field = doc['media.full_body_dense_vector'];
      
      def hardcodedVectors = [ [1,2,3], [0.1,0.4,0.5] ];
      
      def noLoopDistances = [
        l2norm(hardcodedVectors[0], field),
        l2norm(hardcodedVectors[1], field)
      ];
      
      def hardcodedDistances = [];
      for (vector in hardcodedVectors) {
        double euDistance = l2norm(vector, field);
        hardcodedDistances.add(euDistance);
      }
      
      def parametrizedDistances = [];
      for (vector in params.filterVectors) {
        double euDistance = l2norm(vector, field);
        parametrizedDistances.add(euDistance);
      }
      
      def comparisonMap = [
        "no-loop": noLoopDistances,
        "hardcoded": hardcodedDistances,
        "parametrized": parametrizedDistances
      ];
      
      Debug.explain(comparisonMap);
     """,
  "params": {
    "filterVectors": [ [1,2,3], [0.1,0.4,0.5] ]
  },
  "lang": "painless"
}

yielding

{
  "no-loop":[             
    8.558621384311845,    // <-- the only time two different l2norm calls behave correctly
    11.071133967619906
  ],
  "parametrized":[
    8.558621384311845,
    8.558621384311845
  ],
  "hardcoded":[
    8.558621384311845,
    8.558621384311845
  ]
}

I also tried to call l2norm in a separate function def but then the compiler complains about the this context.

When I try to pass the top-level l2norm to a separate function as an argument, the compiler doesn't recognize the l2norm keyword (unless it was of course called with arguments that it expects).

rjernst · 2021-03-22T17:52:09Z

@jzzfs Thanks for the additional info! You are correct all of the score script "static" methods have this problem. Note though that it is not all static methods in painless; there are many more than that.

@jdconrad @mayya-sharipova @jtibshirani I think the problem here is the argument order. We want field to be passed in as a fixed argument, so that we can cache the doc lookup/doc value accessor inside the ctor of the class binding. But we want the query vectors to be passed in at runtime, since they are unique per document. Class bindings already support this, but fixing it means completely breaking existing uses of these methods. The call sites would need to change to eg cosineSimilarity(field, queryVector) so that the ctor of CosineSimilarity can take the field name, but the runtime method would take the query vectory. The only way to not change the signature would be to come up with some way to map the arguments, which i think would be difficult given how the bindings/definitions are defined now.

jdconrad · 2021-03-23T16:55:32Z

@mayya-sharipova @jtibshirani I was hoping one of you would be able to chime in to give your thoughts on the following:

Are these methods caching as expected? I see a ton of work done on the passed in queryVector as part of the constructor/caching process, so I'm concerned that removing this may give very slow behavior.
Is it better to just document this behavior or would is be better to have fast/slow versions? I understand the latter may be confusing, though.

@jzzfs The "this" context is indeed a bug. We don't appropriately capture the outer script class with an inner lambda or user-defined function.

coreation · 2021-03-23T17:22:26Z

Thanks a bunch for the ongoing discussion @jdconrad and everyone else involved :) Just out of curiosity, your proposal would be to leave it as is, document this behaviour and fix the "this" bug so that a work-around could be to define a custom function within the Painless script so that it could be made possible to use functions in loops?

mayya-sharipova · 2021-03-25T20:13:09Z

@rjernst @jdconrad Thank you for your feedback and suggestions.

I see a ton of work done on the passed in queryVector as part of the constructor/caching process, so I'm concerned that removing this may give very slow behavior.

For cosineSimilarity function (which is the most popular vector function), we indeed would like to pre-calculate query vector magnitude only once for all documents and do caching.

So, I would suggest I can just document this behaviour without introducing slow functions or doing any other updates.

@coreation It seems that exposing dense vector iterator will solve your issue, we will try to expose vector iterator.

coreation · 2021-03-25T20:28:49Z

Thanks for your reply @mayya-sharipova , does that mean you'd allow the dense vector to be accessible as an iterator, so we could write our own cosineSimilarity by iterating over the values of the (soon to be exposed) dense_vector values?

jtibshirani · 2021-03-29T20:56:13Z

So, I would suggest I can just document this behaviour without introducing slow functions or doing any other updates.

Would there be an inexpensive way to catch this case and throw an error? That way we'd avoid silently returning an incorrect result based on cached parameters (which is extra confusing and difficult to debug).

rjernst · 2021-03-30T03:14:43Z

@jtibshirani Yes, I think we could throw an error. If I understand correctly the intention, since all the parameters are fixed, the only thing that is changing is the underlying document per execution. So, we can can cache the last docid value used inside DenseVectorFunction.getEncodedVector(), and error if the value does not change. That is, the function was called twice for the same document. Would those semantics match with what you were thinking?

jtibshirani · 2021-03-30T16:33:41Z

I think that would work. It would also disallow cases where you make the same call twice but the query vector parameter hasn't changed, but that is easy to avoid/ work around. I guess this restriction is a bit arbitrary, but if we plan to "leave as is" then a clear error seems preferable than none.

jdconrad · 2021-03-30T16:51:19Z

@rjernst While that will work in this case, that doesn't necessarily mean the value isn't changing between iterations as the vectors could be pulled from say a field in each new doc that are different. If you actually wanted to cover all cases you'd have to check the incoming parameters each time the method is called, and I imagine that would be pretty slow.

rjernst · 2021-03-30T17:01:02Z

While I understand the concern there, I think limiting to one call per document will achieve the desired result in practice.

The expectation is the query params are passed in through script params. While in theory they could be changed across documents, in practice this is not the expectation, and is unsupported.

The other case I could think of would be a user generating a query values array per document, but still only calling once. Could we perhaps use reference checking to ensure the same object is passed in at each call site? It would seem like a good change for class bindings in general. It won't catch all cases (like modifying the underlying map values as described above), but it would catch unsupported cases so the user at least gets an error rather than confusing behavior.

jdconrad · 2021-03-30T17:07:58Z

Reference checking plus id seems like a good solution w/o hurting performance. We could still miss, but it would require users to actively change the data in the direct references passed in.

coreation · 2021-04-15T12:58:45Z

@mayya-sharipova Thanks so much for linking the PR! Do I understand this correct then that it is not possible to use the cosineSimilarity function in a loop, but by providing access to the underlying value of the dense vector that it opens up the possibility for clients to script this themselves?

Edit: nvm, the documentation edits in the PR says yes :)

konstantinmiller · 2021-04-29T09:55:43Z

In my use case, I submit several vectors as query parameters and would like to iterate over them inside the painless script, and compute their cosine similarity to the document.

Now, do I understand correctly that the decision is not to support that! That would be very said!

is there any documentation on a possible workaround?

coreation · 2021-04-29T10:03:15Z

@konstantinmiller i think indeed the choice is not to support it via the built-in cosineSimilarity function, but if you look at the PR that was merged, you'll see that in the documentation they do provide documentation on how you can work around it. The example given is even with the cosineSimilarity computation as an example.

coreation · 2021-04-29T10:03:37Z

Closing this issue as it is resolved by #71313

coreation added >bug needs:triage Requires assignment of a team area label labels Mar 16, 2021

danielmitterdorfer added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache and removed needs:triage Requires assignment of a team area label labels Mar 18, 2021

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Mar 18, 2021

coreation mentioned this issue Mar 18, 2021

Expose dense vector iterator in painless #51964

Closed

mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Mar 18, 2021

elasticmachine added the Team:Search Meta label for search team label Mar 18, 2021

mayya-sharipova mentioned this issue Apr 14, 2021

Add access to dense_vector values #71313

Merged

coreation closed this as completed Apr 29, 2021

stu-elastic mentioned this issue Feb 10, 2022

Script: Fields API for Dense Vector #83550

Merged

This was referenced Nov 10, 2022

Painless functions return wrong output #89602

Closed

Make it clear script score functions don't work in a loop #91496

Closed

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437

Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437

coreation commented Mar 16, 2021 •

edited

Loading

elasticmachine commented Mar 18, 2021

jdconrad commented Mar 18, 2021

jdconrad commented Mar 18, 2021

coreation commented Mar 18, 2021

jdconrad commented Mar 18, 2021

elasticmachine commented Mar 18, 2021

jzzfs commented Mar 19, 2021 •

edited

Loading

rjernst commented Mar 22, 2021

jdconrad commented Mar 23, 2021

coreation commented Mar 23, 2021

mayya-sharipova commented Mar 25, 2021 •

edited

Loading

coreation commented Mar 25, 2021

jtibshirani commented Mar 29, 2021 •

edited

Loading

rjernst commented Mar 30, 2021

jtibshirani commented Mar 30, 2021

jdconrad commented Mar 30, 2021

rjernst commented Mar 30, 2021

jdconrad commented Mar 30, 2021

coreation commented Apr 15, 2021 •

edited

Loading

konstantinmiller commented Apr 29, 2021

coreation commented Apr 29, 2021

coreation commented Apr 29, 2021

Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437

Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437

Comments

coreation commented Mar 16, 2021 • edited Loading

elasticmachine commented Mar 18, 2021

jdconrad commented Mar 18, 2021

jdconrad commented Mar 18, 2021

coreation commented Mar 18, 2021

jdconrad commented Mar 18, 2021

elasticmachine commented Mar 18, 2021

jzzfs commented Mar 19, 2021 • edited Loading

rjernst commented Mar 22, 2021

jdconrad commented Mar 23, 2021

coreation commented Mar 23, 2021

mayya-sharipova commented Mar 25, 2021 • edited Loading

coreation commented Mar 25, 2021

jtibshirani commented Mar 29, 2021 • edited Loading

rjernst commented Mar 30, 2021

jtibshirani commented Mar 30, 2021

jdconrad commented Mar 30, 2021

rjernst commented Mar 30, 2021

jdconrad commented Mar 30, 2021

coreation commented Apr 15, 2021 • edited Loading

konstantinmiller commented Apr 29, 2021

coreation commented Apr 29, 2021

coreation commented Apr 29, 2021

coreation commented Mar 16, 2021 •

edited

Loading

jzzfs commented Mar 19, 2021 •

edited

Loading

mayya-sharipova commented Mar 25, 2021 •

edited

Loading

jtibshirani commented Mar 29, 2021 •

edited

Loading

coreation commented Apr 15, 2021 •

edited

Loading