-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undocumented behaviour / Bug: ElasticSearch Painless - (Vector) Functions in (for) loops #70437
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
@mayya-sharipova Would you please take a look at this? I don't know what the expected behavior for It looks like all the parameters are intentionally "locked" at the first invocation as they are all part of the constructor of a Painless class binding. I see in the history there was at one point a mutable parameter passed into the method of the class binding that was removed in #48604. |
@mayya-sharipova @rjernst looked through the code with me and we both think it should work as-is. I will take some time to try to reproduce and see if we can figure out what isn't getting updated as we think it should be. |
@jdconrad maybe @mayya-sharipova 's reply helps, she provided a reproducable flow on this comment. |
I have found the issue here. We have certain methods that we whitelist that fall under the category of class binding. A class binding in Painless is a method that we internally create a new instance of a class for and pass specific arguments that we consider read-only to a constructor for caching and other mutable arguments to a method that is invoked each time. We do this for performance reasons as this set of methods would otherwise be prohibitive to use repetitively if a large number of documents match a search.
I see three possible options that I would like to discuss as alternatives to what exists now:
Given our general goal to make Painless as user-friendly as possible, 3 is probably the best choice, but I'd like to hear thoughts from other members of the team on this. @rjernst @stu-elastic @mayya-sharipova @jtibshirani |
Pinging @elastic/es-search (Team:Search) |
@jdconrad my 2 cents if I may: I don't think it's only limited to
yielding
I also tried to call When I try to pass the top-level |
@jzzfs Thanks for the additional info! You are correct all of the score script "static" methods have this problem. Note though that it is not all static methods in painless; there are many more than that. @jdconrad @mayya-sharipova @jtibshirani I think the problem here is the argument order. We want |
@mayya-sharipova @jtibshirani I was hoping one of you would be able to chime in to give your thoughts on the following:
@jzzfs The "this" context is indeed a bug. We don't appropriately capture the outer script class with an inner lambda or user-defined function. |
Thanks a bunch for the ongoing discussion @jdconrad and everyone else involved :) Just out of curiosity, your proposal would be to leave it as is, document this behaviour and fix the "this" bug so that a work-around could be to define a custom function within the Painless script so that it could be made possible to use functions in loops? |
@rjernst @jdconrad Thank you for your feedback and suggestions.
For So, I would suggest I can just document this behaviour without introducing slow functions or doing any other updates. @coreation It seems that exposing dense vector iterator will solve your issue, we will try to expose vector iterator. |
Thanks for your reply @mayya-sharipova , does that mean you'd allow the dense vector to be accessible as an iterator, so we could write our own cosineSimilarity by iterating over the values of the (soon to be exposed) dense_vector values? |
Would there be an inexpensive way to catch this case and throw an error? That way we'd avoid silently returning an incorrect result based on cached parameters (which is extra confusing and difficult to debug). |
@jtibshirani Yes, I think we could throw an error. If I understand correctly the intention, since all the parameters are fixed, the only thing that is changing is the underlying document per execution. So, we can can cache the last docid value used inside DenseVectorFunction.getEncodedVector(), and error if the value does not change. That is, the function was called twice for the same document. Would those semantics match with what you were thinking? |
I think that would work. It would also disallow cases where you make the same call twice but the query vector parameter hasn't changed, but that is easy to avoid/ work around. I guess this restriction is a bit arbitrary, but if we plan to "leave as is" then a clear error seems preferable than none. |
@rjernst While that will work in this case, that doesn't necessarily mean the value isn't changing between iterations as the vectors could be pulled from say a field in each new doc that are different. If you actually wanted to cover all cases you'd have to check the incoming parameters each time the method is called, and I imagine that would be pretty slow. |
While I understand the concern there, I think limiting to one call per document will achieve the desired result in practice. The expectation is the query params are passed in through script params. While in theory they could be changed across documents, in practice this is not the expectation, and is unsupported. The other case I could think of would be a user generating a query values array per document, but still only calling once. Could we perhaps use reference checking to ensure the same object is passed in at each call site? It would seem like a good change for class bindings in general. It won't catch all cases (like modifying the underlying map values as described above), but it would catch unsupported cases so the user at least gets an error rather than confusing behavior. |
Reference checking plus id seems like a good solution w/o hurting performance. We could still miss, but it would require users to actively change the data in the direct references passed in. |
@mayya-sharipova Thanks so much for linking the PR! Do I understand this correct then that it is not possible to use the cosineSimilarity function in a loop, but by providing access to the underlying value of the dense vector that it opens up the possibility for clients to script this themselves? Edit: nvm, the documentation edits in the PR says yes :) |
In my use case, I submit several vectors as query parameters and would like to iterate over them inside the painless script, and compute their cosine similarity to the document. Now, do I understand correctly that the decision is not to support that! That would be very said! is there any documentation on a possible workaround? |
@konstantinmiller i think indeed the choice is not to support it via the built-in cosineSimilarity function, but if you look at the PR that was merged, you'll see that in the documentation they do provide documentation on how you can work around it. The example given is even with the cosineSimilarity computation as an example. |
Closing this issue as it is resolved by #71313 |
Elasticsearch version (
bin/elasticsearch --version
): 7.11Plugins installed: none outside the default ones coming with 7.11
JVM version (
java -version
): Java8OS version (
uname -a
if on a Unix-like system): OS XDescription of the problem including expected versus actual behavior:
I'm currently experiencing something weird in the following script:
The result that is added to the array, is always the cosineSimilarity outcome of the first iteration, however if you log "xt[i]", it is looping through the passed parameters of the script. It seems however that if a (vector) function is passed in a for loop, it's always the same outcome as in the first iteration. In other words it doesn't seem like you can dynamically re-use the function inside a for loop. (?)
**I have no issue with providing more detail If someone could tell me that this is a bug, or just undocumented behaviour. By undocumented I mean that there's documentation on using for loops, and documentation on a variety of vector functions (much appreciated btw!!), but there's nothing that states you can't use them inside a loop.
Steps to reproduce:
Provide logs (if relevant):
No logs are produced, just the exception which prints 2 times the same calculation outcome.
Link to relevant StackOverflow ticket
https://stackoverflow.com/questions/66639279/elasticsearch-painless-using-vector-functions-in-for-loops-bug/66662131#66662131
The text was updated successfully, but these errors were encountered: