Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose dense vector iterator in painless #51964

Closed
stu-elastic opened this issue Feb 5, 2020 · 13 comments · Fixed by #71313
Closed

Expose dense vector iterator in painless #51964

stu-elastic opened this issue Feb 5, 2020 · 13 comments · Fixed by #71313
Assignees
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement Team:Core/Infra Meta label for core/infra team

Comments

@stu-elastic
Copy link
Contributor

stu-elastic commented Feb 5, 2020

Painless does not have access to dense_vector at the moment except as an opaque value, users can pass it around but not access it's contents.

We can expose the vector as an iterator, which will give access to the data without revealing the internal representation.

cc: @mayya-sharipova @jtibshirani

@stu-elastic stu-elastic added >enhancement :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache labels Feb 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Scripting)

@mayya-sharipova
Copy link
Contributor

I would suggest NOT to expose dense vector iterator.
There are two reasons for this:

  1. I have not encountered a need to iterate over values of a dense_vector field, except probably the issue Bug report: Indexing a list of floats in a painless script seems to access the elements in sorted order #49695, which a user solved through a custom plugin. We provide predefined vector functions, and would like users to access vectors only through these functions.
  2. Complex vector representation. Each vector is represented as a binary value that encodes not only an original vector values but also computed vector magnitude. In the future, we may add more data to this encoding. So even if a user gets this binary value, it will be challenging to decode it.

@stu-elastic I am wondering if you have any specific need to expose dense vector iterator?

@stu-elastic
Copy link
Contributor Author

stu-elastic commented Feb 5, 2020

@mayya-sharipova this is based on a request from @honzakral. After chatting we thought an iterator was pretty lightweight.

I'm trying to understand point 2, if we iterated over a vector and provided the values, what's the issue there? We can simply ignore the magnitude at the end.

@mayya-sharipova
Copy link
Contributor

@honzakral I am interested to learn about a use case to access vector values directly.

@stu-elastic

We can simply ignore the magnitude at the end.

For now depending on index version it is just magnitude at the end. But later we may add more metadata.

Were you planning to iterate over Binary DocValues as it would be tricky for a user to decode these docvalues?
Alternatively I can see how we can expose an iterator over float[] -- original vectors' values by first decoding them, something we do in our vector functions

@stu-elastic
Copy link
Contributor Author

Alternatively I can see how we can expose an iterator over float[] -- original vectors' values by first decoding them, something we do in our vector functions

That's what we were thinking, allow users to get their data back.

@honzakral
Copy link
Contributor

The use case we encountered with is the ability to access data in order, just as #49695. The idea was to store historical records (price at a point of time) where we are then interested in deltas between two arbitrary points (price yesterday compared to price a year ago).

Another use case that came up was a user implementing custom vector function in painless.

@mayya-sharipova
Copy link
Contributor

@honzakral thanks, looks to be valid use cases to me.

That's what we were thinking, allow users to get their data back.

@stu-elastic thanks, makes sense. We need to think how to implement it as DenseVectorScriptDocValues needs to know an index version to decode vectors in a right way.

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Sep 18, 2020

Relevant request from another user of exposing vector functions in other painless contexts besides ScoreScript.CONTEXT.

@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@jdconrad
Copy link
Contributor

jdconrad commented Dec 9, 2020

@mayya-sharipova Just wanted to check to see if you would still like this to be exposed.

@jdconrad jdconrad removed the needs:triage Requires assignment of a team area label label Dec 9, 2020
@coreation
Copy link

coreation commented Mar 15, 2021

@jdconrad @mayya-sharipova You'll have to excuse me if this isn't on topic, but to my understanding the following example could be another use case to have access to the vector fields. I'm currently experiencing something weird in the following:

def aha =[]; def xt = params.filterVectors; for (int i=0; i< xt.size();i++) { aha.add(cosineSimilarity(xt[i], doc['media.full_body_dense_vector'].asList()));} throw new Exception(aha.toString());

The result that is added to the array, is always the cosineSimilarity outcome of the first iteration, however if you log the xt[i], it's definitely looping through the vectors we pass as parameters. This is my best hunch so far: is this because the script is compiled, where the function is compiled without having the variable values injected? I just can't wrap my head around how I should be using these functions within a for/while loop :/

If I'd have access to the dense vector field, I could write my cosSim. function which might get rid of that problem...but I'm not sure about that one. Any pointers on this would be great!

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Mar 17, 2021

@coreation I see the problem. Thank you for reporting it. I can reproduce your problem with this example:

PUT test_index
{
  "mappings": {
    "properties": {
      "v": {
        "type": "dense_vector",
        "dims" : 3
      }
    }
  }
}


POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"v" : [10, 10, 10]}
{ "index" : { "_id" : "2" } }
{"v" : [10, 20, 30]}

POST test_index/_explain/1
{
  "query": {
    "script_score" : {
      "query" : {"match_all" : {}},
      "script": {
        "source": """
            def aha =[];
            def xt = params.filterVectors; 
            for (int i=0; i< xt.size(); i++) 
            { 
            	aha.add(cosineSimilarity(xt[i], 'v'));
            } 
            Debug.explain(aha); 
        """, 
        "params": {
          "filterVectors": [[10, 10, 10], [10, 20, 30] ] 
        }
      }
    }
  }
}

Debug.explain(aha) returns "[0.999999889879181, 0.999999889879181]" instead of the expected "[0.999999889879181, 0.9258200697395147]

@jdconrad @stu-elastic The problem here is that the result of cosineSimilarity(xt[i], 'v') seems to be calculated only once in the loop and then cached and reused for all other loops iterations. What can be done in this case to ensure we calculate cosineSimilarity(xt[i], 'v') for each iteration?

@stu-elastic
Copy link
Contributor Author

stu-elastic commented Mar 18, 2021

@coreation and @mayya-sharipova let's move this discussion to a new issue.

You'll want to look at the implementation of the CosineSimilarity whitelisted method.

@coreation
Copy link

@stu-elastic @mayya-sharipova thanks for following up! I've already made a separate issue for this and it has been triaged :) The ticket is #70437

@mayya-sharipova mayya-sharipova self-assigned this Apr 5, 2021
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Apr 5, 2021
Allow direct access to a dense_vector' values in script
through the following functions:

- getVectorValue – returns a vector's value as an array of floats
- getVectorMagnitude – returns a vector's magnitude

Closes elastic#51964
mayya-sharipova added a commit that referenced this issue Apr 19, 2021
Allow direct access to a dense_vector' values in script
through the following functions:

- getVectorValue – returns a vector's value as an array of floats
- getMagnitude – returns a vector's magnitude

Closes #51964
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Apr 19, 2021
Allow direct access to a dense_vector' values in script
through the following functions:

- getVectorValue – returns a vector's value as an array of floats
- getMagnitude – returns a vector's magnitude

Closes elastic#51964
Backport for elastic#71313
mayya-sharipova added a commit that referenced this issue Apr 19, 2021
Allow direct access to a dense_vector' values in script
through the following functions:

- getVectorValue – returns a vector's value as an array of floats
- getMagnitude – returns a vector's magnitude

Closes #51964
Backport for #71313
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants