Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add access to dense_vector values #71313

Merged
merged 13 commits into from
Apr 19, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,6 @@ A `dense_vector` field stores dense vectors of float values.
The maximum number of dimensions that can be in a vector should
not exceed 2048. A `dense_vector` field is a single-valued field.

These vectors can be used for <<vector-functions,document scoring>>.
mayya-sharipova marked this conversation as resolved.
Show resolved Hide resolved
For example, a document score can represent a distance between
a given query vector and the indexed document vector.

You index a dense vector as an array of floats.

[source,console]
Expand Down Expand Up @@ -47,4 +43,11 @@ PUT my-index-000001/_doc/2

--------------------------------------------------

<1> dims—the number of dimensions in the vector, required parameter.
<1> dims – the number of dimensions in the vector, required parameter.


`dense_vector` fields do not support querying, sorting or aggregating. They can
only be accessed in scripts through the dedicated <<vector-functions,vector functions>>.



57 changes: 57 additions & 0 deletions docs/reference/vectors/vector-functions.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,16 @@ linearly scanned. Thus, expect the query time grow linearly
with the number of matched documents. For this reason, we recommend
to limit the number of matched documents with a `query` parameter.

This is the list of available vector functions:

1. `cosineSimilarity` – calculates cosine similarity
2. `dotProduct` – calculates dot product
3. `l1norm` – calculates L^1^ distance
4. `l2norm` - calculates L^2^ distance
5. `getVectorValue` – returns a vector's value as an array of floats
mayya-sharipova marked this conversation as resolved.
Show resolved Hide resolved
6. `getVectorMagnitude` – returns a vector's magnitude


Let's create an index with a `dense_vector` mapping and index a couple
of documents into it.

Expand Down Expand Up @@ -195,3 +205,50 @@ You can check if a document has a value for the field `my_vector` by
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, 'my_vector')"
--------------------------------------------------
// NOTCONSOLE

The recommended way to access dense vectors is through `cosineSimilarity`,
`dotProduct`, `l1norm` or `l2norm` functions. But for custom use cases,
you can access dense vectors's values directly through the following functions:

- `float[] getVectorValue()` – returns a vector's value as an array of floats

- `float getVectorMagnitude()` – returns a vector's magnitude (available for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we implement a slow version of getVectorMagnitude for vectors before 7.5? This seems easy and would make the API simpler. (Then we also might be able to use DenseVectorScriptDocValues#getVectorMagnitude to remove some logic inside cosineSimilarity that requires direct access to the index version!)

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks for the feedback. I was also thinking the same – to implement a slower version of getVectorMagnitude, but this would require decoding the whole vector. If a user is already using vectorValue in their script and decoding a vector, using magnitude would mean decoding this vector the second time. So it would be faster for a user to implement magnitude function themselves since they would already have the decoded vector available. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's the right trade-off for a simple API. First, it will only be slow for vectors indexed before 7.5, which was before they were even GA. It also seems okay that it's slow, users can easily work around it. Maybe we could just write a short note in the docs about the pre-7.5 behavior so that users are aware.

vectors created in the version 7.5 or later).

For example, the script below implements a cosine similarity using these
two functions:

[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": """
float[] v = doc['my_dense_vector'].getVectorValue();
float vm = doc['my_dense_vector'].getVectorMagnitude();
float dotProduct = 0;
for (int i = 0; i < v.length; i++) {
dotProduct += v[i] * params.queryVector[i];
}
return dotProduct / (vm * (float) params.queryVectorMag);
""",
"params": {
"queryVector": [4, 3.4, -0.2],
"queryVectorMag": 5.25357
}
}
}
}
}
--------------------------------------------------
Original file line number Diff line number Diff line change
Expand Up @@ -189,3 +189,55 @@ setup:
- match: {hits.hits.0._id: "1"}
- match: {hits.hits.1._id: "2"}
- match: {hits.hits.1._score: 0.0}

---
"No sort, no aggs, no docvalue_fields are allowed":
mayya-sharipova marked this conversation as resolved.
Show resolved Hide resolved
- do:
index:
refresh: true
index: test-index
id: 1
body:
my_dense_vector: [10, 10, 10]

# sorting on dense_vector field is not supported
- do:
catch: bad_request
search:
index: test-index
body:
query:
match_all: {}
sort:
my_dense_vector

- match: { status: 400 }
- match: { error.root_cause.0.reason: "Field [my_dense_vector] of type [dense_vector] doesn't support sort" }

# aggs on dense_vector field are not supported
- do:
catch: bad_request
search:
index: test-index
body:
aggs:
my_agg:
terms:
field: my_dense_vector

- match: { status: 400 }
- match: { error.root_cause.0.reason: "Field [my_dense_vector] of type [dense_vector] doesn't support docvalue_fields or aggregations" }


# docvalue_fields of dense_vector field are not supported
- do:
catch: bad_request
search:
index: test-index
body:
query:
match_all: {}
docvalue_fields: ["my_dense_vector"]

- match: { status: 400 }
- match: { error.root_cause.0.reason: "Field [my_dense_vector] of type [dense_vector] doesn't support docvalue_fields or aggregations" }
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
"Access to values of dense_vector in script":
- skip:
features: headers
version: " - 7.12.99"
reason: "Access to values of dense_vector in script was added in 7.13"
- do:
indices.create:
index: test-index
body:
mappings:
properties:
v:
type: dense_vector
dims: 3

- do:
bulk:
index: test-index
refresh: true
body:
- '{"index": {"_id": "1"}}'
- '{"v": [1, 1, 1]}'
- '{"index": {"_id": "2"}}'
- '{"v": [1, 1, 2]}'
- '{"index": {"_id": "3"}}'
- '{"v": [1, 1, 3]}'
- '{"index": {"_id": "missing_vector"}}'
- '{}'

# check getVectorValue() API
- do:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test coverage looks good! Since unit tests are generally easier to work with than REST tests, I wondered if there was a way perform some of the same checks as unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks for the feedback. I could not find any good examples of doing unit tests with script doc values. Unit tests that we have they mock scripts contexts and mock what script returns, which kind of defeats the purpose of testing what getVectorValue() and getMagnitude() returns.

I am happy to redesign tests as unit tests, if you know any examples I can follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are some simple cases where we just want to check getVectorValue and getMagnitude return the right value or error appropriately. These could be covered in a test like DenseVectorScriptDocValuesTests. A similar test would be ScriptDocValuesGeoPointsTests.

search:
body:
query:
script_score:
query: { "exists" : { "field" : "v" } }
script:
source: |
float s = 0;
for (def el : doc['v'].getVectorValue()) {
s += el;
}
s;

- match: { hits.hits.0._id: "3" }
- match: { hits.hits.0._score: 5 }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.1._score: 4 }
- match: { hits.hits.2._id: "1" }
- match: { hits.hits.2._score: 3 }


# check getVectorMagnitude() API
- do:
headers:
Content-Type: application/json
jtibshirani marked this conversation as resolved.
Show resolved Hide resolved
search:
body:
query:
script_score:
query: { "exists" : { "field" : "v" } }
script:
source: "doc['v'].getVectorMagnitude()"

- match: { hits.hits.0._id: "3" }
- gte: {hits.hits.0._score: 3.3166}
- lte: {hits.hits.0._score: 3.3167}
- match: { hits.hits.1._id: "2" }
- gte: {hits.hits.1._score: 2.4494}
- lte: {hits.hits.1._score: 2.4495}
- match: { hits.hits.2._id: "1" }
- gte: {hits.hits.2._score: 1.7320}
- lte: {hits.hits.2._score: 1.7321}

# check failed request on missing values
- do:
catch: bad_request
search:
body:
query:
script_score:
query: { match_all: { } }
script:
source: "doc['v'].getVectorValue()[0]"

- match: { status: 400 }
- match: { error.root_cause.0.type: "script_exception" }

# check failed request on missing values
- do:
catch: bad_request
search:
body:
query:
script_score:
query: { match_all: { } }
script:
source: "doc['v'].getVectorMagnitude()"

- match: { status: 400 }
- match: { error.root_cause.0.type: "script_exception" }


# vector functions in loop – return the index of the closest parameter vector based on cosine similarity
- do:
headers:
Content-Type: application/json
search:
body:
query:
script_score:
query: { "exists": { "field": "v" } }
script:
source: |
float[] v = doc['v'].getVectorValue();
float vm = doc['v'].getVectorMagnitude();

int closestPv = 0;
float maxCosSim = -1;
for (int i = 0; i < params.pvs.length; i++) {
float dotProduct = 0;
for (int j = 0; j < v.length; j++) {
dotProduct += v[j] * params.pvs[i][j];
}
float cosSim = dotProduct / (vm * (float) params.pvs_lengths[i]);
if (maxCosSim < cosSim) {
maxCosSim = cosSim;
closestPv = i;
}
}
closestPv;
params:
pvs: [ [ 1, 1, 1 ], [ 1, 1, 2 ], [ 1, 1, 3 ] ]
pvs_lengths: [1.7320, 2.4495, 3.3166]

- match: { hits.hits.0._id: "3" }
- match: { hits.hits.0._score: 2 }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.1._score: 1 }
- match: { hits.hits.2._id: "1" }
- match: { hits.hits.2._score: 0 }
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ protected List<Parameter<?>> getParameters() {
public DenseVectorFieldMapper build(ContentPath contentPath) {
return new DenseVectorFieldMapper(
name,
new DenseVectorFieldType(buildFullName(contentPath), dims.getValue(), meta.getValue()),
new DenseVectorFieldType(buildFullName(contentPath), indexVersionCreated, dims.getValue(), meta.getValue()),
dims.getValue(),
indexVersionCreated,
multiFieldsBuilder.build(this, contentPath),
Expand All @@ -94,10 +94,12 @@ public DenseVectorFieldMapper build(ContentPath contentPath) {

public static final class DenseVectorFieldType extends MappedFieldType {
private final int dims;
private final Version indexVersionCreated;

public DenseVectorFieldType(String name, int dims, Map<String, String> meta) {
public DenseVectorFieldType(String name, Version indexVersionCreated, int dims, Map<String, String> meta) {
super(name, false, false, true, TextSearchInfo.NONE, meta);
this.dims = dims;
this.indexVersionCreated = indexVersionCreated;
}

int dims() {
Expand All @@ -124,7 +126,7 @@ protected Object parseSourceValue(Object value) {

@Override
public DocValueFormat docValueFormat(String format, ZoneId timeZone) {
throw new UnsupportedOperationException(
throw new IllegalArgumentException(
"Field [" + name() + "] of type [" + typeName() + "] doesn't support docvalue_fields or aggregations");
}

Expand All @@ -135,7 +137,7 @@ public boolean isAggregatable() {

@Override
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName, Supplier<SearchLookup> searchLookup) {
return new VectorIndexFieldData.Builder(name(), CoreValuesSourceType.KEYWORD);
return new VectorIndexFieldData.Builder(name(), CoreValuesSourceType.KEYWORD, indexVersionCreated, dims);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

import java.nio.ByteBuffer;


public final class VectorEncoderDecoder {
public static final byte INT_BYTES = 4;

Expand All @@ -31,7 +32,39 @@ public static int denseVectorLength(Version indexVersion, BytesRef vectorBR) {
*/
public static float decodeVectorMagnitude(Version indexVersion, BytesRef vectorBR) {
assert indexVersion.onOrAfter(Version.V_7_5_0);
int offset = vectorBR.offset + vectorBR.length - INT_BYTES;
int intValue = ((vectorBR.bytes[offset] & 0xFF) << 24) |
((vectorBR.bytes[offset+1] & 0xFF) << 16) |
((vectorBR.bytes[offset+2] & 0xFF) << 8) |
(vectorBR.bytes[offset+3] & 0xFF);
return Float.intBitsToFloat(intValue);
}
jtibshirani marked this conversation as resolved.
Show resolved Hide resolved

public static float getVectorMagnitude(Version indexVersion, BytesRef vectorBR) {
if (vectorBR == null) {
throw new IllegalArgumentException("A document doesn't have a value for a vector field!");
}
if (indexVersion.onOrAfter(Version.V_7_5_0)) {
return decodeVectorMagnitude(indexVersion, vectorBR);
} else {
throw new IllegalArgumentException(
"Vector magnitude is not stored for vectors created before version [" + indexVersion.toString() + "].");
}
}

/**
* Decodes a BytesRef into the provided array of floats
* @param vectorBR - dense vector encoded in BytesRef
* @param vector - array of floats where the decoded vector should be stored
*/
public static void decodeDenseVector(BytesRef vectorBR, float[] vector) {
if (vectorBR == null) {
throw new IllegalArgumentException("A document doesn't have a value for a vector field!");
}
ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
return byteBuffer.getFloat(vectorBR.offset + vectorBR.length - 4);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat();
}
}

}
Loading