Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add access to dense_vector values #71313

Merged
merged 13 commits into from
Apr 19, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,6 @@ A `dense_vector` field stores dense vectors of float values.
The maximum number of dimensions that can be in a vector should
not exceed 2048. A `dense_vector` field is a single-valued field.

These vectors can be used for <<vector-functions,document scoring>>.
mayya-sharipova marked this conversation as resolved.
Show resolved Hide resolved
For example, a document score can represent a distance between
a given query vector and the indexed document vector.

You index a dense vector as an array of floats.

[source,console]
Expand Down Expand Up @@ -47,4 +43,11 @@ PUT my-index-000001/_doc/2

--------------------------------------------------

<1> dims—the number of dimensions in the vector, required parameter.
<1> dims – the number of dimensions in the vector, required parameter.


`dense_vector` fields do not support querying, sorting or aggregating. They can
only be accessed in scripts through the dedicated <<vector-functions,vector functions>>.



57 changes: 57 additions & 0 deletions docs/reference/vectors/vector-functions.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,16 @@ linearly scanned. Thus, expect the query time grow linearly
with the number of matched documents. For this reason, we recommend
to limit the number of matched documents with a `query` parameter.

This is the list of available vector functions and vector access methods:

1. `cosineSimilarity` – calculates cosine similarity
2. `dotProduct` – calculates dot product
3. `l1norm` – calculates L^1^ distance
4. `l2norm` - calculates L^2^ distance
5. `doc[<field>].vectorValue` – returns a vector's value as an array of floats
6. `doc[<field>].magnitude` – returns a vector's magnitude


Let's create an index with a `dense_vector` mapping and index a couple
of documents into it.

Expand Down Expand Up @@ -195,3 +205,50 @@ You can check if a document has a value for the field `my_vector` by
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, 'my_vector')"
--------------------------------------------------
// NOTCONSOLE

The recommended way to access dense vectors is through `cosineSimilarity`,
`dotProduct`, `l1norm` or `l2norm` functions. But for custom use cases,
you can access dense vectors's values directly through the following functions:

- `doc[<field>].vectorValue` – returns a vector's value as an array of floats

- `doc[<field>].magnitude` – returns a vector's magnitude as a float
(available for vectors created in the version 7.5 or later).

For example, the script below implements a cosine similarity using these
two functions:

[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": """
float[] v = doc['my_dense_vector'].vectorValue;
float vm = doc['my_dense_vector'].magnitude;
float dotProduct = 0;
for (int i = 0; i < v.length; i++) {
dotProduct += v[i] * params.queryVector[i];
}
return dotProduct / (vm * (float) params.queryVectorMag);
""",
"params": {
"queryVector": [4, 3.4, -0.2],
"queryVectorMag": 5.25357
}
}
}
}
}
--------------------------------------------------
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
"Access to values of dense_vector in script":
- skip:
features: headers
version: " - 7.12.99"
reason: "Access to values of dense_vector in script was added in 7.13"
- do:
indices.create:
index: test-index
body:
mappings:
properties:
v:
type: dense_vector
dims: 3

- do:
bulk:
index: test-index
refresh: true
body:
- '{"index": {"_id": "1"}}'
- '{"v": [1, 1, 1]}'
- '{"index": {"_id": "2"}}'
- '{"v": [1, 1, 2]}'
- '{"index": {"_id": "3"}}'
- '{"v": [1, 1, 3]}'
- '{"index": {"_id": "missing_vector"}}'
- '{}'

# check getVectorValue() API
- do:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test coverage looks good! Since unit tests are generally easier to work with than REST tests, I wondered if there was a way perform some of the same checks as unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks for the feedback. I could not find any good examples of doing unit tests with script doc values. Unit tests that we have they mock scripts contexts and mock what script returns, which kind of defeats the purpose of testing what getVectorValue() and getMagnitude() returns.

I am happy to redesign tests as unit tests, if you know any examples I can follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are some simple cases where we just want to check getVectorValue and getMagnitude return the right value or error appropriately. These could be covered in a test like DenseVectorScriptDocValuesTests. A similar test would be ScriptDocValuesGeoPointsTests.

search:
body:
query:
script_score:
query: { "exists" : { "field" : "v" } }
script:
source: |
float s = 0;
for (def el : doc['v'].vectorValue) {
s += el;
}
s;

- match: { hits.hits.0._id: "3" }
- match: { hits.hits.0._score: 5 }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.1._score: 4 }
- match: { hits.hits.2._id: "1" }
- match: { hits.hits.2._score: 3 }


# check getMagnitude() API
- do:
headers:
Content-Type: application/json
jtibshirani marked this conversation as resolved.
Show resolved Hide resolved
search:
body:
query:
script_score:
query: { "exists" : { "field" : "v" } }
script:
source: "doc['v'].magnitude"

- match: { hits.hits.0._id: "3" }
- gte: {hits.hits.0._score: 3.3166}
- lte: {hits.hits.0._score: 3.3167}
- match: { hits.hits.1._id: "2" }
- gte: {hits.hits.1._score: 2.4494}
- lte: {hits.hits.1._score: 2.4495}
- match: { hits.hits.2._id: "1" }
- gte: {hits.hits.2._score: 1.7320}
- lte: {hits.hits.2._score: 1.7321}

# check failed request on missing values
- do:
catch: bad_request
search:
body:
query:
script_score:
query: { match_all: { } }
script:
source: "doc['v'].vectorValue[0]"

- match: { status: 400 }
- match: { error.root_cause.0.type: "script_exception" }

# check failed request on missing values
- do:
catch: bad_request
search:
body:
query:
script_score:
query: { match_all: { } }
script:
source: "doc['v'].magnitude"

- match: { status: 400 }
- match: { error.root_cause.0.type: "script_exception" }


# vector functions in loop – return the index of the closest parameter vector based on cosine similarity
- do:
headers:
Content-Type: application/json
search:
body:
query:
script_score:
query: { "exists": { "field": "v" } }
script:
source: |
float[] v = doc['v'].vectorValue;
float vm = doc['v'].magnitude;

int closestPv = 0;
float maxCosSim = -1;
for (int i = 0; i < params.pvs.length; i++) {
float dotProduct = 0;
for (int j = 0; j < v.length; j++) {
dotProduct += v[j] * params.pvs[i][j];
}
float cosSim = dotProduct / (vm * (float) params.pvs_lengths[i]);
if (maxCosSim < cosSim) {
maxCosSim = cosSim;
closestPv = i;
}
}
closestPv;
params:
pvs: [ [ 1, 1, 1 ], [ 1, 1, 2 ], [ 1, 1, 3 ] ]
pvs_lengths: [1.7320, 2.4495, 3.3166]

- match: { hits.hits.0._id: "3" }
- match: { hits.hits.0._score: 2 }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.1._score: 1 }
- match: { hits.hits.2._id: "1" }
- match: { hits.hits.2._score: 0 }
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ protected List<Parameter<?>> getParameters() {
public DenseVectorFieldMapper build(ContentPath contentPath) {
return new DenseVectorFieldMapper(
name,
new DenseVectorFieldType(buildFullName(contentPath), dims.getValue(), meta.getValue()),
new DenseVectorFieldType(buildFullName(contentPath), indexVersionCreated, dims.getValue(), meta.getValue()),
dims.getValue(),
indexVersionCreated,
multiFieldsBuilder.build(this, contentPath),
Expand All @@ -94,10 +94,12 @@ public DenseVectorFieldMapper build(ContentPath contentPath) {

public static final class DenseVectorFieldType extends MappedFieldType {
private final int dims;
private final Version indexVersionCreated;

public DenseVectorFieldType(String name, int dims, Map<String, String> meta) {
public DenseVectorFieldType(String name, Version indexVersionCreated, int dims, Map<String, String> meta) {
super(name, false, false, true, TextSearchInfo.NONE, meta);
this.dims = dims;
this.indexVersionCreated = indexVersionCreated;
}

int dims() {
Expand All @@ -124,7 +126,7 @@ protected Object parseSourceValue(Object value) {

@Override
public DocValueFormat docValueFormat(String format, ZoneId timeZone) {
throw new UnsupportedOperationException(
throw new IllegalArgumentException(
"Field [" + name() + "] of type [" + typeName() + "] doesn't support docvalue_fields or aggregations");
}

Expand All @@ -135,7 +137,7 @@ public boolean isAggregatable() {

@Override
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName, Supplier<SearchLookup> searchLookup) {
return new VectorIndexFieldData.Builder(name(), CoreValuesSourceType.KEYWORD);
return new VectorIndexFieldData.Builder(name(), CoreValuesSourceType.KEYWORD, indexVersionCreated, dims);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

import java.nio.ByteBuffer;


public final class VectorEncoderDecoder {
public static final byte INT_BYTES = 4;

Expand All @@ -34,4 +35,32 @@ public static float decodeVectorMagnitude(Version indexVersion, BytesRef vectorB
ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
return byteBuffer.getFloat(vectorBR.offset + vectorBR.length - 4);
}
jtibshirani marked this conversation as resolved.
Show resolved Hide resolved

public static float getMagnitude(Version indexVersion, BytesRef vectorBR) {
if (vectorBR == null) {
throw new IllegalArgumentException("A document doesn't have a value for a vector field!");
}
if (indexVersion.onOrAfter(Version.V_7_5_0)) {
return decodeVectorMagnitude(indexVersion, vectorBR);
} else {
throw new IllegalArgumentException(
"Vector magnitude is not stored for vectors created before version [" + indexVersion.toString() + "].");
}
}

/**
* Decodes a BytesRef into the provided array of floats
* @param vectorBR - dense vector encoded in BytesRef
* @param vector - array of floats where the decoded vector should be stored
*/
public static void decodeDenseVector(BytesRef vectorBR, float[] vector) {
if (vectorBR == null) {
throw new IllegalArgumentException("A document doesn't have a value for a vector field!");
}
ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat();
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,23 @@

import org.apache.lucene.index.BinaryDocValues;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.Version;
import org.elasticsearch.index.fielddata.ScriptDocValues;
import org.elasticsearch.xpack.vectors.mapper.VectorEncoderDecoder;

import java.io.IOException;

public class DenseVectorScriptDocValues extends ScriptDocValues<BytesRef> {

private final BinaryDocValues in;
private final Version indexVersion;
private BytesRef value;
private final float[] vector;

DenseVectorScriptDocValues(BinaryDocValues in) {
DenseVectorScriptDocValues(BinaryDocValues in, Version indexVersion, int dims) {
this.in = in;
this.indexVersion = indexVersion;
this.vector = new float[dims];
}

@Override
Expand All @@ -39,7 +45,23 @@ BytesRef getEncodedValue() {

@Override
public BytesRef get(int index) {
throw new UnsupportedOperationException("accessing a vector field's value through 'get' or 'value' is not supported");
throw new UnsupportedOperationException("accessing a vector field's value through 'get' or 'value' is not supported!" +
"Use 'vectorValue' or 'magnitude' instead!'");
}

/**
* Get dense vector's value as an array of floats
*/
public float[] getVectorValue() {
VectorEncoderDecoder.decodeDenseVector(value, vector);
return vector;
}

/**
* Get dense vector's magnitude
*/
public float getMagnitude() {
return VectorEncoderDecoder.getMagnitude(indexVersion, value);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.util.Accountable;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.Version;
import org.elasticsearch.index.fielddata.LeafFieldData;
import org.elasticsearch.index.fielddata.ScriptDocValues;
import org.elasticsearch.index.fielddata.SortedBinaryDocValues;
Expand All @@ -25,10 +26,14 @@ final class VectorDVLeafFieldData implements LeafFieldData {

private final LeafReader reader;
private final String field;
private final Version indexVersion;
private final int dims;

VectorDVLeafFieldData(LeafReader reader, String field) {
VectorDVLeafFieldData(LeafReader reader, String field, Version indexVersion, int dims) {
this.reader = reader;
this.field = field;
this.indexVersion = indexVersion;
this.dims = dims;
}

@Override
Expand All @@ -50,7 +55,7 @@ public SortedBinaryDocValues getBytesValues() {
public ScriptDocValues<BytesRef> getScriptValues() {
try {
final BinaryDocValues values = DocValues.getBinary(reader, field);
return new DenseVectorScriptDocValues(values);
return new DenseVectorScriptDocValues(values, indexVersion, dims);
} catch (IOException e) {
throw new IllegalStateException("Cannot load doc values for vector field!", e);
}
Expand Down
Loading