diff --git a/benchmarks/perf-tool/README.md b/benchmarks/perf-tool/README.md
index eb4ac0dc1..d4404f551 100644
--- a/benchmarks/perf-tool/README.md
+++ b/benchmarks/perf-tool/README.md
@@ -229,6 +229,28 @@ Ingests a dataset of vectors into the cluster.
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |
+#### ingest_multi_field
+
+Ingests a dataset of multiple context types into the cluster.
+
+##### Parameters
+
+| Parameter Name | Description | Default |
+| ----------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
+| index_name | Name of index to ingest into | No default |
+| field_name | Name of field to ingest into | No default |
+| bulk_size | Documents per bulk request | 300 |
+| dataset_path | Path to data-set | No default |
+| doc_count | Number of documents to create from data-set | Size of the data-set |
+| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
+| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file | No default |
+
+##### Metrics
+
+| Metric Name | Description | Unit |
+| ----------- | ----------- | ----------- |
+| took | Total time to ingest the dataset into the index.| ms |
+
#### query
Runs a set of queries against an index.
@@ -257,6 +279,60 @@ Runs a set of queries against an index.
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
+#### query_with_filter
+
+Runs a set of queries with filter against an index.
+
+##### Parameters
+
+| Parameter Name | Description | Default |
+| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
+| k | Number of neighbors to return on search | 100 |
+| r | r value in Recall@R | 1 |
+| index_name | Name of index to search | No default |
+| field_name | Name field to search | No default |
+| calculate_recall | Whether to calculate recall values | False |
+| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
+| dataset_path | Path to dataset | No default |
+| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
+| neighbors_path | Path to neighbors dataset | No default |
+| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
+| filter_spec | Path to filter specification | No default |
+| filter_type | Type of filter format, we do support following types:
FILTER inner filter format for approximate k-NN search
SCRIPT score scripting with exact k-NN search and pre-filtering
BOOL_POST_FILTER Bool query with post-filtering | SCRIPT |
+| score_script_similarity | Similarity function that has been used to index dataset. Used for SCRIPT filter type and ignored for others | l2 |
+| query_count | Number of queries to create from data-set | Size of the data-set |
+
+##### Metrics
+
+| Metric Name | Description | Unit |
+| ----------- | ----------- | ----------- |
+| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
+| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
+| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
+| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
+
+### Data sets
+
+This benchmark tool uses pre-generated data sets to run indexing and query workload. For some benchmark types existing dataset need to be
+extended. Filtering is an example of use case where such dataset extension is needed.
+
+It's possible to use script provided with this repo to generate dataset and run benchmark for filtering queries.
+You need to have existing dataset with vector data. This dataset will be used to generate additional attribute data and set of ground truth neighbours document ids.
+
+To generate dataset with attributes based on vectors only dataset use following command pattern:
+
+```commandline
+python add-filters-to-dataset.py True False
+```
+
+To generate neighbours dataset for different filters based on dataset with attributes use following command pattern:
+
+```commandline
+python add-filters-to-dataset.py False True
+```
+
+After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.
+
## Contributing
### Linting
diff --git a/benchmarks/perf-tool/add-filters-to-dataset.py b/benchmarks/perf-tool/add-filters-to-dataset.py
new file mode 100644
index 000000000..0624f7323
--- /dev/null
+++ b/benchmarks/perf-tool/add-filters-to-dataset.py
@@ -0,0 +1,200 @@
+# SPDX-License-Identifier: Apache-2.0
+#
+# The OpenSearch Contributors require contributions made to
+# this file be licensed under the Apache-2.0 license or a
+# compatible open source license.
+"""
+Script builds complex dataset with additional attributes from exiting dataset that has only vectors.
+Additional attributes are predefined in the script: color, taste, age. Only HDF5 format of vector dataset is supported.
+
+Output dataset file will have additional dataset 'attributes' with multiple columns, each column corresponds to one attribute
+from an attribute set, and value is generated at random, e.g.:
+
+0: green None 71
+1: green bitter 28
+
+there is no explicit index reference in 'attributes' dataset, index of the row corresponds to a document id.
+For instance, in example above two rows of fields mapped to documents with ids '0' and '1'.
+
+If 'generate_filters' flag is set script generates additional dataset of neighbours (ground truth) for each filter type.
+Output is a new file with several datasets, each dataset corresponds to one filter. Datasets are named 'neighbour_filter_X'
+where X is 1 based index of particular filter.
+Each dataset has rows with array of integers, where integer corresponds to
+a document id from original dataset with additional fields. Array ca have -1 values that are treated as null, this is because
+subset of filtered documents is same of smaller than original set.
+
+For example, dataset file content may look like :
+
+neighbour_filter_1: [[ 2, 5, -1],
+ [ 3, 1, -1],
+ [ 2 5, 7]]
+neighbour_filter_2: [[-1, -1, -1],
+ [ 5, 6, -1],
+ [ 4, 2, 1]]
+
+In this case we do have datasets for two filters, 3 query results for each. [2, 5, -1] indicates that for first query
+if filter 1 is used most similar document is with id 2, next similar is 5, and the rest do not pass filter 1 criteria.
+
+Example of script usage:
+
+ create new hdf5 file with attribute dataset
+ add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-with-attr True False
+
+ create new hdf5 file with filter datasets
+ add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data-with-attr.hdf5 ~/dev/opensearch/datasets/data-with-filters False True
+"""
+
+import getopt
+import os
+import random
+import sys
+
+import h5py
+
+from osb.extensions.data_set import HDF5DataSet
+
+
+class _Dataset:
+ """Type of dataset container for data with additional attributes"""
+ DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME
+
+ def create_dataset(self, source_dataset_path, out_file_path, generate_attrs: bool, generate_filters: bool) -> None:
+ path_elements = os.path.split(os.path.abspath(source_dataset_path))
+ data_set_dir = path_elements[0]
+
+ # For HDF5, because multiple data sets can be grouped in the same file,
+ # we will build data sets in memory and not write to disk until
+ # _flush_data_sets_to_disk is called
+ # read existing dataset
+ data_hdf5 = os.path.join(os.path.dirname(os.path.realpath('/')), source_dataset_path)
+
+ with h5py.File(data_hdf5, "r") as hf:
+
+ if generate_attrs:
+ data_set_w_attr = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir)
+
+ possible_colors = ['red', 'green', 'yellow', 'blue', None]
+ possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None]
+ max_age = 100
+
+ for key in hf.keys():
+ if key not in ['neighbors', 'test', 'train']:
+ continue
+ data_set_w_attr.create_dataset(key, data=hf[key][()])
+
+ attributes = []
+ for i in range(len(hf['train'])):
+ attr = [random.choice(possible_colors), random.choice(possible_tastes),
+ random.randint(0, max_age + 1)]
+ attributes.append(attr)
+
+ data_set_w_attr.create_dataset('attributes', (len(attributes), 3), 'S10', data=attributes)
+
+ data_set_w_attr.flush()
+ data_set_w_attr.close()
+
+ if generate_filters:
+ attributes = hf['attributes'][()]
+ expected_neighbors = hf['neighbors'][()]
+
+ data_set_filters = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir)
+
+ def filter1(attributes, vector_idx):
+ if attributes[vector_idx][0].decode() == 'red' and int(attributes[vector_idx][2].decode()) >= 20:
+ return True
+ else:
+ return False
+
+ self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_1', filter1)
+
+ # filter 2 - color = blue or None and taste = 'salty'
+ def filter2(attributes, vector_idx):
+ if (attributes[vector_idx][0].decode() == 'blue' or attributes[vector_idx][
+ 0].decode() == 'None') and attributes[vector_idx][1].decode() == 'salty':
+ return True
+ else:
+ return False
+
+ self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_2', filter2)
+
+ # filter 3 - color and taste are not None and age is between 20 and 80
+ def filter3(attributes, vector_idx):
+ if attributes[vector_idx][0].decode() != 'None' and attributes[vector_idx][
+ 1].decode() != 'None' and 20 <= \
+ int(attributes[vector_idx][2].decode()) <= 80:
+ return True
+ else:
+ return False
+
+ self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_3', filter3)
+
+ # filter 4 - color green or blue and taste is bitter and age is between (30, 60)
+ def filter4(attributes, vector_idx):
+ if (attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue') \
+ and (attributes[vector_idx][1].decode() == 'bitter') \
+ and 30 <= int(attributes[vector_idx][2].decode()) <= 60:
+ return True
+ else:
+ return False
+
+ self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_4', filter4)
+
+ # filter 5 color is (green or blue or yellow) or taste = sweet or age is between (30, 70)
+ def filter5(attributes, vector_idx):
+ if attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue' \
+ or attributes[vector_idx][0].decode() == 'yellow' \
+ or attributes[vector_idx][1].decode() == 'sweet' \
+ or 30 <= int(attributes[vector_idx][2].decode()) <= 70:
+ return True
+ else:
+ return False
+
+ self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_5', filter5)
+
+ data_set_filters.flush()
+ data_set_filters.close()
+
+ def apply_filter(self, expected_neighbors, attributes, data_set_w_filtering, filter_name, filter_func):
+ neighbors_filter = []
+ filtered_count = 0
+ for expected_neighbors_row in expected_neighbors:
+ neighbors_filter_row = [-1] * len(expected_neighbors_row)
+ idx = 0
+ for vector_idx in expected_neighbors_row:
+ if filter_func(attributes, vector_idx):
+ neighbors_filter_row[idx] = vector_idx
+ idx += 1
+ filtered_count += 1
+ neighbors_filter.append(neighbors_filter_row)
+ overall_count = len(expected_neighbors) * len(expected_neighbors[0])
+ perc = float(filtered_count / overall_count) * 100
+ print('ground truth size for {} is {}, percentage {}'.format(filter_name, filtered_count, perc))
+ data_set_w_filtering.create_dataset(filter_name, data=neighbors_filter)
+ return expected_neighbors
+
+ def create_dataset_file(self, file_name, extension, data_set_dir) -> h5py.File:
+ data_set_file_name = "{}.{}".format(file_name, extension)
+ data_set_path = os.path.join(data_set_dir, data_set_file_name)
+
+ data_set_w_filtering = h5py.File(data_set_path, 'a')
+
+ return data_set_w_filtering
+
+
+def main(argv):
+ opts, args = getopt.getopt(argv, "")
+ in_file_path = args[0]
+ out_file_path = args[1]
+ generate_attr = str2bool(args[2])
+ generate_filters = str2bool(args[3])
+
+ worker = _Dataset()
+ worker.create_dataset(in_file_path, out_file_path, generate_attr, generate_filters)
+
+
+def str2bool(v):
+ return v.lower() in ("yes", "true", "t", "1")
+
+
+if __name__ == "__main__":
+ main(sys.argv[1:])
diff --git a/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5
new file mode 100644
index 000000000..01df75f83
Binary files /dev/null and b/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 differ
diff --git a/benchmarks/perf-tool/dataset/data-with-attr.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr.hdf5
new file mode 100644
index 000000000..22873b06c
Binary files /dev/null and b/benchmarks/perf-tool/dataset/data-with-attr.hdf5 differ
diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/util.py b/benchmarks/perf-tool/okpt/io/config/parsers/util.py
index cecb9f2d0..454fec5a0 100644
--- a/benchmarks/perf-tool/okpt/io/config/parsers/util.py
+++ b/benchmarks/perf-tool/okpt/io/config/parsers/util.py
@@ -13,9 +13,9 @@
def parse_dataset(dataset_format: str, dataset_path: str,
- context: Context) -> DataSet:
+ context: Context, custom_context=None) -> DataSet:
if dataset_format == 'hdf5':
- return HDF5DataSet(dataset_path, context)
+ return HDF5DataSet(dataset_path, context, custom_context)
if dataset_format == 'bigann' and context == Context.NEIGHBORS:
return BigANNNeighborDataSet(dataset_path)
diff --git a/benchmarks/perf-tool/okpt/io/dataset.py b/benchmarks/perf-tool/okpt/io/dataset.py
index 4f8bc22a2..001563bab 100644
--- a/benchmarks/perf-tool/okpt/io/dataset.py
+++ b/benchmarks/perf-tool/okpt/io/dataset.py
@@ -34,6 +34,7 @@ class Context(Enum):
INDEX = 1
QUERY = 2
NEIGHBORS = 3
+ CUSTOM = 4
class DataSet(ABC):
@@ -64,9 +65,9 @@ class HDF5DataSet(DataSet):
`_
"""
- def __init__(self, dataset_path: str, context: Context):
+ def __init__(self, dataset_path: str, context: Context, custom_context=None):
file = h5py.File(dataset_path)
- self.data = cast(h5py.Dataset, file[self._parse_context(context)])
+ self.data = cast(h5py.Dataset, file[self._parse_context(context, custom_context)])
self.current = 0
def read(self, chunk_size: int):
@@ -88,7 +89,7 @@ def reset(self):
self.current = 0
@staticmethod
- def _parse_context(context: Context) -> str:
+ def _parse_context(context: Context, custom_context=None) -> str:
if context == Context.NEIGHBORS:
return "neighbors"
@@ -98,6 +99,9 @@ def _parse_context(context: Context) -> str:
if context == Context.QUERY:
return "test"
+ if context == Context.CUSTOM:
+ return custom_context
+
raise Exception("Unsupported context")
diff --git a/benchmarks/perf-tool/okpt/test/steps/factory.py b/benchmarks/perf-tool/okpt/test/steps/factory.py
index 2b7bcc68d..2e53b4d4d 100644
--- a/benchmarks/perf-tool/okpt/test/steps/factory.py
+++ b/benchmarks/perf-tool/okpt/test/steps/factory.py
@@ -9,7 +9,7 @@
from okpt.test.steps.base import Step, StepConfig
from okpt.test.steps.steps import CreateIndexStep, DisableRefreshStep, RefreshIndexStep, DeleteIndexStep, \
- TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, QueryStep
+ TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, IngestMultiFieldStep, QueryStep, QueryWithFilterStep
def create_step(step_config: StepConfig) -> Step:
@@ -27,8 +27,12 @@ def create_step(step_config: StepConfig) -> Step:
return DeleteIndexStep(step_config)
elif step_config.step_name == IngestStep.label:
return IngestStep(step_config)
+ elif step_config.step_name == IngestMultiFieldStep.label:
+ return IngestMultiFieldStep(step_config)
elif step_config.step_name == QueryStep.label:
return QueryStep(step_config)
+ elif step_config.step_name == QueryWithFilterStep.label:
+ return QueryWithFilterStep(step_config)
elif step_config.step_name == ForceMergeStep.label:
return ForceMergeStep(step_config)
elif step_config.step_name == ClearCacheStep.label:
diff --git a/benchmarks/perf-tool/okpt/test/steps/steps.py b/benchmarks/perf-tool/okpt/test/steps/steps.py
index b61781a6e..bc43bf195 100644
--- a/benchmarks/perf-tool/okpt/test/steps/steps.py
+++ b/benchmarks/perf-tool/okpt/test/steps/steps.py
@@ -9,6 +9,7 @@
so the profiling decorators aren't needed for some functions.
"""
import json
+from abc import abstractmethod
from typing import Any, Dict, List
import numpy as np
@@ -18,7 +19,8 @@
from opensearchpy import OpenSearch, RequestsHttpConnection
from okpt.io.config.parsers.base import ConfigurationError
-from okpt.io.config.parsers.util import parse_string_param, parse_int_param, parse_dataset, parse_bool_param
+from okpt.io.config.parsers.util import parse_string_param, parse_int_param, parse_dataset, parse_bool_param, \
+ parse_list_param
from okpt.io.dataset import Context
from okpt.io.utils.reader import parse_json_from_path
from okpt.test.steps import base
@@ -279,11 +281,8 @@ def _get_measures(self) -> List[str]:
return ['took']
-class IngestStep(OpenSearchStep):
+class BaseIngestStep(OpenSearchStep):
"""See base class."""
-
- label = 'ingest'
-
def __init__(self, step_config: StepConfig):
super().__init__(step_config)
self.index_name = parse_string_param('index_name', step_config.config,
@@ -300,9 +299,9 @@ def __init__(self, step_config: StepConfig):
self.dataset = parse_dataset(dataset_format, dataset_path,
Context.INDEX)
- input_doc_count = parse_int_param('doc_count', step_config.config, {},
+ self.input_doc_count = parse_int_param('doc_count', step_config.config, {},
self.dataset.size())
- self.doc_count = min(input_doc_count, self.dataset.size())
+ self.doc_count = min(self.input_doc_count, self.dataset.size())
def _action(self):
@@ -313,10 +312,7 @@ def action(doc_id):
# much state may cause out of memory failure
for i in range(0, self.doc_count, self.bulk_size):
partition = self.dataset.read(self.bulk_size)
- if partition is None:
- break
- body = bulk_transform(partition, self.field_name, action, i)
- bulk_index(self.opensearch, self.index_name, body)
+ self._handle_data_bulk(partition, action, i)
self.dataset.reset()
@@ -325,11 +321,96 @@ def action(doc_id):
def _get_measures(self) -> List[str]:
return ['took']
+ @abstractmethod
+ def _handle_data_bulk(self, partition, action, i):
+ pass
+
-class QueryStep(OpenSearchStep):
+class IngestStep(BaseIngestStep):
"""See base class."""
- label = 'query'
+ label = 'ingest'
+
+ def _handle_data_bulk(self, partition, action, i):
+ if partition is None:
+ return
+ body = bulk_transform(partition, self.field_name, action, i)
+ bulk_index(self.opensearch, self.index_name, body)
+
+
+class IngestMultiFieldStep(BaseIngestStep):
+ """See base class."""
+
+ label = 'ingest_multi_field'
+
+ def __init__(self, step_config: StepConfig):
+ super().__init__(step_config)
+
+ dataset_path = parse_string_param('dataset_path', step_config.config,
+ {}, None)
+
+ self.attributes_dataset_name = parse_string_param('attributes_dataset_name',
+ step_config.config, {}, None)
+
+ self.attributes_dataset = parse_dataset('hdf5', dataset_path,
+ Context.CUSTOM, self.attributes_dataset_name)
+
+ self.attribute_spec = parse_list_param('attribute_spec',
+ step_config.config, {}, [])
+
+ self.partition_attr = self.attributes_dataset.read(self.doc_count)
+
+ def _handle_data_bulk(self, partition, action, i):
+ if partition is None:
+ return
+ body = self.bulk_transform_with_attributes(partition, self.partition_attr, self.field_name,
+ action, i, self.attribute_spec)
+ bulk_index(self.opensearch, self.index_name, body)
+
+ def bulk_transform_with_attributes(self, partition: np.ndarray, partition_attr, field_name: str,
+ action, offset: int, attributes_def) -> List[Dict[str, Any]]:
+ """Partitions and transforms a list of vectors into OpenSearch's bulk
+ injection format.
+ Args:
+ partition: An array of vectors to transform.
+ partition_attr: dictionary of additional data to transform
+ field_name: field name for action
+ action: Bulk API action.
+ offset: to start counting from
+ attributes_def: definition of additional doc fields
+ Returns:
+ An array of transformed vectors in bulk format.
+ """
+ actions = []
+ _ = [
+ actions.extend([action(i + offset), None])
+ for i in range(len(partition))
+ ]
+ idx = 1
+ part_list = partition.tolist()
+ for i in range(len(partition)):
+ actions[idx] = {field_name: part_list[i]}
+ attr_idx = i + offset
+ attr_def_idx = 0
+ for attribute in attributes_def:
+ attr_def_name = attribute['name']
+ attr_def_type = attribute['type']
+
+ if attr_def_type == 'str':
+ val = partition_attr[attr_idx][attr_def_idx].decode()
+ if val != 'None':
+ actions[idx][attr_def_name] = val
+ elif attr_def_type == 'int':
+ val = int(partition_attr[attr_idx][attr_def_idx].decode())
+ actions[idx][attr_def_name] = val
+ attr_def_idx += 1
+ idx += 2
+
+ return actions
+
+
+class BaseQueryStep(OpenSearchStep):
+ """See base class."""
def __init__(self, step_config: StepConfig):
super().__init__(step_config)
@@ -353,29 +434,13 @@ def __init__(self, step_config: StepConfig):
self.dataset.size())
self.query_count = min(input_query_count, self.dataset.size())
- neighbors_format = parse_string_param('neighbors_format',
- step_config.config, {}, 'hdf5')
- neighbors_path = parse_string_param('neighbors_path',
- step_config.config, {}, None)
- self.neighbors = parse_dataset(neighbors_format, neighbors_path,
- Context.NEIGHBORS)
- self.implicit_config = step_config.implicit_config
+ self.neighbors_format = parse_string_param('neighbors_format',
+ step_config.config, {}, 'hdf5')
+ self.neighbors_path = parse_string_param('neighbors_path',
+ step_config.config, {}, None)
def _action(self):
- def get_body(vec):
- return {
- 'size': self.k,
- 'query': {
- 'knn': {
- self.field_name: {
- 'vector': vec,
- 'k': self.k
- }
- }
- }
- }
-
results = {}
query_responses = []
for _ in range(self.query_count):
@@ -384,7 +449,7 @@ def get_body(vec):
break
query_responses.append(
query_index(self.opensearch, self.index_name,
- get_body(query[0]), [self.field_name]))
+ self.get_body(query[0]) , [self.field_name]))
results['took'] = [
float(query_response['took']) for query_response in query_responses
@@ -414,6 +479,115 @@ def _get_measures(self) -> List[str]:
return measures
+ @abstractmethod
+ def get_body(self, vec):
+ pass
+
+
+class QueryStep(BaseQueryStep):
+ """See base class."""
+
+ label = 'query'
+
+ def __init__(self, step_config: StepConfig):
+ super().__init__(step_config)
+ self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,
+ Context.NEIGHBORS)
+ self.implicit_config = step_config.implicit_config
+
+ def get_body(self, vec):
+ return {
+ 'size': self.k,
+ 'query': {
+ 'knn': {
+ self.field_name: {
+ 'vector': vec,
+ 'k': self.k
+ }
+ }
+ }
+ }
+
+
+class QueryWithFilterStep(BaseQueryStep):
+ """See base class."""
+
+ label = 'query_with_filter'
+
+ def __init__(self, step_config: StepConfig):
+ super().__init__(step_config)
+
+ neighbors_dataset = parse_string_param('neighbors_dataset',
+ step_config.config, {}, None)
+
+ self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,
+ Context.CUSTOM, neighbors_dataset)
+
+ self.filter_type = parse_string_param('filter_type', step_config.config, {}, 'SCRIPT')
+ self.filter_spec = parse_string_param('filter_spec', step_config.config, {}, None)
+ self.score_script_similarity = parse_string_param('score_script_similarity', step_config.config, {}, 'l2')
+
+ self.implicit_config = step_config.implicit_config
+
+ def get_body(self, vec):
+ filter_json = json.load(open(self.filter_spec))
+ if self.filter_type == 'FILTER':
+ return {
+ 'size': self.k,
+ 'query': {
+ 'knn': {
+ self.field_name: {
+ 'vector': vec,
+ 'k': self.k,
+ 'filter': filter_json
+ }
+ }
+ }
+ }
+ elif self.filter_type == 'SCRIPT':
+ return {
+ 'size': self.k,
+ 'query': {
+ 'script_score': {
+ 'query': {
+ 'bool': {
+ 'filter': filter_json
+ }
+ },
+ 'script': {
+ 'source': 'knn_score',
+ 'lang': 'knn',
+ 'params': {
+ 'field': self.field_name,
+ 'query_value': vec,
+ 'space_type': self.score_script_similarity
+ }
+ }
+ }
+ }
+ }
+ elif self.filter_type == 'BOOL_POST_FILTER':
+ return {
+ 'size': self.k,
+ 'query': {
+ 'bool': {
+ 'filter': filter_json,
+ 'must': [
+ {
+ 'knn': {
+ self.field_name: {
+ 'vector': vec,
+ 'k': self.k
+ }
+ }
+ }
+ ]
+ }
+ }
+ }
+ else:
+ raise ConfigurationError('Not supported filter type {}'.format(self.filter_type))
+
# Helper functions - (AKA not steps)
def bulk_transform(partition: np.ndarray, field_name: str, action,
@@ -520,16 +694,20 @@ def recall_at_r(results, neighbor_dataset, r, k, query_count):
Recall at R
"""
correct = 0.0
+ total_num_of_results = 0
for query in range(query_count):
true_neighbors = neighbor_dataset.read(1)
if true_neighbors is None:
break
true_neighbors_set = set(true_neighbors[0][:k])
- for j in range(r):
+ true_neighbors_set.discard(-1)
+ min_r = min(r, len(true_neighbors_set))
+ total_num_of_results += min_r
+ for j in range(min_r):
if results[query][j] in true_neighbors_set:
correct += 1.0
- return correct / (r * query_count)
+ return correct / total_num_of_results
def get_index_size_in_kb(opensearch, index_name):
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json
new file mode 100644
index 000000000..f529de4fe
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json
@@ -0,0 +1,24 @@
+{
+ "bool":
+ {
+ "must":
+ [
+ {
+ "range":
+ {
+ "age":
+ {
+ "gte": 20,
+ "lte": 100
+ }
+ }
+ },
+ {
+ "term":
+ {
+ "color": "red"
+ }
+ }
+ ]
+ }
+}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json
new file mode 100644
index 000000000..9d4514e62
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json
@@ -0,0 +1,40 @@
+{
+ "bool":
+ {
+ "must":
+ [
+ {
+ "term":
+ {
+ "taste": "salty"
+ }
+ },
+ {
+ "bool":
+ {
+ "should":
+ [
+ {
+ "bool":
+ {
+ "must_not":
+ {
+ "exists":
+ {
+ "field": "color"
+ }
+ }
+ }
+ },
+ {
+ "term":
+ {
+ "color": "blue"
+ }
+ }
+ ]
+ }
+ }
+ ]
+ }
+}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json
new file mode 100644
index 000000000..d69f8768e
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json
@@ -0,0 +1,30 @@
+{
+ "bool":
+ {
+ "must":
+ [
+ {
+ "range":
+ {
+ "age":
+ {
+ "gte": 20,
+ "lte": 80
+ }
+ }
+ },
+ {
+ "exists":
+ {
+ "field": "color"
+ }
+ },
+ {
+ "exists":
+ {
+ "field": "taste"
+ }
+ }
+ ]
+ }
+}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json
new file mode 100644
index 000000000..9e6356f1c
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json
@@ -0,0 +1,44 @@
+{
+ "bool":
+ {
+ "must":
+ [
+ {
+ "range":
+ {
+ "age":
+ {
+ "gte": 30,
+ "lte": 60
+ }
+ }
+ },
+ {
+ "term":
+ {
+ "taste": "bitter"
+ }
+ },
+ {
+ "bool":
+ {
+ "should":
+ [
+ {
+ "term":
+ {
+ "color": "blue"
+ }
+ },
+ {
+ "term":
+ {
+ "color": "green"
+ }
+ }
+ ]
+ }
+ }
+ ]
+ }
+}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json
new file mode 100644
index 000000000..fecde0392
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json
@@ -0,0 +1,42 @@
+{
+ "bool":
+ {
+ "should":
+ [
+ {
+ "range":
+ {
+ "age":
+ {
+ "gte": 30,
+ "lte": 70
+ }
+ }
+ },
+ {
+ "term":
+ {
+ "color": "green"
+ }
+ },
+ {
+ "term":
+ {
+ "color": "blue"
+ }
+ },
+ {
+ "term":
+ {
+ "color": "yellow"
+ }
+ },
+ {
+ "term":
+ {
+ "color": "sweet"
+ }
+ }
+ ]
+ }
+}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json
new file mode 100644
index 000000000..83ea79b15
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json
@@ -0,0 +1,27 @@
+{
+ "settings": {
+ "index": {
+ "knn": true,
+ "refresh_interval": "10s",
+ "number_of_shards": 30,
+ "number_of_replicas": 0
+ }
+ },
+ "mappings": {
+ "properties": {
+ "target_field": {
+ "type": "knn_vector",
+ "dimension": 128,
+ "method": {
+ "name": "hnsw",
+ "space_type": "l2",
+ "engine": "lucene",
+ "parameters": {
+ "ef_construction": 100,
+ "m": 16
+ }
+ }
+ }
+ }
+ }
+}
diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml
new file mode 100644
index 000000000..aa2ee6389
--- /dev/null
+++ b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml
@@ -0,0 +1,41 @@
+endpoint: localhost
+test_name: lucene_sift_hnsw
+test_id: "Test workflow for lucene hnsw"
+num_runs: 1
+show_runs: false
+setup:
+ - name: delete_index
+ index_name: target_index
+steps:
+ - name: create_index
+ index_name: target_index
+ index_spec: sample-configs/lucene-sift-hnsw-filter/index-spec.json
+ - name: ingest_multi_field
+ index_name: target_index
+ field_name: target_field
+ bulk_size: 500
+ dataset_format: hdf5
+ dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
+ attributes_dataset_name: attributes
+ attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
+ - name: refresh_index
+ index_name: target_index
+ - name: force_merge
+ index_name: target_index
+ max_num_segments: 10
+ - name: query_with_filter
+ k: 10
+ r: 1
+ calculate_recall: true
+ index_name: target_index
+ field_name: target_field
+ dataset_format: hdf5
+ dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
+ neighbors_format: hdf5
+ neighbors_path: ../dataset/sift-128-euclidean-with-attr-with-filters.hdf5
+ neighbors_dataset: neighbors_filter_1
+ filter_spec: sample-configs/filter-spec/filter-1-spec.json
+ query_count: 100
+cleanup:
+ - name: delete_index
+ index_name: target_index
\ No newline at end of file