diff --git a/benchmarks/perf-tool/README.md b/benchmarks/perf-tool/README.md index eb4ac0dc1..d4404f551 100644 --- a/benchmarks/perf-tool/README.md +++ b/benchmarks/perf-tool/README.md @@ -229,6 +229,28 @@ Ingests a dataset of vectors into the cluster. | ----------- | ----------- | ----------- | | took | Total time to ingest the dataset into the index.| ms | +#### ingest_multi_field + +Ingests a dataset of multiple context types into the cluster. + +##### Parameters + +| Parameter Name | Description | Default | +| ----------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- | +| index_name | Name of index to ingest into | No default | +| field_name | Name of field to ingest into | No default | +| bulk_size | Documents per bulk request | 300 | +| dataset_path | Path to data-set | No default | +| doc_count | Number of documents to create from data-set | Size of the data-set | +| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default | +| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file | No default | + +##### Metrics + +| Metric Name | Description | Unit | +| ----------- | ----------- | ----------- | +| took | Total time to ingest the dataset into the index.| ms | + #### query Runs a set of queries against an index. @@ -257,6 +279,60 @@ Runs a set of queries against an index. | recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 | | recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 | +#### query_with_filter + +Runs a set of queries with filter against an index. + +##### Parameters + +| Parameter Name | Description | Default | +| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| +| k | Number of neighbors to return on search | 100 | +| r | r value in Recall@R | 1 | +| index_name | Name of index to search | No default | +| field_name | Name field to search | No default | +| calculate_recall | Whether to calculate recall values | False | +| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | +| dataset_path | Path to dataset | No default | +| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | +| neighbors_path | Path to neighbors dataset | No default | +| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default | +| filter_spec | Path to filter specification | No default | +| filter_type | Type of filter format, we do support following types:
FILTER inner filter format for approximate k-NN search
SCRIPT score scripting with exact k-NN search and pre-filtering
BOOL_POST_FILTER Bool query with post-filtering | SCRIPT | +| score_script_similarity | Similarity function that has been used to index dataset. Used for SCRIPT filter type and ignored for others | l2 | +| query_count | Number of queries to create from data-set | Size of the data-set | + +##### Metrics + +| Metric Name | Description | Unit | +| ----------- | ----------- | ----------- | +| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms | +| memory_kb | Native memory k-NN is using at the end of the query workload | KB | +| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 | +| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 | + +### Data sets + +This benchmark tool uses pre-generated data sets to run indexing and query workload. For some benchmark types existing dataset need to be +extended. Filtering is an example of use case where such dataset extension is needed. + +It's possible to use script provided with this repo to generate dataset and run benchmark for filtering queries. +You need to have existing dataset with vector data. This dataset will be used to generate additional attribute data and set of ground truth neighbours document ids. + +To generate dataset with attributes based on vectors only dataset use following command pattern: + +```commandline +python add-filters-to-dataset.py True False +``` + +To generate neighbours dataset for different filters based on dataset with attributes use following command pattern: + +```commandline +python add-filters-to-dataset.py False True +``` + +After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps. + ## Contributing ### Linting diff --git a/benchmarks/perf-tool/add-filters-to-dataset.py b/benchmarks/perf-tool/add-filters-to-dataset.py new file mode 100644 index 000000000..0624f7323 --- /dev/null +++ b/benchmarks/perf-tool/add-filters-to-dataset.py @@ -0,0 +1,200 @@ +# SPDX-License-Identifier: Apache-2.0 +# +# The OpenSearch Contributors require contributions made to +# this file be licensed under the Apache-2.0 license or a +# compatible open source license. +""" +Script builds complex dataset with additional attributes from exiting dataset that has only vectors. +Additional attributes are predefined in the script: color, taste, age. Only HDF5 format of vector dataset is supported. + +Output dataset file will have additional dataset 'attributes' with multiple columns, each column corresponds to one attribute +from an attribute set, and value is generated at random, e.g.: + +0: green None 71 +1: green bitter 28 + +there is no explicit index reference in 'attributes' dataset, index of the row corresponds to a document id. +For instance, in example above two rows of fields mapped to documents with ids '0' and '1'. + +If 'generate_filters' flag is set script generates additional dataset of neighbours (ground truth) for each filter type. +Output is a new file with several datasets, each dataset corresponds to one filter. Datasets are named 'neighbour_filter_X' +where X is 1 based index of particular filter. +Each dataset has rows with array of integers, where integer corresponds to +a document id from original dataset with additional fields. Array ca have -1 values that are treated as null, this is because +subset of filtered documents is same of smaller than original set. + +For example, dataset file content may look like : + +neighbour_filter_1: [[ 2, 5, -1], + [ 3, 1, -1], + [ 2 5, 7]] +neighbour_filter_2: [[-1, -1, -1], + [ 5, 6, -1], + [ 4, 2, 1]] + +In this case we do have datasets for two filters, 3 query results for each. [2, 5, -1] indicates that for first query +if filter 1 is used most similar document is with id 2, next similar is 5, and the rest do not pass filter 1 criteria. + +Example of script usage: + + create new hdf5 file with attribute dataset + add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-with-attr True False + + create new hdf5 file with filter datasets + add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data-with-attr.hdf5 ~/dev/opensearch/datasets/data-with-filters False True +""" + +import getopt +import os +import random +import sys + +import h5py + +from osb.extensions.data_set import HDF5DataSet + + +class _Dataset: + """Type of dataset container for data with additional attributes""" + DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME + + def create_dataset(self, source_dataset_path, out_file_path, generate_attrs: bool, generate_filters: bool) -> None: + path_elements = os.path.split(os.path.abspath(source_dataset_path)) + data_set_dir = path_elements[0] + + # For HDF5, because multiple data sets can be grouped in the same file, + # we will build data sets in memory and not write to disk until + # _flush_data_sets_to_disk is called + # read existing dataset + data_hdf5 = os.path.join(os.path.dirname(os.path.realpath('/')), source_dataset_path) + + with h5py.File(data_hdf5, "r") as hf: + + if generate_attrs: + data_set_w_attr = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir) + + possible_colors = ['red', 'green', 'yellow', 'blue', None] + possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None] + max_age = 100 + + for key in hf.keys(): + if key not in ['neighbors', 'test', 'train']: + continue + data_set_w_attr.create_dataset(key, data=hf[key][()]) + + attributes = [] + for i in range(len(hf['train'])): + attr = [random.choice(possible_colors), random.choice(possible_tastes), + random.randint(0, max_age + 1)] + attributes.append(attr) + + data_set_w_attr.create_dataset('attributes', (len(attributes), 3), 'S10', data=attributes) + + data_set_w_attr.flush() + data_set_w_attr.close() + + if generate_filters: + attributes = hf['attributes'][()] + expected_neighbors = hf['neighbors'][()] + + data_set_filters = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir) + + def filter1(attributes, vector_idx): + if attributes[vector_idx][0].decode() == 'red' and int(attributes[vector_idx][2].decode()) >= 20: + return True + else: + return False + + self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_1', filter1) + + # filter 2 - color = blue or None and taste = 'salty' + def filter2(attributes, vector_idx): + if (attributes[vector_idx][0].decode() == 'blue' or attributes[vector_idx][ + 0].decode() == 'None') and attributes[vector_idx][1].decode() == 'salty': + return True + else: + return False + + self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_2', filter2) + + # filter 3 - color and taste are not None and age is between 20 and 80 + def filter3(attributes, vector_idx): + if attributes[vector_idx][0].decode() != 'None' and attributes[vector_idx][ + 1].decode() != 'None' and 20 <= \ + int(attributes[vector_idx][2].decode()) <= 80: + return True + else: + return False + + self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_3', filter3) + + # filter 4 - color green or blue and taste is bitter and age is between (30, 60) + def filter4(attributes, vector_idx): + if (attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue') \ + and (attributes[vector_idx][1].decode() == 'bitter') \ + and 30 <= int(attributes[vector_idx][2].decode()) <= 60: + return True + else: + return False + + self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_4', filter4) + + # filter 5 color is (green or blue or yellow) or taste = sweet or age is between (30, 70) + def filter5(attributes, vector_idx): + if attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue' \ + or attributes[vector_idx][0].decode() == 'yellow' \ + or attributes[vector_idx][1].decode() == 'sweet' \ + or 30 <= int(attributes[vector_idx][2].decode()) <= 70: + return True + else: + return False + + self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_5', filter5) + + data_set_filters.flush() + data_set_filters.close() + + def apply_filter(self, expected_neighbors, attributes, data_set_w_filtering, filter_name, filter_func): + neighbors_filter = [] + filtered_count = 0 + for expected_neighbors_row in expected_neighbors: + neighbors_filter_row = [-1] * len(expected_neighbors_row) + idx = 0 + for vector_idx in expected_neighbors_row: + if filter_func(attributes, vector_idx): + neighbors_filter_row[idx] = vector_idx + idx += 1 + filtered_count += 1 + neighbors_filter.append(neighbors_filter_row) + overall_count = len(expected_neighbors) * len(expected_neighbors[0]) + perc = float(filtered_count / overall_count) * 100 + print('ground truth size for {} is {}, percentage {}'.format(filter_name, filtered_count, perc)) + data_set_w_filtering.create_dataset(filter_name, data=neighbors_filter) + return expected_neighbors + + def create_dataset_file(self, file_name, extension, data_set_dir) -> h5py.File: + data_set_file_name = "{}.{}".format(file_name, extension) + data_set_path = os.path.join(data_set_dir, data_set_file_name) + + data_set_w_filtering = h5py.File(data_set_path, 'a') + + return data_set_w_filtering + + +def main(argv): + opts, args = getopt.getopt(argv, "") + in_file_path = args[0] + out_file_path = args[1] + generate_attr = str2bool(args[2]) + generate_filters = str2bool(args[3]) + + worker = _Dataset() + worker.create_dataset(in_file_path, out_file_path, generate_attr, generate_filters) + + +def str2bool(v): + return v.lower() in ("yes", "true", "t", "1") + + +if __name__ == "__main__": + main(sys.argv[1:]) diff --git a/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 new file mode 100644 index 000000000..01df75f83 Binary files /dev/null and b/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 differ diff --git a/benchmarks/perf-tool/dataset/data-with-attr.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr.hdf5 new file mode 100644 index 000000000..22873b06c Binary files /dev/null and b/benchmarks/perf-tool/dataset/data-with-attr.hdf5 differ diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/util.py b/benchmarks/perf-tool/okpt/io/config/parsers/util.py index cecb9f2d0..454fec5a0 100644 --- a/benchmarks/perf-tool/okpt/io/config/parsers/util.py +++ b/benchmarks/perf-tool/okpt/io/config/parsers/util.py @@ -13,9 +13,9 @@ def parse_dataset(dataset_format: str, dataset_path: str, - context: Context) -> DataSet: + context: Context, custom_context=None) -> DataSet: if dataset_format == 'hdf5': - return HDF5DataSet(dataset_path, context) + return HDF5DataSet(dataset_path, context, custom_context) if dataset_format == 'bigann' and context == Context.NEIGHBORS: return BigANNNeighborDataSet(dataset_path) diff --git a/benchmarks/perf-tool/okpt/io/dataset.py b/benchmarks/perf-tool/okpt/io/dataset.py index 4f8bc22a2..001563bab 100644 --- a/benchmarks/perf-tool/okpt/io/dataset.py +++ b/benchmarks/perf-tool/okpt/io/dataset.py @@ -34,6 +34,7 @@ class Context(Enum): INDEX = 1 QUERY = 2 NEIGHBORS = 3 + CUSTOM = 4 class DataSet(ABC): @@ -64,9 +65,9 @@ class HDF5DataSet(DataSet): `_ """ - def __init__(self, dataset_path: str, context: Context): + def __init__(self, dataset_path: str, context: Context, custom_context=None): file = h5py.File(dataset_path) - self.data = cast(h5py.Dataset, file[self._parse_context(context)]) + self.data = cast(h5py.Dataset, file[self._parse_context(context, custom_context)]) self.current = 0 def read(self, chunk_size: int): @@ -88,7 +89,7 @@ def reset(self): self.current = 0 @staticmethod - def _parse_context(context: Context) -> str: + def _parse_context(context: Context, custom_context=None) -> str: if context == Context.NEIGHBORS: return "neighbors" @@ -98,6 +99,9 @@ def _parse_context(context: Context) -> str: if context == Context.QUERY: return "test" + if context == Context.CUSTOM: + return custom_context + raise Exception("Unsupported context") diff --git a/benchmarks/perf-tool/okpt/test/steps/factory.py b/benchmarks/perf-tool/okpt/test/steps/factory.py index 2b7bcc68d..2e53b4d4d 100644 --- a/benchmarks/perf-tool/okpt/test/steps/factory.py +++ b/benchmarks/perf-tool/okpt/test/steps/factory.py @@ -9,7 +9,7 @@ from okpt.test.steps.base import Step, StepConfig from okpt.test.steps.steps import CreateIndexStep, DisableRefreshStep, RefreshIndexStep, DeleteIndexStep, \ - TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, QueryStep + TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, IngestMultiFieldStep, QueryStep, QueryWithFilterStep def create_step(step_config: StepConfig) -> Step: @@ -27,8 +27,12 @@ def create_step(step_config: StepConfig) -> Step: return DeleteIndexStep(step_config) elif step_config.step_name == IngestStep.label: return IngestStep(step_config) + elif step_config.step_name == IngestMultiFieldStep.label: + return IngestMultiFieldStep(step_config) elif step_config.step_name == QueryStep.label: return QueryStep(step_config) + elif step_config.step_name == QueryWithFilterStep.label: + return QueryWithFilterStep(step_config) elif step_config.step_name == ForceMergeStep.label: return ForceMergeStep(step_config) elif step_config.step_name == ClearCacheStep.label: diff --git a/benchmarks/perf-tool/okpt/test/steps/steps.py b/benchmarks/perf-tool/okpt/test/steps/steps.py index b61781a6e..bc43bf195 100644 --- a/benchmarks/perf-tool/okpt/test/steps/steps.py +++ b/benchmarks/perf-tool/okpt/test/steps/steps.py @@ -9,6 +9,7 @@ so the profiling decorators aren't needed for some functions. """ import json +from abc import abstractmethod from typing import Any, Dict, List import numpy as np @@ -18,7 +19,8 @@ from opensearchpy import OpenSearch, RequestsHttpConnection from okpt.io.config.parsers.base import ConfigurationError -from okpt.io.config.parsers.util import parse_string_param, parse_int_param, parse_dataset, parse_bool_param +from okpt.io.config.parsers.util import parse_string_param, parse_int_param, parse_dataset, parse_bool_param, \ + parse_list_param from okpt.io.dataset import Context from okpt.io.utils.reader import parse_json_from_path from okpt.test.steps import base @@ -279,11 +281,8 @@ def _get_measures(self) -> List[str]: return ['took'] -class IngestStep(OpenSearchStep): +class BaseIngestStep(OpenSearchStep): """See base class.""" - - label = 'ingest' - def __init__(self, step_config: StepConfig): super().__init__(step_config) self.index_name = parse_string_param('index_name', step_config.config, @@ -300,9 +299,9 @@ def __init__(self, step_config: StepConfig): self.dataset = parse_dataset(dataset_format, dataset_path, Context.INDEX) - input_doc_count = parse_int_param('doc_count', step_config.config, {}, + self.input_doc_count = parse_int_param('doc_count', step_config.config, {}, self.dataset.size()) - self.doc_count = min(input_doc_count, self.dataset.size()) + self.doc_count = min(self.input_doc_count, self.dataset.size()) def _action(self): @@ -313,10 +312,7 @@ def action(doc_id): # much state may cause out of memory failure for i in range(0, self.doc_count, self.bulk_size): partition = self.dataset.read(self.bulk_size) - if partition is None: - break - body = bulk_transform(partition, self.field_name, action, i) - bulk_index(self.opensearch, self.index_name, body) + self._handle_data_bulk(partition, action, i) self.dataset.reset() @@ -325,11 +321,96 @@ def action(doc_id): def _get_measures(self) -> List[str]: return ['took'] + @abstractmethod + def _handle_data_bulk(self, partition, action, i): + pass + -class QueryStep(OpenSearchStep): +class IngestStep(BaseIngestStep): """See base class.""" - label = 'query' + label = 'ingest' + + def _handle_data_bulk(self, partition, action, i): + if partition is None: + return + body = bulk_transform(partition, self.field_name, action, i) + bulk_index(self.opensearch, self.index_name, body) + + +class IngestMultiFieldStep(BaseIngestStep): + """See base class.""" + + label = 'ingest_multi_field' + + def __init__(self, step_config: StepConfig): + super().__init__(step_config) + + dataset_path = parse_string_param('dataset_path', step_config.config, + {}, None) + + self.attributes_dataset_name = parse_string_param('attributes_dataset_name', + step_config.config, {}, None) + + self.attributes_dataset = parse_dataset('hdf5', dataset_path, + Context.CUSTOM, self.attributes_dataset_name) + + self.attribute_spec = parse_list_param('attribute_spec', + step_config.config, {}, []) + + self.partition_attr = self.attributes_dataset.read(self.doc_count) + + def _handle_data_bulk(self, partition, action, i): + if partition is None: + return + body = self.bulk_transform_with_attributes(partition, self.partition_attr, self.field_name, + action, i, self.attribute_spec) + bulk_index(self.opensearch, self.index_name, body) + + def bulk_transform_with_attributes(self, partition: np.ndarray, partition_attr, field_name: str, + action, offset: int, attributes_def) -> List[Dict[str, Any]]: + """Partitions and transforms a list of vectors into OpenSearch's bulk + injection format. + Args: + partition: An array of vectors to transform. + partition_attr: dictionary of additional data to transform + field_name: field name for action + action: Bulk API action. + offset: to start counting from + attributes_def: definition of additional doc fields + Returns: + An array of transformed vectors in bulk format. + """ + actions = [] + _ = [ + actions.extend([action(i + offset), None]) + for i in range(len(partition)) + ] + idx = 1 + part_list = partition.tolist() + for i in range(len(partition)): + actions[idx] = {field_name: part_list[i]} + attr_idx = i + offset + attr_def_idx = 0 + for attribute in attributes_def: + attr_def_name = attribute['name'] + attr_def_type = attribute['type'] + + if attr_def_type == 'str': + val = partition_attr[attr_idx][attr_def_idx].decode() + if val != 'None': + actions[idx][attr_def_name] = val + elif attr_def_type == 'int': + val = int(partition_attr[attr_idx][attr_def_idx].decode()) + actions[idx][attr_def_name] = val + attr_def_idx += 1 + idx += 2 + + return actions + + +class BaseQueryStep(OpenSearchStep): + """See base class.""" def __init__(self, step_config: StepConfig): super().__init__(step_config) @@ -353,29 +434,13 @@ def __init__(self, step_config: StepConfig): self.dataset.size()) self.query_count = min(input_query_count, self.dataset.size()) - neighbors_format = parse_string_param('neighbors_format', - step_config.config, {}, 'hdf5') - neighbors_path = parse_string_param('neighbors_path', - step_config.config, {}, None) - self.neighbors = parse_dataset(neighbors_format, neighbors_path, - Context.NEIGHBORS) - self.implicit_config = step_config.implicit_config + self.neighbors_format = parse_string_param('neighbors_format', + step_config.config, {}, 'hdf5') + self.neighbors_path = parse_string_param('neighbors_path', + step_config.config, {}, None) def _action(self): - def get_body(vec): - return { - 'size': self.k, - 'query': { - 'knn': { - self.field_name: { - 'vector': vec, - 'k': self.k - } - } - } - } - results = {} query_responses = [] for _ in range(self.query_count): @@ -384,7 +449,7 @@ def get_body(vec): break query_responses.append( query_index(self.opensearch, self.index_name, - get_body(query[0]), [self.field_name])) + self.get_body(query[0]) , [self.field_name])) results['took'] = [ float(query_response['took']) for query_response in query_responses @@ -414,6 +479,115 @@ def _get_measures(self) -> List[str]: return measures + @abstractmethod + def get_body(self, vec): + pass + + +class QueryStep(BaseQueryStep): + """See base class.""" + + label = 'query' + + def __init__(self, step_config: StepConfig): + super().__init__(step_config) + self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path, + Context.NEIGHBORS) + self.implicit_config = step_config.implicit_config + + def get_body(self, vec): + return { + 'size': self.k, + 'query': { + 'knn': { + self.field_name: { + 'vector': vec, + 'k': self.k + } + } + } + } + + +class QueryWithFilterStep(BaseQueryStep): + """See base class.""" + + label = 'query_with_filter' + + def __init__(self, step_config: StepConfig): + super().__init__(step_config) + + neighbors_dataset = parse_string_param('neighbors_dataset', + step_config.config, {}, None) + + self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path, + Context.CUSTOM, neighbors_dataset) + + self.filter_type = parse_string_param('filter_type', step_config.config, {}, 'SCRIPT') + self.filter_spec = parse_string_param('filter_spec', step_config.config, {}, None) + self.score_script_similarity = parse_string_param('score_script_similarity', step_config.config, {}, 'l2') + + self.implicit_config = step_config.implicit_config + + def get_body(self, vec): + filter_json = json.load(open(self.filter_spec)) + if self.filter_type == 'FILTER': + return { + 'size': self.k, + 'query': { + 'knn': { + self.field_name: { + 'vector': vec, + 'k': self.k, + 'filter': filter_json + } + } + } + } + elif self.filter_type == 'SCRIPT': + return { + 'size': self.k, + 'query': { + 'script_score': { + 'query': { + 'bool': { + 'filter': filter_json + } + }, + 'script': { + 'source': 'knn_score', + 'lang': 'knn', + 'params': { + 'field': self.field_name, + 'query_value': vec, + 'space_type': self.score_script_similarity + } + } + } + } + } + elif self.filter_type == 'BOOL_POST_FILTER': + return { + 'size': self.k, + 'query': { + 'bool': { + 'filter': filter_json, + 'must': [ + { + 'knn': { + self.field_name: { + 'vector': vec, + 'k': self.k + } + } + } + ] + } + } + } + else: + raise ConfigurationError('Not supported filter type {}'.format(self.filter_type)) + # Helper functions - (AKA not steps) def bulk_transform(partition: np.ndarray, field_name: str, action, @@ -520,16 +694,20 @@ def recall_at_r(results, neighbor_dataset, r, k, query_count): Recall at R """ correct = 0.0 + total_num_of_results = 0 for query in range(query_count): true_neighbors = neighbor_dataset.read(1) if true_neighbors is None: break true_neighbors_set = set(true_neighbors[0][:k]) - for j in range(r): + true_neighbors_set.discard(-1) + min_r = min(r, len(true_neighbors_set)) + total_num_of_results += min_r + for j in range(min_r): if results[query][j] in true_neighbors_set: correct += 1.0 - return correct / (r * query_count) + return correct / total_num_of_results def get_index_size_in_kb(opensearch, index_name): diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json new file mode 100644 index 000000000..f529de4fe --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json @@ -0,0 +1,24 @@ +{ + "bool": + { + "must": + [ + { + "range": + { + "age": + { + "gte": 20, + "lte": 100 + } + } + }, + { + "term": + { + "color": "red" + } + } + ] + } +} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json new file mode 100644 index 000000000..9d4514e62 --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json @@ -0,0 +1,40 @@ +{ + "bool": + { + "must": + [ + { + "term": + { + "taste": "salty" + } + }, + { + "bool": + { + "should": + [ + { + "bool": + { + "must_not": + { + "exists": + { + "field": "color" + } + } + } + }, + { + "term": + { + "color": "blue" + } + } + ] + } + } + ] + } +} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json new file mode 100644 index 000000000..d69f8768e --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json @@ -0,0 +1,30 @@ +{ + "bool": + { + "must": + [ + { + "range": + { + "age": + { + "gte": 20, + "lte": 80 + } + } + }, + { + "exists": + { + "field": "color" + } + }, + { + "exists": + { + "field": "taste" + } + } + ] + } +} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json new file mode 100644 index 000000000..9e6356f1c --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json @@ -0,0 +1,44 @@ +{ + "bool": + { + "must": + [ + { + "range": + { + "age": + { + "gte": 30, + "lte": 60 + } + } + }, + { + "term": + { + "taste": "bitter" + } + }, + { + "bool": + { + "should": + [ + { + "term": + { + "color": "blue" + } + }, + { + "term": + { + "color": "green" + } + } + ] + } + } + ] + } +} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json new file mode 100644 index 000000000..fecde0392 --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json @@ -0,0 +1,42 @@ +{ + "bool": + { + "should": + [ + { + "range": + { + "age": + { + "gte": 30, + "lte": 70 + } + } + }, + { + "term": + { + "color": "green" + } + }, + { + "term": + { + "color": "blue" + } + }, + { + "term": + { + "color": "yellow" + } + }, + { + "term": + { + "color": "sweet" + } + } + ] + } +} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json new file mode 100644 index 000000000..83ea79b15 --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json @@ -0,0 +1,27 @@ +{ + "settings": { + "index": { + "knn": true, + "refresh_interval": "10s", + "number_of_shards": 30, + "number_of_replicas": 0 + } + }, + "mappings": { + "properties": { + "target_field": { + "type": "knn_vector", + "dimension": 128, + "method": { + "name": "hnsw", + "space_type": "l2", + "engine": "lucene", + "parameters": { + "ef_construction": 100, + "m": 16 + } + } + } + } + } +} diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml new file mode 100644 index 000000000..aa2ee6389 --- /dev/null +++ b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml @@ -0,0 +1,41 @@ +endpoint: localhost +test_name: lucene_sift_hnsw +test_id: "Test workflow for lucene hnsw" +num_runs: 1 +show_runs: false +setup: + - name: delete_index + index_name: target_index +steps: + - name: create_index + index_name: target_index + index_spec: sample-configs/lucene-sift-hnsw-filter/index-spec.json + - name: ingest_multi_field + index_name: target_index + field_name: target_field + bulk_size: 500 + dataset_format: hdf5 + dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5 + attributes_dataset_name: attributes + attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] + - name: refresh_index + index_name: target_index + - name: force_merge + index_name: target_index + max_num_segments: 10 + - name: query_with_filter + k: 10 + r: 1 + calculate_recall: true + index_name: target_index + field_name: target_field + dataset_format: hdf5 + dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5 + neighbors_format: hdf5 + neighbors_path: ../dataset/sift-128-euclidean-with-attr-with-filters.hdf5 + neighbors_dataset: neighbors_filter_1 + filter_spec: sample-configs/filter-spec/filter-1-spec.json + query_count: 100 +cleanup: + - name: delete_index + index_name: target_index \ No newline at end of file