Skip to content

Commit

Permalink
Adding workflow for benchmarking queries with filters
Browse files Browse the repository at this point in the history
Signed-off-by: Martin Gaievski <[email protected]>
  • Loading branch information
martin-gaievski committed Oct 27, 2022
1 parent 5bb7a3f commit 3064f07
Show file tree
Hide file tree
Showing 15 changed files with 748 additions and 9 deletions.
77 changes: 77 additions & 0 deletions benchmarks/perf-tool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,29 @@ Ingests a dataset of vectors into the cluster.
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |

#### ingest_extended

Ingests a dataset of multiple context types into the cluster.

##### Parameters

| Parameter Name | Description | Default |
| ----------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
| index_name | Name of index to ingest into | No default |
| field_name | Name of field to ingest into | No default |
| bulk_size | Documents per bulk request | 300 |
| dataset_format | Format the data-set is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| dataset_path | Path to data-set | No default |
| doc_count | Number of documents to create from data-set | Size of the data-set |
| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
| attribute_spec | Definition of attributes, format is: [{id: [id_val], name: [name_val], type: [type_val]}] | No default |

##### Metrics

| Metric Name | Description | Unit |
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |

#### query

Runs a set of queries against an index.
Expand Down Expand Up @@ -257,6 +280,60 @@ Runs a set of queries against an index.
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |

#### query_with_filter

Runs a set of queries with filter against an index.

##### Parameters

| Parameter Name | Description | Default |
| ----------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| k | Number of neighbors to return on search | 100 |
| r | r value in Recall@R | 1 |
| index_name | Name of index to search | No default |
| field_name | Name field to search | No default |
| calculate_recall | Whether to calculate recall values | False |
| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| dataset_path | Path to dataset | No default |
| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| neighbors_path | Path to neighbors dataset | No default |
| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
| filter_spec | Path to filter specification | No default |
| filter_type | Type of filter format, we do support following types: <br/>FILTER inner filter format for approximate k-NN search<br/>SCRIPT score scripting style with exact k-NN search | SCRIPT |
| score_script_similarity | Similarity function that has been used to index dataset. Used for SCRIPT filter type and ignored for others | l2 |
| query_count | Number of queries to create from data-set | Size of the data-set |

##### Metrics

| Metric Name | Description | Unit |
| ----------- | ----------- | ----------- |
| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |

### Data sets

This benchmark tool uses pre-generated data sets to run indexing and query workload. For some benchmark types existing dataset need to be
extended. Filtering is an example of use case where such dataset extension is needed.

It's possible to use script provided with this repo to generate dataset and run benchmark for filtering queries.
You need to have existing dataset with vector data. This dataset will be used to generate additional attribute data and set of ground truth neighbours document ids.

To generate dataset with attributes based on vectors only dataset use following command pattern:

```commandline
python add-filters-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dataset_with_attributes> True False
```

To generate neighbours dataset for different filters based on dataset with attributes use following command pattern:

```commandline
python add-filters-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dataset_with_attributes> False True
```

After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.

## Contributing

### Linting
Expand Down
172 changes: 172 additions & 0 deletions benchmarks/perf-tool/add-filters-to-dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
import getopt
import os
import random
import sys

import h5py

from osb.extensions.data_set import Context, HDF5DataSet

"""
Script builds complex dataset with additional attributes from exiting dataset that has only vectors.
Additional attributes are predefined in the script: color, taste, age. Only HDF5 format of vector dataset is supported.
Script generates additional dataset of neighbours (ground truth) for each filter type.
Example of usage:
create new hdf5 file with attribute dataset
add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-with-attr True False
create new hdf5 file with filter datasets
add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data-with-attr.hdf5 ~/dev/opensearch/datasets/data-with-filters False True
"""

class Dataset():
DEFAULT_INDEX_NAME = "test-index"
DEFAULT_FIELD_NAME = "test-field"
DEFAULT_CONTEXT = Context.INDEX
DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME
DEFAULT_NUM_VECTORS = 10
DEFAULT_DIMENSION = 10
DEFAULT_RANDOM_STRING_LENGTH = 8

def createDataset(self, source_dataset_path, out_file_path, generate_attrs: bool, generate_filters: bool) -> None:
path_elements = os.path.split(os.path.abspath(source_dataset_path))
data_set_dir = path_elements[0]

# For HDF5, because multiple data sets can be grouped in the same file,
# we will build data sets in memory and not write to disk until
# _flush_data_sets_to_disk is called
# read existing dataset
data_hdf5 = os.path.join(os.path.dirname(os.path.realpath('/')), source_dataset_path)

with h5py.File(data_hdf5, "r") as hf:

if generate_attrs:
data_set_w_attr = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir)

possible_colors = ['red', 'green', 'yellow', 'blue', None]
possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None]
max_age = 100

for key in hf.keys():
if key not in ['neighbors', 'test', 'train']:
continue
data_set_w_attr.create_dataset(key, data=hf[key][()])

attributes = []
for i in range(len(hf['train'])):
attr = [random.choice(possible_colors), random.choice(possible_tastes),
random.randint(0, max_age + 1)]
attributes.append(attr)

data_set_w_attr.create_dataset('attributes', (len(attributes), 3), 'S10', data=attributes)

data_set_w_attr.flush()
data_set_w_attr.close()

if generate_filters:
attributes = hf['attributes'][()]
expected_neighbors = hf['neighbors'][()]

data_set_filters = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir)

def filter1(attributes, vector_idx):
if attributes[vector_idx][0].decode() == 'red' and int(attributes[vector_idx][2].decode()) >= 20:
return True
else:
return False

self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_1', filter1)

# filter 2 - color = blue or None and taste = 'salty'
def filter2(attributes, vector_idx):
if (attributes[vector_idx][0].decode() == 'blue' or attributes[vector_idx][
0].decode() == 'None') and attributes[vector_idx][1].decode() == 'salty':
return True
else:
return False

self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_2', filter2)

# filter 3 - color and taste are not None and age is between 20 and 80
def filter3(attributes, vector_idx):
if attributes[vector_idx][0].decode() != 'None' and attributes[vector_idx][
1].decode() != 'None' and 20 <= \
int(attributes[vector_idx][2].decode()) <= 80:
return True
else:
return False

self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_3', filter3)

# filter 4 - color green or blue and taste is bitter and age is between (30, 60)
def filter4(attributes, vector_idx):
if (attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue') \
and (attributes[vector_idx][1].decode() == 'bitter') \
and 30 <= int(attributes[vector_idx][2].decode()) <= 60:
return True
else:
return False

self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_4', filter4)

# filter 5 color is (green or blue or yellow) or taste = sweet or age is between (30, 70)
def filter5(attributes, vector_idx):
if attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue' \
or attributes[vector_idx][0].decode() == 'yellow' \
or attributes[vector_idx][1].decode() == 'sweet' \
or 30 <= int(attributes[vector_idx][2].decode()) <= 70:
return True
else:
return False

self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_5', filter5)

data_set_filters.flush()
data_set_filters.close()

def apply_filter(self, expected_neighbors, attributes, data_set_w_filtering, filter_name, filter_func):
neighbors_filter = []
filtered_count = 0
for expected_neighbors_row in expected_neighbors:
neighbors_filter_row = [-1] * len(expected_neighbors_row)
idx = 0
for vector_idx in expected_neighbors_row:
if filter_func(attributes, vector_idx):
neighbors_filter_row[idx] = vector_idx
idx += 1
filtered_count += 1
neighbors_filter.append(neighbors_filter_row)
overall_count = len(expected_neighbors) * len(expected_neighbors[0])
perc = float(filtered_count/overall_count) * 100
print('ground truth size for {} is {}, percentage {}'.format(filter_name, filtered_count, perc))
data_set_w_filtering.create_dataset(filter_name, data=neighbors_filter)
return expected_neighbors

def create_dataset_file(self, file_name, extension, data_set_dir) -> h5py.File:
data_set_file_name = "{}.{}".format(file_name, extension)
data_set_path = os.path.join(data_set_dir, data_set_file_name)

data_set_w_filtering = h5py.File(data_set_path, 'a')

return data_set_w_filtering


def main(argv):
opts, args = getopt.getopt(argv, "")
in_file_path = args[0]
out_file_path = args[1]
generate_attr = str2bool(args[2])
generate_filters = str2bool(args[3])

worker = Dataset()
worker.createDataset(in_file_path, out_file_path, generate_attr, generate_filters)

def str2bool(v):
return v.lower() in ("yes", "true", "t", "1")

if __name__ == "__main__":
main(sys.argv[1:])

Binary file not shown.
Binary file added benchmarks/perf-tool/dataset/data-with-attr.hdf5
Binary file not shown.
4 changes: 2 additions & 2 deletions benchmarks/perf-tool/okpt/io/config/parsers/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@


def parse_dataset(dataset_format: str, dataset_path: str,
context: Context) -> DataSet:
context: Context, custom_context=None) -> DataSet:
if dataset_format == 'hdf5':
return HDF5DataSet(dataset_path, context)
return HDF5DataSet(dataset_path, context, custom_context)

if dataset_format == 'bigann' and context == Context.NEIGHBORS:
return BigANNNeighborDataSet(dataset_path)
Expand Down
10 changes: 7 additions & 3 deletions benchmarks/perf-tool/okpt/io/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ class Context(Enum):
INDEX = 1
QUERY = 2
NEIGHBORS = 3
CUSTOM = 4


class DataSet(ABC):
Expand Down Expand Up @@ -64,9 +65,9 @@ class HDF5DataSet(DataSet):
<https://github.com/erikbern/ann-benchmarks#data-sets>`_
"""

def __init__(self, dataset_path: str, context: Context):
def __init__(self, dataset_path: str, context: Context, custom_context=None):
file = h5py.File(dataset_path)
self.data = cast(h5py.Dataset, file[self._parse_context(context)])
self.data = cast(h5py.Dataset, file[self._parse_context(context, custom_context)])
self.current = 0

def read(self, chunk_size: int):
Expand All @@ -88,7 +89,7 @@ def reset(self):
self.current = 0

@staticmethod
def _parse_context(context: Context) -> str:
def _parse_context(context: Context, custom_context=None) -> str:
if context == Context.NEIGHBORS:
return "neighbors"

Expand All @@ -98,6 +99,9 @@ def _parse_context(context: Context) -> str:
if context == Context.QUERY:
return "test"

if context == Context.CUSTOM:
return custom_context

raise Exception("Unsupported context")


Expand Down
6 changes: 5 additions & 1 deletion benchmarks/perf-tool/okpt/test/steps/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from okpt.test.steps.base import Step, StepConfig

from okpt.test.steps.steps import CreateIndexStep, DisableRefreshStep, RefreshIndexStep, DeleteIndexStep, \
TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, QueryStep
TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, IngestStepExtended, QueryStep, QueryWithFilterStep


def create_step(step_config: StepConfig) -> Step:
Expand All @@ -27,8 +27,12 @@ def create_step(step_config: StepConfig) -> Step:
return DeleteIndexStep(step_config)
elif step_config.step_name == IngestStep.label:
return IngestStep(step_config)
elif step_config.step_name == IngestStepExtended.label:
return IngestStepExtended(step_config)
elif step_config.step_name == QueryStep.label:
return QueryStep(step_config)
elif step_config.step_name == QueryWithFilterStep.label:
return QueryWithFilterStep(step_config)
elif step_config.step_name == ForceMergeStep.label:
return ForceMergeStep(step_config)
elif step_config.step_name == ClearCacheStep.label:
Expand Down
Loading

0 comments on commit 3064f07

Please sign in to comment.