Adding benchmark workflow for queries with filters #598

martin-gaievski · 2022-10-27T00:52:15Z

Signed-off-by: Martin Gaievski [email protected]

Description

Adding ability to run benchmark using k-NN queries with filters. There are two main parts in this change:

data ingestion piece. new script for data set enrichment that builds hdf5 dataset with additional attributes (for now string and int attributes are supported). new test step that can read and ingest data with attributes.
query piece. new test step that builds test query with added filter based on provided filter definition. filters can be defined with scoring script or in 'filter' field as part of the query (only supported for lucene engine).

Readme file has been updated along with five sample filter definitions and example of test configuration that uses queries with filters.

Check List

New functionality has been documented.
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Martin Gaievski <[email protected]>

codecov-commenter · 2022-10-27T00:58:37Z

Codecov Report

Merging #598 (1c66410) into main (5bb7a3f) will increase coverage by 0.04%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #598      +/-   ##
============================================
+ Coverage     84.77%   84.82%   +0.04%     
- Complexity     1059     1072      +13     
============================================
  Files           149      149              
  Lines          4301     4361      +60     
  Branches        382      397      +15     
============================================
+ Hits           3646     3699      +53     
- Misses          480      485       +5     
- Partials        175      177       +2

Impacted Files	Coverage Δ
...va/org/opensearch/knn/index/KNNCircuitBreaker.java	`60.00% <0.00%> (-20.00%)`	⬇️
...ain/java/org/opensearch/knn/index/KNNSettings.java	`80.88% <0.00%> (-2.21%)`	⬇️
...rg/opensearch/knn/index/query/KNNQueryBuilder.java	`90.56% <0.00%> (+6.35%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

benchmarks/perf-tool/add-filters-to-dataset.py

benchmarks/perf-tool/okpt/test/steps/steps.py

benchmarks/perf-tool/okpt/io/dataset.py

benchmarks/perf-tool/add-filters-to-dataset.py

jmazanec15 · 2022-10-27T16:49:11Z

benchmarks/perf-tool/add-filters-to-dataset.py

+
+Script generates additional dataset of neighbours (ground truth) for each filter type. 
+
+Example of usage:


What differentiates between filter and attribute?

In this context I use attribute as an additional field for a document that we'll index, and filter is the set of criteria that will make a subset out of main set of documents. So this script can work in two modes:

It takes existing set of vector data and adds fields of different types to each document. In this context I do use term attribute

Based on predefined rules it will take dataset generated at step 1 and apply filter to it, so the outcome is set of new datasets with true neighbors that are both ordered by similarity and also filtered. All those new datasets are stored in a separate new file.

benchmarks/perf-tool/add-filters-to-dataset.py

Signed-off-by: Martin Gaievski <[email protected]>

jmazanec15 · 2022-11-08T21:01:51Z

benchmarks/perf-tool/okpt/test/steps/steps.py

+        bulk_index(self.opensearch, self.index_name, body)
+
+
+class IngestStepExtended(BaseIngestStep):


I think a better name would be IngestMultiFieldStep. I dont think extended is intuitive

agree, Extended isn't very. intuitive but I couldn't figure out better name. IngestMultiFieldStep sounds reasonable

jmazanec15 · 2022-11-08T21:04:58Z

benchmarks/perf-tool/okpt/test/steps/steps.py

+        neighbors_dataset = parse_string_param('neighbors_dataset',
+                                               step_config.config, {}, None)
+
+        self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,


Why is this CUSTOM and not NEIGHBORS?

we do have a dataset for each filter, instead of making it multiple files with same dataset name I do one file with multiple datasets for each filter. It makes it a bit easy if there are many filters, say for lucene benchmarking I used 5.

jmazanec15 · 2022-11-08T21:09:00Z

benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml

+    dataset_format: hdf5
+    dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
+    attributes_dataset_name: attributes
+    attribute_spec: [ { id: 0, name: 'color', type: 'str' }, { id: 1, name: 'taste', type: 'str' }, { id: 2, name: 'age', type: 'int' } ]


Why is ID needed here? Shouldnt all names be unique?

it's more for sorting, when we generate the dataset with additional fields those are written as table, so it order to map column from dataset to a schema field we are using ids. For instance:
data set:

2 | 32 | red

and schema {{id:0, name: age}, {id:2, name: color}, {id:1, name: weight}}

we can map age -> 2, color -> red, weight ->32

Right but a list is being passed in. Why cant we keep that order for that reference?

Martin: I see, that makes sense. We can use field's order as sequence

benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json

benchmarks/perf-tool/okpt/test/steps/steps.py

Signed-off-by: Martin Gaievski <[email protected]>

benchmarks/perf-tool/okpt/test/steps/steps.py

benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json

jmazanec15 · 2022-11-10T18:41:56Z

benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml

+    dataset_format: hdf5
+    dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
+    attributes_dataset_name: attributes
+    attribute_spec: [ { id: 0, name: 'color', type: 'str' }, { id: 1, name: 'taste', type: 'str' }, { id: 2, name: 'age', type: 'int' } ]


Right but a list is being passed in. Why cant we keep that order for that reference?

Martin: I see, that makes sense. We can use field's order as sequence

Signed-off-by: Martin Gaievski <[email protected]>

vamshin

LGTM! Thanks

* Adding workflow for benchmarking queries with filters Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 79ae6c2)

* Adding workflow for benchmarking queries with filters Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 79ae6c2) Co-authored-by: Martin Gaievski <[email protected]>

Adding workflow for benchmarking queries with filters

3064f07

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications backport 2.x 2.4.0 labels Oct 27, 2022

martin-gaievski marked this pull request as ready for review October 27, 2022 15:41

martin-gaievski requested a review from a team October 27, 2022 15:41

jmazanec15 reviewed Oct 27, 2022

View reviewed changes

Adding bool query post-filtering option

fb8f345

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the feature/benchmark-for-filtering-load branch from 2d7440e to fb8f345 Compare October 28, 2022 01:23

heemin32 removed the 2.4.0 label Nov 2, 2022

martin-gaievski force-pushed the feature/benchmark-for-filtering-load branch 2 times, most recently from 9778d79 to 3d1d915 Compare November 4, 2022 00:31

Cleaning up code, adding docs

bc09d85

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the feature/benchmark-for-filtering-load branch from 3d1d915 to bc09d85 Compare November 4, 2022 00:40

martin-gaievski added 2 commits November 3, 2022 21:00

Refactor code for query flow

1c66410

Signed-off-by: Martin Gaievski <[email protected]>

Refactor ingestion steps for better code reusage

dda2556

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski requested a review from jmazanec15 November 7, 2022 22:37

jmazanec15 reviewed Nov 8, 2022

View reviewed changes

Rename IngestStepExtended to IngestMultiFieldStep

b19b9e0

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski requested a review from jmazanec15 November 10, 2022 18:28

jmazanec15 reviewed Nov 10, 2022

View reviewed changes

Code and format cleanup

379cb21

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski requested a review from jmazanec15 November 10, 2022 20:41

jmazanec15 approved these changes Nov 14, 2022

View reviewed changes

vamshin approved these changes Nov 14, 2022

View reviewed changes

martin-gaievski merged commit 79ae6c2 into main Nov 14, 2022

opensearch-trigger-bot bot mentioned this pull request Nov 14, 2022

[Backport 2.x] Adding benchmark workflow for queries with filters #629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding benchmark workflow for queries with filters #598

Adding benchmark workflow for queries with filters #598

martin-gaievski commented Oct 27, 2022

codecov-commenter commented Oct 27, 2022 •

edited

Loading

jmazanec15 Oct 27, 2022

martin-gaievski Nov 3, 2022

jmazanec15 Nov 8, 2022

martin-gaievski Nov 9, 2022

jmazanec15 Nov 8, 2022

martin-gaievski Nov 9, 2022

jmazanec15 Nov 8, 2022

martin-gaievski Nov 9, 2022

jmazanec15 Nov 10, 2022 •

edited by martin-gaievski

Loading

jmazanec15 Nov 10, 2022 •

edited by martin-gaievski

Loading

vamshin left a comment


		Script generates additional dataset of neighbours (ground truth) for each filter type.

		Example of usage:

		bulk_index(self.opensearch, self.index_name, body)


		class IngestStepExtended(BaseIngestStep):

Adding benchmark workflow for queries with filters #598

Adding benchmark workflow for queries with filters #598

Conversation

martin-gaievski commented Oct 27, 2022

Description

Check List

codecov-commenter commented Oct 27, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 Nov 10, 2022 • edited by martin-gaievski Loading

Choose a reason for hiding this comment

jmazanec15 Nov 10, 2022 • edited by martin-gaievski Loading

Choose a reason for hiding this comment

vamshin left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 27, 2022 •

edited

Loading

jmazanec15 Nov 10, 2022 •

edited by martin-gaievski

Loading

jmazanec15 Nov 10, 2022 •

edited by martin-gaievski

Loading