Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding benchmark workflow for queries with filters #598

Merged
merged 7 commits into from
Nov 14, 2022

Conversation

martin-gaievski
Copy link
Member

Signed-off-by: Martin Gaievski [email protected]

Description

Adding ability to run benchmark using k-NN queries with filters. There are two main parts in this change:

  • data ingestion piece. new script for data set enrichment that builds hdf5 dataset with additional attributes (for now string and int attributes are supported). new test step that can read and ingest data with attributes.
  • query piece. new test step that builds test query with added filter based on provided filter definition. filters can be defined with scoring script or in 'filter' field as part of the query (only supported for lucene engine).

Readme file has been updated along with five sample filter definitions and example of test configuration that uses queries with filters.

Check List

  • New functionality has been documented.
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added Enhancements Increases software capabilities beyond original client specifications backport 2.x 2.4.0 labels Oct 27, 2022
@codecov-commenter
Copy link

codecov-commenter commented Oct 27, 2022

Codecov Report

Merging #598 (1c66410) into main (5bb7a3f) will increase coverage by 0.04%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #598      +/-   ##
============================================
+ Coverage     84.77%   84.82%   +0.04%     
- Complexity     1059     1072      +13     
============================================
  Files           149      149              
  Lines          4301     4361      +60     
  Branches        382      397      +15     
============================================
+ Hits           3646     3699      +53     
- Misses          480      485       +5     
- Partials        175      177       +2     
Impacted Files Coverage Δ
...va/org/opensearch/knn/index/KNNCircuitBreaker.java 60.00% <0.00%> (-20.00%) ⬇️
...ain/java/org/opensearch/knn/index/KNNSettings.java 80.88% <0.00%> (-2.21%) ⬇️
...rg/opensearch/knn/index/query/KNNQueryBuilder.java 90.56% <0.00%> (+6.35%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@martin-gaievski martin-gaievski marked this pull request as ready for review October 27, 2022 15:41
@martin-gaievski martin-gaievski requested a review from a team October 27, 2022 15:41
benchmarks/perf-tool/add-filters-to-dataset.py Outdated Show resolved Hide resolved
benchmarks/perf-tool/okpt/test/steps/steps.py Show resolved Hide resolved
benchmarks/perf-tool/okpt/test/steps/steps.py Outdated Show resolved Hide resolved
benchmarks/perf-tool/okpt/io/dataset.py Show resolved Hide resolved
benchmarks/perf-tool/add-filters-to-dataset.py Outdated Show resolved Hide resolved

Script generates additional dataset of neighbours (ground truth) for each filter type.

Example of usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What differentiates between filter and attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context I use attribute as an additional field for a document that we'll index, and filter is the set of criteria that will make a subset out of main set of documents. So this script can work in two modes:

  1. It takes existing set of vector data and adds fields of different types to each document. In this context I do use term attribute
  2. Based on predefined rules it will take dataset generated at step 1 and apply filter to it, so the outcome is set of new datasets with true neighbors that are both ordered by similarity and also filtered. All those new datasets are stored in a separate new file.

benchmarks/perf-tool/add-filters-to-dataset.py Outdated Show resolved Hide resolved
@martin-gaievski martin-gaievski force-pushed the feature/benchmark-for-filtering-load branch from 2d7440e to fb8f345 Compare October 28, 2022 01:23
@heemin32 heemin32 removed the 2.4.0 label Nov 2, 2022
@martin-gaievski martin-gaievski force-pushed the feature/benchmark-for-filtering-load branch 2 times, most recently from 9778d79 to 3d1d915 Compare November 4, 2022 00:31
Signed-off-by: Martin Gaievski <[email protected]>
@martin-gaievski martin-gaievski force-pushed the feature/benchmark-for-filtering-load branch from 3d1d915 to bc09d85 Compare November 4, 2022 00:40
bulk_index(self.opensearch, self.index_name, body)


class IngestStepExtended(BaseIngestStep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name would be IngestMultiFieldStep. I dont think extended is intuitive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, Extended isn't very. intuitive but I couldn't figure out better name. IngestMultiFieldStep sounds reasonable

neighbors_dataset = parse_string_param('neighbors_dataset',
step_config.config, {}, None)

self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this CUSTOM and not NEIGHBORS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have a dataset for each filter, instead of making it multiple files with same dataset name I do one file with multiple datasets for each filter. It makes it a bit easy if there are many filters, say for lucene benchmarking I used 5.

dataset_format: hdf5
dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
attributes_dataset_name: attributes
attribute_spec: [ { id: 0, name: 'color', type: 'str' }, { id: 1, name: 'taste', type: 'str' }, { id: 2, name: 'age', type: 'int' } ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ID needed here? Shouldnt all names be unique?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more for sorting, when we generate the dataset with additional fields those are written as table, so it order to map column from dataset to a schema field we are using ids. For instance:
data set:

2 | 32 | red

and schema {{id:0, name: age}, {id:2, name: color}, {id:1, name: weight}}

we can map age -> 2, color -> red, weight ->32

Copy link
Member

@jmazanec15 jmazanec15 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but a list is being passed in. Why cant we keep that order for that reference?

Martin: I see, that makes sense. We can use field's order as sequence

benchmarks/perf-tool/okpt/test/steps/steps.py Show resolved Hide resolved
benchmarks/perf-tool/okpt/test/steps/steps.py Outdated Show resolved Hide resolved
benchmarks/perf-tool/okpt/test/steps/steps.py Outdated Show resolved Hide resolved
benchmarks/perf-tool/okpt/test/steps/steps.py Show resolved Hide resolved
dataset_format: hdf5
dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
attributes_dataset_name: attributes
attribute_spec: [ { id: 0, name: 'color', type: 'str' }, { id: 1, name: 'taste', type: 'str' }, { id: 2, name: 'age', type: 'int' } ]
Copy link
Member

@jmazanec15 jmazanec15 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but a list is being passed in. Why cant we keep that order for that reference?

Martin: I see, that makes sense. We can use field's order as sequence

Signed-off-by: Martin Gaievski <[email protected]>
Copy link
Member

@vamshin vamshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks

@martin-gaievski martin-gaievski merged commit 79ae6c2 into main Nov 14, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 14, 2022
* Adding workflow for benchmarking queries with filters

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 79ae6c2)
martin-gaievski added a commit that referenced this pull request Nov 15, 2022
* Adding workflow for benchmarking queries with filters

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 79ae6c2)

Co-authored-by: Martin Gaievski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants