-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding benchmark workflow for queries with filters #598
Conversation
Signed-off-by: Martin Gaievski <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #598 +/- ##
============================================
+ Coverage 84.77% 84.82% +0.04%
- Complexity 1059 1072 +13
============================================
Files 149 149
Lines 4301 4361 +60
Branches 382 397 +15
============================================
+ Hits 3646 3699 +53
- Misses 480 485 +5
- Partials 175 177 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
||
Script generates additional dataset of neighbours (ground truth) for each filter type. | ||
|
||
Example of usage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What differentiates between filter and attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context I use attribute as an additional field for a document that we'll index, and filter is the set of criteria that will make a subset out of main set of documents. So this script can work in two modes:
- It takes existing set of vector data and adds fields of different types to each document. In this context I do use term attribute
- Based on predefined rules it will take dataset generated at step 1 and apply filter to it, so the outcome is set of new datasets with true neighbors that are both ordered by similarity and also filtered. All those new datasets are stored in a separate new file.
Signed-off-by: Martin Gaievski <[email protected]>
2d7440e
to
fb8f345
Compare
9778d79
to
3d1d915
Compare
Signed-off-by: Martin Gaievski <[email protected]>
3d1d915
to
bc09d85
Compare
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
bulk_index(self.opensearch, self.index_name, body) | ||
|
||
|
||
class IngestStepExtended(BaseIngestStep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better name would be IngestMultiFieldStep. I dont think extended is intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, Extended isn't very. intuitive but I couldn't figure out better name. IngestMultiFieldStep sounds reasonable
neighbors_dataset = parse_string_param('neighbors_dataset', | ||
step_config.config, {}, None) | ||
|
||
self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this CUSTOM and not NEIGHBORS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do have a dataset for each filter, instead of making it multiple files with same dataset name I do one file with multiple datasets for each filter. It makes it a bit easy if there are many filters, say for lucene benchmarking I used 5.
dataset_format: hdf5 | ||
dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5 | ||
attributes_dataset_name: attributes | ||
attribute_spec: [ { id: 0, name: 'color', type: 'str' }, { id: 1, name: 'taste', type: 'str' }, { id: 2, name: 'age', type: 'int' } ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is ID needed here? Shouldnt all names be unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's more for sorting, when we generate the dataset with additional fields those are written as table, so it order to map column from dataset to a schema field we are using ids. For instance:
data set:
2 | 32 | red
and schema {{id:0, name: age}, {id:2, name: color}, {id:1, name: weight}}
we can map age -> 2, color -> red, weight ->32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but a list is being passed in. Why cant we keep that order for that reference?
Martin: I see, that makes sense. We can use field's order as sequence
Signed-off-by: Martin Gaievski <[email protected]>
dataset_format: hdf5 | ||
dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5 | ||
attributes_dataset_name: attributes | ||
attribute_spec: [ { id: 0, name: 'color', type: 'str' }, { id: 1, name: 'taste', type: 'str' }, { id: 2, name: 'age', type: 'int' } ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but a list is being passed in. Why cant we keep that order for that reference?
Martin: I see, that makes sense. We can use field's order as sequence
Signed-off-by: Martin Gaievski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks
* Adding workflow for benchmarking queries with filters Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 79ae6c2)
* Adding workflow for benchmarking queries with filters Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 79ae6c2) Co-authored-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski [email protected]
Description
Adding ability to run benchmark using k-NN queries with filters. There are two main parts in this change:
Readme file has been updated along with five sample filter definitions and example of test configuration that uses queries with filters.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.