Added support for Efficient Pre-filtering for Faiss Engine. The chang…

…es include * Enabled the efficient filtering support for Faiss Engine (#907) * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (#926) * Added exact search for cases when filteredIds < k to improve the recall for exact search (#928) * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (#933) * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (#934) Signed-off-by: Navneet Verma <[email protected]>
opensearch-project · Jun 14, 2023 · f6418b1 · f6418b1
1 parent 556bb1a
commit f6418b1
Show file tree

Hide file tree

Showing 45 changed files with 1,146 additions and 123 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,9 +15,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 
 ## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.8...2.x)
 ### Features
+* Added efficient filtering support for Faiss Engine ([#936](https://github.com/opensearch-project/k-NN/pull/936))
 ### Enhancements
 ### Bug Fixes
 ### Infrastructure
 ### Documentation
 ### Maintenance
-### Refactoring
+### Refactoring
diff --git a/DEVELOPER_GUIDE.md b/DEVELOPER_GUIDE.md
@@ -56,11 +56,11 @@ In addition to this, the plugin has been tested with JDK 17, and this JDK versio
 
 #### CMake
 
-The plugin requires that cmake >= 3.17.2 is installed in order to build the JNI libraries.
+The plugin requires that cmake >= 3.23.1 is installed in order to build the JNI libraries.
 
 One easy way to install on mac or linux is to use pip:
 ```bash
-pip install cmake==3.17.2
+pip install cmake==3.23.1
 ```
 
 #### Faiss Dependencies

diff --git a/benchmarks/perf-tool/README.md b/benchmarks/perf-tool/README.md
@@ -13,18 +13,36 @@ file.
 
 ## Install Prerequisites
 
-### Python
+### Setup
 
-Python 3.7 or above is required.
+K-NN perf requires Python 3.8 or greater to be installed. One of 
+the easier ways to do this is through Conda, a package and environment 
+management system for Python.
 
-### Pip
+First, follow the 
+[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) 
+to install Conda on your system.
 
-Use pip to install the necessary requirements:
+Next, create a Python 3.8 environment:
+```
+conda create -n knn-perf python=3.8
+```
+
+After the environment is created, activate it:
+```
+source activate knn-perf
+```
 
+Lastly, clone the k-NN repo and install all required python packages:
 ```
+git clone https://github.com/opensearch-project/k-NN.git
+cd k-NN/benchmarks/perf-tool
 pip install -r requirements.txt
 ```
 
+After all of this completes, you should be ready to run your first performance benchmarks!
+
+
 ## Usage
 
 ### Quick Start
@@ -72,16 +90,17 @@ The output will be the delta between the two metrics.
 
 ### Test Parameters
 
-| Parameter Name | Description | Default |  
-| ----------- | ----------- | ----------- |
-| endpoint | Endpoint OpenSearch cluster is running on | localhost |
-| test_name | Name of test | No default |
-| test_id | String ID of test | No default |
-| num_runs | Number of runs to execute steps | 1 |
-| show_runs | Whether to output each run in addition to the total summary | false |
-| setup | List of steps to run once before metric collection starts | [] |
-| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
-| cleanup | List of steps to run after each test run | [] |
+| Parameter Name | Description                                                                        | Default    |  
+|----------------|------------------------------------------------------------------------------------|------------|
+| endpoint       | Endpoint OpenSearch cluster is running on                                          | localhost  |
+| port           | Port on which OpenSearch Cluster is running on                                     | 9200       |
+| test_name      | Name of test                                                                       | No default |
+| test_id        | String ID of test                                                                  | No default |
+| num_runs       | Number of runs to execute steps                                                    | 1          |
+| show_runs      | Whether to output each run in addition to the total summary                        | false      |
+| setup          | List of steps to run once before metric collection starts                          | []         |
+| steps          | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
+| cleanup        | List of steps to run after each test run                                           | []         |
 
 ### Steps
 

diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/test.py b/benchmarks/perf-tool/okpt/io/config/parsers/test.py
@@ -23,6 +23,7 @@ class TestConfig:
     test_name: str
     test_id: str
     endpoint: str
+    port: int
     num_runs: int
     show_runs: bool
     setup: List[Step]
@@ -48,6 +49,9 @@ def parse(self, file_obj: TextIOWrapper) -> TestConfig:
         if 'endpoint' in config_obj:
             implicit_step_config['endpoint'] = config_obj['endpoint']
 
+        if 'port' in config_obj:
+            implicit_step_config['port'] = config_obj['port']
+
         # Each step should have its own parse - take the config object and check if its valid
         setup = []
         if 'setup' in config_obj:
@@ -62,6 +66,7 @@ def parse(self, file_obj: TextIOWrapper) -> TestConfig:
 
         test_config = TestConfig(
             endpoint=config_obj['endpoint'],
+            port=config_obj['port'],
             test_name=config_obj['test_name'],
             test_id=config_obj['test_id'],
             num_runs=config_obj['num_runs'],

diff --git a/benchmarks/perf-tool/okpt/io/config/schemas/test.yml b/benchmarks/perf-tool/okpt/io/config/schemas/test.yml
@@ -9,6 +9,9 @@
 endpoint:
   type: string
   default: "localhost"
+port:
+  type: integer
+  default: 9200
 test_name:
   type: string
 test_id:

diff --git a/benchmarks/perf-tool/okpt/test/steps/steps.py b/benchmarks/perf-tool/okpt/test/steps/steps.py
@@ -5,7 +5,7 @@
 # compatible open source license.
 """Provides steps for OpenSearch tests.
 
-Some of the OpenSearch operations return a `took` field in the response body,
+Some OpenSearch operations return a `took` field in the response body,
 so the profiling decorators aren't needed for some functions.
 """
 import json
@@ -454,8 +454,10 @@ def _action(self):
         results['took'] = [
             float(query_response['took']) for query_response in query_responses
         ]
-        port = 9200 if self.endpoint == 'localhost' else 80
-        results['memory_kb'] = get_cache_size_in_kb(self.endpoint, port)
+        results['client_time'] = [
+            float(query_response['client_time']) for query_response in query_responses
+        ]
+        results['memory_kb'] = get_cache_size_in_kb(self.endpoint, self.port)
 
         if self.calculate_recall:
             ids = [[int(hit['_id'])
@@ -473,7 +475,7 @@ def _action(self):
         return results
 
     def _get_measures(self) -> List[str]:
-        measures = ['took', 'memory_kb']
+        measures = ['took', 'memory_kb', 'client_time']
 
         if self.calculate_recall:
             measures.extend(['recall@K', f'recall@{str(self.r)}'])
@@ -614,7 +616,6 @@ def _action(self):
         num_of_search_segments = 0;
         for shard_key in shards.keys():
             for segment in shards[shard_key]:
-
                 num_of_committed_segments += segment["num_committed_segments"]
                 num_of_search_segments += segment["num_search_segments"]
 
@@ -689,12 +690,13 @@ def delete_model(endpoint, port, model_id):
     return response.json()
 
 
-def get_opensearch_client(endpoint: str, port: int):
+def get_opensearch_client(endpoint: str, port: int, timeout=60):
     """
     Get an opensearch client from an endpoint and port
     Args:
         endpoint: Endpoint OpenSearch is running on
         port: Port OpenSearch is running on
+        timeout: timeout for OpenSearch client, default value 60
     Returns:
         OpenSearch client
 
@@ -708,7 +710,7 @@ def get_opensearch_client(endpoint: str, port: int):
         use_ssl=False,
         verify_certs=False,
         connection_class=RequestsHttpConnection,
-        timeout=60,
+        timeout=timeout,
     )
 
 
@@ -784,9 +786,13 @@ def get_cache_size_in_kb(endpoint, port):
 
 def query_index(opensearch: OpenSearch, index_name: str, body: dict,
                 excluded_fields: list):
-    return opensearch.search(index=index_name,
+    start_time = round(time.time()*1000)
+    queryResponse = opensearch.search(index=index_name,
                              body=body,
                              _source_excludes=excluded_fields)
+    end_time = round(time.time() * 1000)
+    queryResponse['client_time'] = end_time - start_time
+    return queryResponse
 
 
 def bulk_index(opensearch: OpenSearch, index_name: str, body: List):

diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json
@@ -0,0 +1,26 @@
+{
+  "settings": {
+    "index": {
+      "knn": true,
+      "number_of_shards": 24,
+      "number_of_replicas": 1
+    }
+  },
+  "mappings": {
+    "properties": {
+      "target_field": {
+        "type": "knn_vector",
+        "dimension": 128,
+        "method": {
+          "name": "hnsw",
+          "space_type": "l2",
+          "engine": "faiss",
+          "parameters": {
+            "ef_construction": 256,
+            "m": 16
+          }
+        }
+      }
+    }
+  }
+}
diff --git a/...ks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json b/...ks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
@@ -0,0 +1,42 @@
+{
+    "bool":
+    {
+        "should":
+        [
+            {
+                "range":
+                {
+                    "age":
+                    {
+                        "gte": 30,
+                        "lte": 70
+                    }
+                }
+            },
+            {
+                "term":
+                {
+                    "color": "green"
+                }
+            },
+            {
+                "term":
+                {
+                    "color": "blue"
+                }
+            },
+            {
+                "term":
+                {
+                    "color": "yellow"
+                }
+            },
+            {
+                "term":
+                {
+                    "color": "sweet"
+                }
+            }
+        ]
+    }
+}
diff --git a/...rks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml b/...rks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
@@ -0,0 +1,34 @@
+endpoint: [ENDPOINT]
+test_name: "Faiss HNSW Relaxed Filter Test"
+test_id: "Faiss HNSW Relaxed Filter Test"
+num_runs: 10
+show_runs: false
+steps:
+  - name: delete_index
+    index_name: target_index
+  - name: create_index
+    index_name: target_index
+    index_spec: [INDEX_SPEC_PATH]/relaxed-filter/index.json
+  - name: ingest_multi_field
+    index_name: target_index
+    field_name: target_field
+    bulk_size: 500
+    dataset_format: hdf5
+    dataset_path: [DATASET_PATH]/sift-128-euclidean-with-attr.hdf5
+    attributes_dataset_name: attributes
+    attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
+  - name: refresh_index
+    index_name: target_index
+  - name: query_with_filter
+    k: 100
+    r: 1
+    calculate_recall: true
+    index_name: target_index
+    field_name: target_field
+    dataset_format: hdf5
+    dataset_path: [DATASET_PATH]/sift-128-euclidean-with-attr.hdf5
+    neighbors_format: hdf5
+    neighbors_path: [DATASET_PATH]/sift-128-euclidean-with-filters.hdf5
+    neighbors_dataset: neighbors_filter_5
+    filter_spec: [INDEX_SPEC_PATH]/relaxed-filter-spec.json
+    filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json
@@ -0,0 +1,26 @@
+{
+  "settings": {
+    "index": {
+      "knn": true,
+      "number_of_shards": 24,
+      "number_of_replicas": 1
+    }
+  },
+  "mappings": {
+    "properties": {
+      "target_field": {
+        "type": "knn_vector",
+        "dimension": 128,
+        "method": {
+          "name": "hnsw",
+          "space_type": "l2",
+          "engine": "faiss",
+          "parameters": {
+            "ef_construction": 256,
+            "m": 16
+          }
+        }
+      }
+    }
+  }
+}
diff --git a/...tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json b/...tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
@@ -0,0 +1,44 @@
+{
+    "bool":
+    {
+        "must":
+        [
+            {
+                "range":
+                {
+                    "age":
+                    {
+                        "gte": 30,
+                        "lte": 60
+                    }
+                }
+            },
+            {
+                "term":
+                {
+                    "taste": "bitter"
+                }
+            },
+            {
+                "bool":
+                {
+                    "should":
+                    [
+                        {
+                            "term":
+                            {
+                                "color": "blue"
+                            }
+                        },
+                        {
+                            "term":
+                            {
+                                "color": "green"
+                            }
+                        }
+                    ]
+                }
+            }
+        ]
+    }
+}