Skip to content

Commit

Permalink
Added support for Efficient Pre-filtering for Faiss Engine. The chang…
Browse files Browse the repository at this point in the history
…es include

 * Enabled the efficient filtering support for Faiss Engine (#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (#934)

Signed-off-by: Navneet Verma <[email protected]>
  • Loading branch information
navneet1v committed Jun 14, 2023
1 parent 556bb1a commit 0aaad0f
Show file tree
Hide file tree
Showing 45 changed files with 1,160 additions and 117 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.8...2.x)
### Features
* Added efficient filtering support for Faiss Engine ([#936](https://github.com/opensearch-project/k-NN/pull/936))
### Enhancements
### Bug Fixes
### Infrastructure
### Documentation
### Maintenance
### Refactoring
### Refactoring
4 changes: 2 additions & 2 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,11 @@ In addition to this, the plugin has been tested with JDK 17, and this JDK versio

#### CMake

The plugin requires that cmake >= 3.17.2 is installed in order to build the JNI libraries.
The plugin requires that cmake >= 3.23.1 is installed in order to build the JNI libraries.

One easy way to install on mac or linux is to use pip:
```bash
pip install cmake==3.17.2
pip install cmake==3.23.1
```

#### Faiss Dependencies
Expand Down
47 changes: 33 additions & 14 deletions benchmarks/perf-tool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,36 @@ file.

## Install Prerequisites

### Python
### Setup

Python 3.7 or above is required.
K-NN perf requires Python 3.8 or greater to be installed. One of
the easier ways to do this is through Conda, a package and environment
management system for Python.

### Pip
First, follow the
[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)
to install Conda on your system.

Use pip to install the necessary requirements:
Next, create a Python 3.8 environment:
```
conda create -n knn-perf python=3.8
```

After the environment is created, activate it:
```
source activate knn-perf
```

Lastly, clone the k-NN repo and install all required python packages:
```
git clone https://github.com/opensearch-project/k-NN.git
cd k-NN/benchmarks/perf-tool
pip install -r requirements.txt
```

After all of this completes, you should be ready to run your first performance benchmarks!


## Usage

### Quick Start
Expand Down Expand Up @@ -72,16 +90,17 @@ The output will be the delta between the two metrics.

### Test Parameters

| Parameter Name | Description | Default |
| ----------- | ----------- | ----------- |
| endpoint | Endpoint OpenSearch cluster is running on | localhost |
| test_name | Name of test | No default |
| test_id | String ID of test | No default |
| num_runs | Number of runs to execute steps | 1 |
| show_runs | Whether to output each run in addition to the total summary | false |
| setup | List of steps to run once before metric collection starts | [] |
| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
| cleanup | List of steps to run after each test run | [] |
| Parameter Name | Description | Default |
|----------------|------------------------------------------------------------------------------------|------------|
| endpoint | Endpoint OpenSearch cluster is running on | localhost |
| port | Port on which OpenSearch Cluster is running on | 9200 |
| test_name | Name of test | No default |
| test_id | String ID of test | No default |
| num_runs | Number of runs to execute steps | 1 |
| show_runs | Whether to output each run in addition to the total summary | false |
| setup | List of steps to run once before metric collection starts | [] |
| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
| cleanup | List of steps to run after each test run | [] |

### Steps

Expand Down
5 changes: 5 additions & 0 deletions benchmarks/perf-tool/okpt/io/config/parsers/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ class TestConfig:
test_name: str
test_id: str
endpoint: str
port: int
num_runs: int
show_runs: bool
setup: List[Step]
Expand All @@ -48,6 +49,9 @@ def parse(self, file_obj: TextIOWrapper) -> TestConfig:
if 'endpoint' in config_obj:
implicit_step_config['endpoint'] = config_obj['endpoint']

if 'port' in config_obj:
implicit_step_config['port'] = config_obj['port']

# Each step should have its own parse - take the config object and check if its valid
setup = []
if 'setup' in config_obj:
Expand All @@ -62,6 +66,7 @@ def parse(self, file_obj: TextIOWrapper) -> TestConfig:

test_config = TestConfig(
endpoint=config_obj['endpoint'],
port=config_obj['port'],
test_name=config_obj['test_name'],
test_id=config_obj['test_id'],
num_runs=config_obj['num_runs'],
Expand Down
3 changes: 3 additions & 0 deletions benchmarks/perf-tool/okpt/io/config/schemas/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
endpoint:
type: string
default: "localhost"
port:
type: integer
default: 9200
test_name:
type: string
test_id:
Expand Down
22 changes: 14 additions & 8 deletions benchmarks/perf-tool/okpt/test/steps/steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# compatible open source license.
"""Provides steps for OpenSearch tests.
Some of the OpenSearch operations return a `took` field in the response body,
Some OpenSearch operations return a `took` field in the response body,
so the profiling decorators aren't needed for some functions.
"""
import json
Expand Down Expand Up @@ -454,8 +454,10 @@ def _action(self):
results['took'] = [
float(query_response['took']) for query_response in query_responses
]
port = 9200 if self.endpoint == 'localhost' else 80
results['memory_kb'] = get_cache_size_in_kb(self.endpoint, port)
results['client_time'] = [
float(query_response['client_time']) for query_response in query_responses
]
results['memory_kb'] = get_cache_size_in_kb(self.endpoint, self.port)

if self.calculate_recall:
ids = [[int(hit['_id'])
Expand All @@ -473,7 +475,7 @@ def _action(self):
return results

def _get_measures(self) -> List[str]:
measures = ['took', 'memory_kb']
measures = ['took', 'memory_kb', 'client_time']

if self.calculate_recall:
measures.extend(['recall@K', f'recall@{str(self.r)}'])
Expand Down Expand Up @@ -614,7 +616,6 @@ def _action(self):
num_of_search_segments = 0;
for shard_key in shards.keys():
for segment in shards[shard_key]:

num_of_committed_segments += segment["num_committed_segments"]
num_of_search_segments += segment["num_search_segments"]

Expand Down Expand Up @@ -689,12 +690,13 @@ def delete_model(endpoint, port, model_id):
return response.json()


def get_opensearch_client(endpoint: str, port: int):
def get_opensearch_client(endpoint: str, port: int, timeout=60):
"""
Get an opensearch client from an endpoint and port
Args:
endpoint: Endpoint OpenSearch is running on
port: Port OpenSearch is running on
timeout: timeout for OpenSearch client, default value 60
Returns:
OpenSearch client
Expand All @@ -708,7 +710,7 @@ def get_opensearch_client(endpoint: str, port: int):
use_ssl=False,
verify_certs=False,
connection_class=RequestsHttpConnection,
timeout=60,
timeout=timeout,
)


Expand Down Expand Up @@ -784,9 +786,13 @@ def get_cache_size_in_kb(endpoint, port):

def query_index(opensearch: OpenSearch, index_name: str, body: dict,
excluded_fields: list):
return opensearch.search(index=index_name,
start_time = round(time.time()*1000)
queryResponse = opensearch.search(index=index_name,
body=body,
_source_excludes=excluded_fields)
end_time = round(time.time() * 1000)
queryResponse['client_time'] = end_time - start_time
return queryResponse


def bulk_index(opensearch: OpenSearch, index_name: str, body: List):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"settings": {
"index": {
"knn": true,
"number_of_shards": 24,
"number_of_replicas": 1
}
},
"mappings": {
"properties": {
"target_field": {
"type": "knn_vector",
"dimension": 128,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 16
}
}
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
{
"bool":
{
"should":
[
{
"range":
{
"age":
{
"gte": 30,
"lte": 70
}
}
},
{
"term":
{
"color": "green"
}
},
{
"term":
{
"color": "blue"
}
},
{
"term":
{
"color": "yellow"
}
},
{
"term":
{
"color": "sweet"
}
}
]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
endpoint: [ENDPOINT]
test_name: "Faiss HNSW Relaxed Filter Test"
test_id: "Faiss HNSW Relaxed Filter Test"
num_runs: 10
show_runs: false
steps:
- name: delete_index
index_name: target_index
- name: create_index
index_name: target_index
index_spec: [INDEX_SPEC_PATH]/relaxed-filter/index.json
- name: ingest_multi_field
index_name: target_index
field_name: target_field
bulk_size: 500
dataset_format: hdf5
dataset_path: [DATASET_PATH]/sift-128-euclidean-with-attr.hdf5
attributes_dataset_name: attributes
attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- name: refresh_index
index_name: target_index
- name: query_with_filter
k: 100
r: 1
calculate_recall: true
index_name: target_index
field_name: target_field
dataset_format: hdf5
dataset_path: [DATASET_PATH]/sift-128-euclidean-with-attr.hdf5
neighbors_format: hdf5
neighbors_path: [DATASET_PATH]/sift-128-euclidean-with-filters.hdf5
neighbors_dataset: neighbors_filter_5
filter_spec: [INDEX_SPEC_PATH]/relaxed-filter-spec.json
filter_type: FILTER
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"settings": {
"index": {
"knn": true,
"number_of_shards": 24,
"number_of_replicas": 1
}
},
"mappings": {
"properties": {
"target_field": {
"type": "knn_vector",
"dimension": 128,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 16
}
}
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"bool":
{
"must":
[
{
"range":
{
"age":
{
"gte": 30,
"lte": 60
}
}
},
{
"term":
{
"taste": "bitter"
}
},
{
"bool":
{
"should":
[
{
"term":
{
"color": "blue"
}
},
{
"term":
{
"color": "green"
}
}
]
}
}
]
}
}
Loading

0 comments on commit 0aaad0f

Please sign in to comment.