Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport-2.x] Enabled the efficient filtering support for Faiss Engine. #960

Merged
merged 3 commits into from
Jul 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.8...2.x)
### Features
* Added efficient filtering support for Faiss Engine ([#936](https://github.com/opensearch-project/k-NN/pull/936))
### Enhancements
### Bug Fixes
### Infrastructure
### Documentation
### Maintenance
### Refactoring
### Refactoring
4 changes: 2 additions & 2 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,11 @@ In addition to this, the plugin has been tested with JDK 17, and this JDK versio

#### CMake

The plugin requires that cmake >= 3.17.2 is installed in order to build the JNI libraries.
The plugin requires that cmake >= 3.23.3 is installed in order to build the JNI libraries.

One easy way to install on mac or linux is to use pip:
```bash
pip install cmake==3.17.2
pip install cmake==3.23.3
```

#### Faiss Dependencies
Expand Down
47 changes: 33 additions & 14 deletions benchmarks/perf-tool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,36 @@ file.

## Install Prerequisites

### Python
### Setup

Python 3.7 or above is required.
K-NN perf requires Python 3.8 or greater to be installed. One of
the easier ways to do this is through Conda, a package and environment
management system for Python.

### Pip
First, follow the
[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)
to install Conda on your system.

Use pip to install the necessary requirements:
Next, create a Python 3.8 environment:
```
conda create -n knn-perf python=3.8
```

After the environment is created, activate it:
```
source activate knn-perf
```

Lastly, clone the k-NN repo and install all required python packages:
```
git clone https://github.com/opensearch-project/k-NN.git
cd k-NN/benchmarks/perf-tool
pip install -r requirements.txt
```

After all of this completes, you should be ready to run your first performance benchmarks!


## Usage

### Quick Start
Expand Down Expand Up @@ -72,16 +90,17 @@ The output will be the delta between the two metrics.

### Test Parameters

| Parameter Name | Description | Default |
| ----------- | ----------- | ----------- |
| endpoint | Endpoint OpenSearch cluster is running on | localhost |
| test_name | Name of test | No default |
| test_id | String ID of test | No default |
| num_runs | Number of runs to execute steps | 1 |
| show_runs | Whether to output each run in addition to the total summary | false |
| setup | List of steps to run once before metric collection starts | [] |
| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
| cleanup | List of steps to run after each test run | [] |
| Parameter Name | Description | Default |
|----------------|------------------------------------------------------------------------------------|------------|
| endpoint | Endpoint OpenSearch cluster is running on | localhost |
| port | Port on which OpenSearch Cluster is running on | 9200 |
| test_name | Name of test | No default |
| test_id | String ID of test | No default |
| num_runs | Number of runs to execute steps | 1 |
| show_runs | Whether to output each run in addition to the total summary | false |
| setup | List of steps to run once before metric collection starts | [] |
| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
| cleanup | List of steps to run after each test run | [] |

### Steps

Expand Down
5 changes: 5 additions & 0 deletions benchmarks/perf-tool/okpt/io/config/parsers/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ class TestConfig:
test_name: str
test_id: str
endpoint: str
port: int
num_runs: int
show_runs: bool
setup: List[Step]
Expand All @@ -48,6 +49,9 @@ def parse(self, file_obj: TextIOWrapper) -> TestConfig:
if 'endpoint' in config_obj:
implicit_step_config['endpoint'] = config_obj['endpoint']

if 'port' in config_obj:
implicit_step_config['port'] = config_obj['port']

# Each step should have its own parse - take the config object and check if its valid
setup = []
if 'setup' in config_obj:
Expand All @@ -62,6 +66,7 @@ def parse(self, file_obj: TextIOWrapper) -> TestConfig:

test_config = TestConfig(
endpoint=config_obj['endpoint'],
port=config_obj['port'],
test_name=config_obj['test_name'],
test_id=config_obj['test_id'],
num_runs=config_obj['num_runs'],
Expand Down
3 changes: 3 additions & 0 deletions benchmarks/perf-tool/okpt/io/config/schemas/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
endpoint:
type: string
default: "localhost"
port:
type: integer
default: 9200
test_name:
type: string
test_id:
Expand Down
22 changes: 14 additions & 8 deletions benchmarks/perf-tool/okpt/test/steps/steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# compatible open source license.
"""Provides steps for OpenSearch tests.

Some of the OpenSearch operations return a `took` field in the response body,
Some OpenSearch operations return a `took` field in the response body,
so the profiling decorators aren't needed for some functions.
"""
import json
Expand Down Expand Up @@ -454,8 +454,10 @@ def _action(self):
results['took'] = [
float(query_response['took']) for query_response in query_responses
]
port = 9200 if self.endpoint == 'localhost' else 80
results['memory_kb'] = get_cache_size_in_kb(self.endpoint, port)
results['client_time'] = [
float(query_response['client_time']) for query_response in query_responses
]
results['memory_kb'] = get_cache_size_in_kb(self.endpoint, self.port)

if self.calculate_recall:
ids = [[int(hit['_id'])
Expand All @@ -473,7 +475,7 @@ def _action(self):
return results

def _get_measures(self) -> List[str]:
measures = ['took', 'memory_kb']
measures = ['took', 'memory_kb', 'client_time']

if self.calculate_recall:
measures.extend(['recall@K', f'recall@{str(self.r)}'])
Expand Down Expand Up @@ -614,7 +616,6 @@ def _action(self):
num_of_search_segments = 0;
for shard_key in shards.keys():
for segment in shards[shard_key]:

num_of_committed_segments += segment["num_committed_segments"]
num_of_search_segments += segment["num_search_segments"]

Expand Down Expand Up @@ -689,12 +690,13 @@ def delete_model(endpoint, port, model_id):
return response.json()


def get_opensearch_client(endpoint: str, port: int):
def get_opensearch_client(endpoint: str, port: int, timeout=60):
"""
Get an opensearch client from an endpoint and port
Args:
endpoint: Endpoint OpenSearch is running on
port: Port OpenSearch is running on
timeout: timeout for OpenSearch client, default value 60
Returns:
OpenSearch client

Expand All @@ -708,7 +710,7 @@ def get_opensearch_client(endpoint: str, port: int):
use_ssl=False,
verify_certs=False,
connection_class=RequestsHttpConnection,
timeout=60,
timeout=timeout,
)


Expand Down Expand Up @@ -784,9 +786,13 @@ def get_cache_size_in_kb(endpoint, port):

def query_index(opensearch: OpenSearch, index_name: str, body: dict,
excluded_fields: list):
return opensearch.search(index=index_name,
start_time = round(time.time()*1000)
queryResponse = opensearch.search(index=index_name,
body=body,
_source_excludes=excluded_fields)
end_time = round(time.time() * 1000)
queryResponse['client_time'] = end_time - start_time
return queryResponse


def bulk_index(opensearch: OpenSearch, index_name: str, body: List):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"settings": {
"index": {
"knn": true,
"number_of_shards": 24,
"number_of_replicas": 1
}
},
"mappings": {
"properties": {
"target_field": {
"type": "knn_vector",
"dimension": 128,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 16
}
}
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
{
"bool":
{
"should":
[
{
"range":
{
"age":
{
"gte": 30,
"lte": 70
}
}
},
{
"term":
{
"color": "green"
}
},
{
"term":
{
"color": "blue"
}
},
{
"term":
{
"color": "yellow"
}
},
{
"term":
{
"color": "sweet"
}
}
]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
endpoint: [ENDPOINT]
test_name: "Faiss HNSW Relaxed Filter Test"
test_id: "Faiss HNSW Relaxed Filter Test"
num_runs: 10
show_runs: false
steps:
- name: delete_index
index_name: target_index
- name: create_index
index_name: target_index
index_spec: [INDEX_SPEC_PATH]/relaxed-filter/index.json
- name: ingest_multi_field
index_name: target_index
field_name: target_field
bulk_size: 500
dataset_format: hdf5
dataset_path: [DATASET_PATH]/sift-128-euclidean-with-attr.hdf5
attributes_dataset_name: attributes
attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- name: refresh_index
index_name: target_index
- name: query_with_filter
k: 100
r: 1
calculate_recall: true
index_name: target_index
field_name: target_field
dataset_format: hdf5
dataset_path: [DATASET_PATH]/sift-128-euclidean-with-attr.hdf5
neighbors_format: hdf5
neighbors_path: [DATASET_PATH]/sift-128-euclidean-with-filters.hdf5
neighbors_dataset: neighbors_filter_5
filter_spec: [INDEX_SPEC_PATH]/relaxed-filter-spec.json
filter_type: FILTER
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"settings": {
"index": {
"knn": true,
"number_of_shards": 24,
"number_of_replicas": 1
}
},
"mappings": {
"properties": {
"target_field": {
"type": "knn_vector",
"dimension": 128,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 16
}
}
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"bool":
{
"must":
[
{
"range":
{
"age":
{
"gte": 30,
"lte": 60
}
}
},
{
"term":
{
"taste": "bitter"
}
},
{
"bool":
{
"should":
[
{
"term":
{
"color": "blue"
}
},
{
"term":
{
"color": "green"
}
}
]
}
}
]
}
}
Loading