Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x]Add parent join support for faiss hnsw (#1398) #1400

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,15 @@ jobs:
with:
submodules: true

# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
- name: Apply Git Patch
# Deleting file at the end to skip `git apply` inside CMAKE file
run: |
cd jni/external/faiss
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
working-directory: ${{ github.workspace }}

- name: Setup Java ${{ matrix.java }}
uses: actions/setup-java@v1
with:
Expand Down
9 changes: 9 additions & 0 deletions .github/workflows/test_security.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,15 @@ jobs:
with:
submodules: true

# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
- name: Apply Git Patch
# Deleting file at the end to skip `git apply` inside CMAKE file
run: |
cd jni/external/faiss
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
working-directory: ${{ github.workspace }}

- name: Setup Java ${{ matrix.java }}
uses: actions/setup-java@v1
with:
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.12...2.x)
### Features
* Add parent join support for lucene knn [#1182](https://github.com/opensearch-project/k-NN/pull/1182)
* Add parent join support for faiss hnsw [#1398](https://github.com/opensearch-project/k-NN/pull/1398)
### Enhancements
* Increase Lucene max dimension limit to 16,000 [#1346](https://github.com/opensearch-project/k-NN/pull/1346)
* Tuned default values for ef_search and ef_construction for better indexing and search performance for vector search [#1353](https://github.com/opensearch-project/k-NN/pull/1353)
Expand Down
7 changes: 7 additions & 0 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,13 @@ For users that want to get the most out of the libraries, they should follow [th
and build the libraries from source in their production environment, so that if their environment has optimized
instruction sets, they take advantage of them.

### Custom patch on JNI Library
If you want to make a custom patch on JNI library
1. Make a change on top of current version of JNI library and push the commit locally.
2. Create a patch file for the change using `git format-patch -o patches HEAD^`
3. Place the patch file under `jni/patches`
4. Make a change in `jni/CmakeLists.txt`, `.github/workflows/CI.yml` to apply the patch during build

## Run OpenSearch k-NN

### Run Single-node Cluster Locally
Expand Down
56 changes: 56 additions & 0 deletions benchmarks/perf-tool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,26 @@ Ingests a dataset of multiple context types into the cluster.
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |

#### ingest_nested_field

Ingests a dataset with nested field into the cluster.

##### Parameters

| Parameter Name | Description | Default |
| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
| index_name | Name of index to ingest into | No default |
| field_name | Name of field to ingest into | No default |
| dataset_path | Path to data-set | No default |
| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file. It should contains { name: 'parent_id', type: 'int'} | No default |

##### Metrics

| Metric Name | Description | Unit |
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |

#### query

Runs a set of queries against an index.
Expand Down Expand Up @@ -330,6 +350,36 @@ Runs a set of queries with filter against an index.
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |


#### query_nested_field

Runs a set of queries with nested field against an index.

##### Parameters

| Parameter Name | Description | Default |
| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| k | Number of neighbors to return on search | 100 |
| r | r value in Recall@R | 1 |
| index_name | Name of index to search | No default |
| field_name | Name field to search | No default |
| calculate_recall | Whether to calculate recall values | False |
| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| dataset_path | Path to dataset | No default |
| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| neighbors_path | Path to neighbors dataset | No default |
| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
| query_count | Number of queries to create from data-set | Size of the data-set |

##### Metrics

| Metric Name | Description | Unit |
| ----------- | ----------- | ----------- |
| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |

#### get_stats

Gets the index stats.
Expand Down Expand Up @@ -369,6 +419,12 @@ python add-filters-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dat

After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.

To generate dataset with parent doc id based on vectors only dataset, use following command pattern:
```commandline
python add-parent-doc-id-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dataset_with_parent_id>
```
This will generate neighbours dataset as well. This new dataset(s) can be referred from testcase definition in `ingest_nested_field` and `query_nested_field` steps.

## Contributing

### Linting
Expand Down
Loading
Loading