diff --git a/benchmarks/osb/README.md b/benchmarks/osb/README.md
deleted file mode 100644
index 0d0b05f8d..000000000
--- a/benchmarks/osb/README.md
+++ /dev/null
@@ -1,478 +0,0 @@
-# IMPORTANT NOTE: No new features will be added to this tool . This tool is currently in maintanence mode. All new features will be added to [vector search workload]( https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch)
-# OpenSearch Benchmarks for k-NN
-
-## Overview
-
-This directory contains code and configurations to run k-NN benchmarking
-workloads using OpenSearch Benchmarks.
-
-The [extensions](extensions) directory contains common code shared between
-procedures. The [procedures](procedures) directory contains the individual
-test procedures for this workload.
-
-## Getting Started
-
-### OpenSearch Benchmarks Background
-
-OpenSearch Benchmark is a framework for performance benchmarking an OpenSearch
-cluster. For more details, checkout their
-[repo](https://github.com/opensearch-project/opensearch-benchmark/).
-
-Before getting into the benchmarks, it is helpful to know a few terms:
-1. Workload - Top level description of a benchmark suite. A workload will have a `workload.json` file that defines different components of the tests
-2. Test Procedures - A workload can have a schedule of operations that run the test. However, a workload can also have several test procedures that define their own schedule of operations. This is helpful for sharing code between tests
-3. Operation - An action against the OpenSearch cluster
-4. Parameter source - Producers of parameters for OpenSearch operations
-5. Runners - Code that actually will execute the OpenSearch operations
-
-### Setup
-
-OpenSearch Benchmarks requires Python 3.8 or greater to be installed. One of
-the easier ways to do this is through Conda, a package and environment
-management system for Python.
-
-First, follow the
-[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)
-to install Conda on your system.
-
-Next, create a Python 3.8 environment:
-```
-conda create -n knn-osb python=3.8
-```
-
-After the environment is created, activate it:
-```
-source activate knn-osb
-```
-
-Lastly, clone the k-NN repo and install all required python packages:
-```
-git clone https://github.com/opensearch-project/k-NN.git
-cd k-NN/benchmarks/osb
-pip install -r requirements.txt
-```
-
-After all of this completes, you should be ready to run your first benchmark!
-
-### Running a benchmark
-
-Before running a benchmark, make sure you have the endpoint of your cluster and
- the machine you are running the benchmarks from can access it.
- Additionally, ensure that all data has been pulled to the client.
-
-Currently, we support 2 test procedures for the k-NN workload: train-test and
-no-train-test. The train test has steps to train a model included in the
-schedule, while no train does not. Both test procedures will index a data set
-of vectors into an OpenSearch index and then run a set of queries against them.
-
-Once you have decided which test procedure you want to use, open up
-[params/train-params.json](params/train-params.json) or
-[params/no-train-params.json](params/no-train-params.json) and
-fill out the parameters. Notice, at the bottom of `no-train-params.json` there
-are several parameters that relate to training. Ignore these. They need to be
-defined for the workload but not used.
-
-Once the parameters are set, set the URL and PORT of your cluster and run the
-command to run the test procedure.
-
-```
-export URL=
-export PORT=
-export PARAMS_FILE=
-export PROCEDURE={no-train-test | train-test}
-
-opensearch-benchmark execute_test \
- --target-hosts $URL:$PORT \
- --workload-path ./workload.json \
- --workload-params ${PARAMS_FILE} \
- --test-procedure=${PROCEDURE} \
- --pipeline benchmark-only
-```
-
-## Current Procedures
-
-### No Train Test
-
-The No Train Test procedure is used to test `knn_vector` indices that do not
-use an algorithm that requires training.
-
-#### Workflow
-
-1. Delete old resources in the cluster if they are present
-2. Create an OpenSearch index with `knn_vector` configured to use the HNSW algorithm
-3. Wait for cluster to be green
-4. Ingest data set into the cluster
-5. Refresh the index
-6. Run queries from data set against the cluster
-
-#### Parameters
-
-| Name | Description |
-|-----------------------------------------|--------------------------------------------------------------------------|
-| target_index_name | Name of index to add vectors to |
-| target_field_name | Name of field to add vectors to |
-| target_index_body | Path to target index definition |
-| target_index_primary_shards | Target index primary shards |
-| target_index_replica_shards | Target index replica shards |
-| target_index_dimension | Dimension of target index |
-| target_index_space_type | Target index space type |
-| target_index_bulk_size | Target index bulk size |
-| target_index_bulk_index_data_set_format | Format of vector data set |
-| target_index_bulk_index_data_set_path | Path to vector data set |
-| target_index_bulk_index_clients | Clients to be used for bulk ingestion (must be divisor of data set size) |
-| target_index_max_num_segments | Number of segments to merge target index down to before beginning search |
-| target_index_force_merge_timeout | Timeout for of force merge requests in seconds |
-| hnsw_ef_search | HNSW ef search parameter |
-| hnsw_ef_construction | HNSW ef construction parameter |
-| hnsw_m | HNSW m parameter |
-| query_k | The number of neighbors to return for the search |
-| query_clients | Number of clients to use for running queries |
-| query_data_set_format | Format of vector data set for queries |
-| query_data_set_path | Path to vector data set for queries |
-
-#### Metrics
-
-The result metrics of this procedure will look like:
-```
-------------------------------------------------------
- _______ __ _____
- / ____(_)___ ____ _/ / / ___/_________ ________
- / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
- / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
-/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
-------------------------------------------------------
-
-| Metric | Task | Value | Unit |
-|---------------------------------------------------------------:|------------------------:|------------:|-------:|
-| Cumulative indexing time of primary shards | | 1.82885 | min |
-| Min cumulative indexing time across primary shards | | 0.4121 | min |
-| Median cumulative indexing time across primary shards | | 0.559617 | min |
-| Max cumulative indexing time across primary shards | | 0.857133 | min |
-| Cumulative indexing throttle time of primary shards | | 0 | min |
-| Min cumulative indexing throttle time across primary shards | | 0 | min |
-| Median cumulative indexing throttle time across primary shards | | 0 | min |
-| Max cumulative indexing throttle time across primary shards | | 0 | min |
-| Cumulative merge time of primary shards | | 5.89065 | min |
-| Cumulative merge count of primary shards | | 3 | |
-| Min cumulative merge time across primary shards | | 1.95945 | min |
-| Median cumulative merge time across primary shards | | 1.96345 | min |
-| Max cumulative merge time across primary shards | | 1.96775 | min |
-| Cumulative merge throttle time of primary shards | | 0 | min |
-| Min cumulative merge throttle time across primary shards | | 0 | min |
-| Median cumulative merge throttle time across primary shards | | 0 | min |
-| Max cumulative merge throttle time across primary shards | | 0 | min |
-| Cumulative refresh time of primary shards | | 8.52517 | min |
-| Cumulative refresh count of primary shards | | 29 | |
-| Min cumulative refresh time across primary shards | | 2.64265 | min |
-| Median cumulative refresh time across primary shards | | 2.93913 | min |
-| Max cumulative refresh time across primary shards | | 2.94338 | min |
-| Cumulative flush time of primary shards | | 0.00221667 | min |
-| Cumulative flush count of primary shards | | 3 | |
-| Min cumulative flush time across primary shards | | 0.000733333 | min |
-| Median cumulative flush time across primary shards | | 0.000733333 | min |
-| Max cumulative flush time across primary shards | | 0.00075 | min |
-| Total Young Gen GC time | | 0.318 | s |
-| Total Young Gen GC count | | 2 | |
-| Total Old Gen GC time | | 0 | s |
-| Total Old Gen GC count | | 0 | |
-| Store size | | 1.43566 | GB |
-| Translog size | | 1.53668e-07 | GB |
-| Heap used for segments | | 0.00410843 | MB |
-| Heap used for doc values | | 0.000286102 | MB |
-| Heap used for terms | | 0.00121307 | MB |
-| Heap used for norms | | 0 | MB |
-| Heap used for points | | 0 | MB |
-| Heap used for stored fields | | 0.00260925 | MB |
-| Segment count | | 3 | |
-| Min Throughput | custom-vector-bulk | 38005.8 | docs/s |
-| Mean Throughput | custom-vector-bulk | 44827.9 | docs/s |
-| Median Throughput | custom-vector-bulk | 40507.2 | docs/s |
-| Max Throughput | custom-vector-bulk | 88967.8 | docs/s |
-| 50th percentile latency | custom-vector-bulk | 29.5857 | ms |
-| 90th percentile latency | custom-vector-bulk | 49.0719 | ms |
-| 99th percentile latency | custom-vector-bulk | 72.6138 | ms |
-| 99.9th percentile latency | custom-vector-bulk | 279.826 | ms |
-| 100th percentile latency | custom-vector-bulk | 15688 | ms |
-| 50th percentile service time | custom-vector-bulk | 29.5857 | ms |
-| 90th percentile service time | custom-vector-bulk | 49.0719 | ms |
-| 99th percentile service time | custom-vector-bulk | 72.6138 | ms |
-| 99.9th percentile service time | custom-vector-bulk | 279.826 | ms |
-| 100th percentile service time | custom-vector-bulk | 15688 | ms |
-| error rate | custom-vector-bulk | 0 | % |
-| Min Throughput | refresh-target-index | 0.01 | ops/s |
-| Mean Throughput | refresh-target-index | 0.01 | ops/s |
-| Median Throughput | refresh-target-index | 0.01 | ops/s |
-| Max Throughput | refresh-target-index | 0.01 | ops/s |
-| 100th percentile latency | refresh-target-index | 176610 | ms |
-| 100th percentile service time | refresh-target-index | 176610 | ms |
-| error rate | refresh-target-index | 0 | % |
-| Min Throughput | knn-query-from-data-set | 444.17 | ops/s |
-| Mean Throughput | knn-query-from-data-set | 601.68 | ops/s |
-| Median Throughput | knn-query-from-data-set | 621.19 | ops/s |
-| Max Throughput | knn-query-from-data-set | 631.23 | ops/s |
-| 50th percentile latency | knn-query-from-data-set | 14.7612 | ms |
-| 90th percentile latency | knn-query-from-data-set | 20.6954 | ms |
-| 99th percentile latency | knn-query-from-data-set | 27.7499 | ms |
-| 99.9th percentile latency | knn-query-from-data-set | 41.3506 | ms |
-| 99.99th percentile latency | knn-query-from-data-set | 162.391 | ms |
-| 100th percentile latency | knn-query-from-data-set | 162.756 | ms |
-| 50th percentile service time | knn-query-from-data-set | 14.7612 | ms |
-| 90th percentile service time | knn-query-from-data-set | 20.6954 | ms |
-| 99th percentile service time | knn-query-from-data-set | 27.7499 | ms |
-| 99.9th percentile service time | knn-query-from-data-set | 41.3506 | ms |
-| 99.99th percentile service time | knn-query-from-data-set | 162.391 | ms |
-| 100th percentile service time | knn-query-from-data-set | 162.756 | ms |
-| error rate | knn-query-from-data-set | 0 | % |
-
-
----------------------------------
-[INFO] SUCCESS (took 618 seconds)
----------------------------------
-```
-
-### Train Test
-
-The Train Test procedure is used to test `knn_vector` indices that do use an
-algorithm that requires training.
-
-#### Workflow
-
-1. Delete old resources in the cluster if they are present
-2. Create an OpenSearch index with `knn_vector` configured to load with training data
-3. Wait for cluster to be green
-4. Ingest data set into the training index
-5. Refresh the index
-6. Train a model based on user provided input parameters
-7. Create an OpenSearch index with `knn_vector` configured to use the model
-8. Ingest vectors into the target index
-9. Refresh the target index
-10. Run queries from data set against the cluster
-
-#### Parameters
-
-| Name | Description |
-|-----------------------------------------|--------------------------------------------------------------------------|
-| target_index_name | Name of index to add vectors to |
-| target_field_name | Name of field to add vectors to |
-| target_index_body | Path to target index definition |
-| target_index_primary_shards | Target index primary shards |
-| target_index_replica_shards | Target index replica shards |
-| target_index_dimension | Dimension of target index |
-| target_index_space_type | Target index space type |
-| target_index_bulk_size | Target index bulk size |
-| target_index_bulk_index_data_set_format | Format of vector data set for ingestion |
-| target_index_bulk_index_data_set_path | Path to vector data set for ingestion |
-| target_index_bulk_index_clients | Clients to be used for bulk ingestion (must be divisor of data set size) |
-| target_index_max_num_segments | Number of segments to merge target index down to before beginning search |
-| target_index_force_merge_timeout | Timeout for of force merge requests in seconds |
-| ivf_nlists | IVF nlist parameter |
-| ivf_nprobes | IVF nprobe parameter |
-| pq_code_size | PQ code_size parameter |
-| pq_m | PQ m parameter |
-| train_model_method | Method to be used for model (ivf or ivfpq) |
-| train_model_id | Model ID |
-| train_index_name | Name of index to put training data into |
-| train_field_name | Name of field to put training data into |
-| train_index_body | Path to train index definition |
-| train_search_size | Search size to use when pulling training data |
-| train_timeout | Timeout to wait for training to finish |
-| train_index_primary_shards | Train index primary shards |
-| train_index_replica_shards | Train index replica shards |
-| train_index_bulk_size | Train index bulk size |
-| train_index_data_set_format | Format of vector data set for training |
-| train_index_data_set_path | Path to vector data set for training |
-| train_index_num_vectors | Number of vectors to use from vector data set for training |
-| train_index_bulk_index_clients | Clients to be used for bulk ingestion (must be divisor of data set size) |
-| query_k | The number of neighbors to return for the search |
-| query_clients | Number of clients to use for running queries |
-| query_data_set_format | Format of vector data set for queries |
-| query_data_set_path | Path to vector data set for queries |
-
-#### Metrics
-
-The result metrics of this procedure will look like:
-```
-------------------------------------------------------
- _______ __ _____
- / ____(_)___ ____ _/ / / ___/_________ ________
- / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \
- / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/
-/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/
-------------------------------------------------------
-
-| Metric | Task | Value | Unit |
-|---------------------------------------------------------------:|------------------------:|-----------:|-----------------:|
-| Cumulative indexing time of primary shards | | 2.92382 | min |
-| Min cumulative indexing time across primary shards | | 0.42245 | min |
-| Median cumulative indexing time across primary shards | | 0.43395 | min |
-| Max cumulative indexing time across primary shards | | 1.63347 | min |
-| Cumulative indexing throttle time of primary shards | | 0 | min |
-| Min cumulative indexing throttle time across primary shards | | 0 | min |
-| Median cumulative indexing throttle time across primary shards | | 0 | min |
-| Max cumulative indexing throttle time across primary shards | | 0 | min |
-| Cumulative merge time of primary shards | | 1.36293 | min |
-| Cumulative merge count of primary shards | | 20 | |
-| Min cumulative merge time across primary shards | | 0.263283 | min |
-| Median cumulative merge time across primary shards | | 0.291733 | min |
-| Max cumulative merge time across primary shards | | 0.516183 | min |
-| Cumulative merge throttle time of primary shards | | 0.701683 | min |
-| Min cumulative merge throttle time across primary shards | | 0.163883 | min |
-| Median cumulative merge throttle time across primary shards | | 0.175717 | min |
-| Max cumulative merge throttle time across primary shards | | 0.186367 | min |
-| Cumulative refresh time of primary shards | | 0.222217 | min |
-| Cumulative refresh count of primary shards | | 67 | |
-| Min cumulative refresh time across primary shards | | 0.03915 | min |
-| Median cumulative refresh time across primary shards | | 0.039825 | min |
-| Max cumulative refresh time across primary shards | | 0.103417 | min |
-| Cumulative flush time of primary shards | | 0.0276833 | min |
-| Cumulative flush count of primary shards | | 1 | |
-| Min cumulative flush time across primary shards | | 0 | min |
-| Median cumulative flush time across primary shards | | 0 | min |
-| Max cumulative flush time across primary shards | | 0.0276833 | min |
-| Total Young Gen GC time | | 0.074 | s |
-| Total Young Gen GC count | | 8 | |
-| Total Old Gen GC time | | 0 | s |
-| Total Old Gen GC count | | 0 | |
-| Store size | | 1.67839 | GB |
-| Translog size | | 0.115145 | GB |
-| Heap used for segments | | 0.0350914 | MB |
-| Heap used for doc values | | 0.00771713 | MB |
-| Heap used for terms | | 0.0101089 | MB |
-| Heap used for norms | | 0 | MB |
-| Heap used for points | | 0 | MB |
-| Heap used for stored fields | | 0.0172653 | MB |
-| Segment count | | 25 | |
-| Min Throughput | delete-model | 25.45 | ops/s |
-| Mean Throughput | delete-model | 25.45 | ops/s |
-| Median Throughput | delete-model | 25.45 | ops/s |
-| Max Throughput | delete-model | 25.45 | ops/s |
-| 100th percentile latency | delete-model | 39.0409 | ms |
-| 100th percentile service time | delete-model | 39.0409 | ms |
-| error rate | delete-model | 0 | % |
-| Min Throughput | train-vector-bulk | 49518.9 | docs/s |
-| Mean Throughput | train-vector-bulk | 54418.8 | docs/s |
-| Median Throughput | train-vector-bulk | 52984.2 | docs/s |
-| Max Throughput | train-vector-bulk | 62118.3 | docs/s |
-| 50th percentile latency | train-vector-bulk | 26.5293 | ms |
-| 90th percentile latency | train-vector-bulk | 41.8212 | ms |
-| 99th percentile latency | train-vector-bulk | 239.351 | ms |
-| 99.9th percentile latency | train-vector-bulk | 348.507 | ms |
-| 100th percentile latency | train-vector-bulk | 436.292 | ms |
-| 50th percentile service time | train-vector-bulk | 26.5293 | ms |
-| 90th percentile service time | train-vector-bulk | 41.8212 | ms |
-| 99th percentile service time | train-vector-bulk | 239.351 | ms |
-| 99.9th percentile service time | train-vector-bulk | 348.507 | ms |
-| 100th percentile service time | train-vector-bulk | 436.292 | ms |
-| error rate | train-vector-bulk | 0 | % |
-| Min Throughput | refresh-train-index | 0.47 | ops/s |
-| Mean Throughput | refresh-train-index | 0.47 | ops/s |
-| Median Throughput | refresh-train-index | 0.47 | ops/s |
-| Max Throughput | refresh-train-index | 0.47 | ops/s |
-| 100th percentile latency | refresh-train-index | 2142.96 | ms |
-| 100th percentile service time | refresh-train-index | 2142.96 | ms |
-| error rate | refresh-train-index | 0 | % |
-| Min Throughput | ivfpq-train-model | 0.01 | models_trained/s |
-| Mean Throughput | ivfpq-train-model | 0.01 | models_trained/s |
-| Median Throughput | ivfpq-train-model | 0.01 | models_trained/s |
-| Max Throughput | ivfpq-train-model | 0.01 | models_trained/s |
-| 100th percentile latency | ivfpq-train-model | 136563 | ms |
-| 100th percentile service time | ivfpq-train-model | 136563 | ms |
-| error rate | ivfpq-train-model | 0 | % |
-| Min Throughput | custom-vector-bulk | 62384.8 | docs/s |
-| Mean Throughput | custom-vector-bulk | 69035.2 | docs/s |
-| Median Throughput | custom-vector-bulk | 68675.4 | docs/s |
-| Max Throughput | custom-vector-bulk | 80713.4 | docs/s |
-| 50th percentile latency | custom-vector-bulk | 18.7726 | ms |
-| 90th percentile latency | custom-vector-bulk | 34.8881 | ms |
-| 99th percentile latency | custom-vector-bulk | 150.435 | ms |
-| 99.9th percentile latency | custom-vector-bulk | 296.862 | ms |
-| 100th percentile latency | custom-vector-bulk | 344.394 | ms |
-| 50th percentile service time | custom-vector-bulk | 18.7726 | ms |
-| 90th percentile service time | custom-vector-bulk | 34.8881 | ms |
-| 99th percentile service time | custom-vector-bulk | 150.435 | ms |
-| 99.9th percentile service time | custom-vector-bulk | 296.862 | ms |
-| 100th percentile service time | custom-vector-bulk | 344.394 | ms |
-| error rate | custom-vector-bulk | 0 | % |
-| Min Throughput | refresh-target-index | 28.32 | ops/s |
-| Mean Throughput | refresh-target-index | 28.32 | ops/s |
-| Median Throughput | refresh-target-index | 28.32 | ops/s |
-| Max Throughput | refresh-target-index | 28.32 | ops/s |
-| 100th percentile latency | refresh-target-index | 34.9811 | ms |
-| 100th percentile service time | refresh-target-index | 34.9811 | ms |
-| error rate | refresh-target-index | 0 | % |
-| Min Throughput | knn-query-from-data-set | 0.9 | ops/s |
-| Mean Throughput | knn-query-from-data-set | 453.84 | ops/s |
-| Median Throughput | knn-query-from-data-set | 554.15 | ops/s |
-| Max Throughput | knn-query-from-data-set | 681 | ops/s |
-| 50th percentile latency | knn-query-from-data-set | 11.7174 | ms |
-| 90th percentile latency | knn-query-from-data-set | 15.4445 | ms |
-| 99th percentile latency | knn-query-from-data-set | 21.0682 | ms |
-| 99.9th percentile latency | knn-query-from-data-set | 39.5414 | ms |
-| 99.99th percentile latency | knn-query-from-data-set | 1116.33 | ms |
-| 100th percentile latency | knn-query-from-data-set | 1116.66 | ms |
-| 50th percentile service time | knn-query-from-data-set | 11.7174 | ms |
-| 90th percentile service time | knn-query-from-data-set | 15.4445 | ms |
-| 99th percentile service time | knn-query-from-data-set | 21.0682 | ms |
-| 99.9th percentile service time | knn-query-from-data-set | 39.5414 | ms |
-| 99.99th percentile service time | knn-query-from-data-set | 1116.33 | ms |
-| 100th percentile service time | knn-query-from-data-set | 1116.66 | ms |
-| error rate | knn-query-from-data-set | 0 | % |
-
-
----------------------------------
-[INFO] SUCCESS (took 281 seconds)
----------------------------------
-```
-
-## Adding a procedure
-
-Adding additional benchmarks is very simple. First, place any custom parameter
-sources or runners in the [extensions](extensions) directory so that other tests
-can use them and also update the [documentation](#custom-extensions)
-accordingly.
-
-Next, create a new test procedure file and add the operations you want your test
-to run. Lastly, be sure to update documentation.
-
-## Custom Extensions
-
-OpenSearch Benchmarks is very extendable. To fit the plugins needs, we add
-customer parameter sources and custom runners. Parameter sources allow users to
-supply custom parameters to an operation. Runners are what actually performs
-the operations against OpenSearch.
-
-### Custom Parameter Sources
-
-Custom parameter sources are defined in [extensions/param_sources.py](extensions/param_sources.py).
-
-| Name | Description | Parameters |
-|-------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| bulk-from-data-set | Provides bulk payloads containing vectors from a data set for indexing | 1. data_set_format - (hdf5, bigann)
2. data_set_path - path to data set
3. index - name of index for bulk ingestion
4. field - field to place vector in
5. bulk_size - vectors per bulk request
6. num_vectors - number of vectors to use from the data set. Defaults to the whole data set. |
-| knn-query-from-data-set | Provides a query generated from a data set | 1. data_set_format - (hdf5, bigann)
2. data_set_path - path to data set
3. index - name of index to query against
4. field - field to to query against
5. k - number of results to return
6. dimension - size of vectors to produce
7. num_vectors - number of vectors to use from the data set. Defaults to the whole data set. |
-
-
-### Custom Runners
-
-Custom runners are defined in [extensions/runners.py](extensions/runners.py).
-
-| Syntax | Description | Parameters |
-|--------------------|-----------------------------------------------------|:-------------------------------------------------------------------------------------------------------------|
-| custom-vector-bulk | Bulk index a set of vectors in an OpenSearch index. | 1. bulk-from-data-set |
-| custom-refresh | Run refresh with retry capabilities. | 1. index - name of index to refresh
2. retries - number of times to retry the operation |
-| train-model | Trains a model. | 1. body - model definition
2. timeout - time to wait for model to finish
3. model_id - ID of model |
-| delete-model | Deletes a model if it exists. | 1. model_id - ID of model |
-
-### Testing
-
-We have a set of unit tests for our extensions in
-[tests](tests). To run all the tests, run the following
-command:
-
-```commandline
-python -m unittest discover ./tests
-```
-
-To run an individual test:
-```commandline
-python -m unittest tests.test_param_sources.VectorsFromDataSetParamSourceTestCase.test_partition_hdf5
-```
diff --git a/benchmarks/osb/__init__.py b/benchmarks/osb/__init__.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/benchmarks/osb/extensions/__init__.py b/benchmarks/osb/extensions/__init__.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/benchmarks/osb/extensions/data_set.py b/benchmarks/osb/extensions/data_set.py
deleted file mode 100644
index 7e8058844..000000000
--- a/benchmarks/osb/extensions/data_set.py
+++ /dev/null
@@ -1,202 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-import os
-import numpy as np
-from abc import ABC, ABCMeta, abstractmethod
-from enum import Enum
-from typing import cast
-import h5py
-import struct
-
-
-class Context(Enum):
- """DataSet context enum. Can be used to add additional context for how a
- data-set should be interpreted.
- """
- INDEX = 1
- QUERY = 2
- NEIGHBORS = 3
-
-
-class DataSet(ABC):
- """DataSet interface. Used for reading data-sets from files.
-
- Methods:
- read: Read a chunk of data from the data-set
- seek: Get to position in the data-set
- size: Gets the number of items in the data-set
- reset: Resets internal state of data-set to beginning
- """
- __metaclass__ = ABCMeta
-
- BEGINNING = 0
-
- @abstractmethod
- def read(self, chunk_size: int):
- pass
-
- @abstractmethod
- def seek(self, offset: int):
- pass
-
- @abstractmethod
- def size(self):
- pass
-
- @abstractmethod
- def reset(self):
- pass
-
-
-class HDF5DataSet(DataSet):
- """ Data-set format corresponding to `ANN Benchmarks
- `_
- """
-
- FORMAT_NAME = "hdf5"
-
- def __init__(self, dataset_path: str, context: Context):
- file = h5py.File(dataset_path)
- self.data = cast(h5py.Dataset, file[self.parse_context(context)])
- self.current = self.BEGINNING
-
- def read(self, chunk_size: int):
- if self.current >= self.size():
- return None
-
- end_offset = self.current + chunk_size
- if end_offset > self.size():
- end_offset = self.size()
-
- v = cast(np.ndarray, self.data[self.current:end_offset])
- self.current = end_offset
- return v
-
- def seek(self, offset: int):
-
- if offset < self.BEGINNING:
- raise Exception("Offset must be greater than or equal to 0")
-
- if offset >= self.size():
- raise Exception("Offset must be less than the data set size")
-
- self.current = offset
-
- def size(self):
- return self.data.len()
-
- def reset(self):
- self.current = self.BEGINNING
-
- @staticmethod
- def parse_context(context: Context) -> str:
- if context == Context.NEIGHBORS:
- return "neighbors"
-
- if context == Context.INDEX:
- return "train"
-
- if context == Context.QUERY:
- return "test"
-
- raise Exception("Unsupported context")
-
-
-class BigANNVectorDataSet(DataSet):
- """ Data-set format for vector data-sets for `Big ANN Benchmarks
- `_
- """
-
- DATA_SET_HEADER_LENGTH = 8
- U8BIN_EXTENSION = "u8bin"
- FBIN_EXTENSION = "fbin"
- FORMAT_NAME = "bigann"
-
- BYTES_PER_U8INT = 1
- BYTES_PER_FLOAT = 4
-
- def __init__(self, dataset_path: str):
- self.file = open(dataset_path, 'rb')
- self.file.seek(BigANNVectorDataSet.BEGINNING, os.SEEK_END)
- num_bytes = self.file.tell()
- self.file.seek(BigANNVectorDataSet.BEGINNING)
-
- if num_bytes < BigANNVectorDataSet.DATA_SET_HEADER_LENGTH:
- raise Exception("File is invalid")
-
- self.num_points = int.from_bytes(self.file.read(4), "little")
- self.dimension = int.from_bytes(self.file.read(4), "little")
- self.bytes_per_num = self._get_data_size(dataset_path)
-
- if (num_bytes - BigANNVectorDataSet.DATA_SET_HEADER_LENGTH) != self.num_points * \
- self.dimension * self.bytes_per_num:
- raise Exception("File is invalid")
-
- self.reader = self._value_reader(dataset_path)
- self.current = BigANNVectorDataSet.BEGINNING
-
- def read(self, chunk_size: int):
- if self.current >= self.size():
- return None
-
- end_offset = self.current + chunk_size
- if end_offset > self.size():
- end_offset = self.size()
-
- v = np.asarray([self._read_vector() for _ in
- range(end_offset - self.current)])
- self.current = end_offset
- return v
-
- def seek(self, offset: int):
-
- if offset < self.BEGINNING:
- raise Exception("Offset must be greater than or equal to 0")
-
- if offset >= self.size():
- raise Exception("Offset must be less than the data set size")
-
- bytes_offset = BigANNVectorDataSet.DATA_SET_HEADER_LENGTH + \
- self.dimension * self.bytes_per_num * offset
- self.file.seek(bytes_offset)
- self.current = offset
-
- def _read_vector(self):
- return np.asarray([self.reader(self.file) for _ in
- range(self.dimension)])
-
- def size(self):
- return self.num_points
-
- def reset(self):
- self.file.seek(BigANNVectorDataSet.DATA_SET_HEADER_LENGTH)
- self.current = BigANNVectorDataSet.BEGINNING
-
- def __del__(self):
- self.file.close()
-
- @staticmethod
- def _get_data_size(file_name):
- ext = file_name.split('.')[-1]
- if ext == BigANNVectorDataSet.U8BIN_EXTENSION:
- return BigANNVectorDataSet.BYTES_PER_U8INT
-
- if ext == BigANNVectorDataSet.FBIN_EXTENSION:
- return BigANNVectorDataSet.BYTES_PER_FLOAT
-
- raise Exception("Unknown extension")
-
- @staticmethod
- def _value_reader(file_name):
- ext = file_name.split('.')[-1]
- if ext == BigANNVectorDataSet.U8BIN_EXTENSION:
- return lambda file: float(int.from_bytes(file.read(BigANNVectorDataSet.BYTES_PER_U8INT), "little"))
-
- if ext == BigANNVectorDataSet.FBIN_EXTENSION:
- return lambda file: struct.unpack('= self.num_vectors + self.offset:
- raise StopIteration
-
- if self.vector_batch is None or len(self.vector_batch) == 0:
- self.vector_batch = self._batch_read(self.data_set)
- if self.vector_batch is None:
- raise StopIteration
- vector = self.vector_batch.pop(0)
- self.current += 1
- self.percent_completed = self.current / self.total
-
- return self._build_query_body(self.index_name, self.field_name, self.k,
- vector)
-
- def _batch_read(self, data_set: DataSet):
- return list(data_set.read(self.VECTOR_READ_BATCH_SIZE))
-
- def _build_query_body(self, index_name: str, field_name: str, k: int,
- vector) -> dict:
- """Builds a k-NN query that can be used to execute an approximate nearest
- neighbor search against a k-NN plugin index
- Args:
- index_name: name of index to search
- field_name: name of field to search
- k: number of results to return
- vector: vector used for query
- Returns:
- A dictionary containing the body used for search, a set of request
- parameters to attach to the search and the name of the index.
- """
- return {
- "index": index_name,
- "request-params": {
- "_source": {
- "exclude": [field_name]
- }
- },
- "body": {
- "size": k,
- "query": {
- "knn": {
- field_name: {
- "vector": vector,
- "k": k
- }
- }
- }
- }
- }
-
-
-class BulkVectorsFromDataSetParamSource(VectorsFromDataSetParamSource):
- """ Create bulk index requests from a data set of vectors.
-
- Attributes:
- bulk_size: number of vectors per request
- retries: number of times to retry the request when it fails
- """
-
- DEFAULT_RETRIES = 10
-
- def __init__(self, workload, params, **kwargs):
- super().__init__(params, Context.INDEX)
- self.bulk_size: int = parse_int_parameter("bulk_size", params)
- self.retries: int = parse_int_parameter("retries", params,
- self.DEFAULT_RETRIES)
-
- def params(self):
- """
- Returns: A bulk index parameter with vectors from a data set.
- """
- if self.current >= self.num_vectors + self.offset:
- raise StopIteration
-
- def action(doc_id):
- return {'index': {'_index': self.index_name, '_id': doc_id}}
-
- partition = self.data_set.read(self.bulk_size)
- body = bulk_transform(partition, self.field_name, action, self.current)
- size = len(body) // 2
- self.current += size
- self.percent_completed = self.current / self.total
-
- return {
- "body": body,
- "retries": self.retries,
- "size": size
- }
diff --git a/benchmarks/osb/extensions/registry.py b/benchmarks/osb/extensions/registry.py
deleted file mode 100644
index 5ce17ab6f..000000000
--- a/benchmarks/osb/extensions/registry.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-from .param_sources import register as param_sources_register
-from .runners import register as runners_register
-
-
-def register(registry):
- param_sources_register(registry)
- runners_register(registry)
diff --git a/benchmarks/osb/extensions/runners.py b/benchmarks/osb/extensions/runners.py
deleted file mode 100644
index d048f80b0..000000000
--- a/benchmarks/osb/extensions/runners.py
+++ /dev/null
@@ -1,121 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-from opensearchpy.exceptions import ConnectionTimeout
-from .util import parse_int_parameter, parse_string_parameter
-import logging
-import time
-
-
-def register(registry):
- registry.register_runner(
- "custom-vector-bulk", BulkVectorsFromDataSetRunner(), async_runner=True
- )
- registry.register_runner(
- "custom-refresh", CustomRefreshRunner(), async_runner=True
- )
- registry.register_runner(
- "train-model", TrainModelRunner(), async_runner=True
- )
- registry.register_runner(
- "delete-model", DeleteModelRunner(), async_runner=True
- )
-
-
-class BulkVectorsFromDataSetRunner:
-
- async def __call__(self, opensearch, params):
- size = parse_int_parameter("size", params)
- retries = parse_int_parameter("retries", params, 0) + 1
-
- for _ in range(retries):
- try:
- await opensearch.bulk(
- body=params["body"],
- timeout='5m'
- )
-
- return size, "docs"
- except ConnectionTimeout:
- logging.getLogger(__name__)\
- .warning("Bulk vector ingestion timed out. Retrying")
-
- raise TimeoutError("Failed to submit bulk request in specified number "
- "of retries: {}".format(retries))
-
- def __repr__(self, *args, **kwargs):
- return "custom-vector-bulk"
-
-
-class CustomRefreshRunner:
-
- async def __call__(self, opensearch, params):
- retries = parse_int_parameter("retries", params, 0) + 1
-
- for _ in range(retries):
- try:
- await opensearch.indices.refresh(
- index=parse_string_parameter("index", params)
- )
-
- return
- except ConnectionTimeout:
- logging.getLogger(__name__)\
- .warning("Custom refresh timed out. Retrying")
-
- raise TimeoutError("Failed to refresh the index in specified number "
- "of retries: {}".format(retries))
-
- def __repr__(self, *args, **kwargs):
- return "custom-refresh"
-
-
-class TrainModelRunner:
-
- async def __call__(self, opensearch, params):
- # Train a model and wait for it training to complete
- body = params["body"]
- timeout = parse_int_parameter("timeout", params)
- model_id = parse_string_parameter("model_id", params)
-
- method = "POST"
- model_uri = "/_plugins/_knn/models/{}".format(model_id)
- await opensearch.transport.perform_request(method, "{}/_train".format(model_uri), body=body)
-
- start_time = time.time()
- while time.time() < start_time + timeout:
- time.sleep(1)
- model_response = await opensearch.transport.perform_request("GET", model_uri)
-
- if 'state' not in model_response.keys():
- continue
-
- if model_response['state'] == 'created':
- #TODO: Return model size as well
- return 1, "models_trained"
-
- if model_response['state'] == 'failed':
- raise Exception("Failed to create model: {}".format(model_response))
-
- raise Exception('Failed to create model: {} within timeout {} seconds'
- .format(model_id, timeout))
-
- def __repr__(self, *args, **kwargs):
- return "train-model"
-
-
-class DeleteModelRunner:
-
- async def __call__(self, opensearch, params):
- # Delete model provided by model id
- method = "DELETE"
- model_id = parse_string_parameter("model_id", params)
- uri = "/_plugins/_knn/models/{}".format(model_id)
-
- # Ignore if model doesnt exist
- await opensearch.transport.perform_request(method, uri, params={"ignore": [400, 404]})
-
- def __repr__(self, *args, **kwargs):
- return "delete-model"
diff --git a/benchmarks/osb/extensions/util.py b/benchmarks/osb/extensions/util.py
deleted file mode 100644
index f7f6aab62..000000000
--- a/benchmarks/osb/extensions/util.py
+++ /dev/null
@@ -1,71 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-import numpy as np
-from typing import List
-from typing import Dict
-from typing import Any
-
-
-def bulk_transform(partition: np.ndarray, field_name: str, action,
- offset: int) -> List[Dict[str, Any]]:
- """Partitions and transforms a list of vectors into OpenSearch's bulk
- injection format.
- Args:
- offset: to start counting from
- partition: An array of vectors to transform.
- field_name: field name for action
- action: Bulk API action.
- Returns:
- An array of transformed vectors in bulk format.
- """
- actions = []
- _ = [
- actions.extend([action(i + offset), None])
- for i in range(len(partition))
- ]
- actions[1::2] = [{field_name: vec} for vec in partition.tolist()]
- return actions
-
-
-def parse_string_parameter(key: str, params: dict, default: str = None) -> str:
- if key not in params:
- if default is not None:
- return default
- raise ConfigurationError(
- "Value cannot be None for param {}".format(key)
- )
-
- if type(params[key]) is str:
- return params[key]
-
- raise ConfigurationError("Value must be a string for param {}".format(key))
-
-
-def parse_int_parameter(key: str, params: dict, default: int = None) -> int:
- if key not in params:
- if default:
- return default
- raise ConfigurationError(
- "Value cannot be None for param {}".format(key)
- )
-
- if type(params[key]) is int:
- return params[key]
-
- raise ConfigurationError("Value must be a int for param {}".format(key))
-
-
-class ConfigurationError(Exception):
- """Exception raised for errors configuration.
-
- Attributes:
- message -- explanation of the error
- """
-
- def __init__(self, message: str):
- self.message = f'{message}'
- super().__init__(self.message)
diff --git a/benchmarks/osb/indices/faiss-index.json b/benchmarks/osb/indices/faiss-index.json
deleted file mode 100644
index 2db4d34d4..000000000
--- a/benchmarks/osb/indices/faiss-index.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": {{ target_index_primary_shards }},
- "number_of_replicas": {{ target_index_replica_shards }}
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": {{ target_index_dimension }},
- "method": {
- "name": "hnsw",
- "space_type": "{{ target_index_space_type }}",
- "engine": "faiss",
- "parameters": {
- "ef_search": {{ hnsw_ef_search }},
- "ef_construction": {{ hnsw_ef_construction }},
- "m": {{ hnsw_m }}
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/osb/indices/lucene-index.json b/benchmarks/osb/indices/lucene-index.json
deleted file mode 100644
index 0a4ed868a..000000000
--- a/benchmarks/osb/indices/lucene-index.json
+++ /dev/null
@@ -1,26 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": {{ target_index_primary_shards }},
- "number_of_replicas": {{ target_index_replica_shards }}
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": {{ target_index_dimension }},
- "method": {
- "name": "hnsw",
- "space_type": "{{ target_index_space_type }}",
- "engine": "lucene",
- "parameters": {
- "ef_construction": {{ hnsw_ef_construction }},
- "m": {{ hnsw_m }}
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/osb/indices/model-index.json b/benchmarks/osb/indices/model-index.json
deleted file mode 100644
index 0e92c8903..000000000
--- a/benchmarks/osb/indices/model-index.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": {{ target_index_primary_shards | default(1) }},
- "number_of_replicas": {{ target_index_replica_shards | default(0) }}
- }
- },
- "mappings": {
- "properties": {
- "{{ target_field_name }}": {
- "type": "knn_vector",
- "model_id": "{{ train_model_id }}"
- }
- }
- }
-}
diff --git a/benchmarks/osb/indices/nmslib-index.json b/benchmarks/osb/indices/nmslib-index.json
deleted file mode 100644
index 4ceb57977..000000000
--- a/benchmarks/osb/indices/nmslib-index.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "knn.algo_param.ef_search": {{ hnsw_ef_search }},
- "number_of_shards": {{ target_index_primary_shards }},
- "number_of_replicas": {{ target_index_replica_shards }}
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": {{ target_index_dimension }},
- "method": {
- "name": "hnsw",
- "space_type": "{{ target_index_space_type }}",
- "engine": "nmslib",
- "parameters": {
- "ef_construction": {{ hnsw_ef_construction }},
- "m": {{ hnsw_m }}
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/osb/indices/train-index.json b/benchmarks/osb/indices/train-index.json
deleted file mode 100644
index 82af8215e..000000000
--- a/benchmarks/osb/indices/train-index.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": {{ train_index_primary_shards }},
- "number_of_replicas": {{ train_index_replica_shards }}
- }
- },
- "mappings": {
- "properties": {
- "{{ train_field_name }}": {
- "type": "knn_vector",
- "dimension": {{ target_index_dimension }}
- }
- }
- }
-}
diff --git a/benchmarks/osb/operations/default.json b/benchmarks/osb/operations/default.json
deleted file mode 100644
index ee33166f0..000000000
--- a/benchmarks/osb/operations/default.json
+++ /dev/null
@@ -1,53 +0,0 @@
-[
- {
- "name": "ivfpq-train-model",
- "operation-type": "train-model",
- "model_id": "{{ train_model_id }}",
- "timeout": {{ train_timeout }},
- "body": {
- "training_index": "{{ train_index_name }}",
- "training_field": "{{ train_field_name }}",
- "dimension": {{ target_index_dimension }},
- "search_size": {{ train_search_size }},
- "max_training_vector_count": {{ train_index_num_vectors }},
- "method": {
- "name":"ivf",
- "engine":"faiss",
- "space_type": "{{ target_index_space_type }}",
- "parameters":{
- "nlist": {{ ivf_nlists }},
- "nprobes": {{ ivf_nprobes }},
- "encoder":{
- "name":"pq",
- "parameters":{
- "code_size": {{ pq_code_size }},
- "m": {{ pq_m }}
- }
- }
- }
- }
- }
- },
- {
- "name": "ivf-train-model",
- "operation-type": "train-model",
- "model_id": "{{ train_model_id }}",
- "timeout": {{ train_timeout | default(1000) }},
- "body": {
- "training_index": "{{ train_index_name }}",
- "training_field": "{{ train_field_name }}",
- "search_size": {{ train_search_size }},
- "dimension": {{ target_index_dimension }},
- "max_training_vector_count": {{ train_index_num_vectors }},
- "method": {
- "name":"ivf",
- "engine":"faiss",
- "space_type": "{{ target_index_space_type }}",
- "parameters":{
- "nlist": {{ ivf_nlists }},
- "nprobes": {{ ivf_nprobes }}
- }
- }
- }
- }
-]
diff --git a/benchmarks/osb/params/no-train-params.json b/benchmarks/osb/params/no-train-params.json
deleted file mode 100644
index 58e4197fd..000000000
--- a/benchmarks/osb/params/no-train-params.json
+++ /dev/null
@@ -1,40 +0,0 @@
-{
- "target_index_name": "target_index",
- "target_field_name": "target_field",
- "target_index_body": "indices/nmslib-index.json",
- "target_index_primary_shards": 3,
- "target_index_replica_shards": 1,
- "target_index_dimension": 128,
- "target_index_space_type": "l2",
- "target_index_bulk_size": 200,
- "target_index_bulk_index_data_set_format": "hdf5",
- "target_index_bulk_index_data_set_path": "",
- "target_index_bulk_index_clients": 10,
- "target_index_max_num_segments": 10,
- "target_index_force_merge_timeout": 45.0,
- "hnsw_ef_search": 512,
- "hnsw_ef_construction": 512,
- "hnsw_m": 16,
-
- "query_k": 10,
- "query_clients": 10,
- "query_data_set_format": "hdf5",
- "query_data_set_path": "",
-
- "ivf_nlists": 1,
- "ivf_nprobes": 1,
- "pq_code_size": 1,
- "pq_m": 1,
- "train_model_method": "",
- "train_model_id": "",
- "train_index_name": "",
- "train_field_name": "",
- "train_index_body": "",
- "train_search_size": 1,
- "train_timeout": 1,
- "train_index_bulk_size": 1,
- "train_index_data_set_format": "",
- "train_index_data_set_path": "",
- "train_index_num_vectors": 1,
- "train_index_bulk_index_clients": 1
-}
diff --git a/benchmarks/osb/params/train-params.json b/benchmarks/osb/params/train-params.json
deleted file mode 100644
index f55ed4333..000000000
--- a/benchmarks/osb/params/train-params.json
+++ /dev/null
@@ -1,38 +0,0 @@
-{
- "target_index_name": "target_index",
- "target_field_name": "target_field",
- "target_index_body": "indices/model-index.json",
- "target_index_primary_shards": 3,
- "target_index_replica_shards": 1,
- "target_index_dimension": 128,
- "target_index_space_type": "l2",
- "target_index_bulk_size": 200,
- "target_index_bulk_index_data_set_format": "hdf5",
- "target_index_bulk_index_data_set_path": "",
- "target_index_bulk_index_clients": 10,
- "target_index_max_num_segments": 10,
- "target_index_force_merge_timeout": 45.0,
- "ivf_nlists": 10,
- "ivf_nprobes": 1,
- "pq_code_size": 8,
- "pq_m": 8,
- "train_model_method": "ivfpq",
- "train_model_id": "test-model",
- "train_index_name": "train_index",
- "train_field_name": "train_field",
- "train_index_body": "indices/train-index.json",
- "train_search_size": 500,
- "train_timeout": 5000,
- "train_index_primary_shards": 1,
- "train_index_replica_shards": 0,
- "train_index_bulk_size": 200,
- "train_index_data_set_format": "hdf5",
- "train_index_data_set_path": "",
- "train_index_num_vectors": 1000000,
- "train_index_bulk_index_clients": 10,
-
- "query_k": 10,
- "query_clients": 10,
- "query_data_set_format": "hdf5",
- "query_data_set_path": ""
-}
diff --git a/benchmarks/osb/procedures/no-train-test.json b/benchmarks/osb/procedures/no-train-test.json
deleted file mode 100644
index 01985b914..000000000
--- a/benchmarks/osb/procedures/no-train-test.json
+++ /dev/null
@@ -1,73 +0,0 @@
-{% import "benchmark.helpers" as benchmark with context %}
-{
- "name": "no-train-test",
- "default": true,
- "schedule": [
- {
- "operation": {
- "name": "delete-target-index",
- "operation-type": "delete-index",
- "only-if-exists": true,
- "index": "{{ target_index_name }}"
- }
- },
- {
- "operation": {
- "name": "create-target-index",
- "operation-type": "create-index",
- "index": "{{ target_index_name }}"
- }
- },
- {
- "name": "wait-for-cluster-to-be-green",
- "operation": "cluster-health",
- "request-params": {
- "wait_for_status": "green"
- }
- },
- {
- "operation": {
- "name": "custom-vector-bulk",
- "operation-type": "custom-vector-bulk",
- "param-source": "bulk-from-data-set",
- "index": "{{ target_index_name }}",
- "field": "{{ target_field_name }}",
- "bulk_size": {{ target_index_bulk_size }},
- "data_set_format": "{{ target_index_bulk_index_data_set_format }}",
- "data_set_path": "{{ target_index_bulk_index_data_set_path }}"
- },
- "clients": {{ target_index_bulk_index_clients }}
- },
- {
- "operation": {
- "name": "refresh-target-index",
- "operation-type": "custom-refresh",
- "index": "{{ target_index_name }}",
- "retries": 100
- }
- },
- {
- "operation": {
- "name": "force-merge",
- "operation-type": "force-merge",
- "request-timeout": {{ target_index_force_merge_timeout }},
- "index": "{{ target_index_name }}",
- "mode": "polling",
- "max-num-segments": {{ target_index_max_num_segments }}
- }
- },
- {
- "operation": {
- "name": "knn-query-from-data-set",
- "operation-type": "search",
- "index": "{{ target_index_name }}",
- "param-source": "knn-query-from-data-set",
- "k": {{ query_k }},
- "field": "{{ target_field_name }}",
- "data_set_format": "{{ query_data_set_format }}",
- "data_set_path": "{{ query_data_set_path }}"
- },
- "clients": {{ query_clients }}
- }
- ]
-}
diff --git a/benchmarks/osb/procedures/train-test.json b/benchmarks/osb/procedures/train-test.json
deleted file mode 100644
index ca26db0b0..000000000
--- a/benchmarks/osb/procedures/train-test.json
+++ /dev/null
@@ -1,127 +0,0 @@
-{% import "benchmark.helpers" as benchmark with context %}
-{
- "name": "train-test",
- "default": false,
- "schedule": [
- {
- "operation": {
- "name": "delete-target-index",
- "operation-type": "delete-index",
- "only-if-exists": true,
- "index": "{{ target_index_name }}"
- }
- },
- {
- "operation": {
- "name": "delete-train-index",
- "operation-type": "delete-index",
- "only-if-exists": true,
- "index": "{{ train_index_name }}"
- }
- },
- {
- "operation": {
- "operation-type": "delete-model",
- "name": "delete-model",
- "model_id": "{{ train_model_id }}"
- }
- },
- {
- "operation": {
- "name": "create-train-index",
- "operation-type": "create-index",
- "index": "{{ train_index_name }}"
- }
- },
- {
- "name": "wait-for-train-index-to-be-green",
- "operation": "cluster-health",
- "request-params": {
- "wait_for_status": "green"
- }
- },
- {
- "operation": {
- "name": "train-vector-bulk",
- "operation-type": "custom-vector-bulk",
- "param-source": "bulk-from-data-set",
- "index": "{{ train_index_name }}",
- "field": "{{ train_field_name }}",
- "bulk_size": {{ train_index_bulk_size }},
- "data_set_format": "{{ train_index_data_set_format }}",
- "data_set_path": "{{ train_index_data_set_path }}",
- "num_vectors": {{ train_index_num_vectors }}
- },
- "clients": {{ train_index_bulk_index_clients }}
- },
- {
- "operation": {
- "name": "refresh-train-index",
- "operation-type": "custom-refresh",
- "index": "{{ train_index_name }}",
- "retries": 100
- }
- },
- {
- "operation": "{{ train_model_method }}-train-model"
- },
- {
- "operation": {
- "name": "create-target-index",
- "operation-type": "create-index",
- "index": "{{ target_index_name }}"
- }
- },
- {
- "name": "wait-for-target-index-to-be-green",
- "operation": "cluster-health",
- "request-params": {
- "wait_for_status": "green"
- }
- },
- {
- "operation": {
- "name": "custom-vector-bulk",
- "operation-type": "custom-vector-bulk",
- "param-source": "bulk-from-data-set",
- "index": "{{ target_index_name }}",
- "field": "{{ target_field_name }}",
- "bulk_size": {{ target_index_bulk_size }},
- "data_set_format": "{{ target_index_bulk_index_data_set_format }}",
- "data_set_path": "{{ target_index_bulk_index_data_set_path }}"
- },
- "clients": {{ target_index_bulk_index_clients }}
- },
- {
- "operation": {
- "name": "refresh-target-index",
- "operation-type": "custom-refresh",
- "index": "{{ target_index_name }}",
- "retries": 100
- }
- },
- {
- "operation": {
- "name": "force-merge",
- "operation-type": "force-merge",
- "request-timeout": {{ target_index_force_merge_timeout }},
- "index": "{{ target_index_name }}",
- "mode": "polling",
- "max-num-segments": {{ target_index_max_num_segments }}
- }
- },
- {
- "operation": {
- "name": "knn-query-from-data-set",
- "operation-type": "search",
- "index": "{{ target_index_name }}",
- "param-source": "knn-query-from-data-set",
- "k": {{ query_k }},
- "field": "{{ target_field_name }}",
- "data_set_format": "{{ query_data_set_format }}",
- "data_set_path": "{{ query_data_set_path }}"
- },
- "clients": {{ query_clients }}
- }
- ]
-}
diff --git a/benchmarks/osb/requirements.in b/benchmarks/osb/requirements.in
deleted file mode 100644
index a9e12b5d3..000000000
--- a/benchmarks/osb/requirements.in
+++ /dev/null
@@ -1,4 +0,0 @@
-opensearch-py
-numpy
-h5py
-opensearch-benchmark
diff --git a/benchmarks/osb/requirements.txt b/benchmarks/osb/requirements.txt
deleted file mode 100644
index a220ee44f..000000000
--- a/benchmarks/osb/requirements.txt
+++ /dev/null
@@ -1,96 +0,0 @@
-#
-# This file is autogenerated by pip-compile with python 3.8
-# To update, run:
-#
-# pip-compile
-#
-aiohttp==3.9.4
- # via opensearch-py
-aiosignal==1.2.0
- # via aiohttp
-async-timeout==4.0.2
- # via aiohttp
-attrs==21.4.0
- # via
- # aiohttp
- # jsonschema
-cachetools==4.2.4
- # via google-auth
-certifi==2023.7.22
- # via
- # opensearch-benchmark
- # opensearch-py
-frozenlist==1.3.0
- # via
- # aiohttp
- # aiosignal
-google-auth==1.22.1
- # via opensearch-benchmark
-google-crc32c==1.3.0
- # via google-resumable-media
-google-resumable-media==1.1.0
- # via opensearch-benchmark
-h5py==3.6.0
- # via -r requirements.in
-idna==3.7
- # via yarl
-ijson==2.6.1
- # via opensearch-benchmark
-importlib-metadata==4.11.3
- # via jsonschema
-jinja2==3.1.3
- # via opensearch-benchmark
-jsonschema==3.1.1
- # via opensearch-benchmark
-markupsafe==2.0.1
- # via
- # jinja2
- # opensearch-benchmark
-multidict==6.0.2
- # via
- # aiohttp
- # yarl
-numpy==1.24.2
- # via
- # -r requirements.in
- # h5py
-opensearch-benchmark==0.0.2
- # via -r requirements.in
-opensearch-py[async]==1.0.0
- # via
- # -r requirements.in
- # opensearch-benchmark
-psutil==5.8.0
- # via opensearch-benchmark
-py-cpuinfo==7.0.0
- # via opensearch-benchmark
-pyasn1==0.4.8
- # via
- # pyasn1-modules
- # rsa
-pyasn1-modules==0.2.8
- # via google-auth
-pyrsistent==0.18.1
- # via jsonschema
-rsa==4.8
- # via google-auth
-six==1.16.0
- # via
- # google-auth
- # google-resumable-media
- # jsonschema
-tabulate==0.8.7
- # via opensearch-benchmark
-thespian==3.10.1
- # via opensearch-benchmark
-urllib3==1.26.18
- # via opensearch-py
-yappi==1.2.3
- # via opensearch-benchmark
-yarl==1.7.2
- # via aiohttp
-zipp==3.7.0
- # via importlib-metadata
-
-# The following packages are considered to be unsafe in a requirements file:
-# setuptools
diff --git a/benchmarks/osb/tests/__init__.py b/benchmarks/osb/tests/__init__.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/benchmarks/osb/tests/data_set_helper.py b/benchmarks/osb/tests/data_set_helper.py
deleted file mode 100644
index 2b144da49..000000000
--- a/benchmarks/osb/tests/data_set_helper.py
+++ /dev/null
@@ -1,197 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-from abc import ABC, abstractmethod
-
-import h5py
-import numpy as np
-
-from osb.extensions.data_set import Context, HDF5DataSet, BigANNVectorDataSet
-
-""" Module containing utility classes and functions for working with data sets.
-
-Included are utilities that can be used to build data sets and write them to
-paths.
-"""
-
-
-class DataSetBuildContext:
- """ Data class capturing information needed to build a particular data set
-
- Attributes:
- data_set_context: Indicator of what the data set is used for,
- vectors: A 2D array containing vectors that are used to build data set.
- path: string representing path where data set should be serialized to.
- """
- def __init__(self, data_set_context: Context, vectors: np.ndarray, path: str):
- self.data_set_context: Context = data_set_context
- self.vectors: np.ndarray = vectors #TODO: Validate shape
- self.path: str = path
-
- def get_num_vectors(self) -> int:
- return self.vectors.shape[0]
-
- def get_dimension(self) -> int:
- return self.vectors.shape[1]
-
- def get_type(self) -> np.dtype:
- return self.vectors.dtype
-
-
-class DataSetBuilder(ABC):
- """ Abstract builder used to create a build a collection of data sets
-
- Attributes:
- data_set_build_contexts: list of data set build contexts that builder
- will build.
- """
- def __init__(self):
- self.data_set_build_contexts = list()
-
- def add_data_set_build_context(self, data_set_build_context: DataSetBuildContext):
- """ Adds a data set build context to list of contexts to be built.
-
- Args:
- data_set_build_context: DataSetBuildContext to be added to list
-
- Returns: Updated DataSetBuilder
-
- """
- self._validate_data_set_context(data_set_build_context)
- self.data_set_build_contexts.append(data_set_build_context)
- return self
-
- def build(self):
- """ Builds and serializes all data sets build contexts
-
- Returns:
-
- """
- [self._build_data_set(data_set_build_context) for data_set_build_context
- in self.data_set_build_contexts]
-
- @abstractmethod
- def _build_data_set(self, context: DataSetBuildContext):
- """ Builds an individual data set
-
- Args:
- context: DataSetBuildContext of data set to be built
-
- Returns:
-
- """
- pass
-
- @abstractmethod
- def _validate_data_set_context(self, context: DataSetBuildContext):
- """ Validates that data set context can be added to this builder
-
- Args:
- context: DataSetBuildContext to be validated
-
- Returns:
-
- """
- pass
-
-
-class HDF5Builder(DataSetBuilder):
-
- def __init__(self):
- super(HDF5Builder, self).__init__()
- self.data_set_meta_data = dict()
-
- def _validate_data_set_context(self, context: DataSetBuildContext):
- if context.path not in self.data_set_meta_data.keys():
- self.data_set_meta_data[context.path] = {
- context.data_set_context: context
- }
- return
-
- if context.data_set_context in \
- self.data_set_meta_data[context.path].keys():
- raise IllegalDataSetBuildContext("Path and context for data set "
- "are already present in builder.")
-
- self.data_set_meta_data[context.path][context.data_set_context] = \
- context
-
- @staticmethod
- def _validate_extension(context: DataSetBuildContext):
- ext = context.path.split('.')[-1]
-
- if ext != HDF5DataSet.FORMAT_NAME:
- raise IllegalDataSetBuildContext("Invalid file extension")
-
- def _build_data_set(self, context: DataSetBuildContext):
- # For HDF5, because multiple data sets can be grouped in the same file,
- # we will build data sets in memory and not write to disk until
- # _flush_data_sets_to_disk is called
- with h5py.File(context.path, 'a') as hf:
- hf.create_dataset(
- HDF5DataSet.parse_context(context.data_set_context),
- data=context.vectors
- )
-
-
-class BigANNBuilder(DataSetBuilder):
-
- def _validate_data_set_context(self, context: DataSetBuildContext):
- self._validate_extension(context)
-
- # prevent the duplication of paths for data sets
- data_set_paths = [c.path for c in self.data_set_build_contexts]
- if any(data_set_paths.count(x) > 1 for x in data_set_paths):
- raise IllegalDataSetBuildContext("Build context paths have to be "
- "unique.")
-
- @staticmethod
- def _validate_extension(context: DataSetBuildContext):
- ext = context.path.split('.')[-1]
-
- if ext != BigANNVectorDataSet.U8BIN_EXTENSION and ext != \
- BigANNVectorDataSet.FBIN_EXTENSION:
- raise IllegalDataSetBuildContext("Invalid file extension")
-
- if ext == BigANNVectorDataSet.U8BIN_EXTENSION and context.get_type() != \
- np.u8int:
- raise IllegalDataSetBuildContext("Invalid data type for {} ext."
- .format(BigANNVectorDataSet
- .U8BIN_EXTENSION))
-
- if ext == BigANNVectorDataSet.FBIN_EXTENSION and context.get_type() != \
- np.float32:
- print(context.get_type())
- raise IllegalDataSetBuildContext("Invalid data type for {} ext."
- .format(BigANNVectorDataSet
- .FBIN_EXTENSION))
-
- def _build_data_set(self, context: DataSetBuildContext):
- num_vectors = context.get_num_vectors()
- dimension = context.get_dimension()
-
- with open(context.path, 'wb') as f:
- f.write(int.to_bytes(num_vectors, 4, "little"))
- f.write(int.to_bytes(dimension, 4, "little"))
- context.vectors.tofile(f)
-
-
-def create_random_2d_array(num_vectors: int, dimension: int) -> np.ndarray:
- rng = np.random.default_rng()
- return rng.random(size=(num_vectors, dimension), dtype=np.float32)
-
-
-class IllegalDataSetBuildContext(Exception):
- """Exception raised when passed in DataSetBuildContext is illegal
-
- Attributes:
- message -- explanation of the error
- """
-
- def __init__(self, message: str):
- self.message = f'{message}'
- super().__init__(self.message)
-
diff --git a/benchmarks/osb/tests/test_param_sources.py b/benchmarks/osb/tests/test_param_sources.py
deleted file mode 100644
index cda730cee..000000000
--- a/benchmarks/osb/tests/test_param_sources.py
+++ /dev/null
@@ -1,353 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-import os
-import random
-import shutil
-import string
-import sys
-import tempfile
-import unittest
-
-# Add parent directory to path
-import numpy as np
-
-sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
-
-from osb.tests.data_set_helper import HDF5Builder, create_random_2d_array, \
- DataSetBuildContext, BigANNBuilder
-from osb.extensions.data_set import Context, HDF5DataSet
-from osb.extensions.param_sources import VectorsFromDataSetParamSource, \
- QueryVectorsFromDataSetParamSource, BulkVectorsFromDataSetParamSource
-from osb.extensions.util import ConfigurationError
-
-DEFAULT_INDEX_NAME = "test-index"
-DEFAULT_FIELD_NAME = "test-field"
-DEFAULT_CONTEXT = Context.INDEX
-DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME
-DEFAULT_NUM_VECTORS = 10
-DEFAULT_DIMENSION = 10
-DEFAULT_RANDOM_STRING_LENGTH = 8
-
-
-class VectorsFromDataSetParamSourceTestCase(unittest.TestCase):
-
- def setUp(self) -> None:
- self.data_set_dir = tempfile.mkdtemp()
-
- # Create a data set we know to be valid for convenience
- self.valid_data_set_path = _create_data_set(
- DEFAULT_NUM_VECTORS,
- DEFAULT_DIMENSION,
- DEFAULT_TYPE,
- DEFAULT_CONTEXT,
- self.data_set_dir
- )
-
- def tearDown(self):
- shutil.rmtree(self.data_set_dir)
-
- def test_missing_params(self):
- empty_params = dict()
- self.assertRaises(
- ConfigurationError,
- lambda: VectorsFromDataSetParamSourceTestCase.
- TestVectorsFromDataSetParamSource(empty_params, DEFAULT_CONTEXT)
- )
-
- def test_invalid_data_set_format(self):
- invalid_data_set_format = "invalid-data-set-format"
-
- test_param_source_params = {
- "index": DEFAULT_INDEX_NAME,
- "field": DEFAULT_FIELD_NAME,
- "data_set_format": invalid_data_set_format,
- "data_set_path": self.valid_data_set_path,
- }
- self.assertRaises(
- ConfigurationError,
- lambda: self.TestVectorsFromDataSetParamSource(
- test_param_source_params,
- DEFAULT_CONTEXT
- )
- )
-
- def test_invalid_data_set_path(self):
- invalid_data_set_path = "invalid-data-set-path"
- test_param_source_params = {
- "index": DEFAULT_INDEX_NAME,
- "field": DEFAULT_FIELD_NAME,
- "data_set_format": HDF5DataSet.FORMAT_NAME,
- "data_set_path": invalid_data_set_path,
- }
- self.assertRaises(
- FileNotFoundError,
- lambda: self.TestVectorsFromDataSetParamSource(
- test_param_source_params,
- DEFAULT_CONTEXT
- )
- )
-
- def test_partition_hdf5(self):
- num_vectors = 100
-
- hdf5_data_set_path = _create_data_set(
- num_vectors,
- DEFAULT_DIMENSION,
- HDF5DataSet.FORMAT_NAME,
- DEFAULT_CONTEXT,
- self.data_set_dir
- )
-
- test_param_source_params = {
- "index": DEFAULT_INDEX_NAME,
- "field": DEFAULT_FIELD_NAME,
- "data_set_format": HDF5DataSet.FORMAT_NAME,
- "data_set_path": hdf5_data_set_path,
- }
- test_param_source = self.TestVectorsFromDataSetParamSource(
- test_param_source_params,
- DEFAULT_CONTEXT
- )
-
- num_partitions = 10
- vecs_per_partition = test_param_source.num_vectors // num_partitions
-
- self._test_partition(
- test_param_source,
- num_partitions,
- vecs_per_partition
- )
-
- def test_partition_bigann(self):
- num_vectors = 100
- float_extension = "fbin"
-
- bigann_data_set_path = _create_data_set(
- num_vectors,
- DEFAULT_DIMENSION,
- float_extension,
- DEFAULT_CONTEXT,
- self.data_set_dir
- )
-
- test_param_source_params = {
- "index": DEFAULT_INDEX_NAME,
- "field": DEFAULT_FIELD_NAME,
- "data_set_format": "bigann",
- "data_set_path": bigann_data_set_path,
- }
- test_param_source = self.TestVectorsFromDataSetParamSource(
- test_param_source_params,
- DEFAULT_CONTEXT
- )
-
- num_partitions = 10
- vecs_per_partition = test_param_source.num_vectors // num_partitions
-
- self._test_partition(
- test_param_source,
- num_partitions,
- vecs_per_partition
- )
-
- def _test_partition(
- self,
- test_param_source: VectorsFromDataSetParamSource,
- num_partitions: int,
- vec_per_partition: int
- ):
- for i in range(num_partitions):
- test_param_source_i = test_param_source.partition(i, num_partitions)
- self.assertEqual(test_param_source_i.num_vectors, vec_per_partition)
- self.assertEqual(test_param_source_i.offset, i * vec_per_partition)
-
- class TestVectorsFromDataSetParamSource(VectorsFromDataSetParamSource):
- """
- Empty implementation of ABC VectorsFromDataSetParamSource so that we can
- test the concrete methods.
- """
-
- def params(self):
- pass
-
-
-class QueryVectorsFromDataSetParamSourceTestCase(unittest.TestCase):
-
- def setUp(self) -> None:
- self.data_set_dir = tempfile.mkdtemp()
-
- def tearDown(self):
- shutil.rmtree(self.data_set_dir)
-
- def test_params(self):
- # Create a data set
- k = 12
- data_set_path = _create_data_set(
- DEFAULT_NUM_VECTORS,
- DEFAULT_DIMENSION,
- DEFAULT_TYPE,
- Context.QUERY,
- self.data_set_dir
- )
-
- # Create a QueryVectorsFromDataSetParamSource with relevant params
- test_param_source_params = {
- "index": DEFAULT_INDEX_NAME,
- "field": DEFAULT_FIELD_NAME,
- "data_set_format": DEFAULT_TYPE,
- "data_set_path": data_set_path,
- "k": k,
- }
- query_param_source = QueryVectorsFromDataSetParamSource(
- None, test_param_source_params
- )
-
- # Check each
- for i in range(DEFAULT_NUM_VECTORS):
- self._check_params(
- query_param_source.params(),
- DEFAULT_INDEX_NAME,
- DEFAULT_FIELD_NAME,
- DEFAULT_DIMENSION,
- k
- )
-
- # Assert last call creates stop iteration
- self.assertRaises(
- StopIteration,
- lambda: query_param_source.params()
- )
-
- def _check_params(
- self,
- params: dict,
- expected_index: str,
- expected_field: str,
- expected_dimension: int,
- expected_k: int
- ):
- index_name = params.get("index")
- self.assertEqual(expected_index, index_name)
- body = params.get("body")
- self.assertIsInstance(body, dict)
- query = body.get("query")
- self.assertIsInstance(query, dict)
- query_knn = query.get("knn")
- self.assertIsInstance(query_knn, dict)
- field = query_knn.get(expected_field)
- self.assertIsInstance(field, dict)
- vector = field.get("vector")
- self.assertIsInstance(vector, np.ndarray)
- self.assertEqual(len(list(vector)), expected_dimension)
- k = field.get("k")
- self.assertEqual(k, expected_k)
-
-
-class BulkVectorsFromDataSetParamSourceTestCase(unittest.TestCase):
-
- def setUp(self) -> None:
- self.data_set_dir = tempfile.mkdtemp()
-
- def tearDown(self):
- shutil.rmtree(self.data_set_dir)
-
- def test_params(self):
- num_vectors = 49
- bulk_size = 10
- data_set_path = _create_data_set(
- num_vectors,
- DEFAULT_DIMENSION,
- DEFAULT_TYPE,
- Context.INDEX,
- self.data_set_dir
- )
-
- test_param_source_params = {
- "index": DEFAULT_INDEX_NAME,
- "field": DEFAULT_FIELD_NAME,
- "data_set_format": DEFAULT_TYPE,
- "data_set_path": data_set_path,
- "bulk_size": bulk_size
- }
- bulk_param_source = BulkVectorsFromDataSetParamSource(
- None, test_param_source_params
- )
-
- # Check each payload returned
- vectors_consumed = 0
- while vectors_consumed < num_vectors:
- expected_num_vectors = min(num_vectors - vectors_consumed, bulk_size)
- self._check_params(
- bulk_param_source.params(),
- DEFAULT_INDEX_NAME,
- DEFAULT_FIELD_NAME,
- DEFAULT_DIMENSION,
- expected_num_vectors
- )
- vectors_consumed += expected_num_vectors
-
- # Assert last call creates stop iteration
- self.assertRaises(
- StopIteration,
- lambda: bulk_param_source.params()
- )
-
- def _check_params(
- self,
- params: dict,
- expected_index: str,
- expected_field: str,
- expected_dimension: int,
- expected_num_vectors_in_payload: int
- ):
- size = params.get("size")
- self.assertEqual(size, expected_num_vectors_in_payload)
- body = params.get("body")
- self.assertIsInstance(body, list)
- self.assertEqual(len(body) // 2, expected_num_vectors_in_payload)
-
- # Bulk payload has 2 parts: first one is the header and the second one
- # is the body. The header will have the index name and the body will
- # have the vector
- for header, req_body in zip(*[iter(body)] * 2):
- index = header.get("index")
- self.assertIsInstance(index, dict)
- index_name = index.get("_index")
- self.assertEqual(index_name, expected_index)
-
- vector = req_body.get(expected_field)
- self.assertIsInstance(vector, list)
- self.assertEqual(len(vector), expected_dimension)
-
-
-def _create_data_set(
- num_vectors: int,
- dimension: int,
- extension: str,
- data_set_context: Context,
- data_set_dir
-) -> str:
-
- file_name_base = ''.join(random.choice(string.ascii_letters) for _ in
- range(DEFAULT_RANDOM_STRING_LENGTH))
- data_set_file_name = "{}.{}".format(file_name_base, extension)
- data_set_path = os.path.join(data_set_dir, data_set_file_name)
- context = DataSetBuildContext(
- data_set_context,
- create_random_2d_array(num_vectors, dimension),
- data_set_path)
-
- if extension == HDF5DataSet.FORMAT_NAME:
- HDF5Builder().add_data_set_build_context(context).build()
- else:
- BigANNBuilder().add_data_set_build_context(context).build()
-
- return data_set_path
-
-
-if __name__ == '__main__':
- unittest.main()
diff --git a/benchmarks/osb/workload.json b/benchmarks/osb/workload.json
deleted file mode 100644
index bd0d84195..000000000
--- a/benchmarks/osb/workload.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{% import "benchmark.helpers" as benchmark with context %}
-{
- "version": 2,
- "description": "k-NN Plugin train workload",
- "indices": [
- {
- "name": "{{ target_index_name }}",
- "body": "{{ target_index_body }}"
- },
- {
- "name": "{{ train_index_name }}",
- "body": "{{ train_index_body }}"
- }
- ],
- "operations": {{ benchmark.collect(parts="operations/*.json") }},
- "test_procedures": [{{ benchmark.collect(parts="procedures/*.json") }}]
-}
diff --git a/benchmarks/osb/workload.py b/benchmarks/osb/workload.py
deleted file mode 100644
index 32e6ad02c..000000000
--- a/benchmarks/osb/workload.py
+++ /dev/null
@@ -1,18 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-# This code needs to be included at the top of every workload.py file.
-# OpenSearch Benchmarks is not able to find other helper files unless the path
-# is updated.
-import os
-import sys
-sys.path.append(os.path.abspath(os.getcwd()))
-
-from extensions.registry import register as custom_register
-
-
-def register(registry):
- custom_register(registry)
diff --git a/benchmarks/perf-tool/.pylintrc b/benchmarks/perf-tool/.pylintrc
deleted file mode 100644
index 15bf4ccc3..000000000
--- a/benchmarks/perf-tool/.pylintrc
+++ /dev/null
@@ -1,443 +0,0 @@
-# This Pylint rcfile contains a best-effort configuration to uphold the
-# best-practices and style described in the Google Python style guide:
-# https://google.github.io/styleguide/pyguide.html
-#
-# Its canonical open-source location is:
-# https://google.github.io/styleguide/pylintrc
-
-[MASTER]
-
-fail-under=9.0
-
-# Files or directories to be skipped. They should be base names, not paths.
-ignore=third_party
-
-# Files or directories matching the regex patterns are skipped. The regex
-# matches against base names, not paths.
-ignore-patterns=
-
-# Pickle collected data for later comparisons.
-persistent=no
-
-# List of plugins (as comma separated values of python modules names) to load,
-# usually to register additional checkers.
-load-plugins=
-
-# Use multiple processes to speed up Pylint.
-jobs=4
-
-# Allow loading of arbitrary C extensions. Extensions are imported into the
-# active Python interpreter and may run arbitrary code.
-unsafe-load-any-extension=no
-
-
-[MESSAGES CONTROL]
-
-# Only show warnings with the listed confidence levels. Leave empty to show
-# all. Valid levels: HIGH, INFERENCE, INFERENCE_FAILURE, UNDEFINED
-confidence=
-
-# Enable the message, report, category or checker with the given id(s). You can
-# either give multiple identifier separated by comma (,) or put this option
-# multiple time (only on the command line, not in the configuration file where
-# it should appear only once). See also the "--disable" option for examples.
-#enable=
-
-# Disable the message, report, category or checker with the given id(s). You
-# can either give multiple identifiers separated by comma (,) or put this
-# option multiple times (only on the command line, not in the configuration
-# file where it should appear only once).You can also use "--disable=all" to
-# disable everything first and then reenable specific checks. For example, if
-# you want to run only the similarities checker, you can use "--disable=all
-# --enable=similarities". If you want to run only the classes checker, but have
-# no Warning level messages displayed, use"--disable=all --enable=classes
-# --disable=W"
-disable=abstract-method,
- apply-builtin,
- arguments-differ,
- attribute-defined-outside-init,
- backtick,
- bad-option-value,
- basestring-builtin,
- buffer-builtin,
- c-extension-no-member,
- consider-using-enumerate,
- cmp-builtin,
- cmp-method,
- coerce-builtin,
- coerce-method,
- delslice-method,
- div-method,
- duplicate-code,
- eq-without-hash,
- execfile-builtin,
- file-builtin,
- filter-builtin-not-iterating,
- fixme,
- getslice-method,
- global-statement,
- hex-method,
- idiv-method,
- implicit-str-concat-in-sequence,
- import-error,
- import-self,
- import-star-module-level,
- inconsistent-return-statements,
- input-builtin,
- intern-builtin,
- invalid-str-codec,
- locally-disabled,
- long-builtin,
- long-suffix,
- map-builtin-not-iterating,
- misplaced-comparison-constant,
- missing-function-docstring,
- metaclass-assignment,
- next-method-called,
- next-method-defined,
- no-absolute-import,
- no-else-break,
- no-else-continue,
- no-else-raise,
- no-else-return,
- no-init, # added
- no-member,
- no-name-in-module,
- no-self-use,
- nonzero-method,
- oct-method,
- old-division,
- old-ne-operator,
- old-octal-literal,
- old-raise-syntax,
- parameter-unpacking,
- print-statement,
- raising-string,
- range-builtin-not-iterating,
- raw_input-builtin,
- rdiv-method,
- reduce-builtin,
- relative-import,
- reload-builtin,
- round-builtin,
- setslice-method,
- signature-differs,
- standarderror-builtin,
- suppressed-message,
- sys-max-int,
- too-few-public-methods,
- too-many-ancestors,
- too-many-arguments,
- too-many-boolean-expressions,
- too-many-branches,
- too-many-instance-attributes,
- too-many-locals,
- too-many-nested-blocks,
- too-many-public-methods,
- too-many-return-statements,
- too-many-statements,
- trailing-newlines,
- unichr-builtin,
- unicode-builtin,
- unnecessary-pass,
- unpacking-in-except,
- useless-else-on-loop,
- useless-object-inheritance,
- useless-suppression,
- using-cmp-argument,
- wrong-import-order,
- xrange-builtin,
- zip-builtin-not-iterating,
-
-
-[REPORTS]
-
-# Set the output format. Available formats are text, parseable, colorized, msvs
-# (visual studio) and html. You can also give a reporter class, eg
-# mypackage.mymodule.MyReporterClass.
-output-format=text
-
-# Put messages in a separate file for each module / package specified on the
-# command line instead of printing them on stdout. Reports (if any) will be
-# written in a file name "pylint_global.[txt|html]". This option is deprecated
-# and it will be removed in Pylint 2.0.
-files-output=no
-
-# Tells whether to display a full report or only the messages
-reports=no
-
-# Python expression which should return a note less than 10 (10 is the highest
-# note). You have access to the variables errors warning, statement which
-# respectively contain the number of errors / warnings messages and the total
-# number of statements analyzed. This is used by the global evaluation report
-# (RP0004).
-evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10)
-
-# Template used to display messages. This is a python new-style format string
-# used to format the message information. See doc for all details
-#msg-template=
-
-
-[BASIC]
-
-# Good variable names which should always be accepted, separated by a comma
-good-names=main,_
-
-# Bad variable names which should always be refused, separated by a comma
-bad-names=
-
-# Colon-delimited sets of names that determine each other's naming style when
-# the name regexes allow several styles.
-name-group=
-
-# Include a hint for the correct naming format with invalid-name
-include-naming-hint=no
-
-# List of decorators that produce properties, such as abc.abstractproperty. Add
-# to this list to register other decorators that produce valid properties.
-property-classes=abc.abstractproperty,cached_property.cached_property,cached_property.threaded_cached_property,cached_property.cached_property_with_ttl,cached_property.threaded_cached_property_with_ttl
-
-# Regular expression matching correct function names
-function-rgx=^(?:(?PsetUp|tearDown|setUpModule|tearDownModule)|(?P_?[A-Z][a-zA-Z0-9]*)|(?P_?[a-z][a-z0-9_]*))$
-
-# Regular expression matching correct variable names
-variable-rgx=^[a-z][a-z0-9_]*$
-
-# Regular expression matching correct constant names
-const-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$
-
-# Regular expression matching correct attribute names
-attr-rgx=^_{0,2}[a-z][a-z0-9_]*$
-
-# Regular expression matching correct argument names
-argument-rgx=^[a-z][a-z0-9_]*$
-
-# Regular expression matching correct class attribute names
-class-attribute-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$
-
-# Regular expression matching correct inline iteration names
-inlinevar-rgx=^[a-z][a-z0-9_]*$
-
-# Regular expression matching correct class names
-class-rgx=^_?[A-Z][a-zA-Z0-9]*$
-
-# Regular expression matching correct module names
-module-rgx=^(_?[a-z][a-z0-9_]*|__init__)$
-
-# Regular expression matching correct method names
-method-rgx=(?x)^(?:(?P_[a-z0-9_]+__|runTest|setUp|tearDown|setUpTestCase|tearDownTestCase|setupSelf|tearDownClass|setUpClass|(test|assert)_*[A-Z0-9][a-zA-Z0-9_]*|next)|(?P_{0,2}[A-Z][a-zA-Z0-9_]*)|(?P_{0,2}[a-z][a-z0-9_]*))$
-
-# Regular expression which should only match function or class names that do
-# not require a docstring.
-no-docstring-rgx=(__.*__|main|test.*|.*test|.*Test)$
-
-# Minimum line length for functions/classes that require docstrings, shorter
-# ones are exempt.
-docstring-min-length=10
-
-
-[TYPECHECK]
-
-# List of decorators that produce context managers, such as
-# contextlib.contextmanager. Add to this list to register other decorators that
-# produce valid context managers.
-contextmanager-decorators=contextlib.contextmanager,contextlib2.contextmanager
-
-# Tells whether missing members accessed in mixin class should be ignored. A
-# mixin class is detected if its name ends with "mixin" (case insensitive).
-ignore-mixin-members=yes
-
-# List of module names for which member attributes should not be checked
-# (useful for modules/projects where namespaces are manipulated during runtime
-# and thus existing member attributes cannot be deduced by static analysis. It
-# supports qualified module names, as well as Unix pattern matching.
-ignored-modules=
-
-# List of class names for which member attributes should not be checked (useful
-# for classes with dynamically set attributes). This supports the use of
-# qualified names.
-ignored-classes=optparse.Values,thread._local,_thread._local
-
-# List of members which are set dynamically and missed by pylint inference
-# system, and so shouldn't trigger E1101 when accessed. Python regular
-# expressions are accepted.
-generated-members=
-
-
-[FORMAT]
-
-# Maximum number of characters on a single line.
-max-line-length=80
-
-# TODO(https://github.com/PyCQA/pylint/issues/3352): Direct pylint to exempt
-# lines made too long by directives to pytype.
-
-# Regexp for a line that is allowed to be longer than the limit.
-ignore-long-lines=(?x)(
- ^\s*(\#\ )??$|
- ^\s*(from\s+\S+\s+)?import\s+.+$)
-
-# Allow the body of an if to be on the same line as the test if there is no
-# else.
-single-line-if-stmt=yes
-
-# List of optional constructs for which whitespace checking is disabled. `dict-
-# separator` is used to allow tabulation in dicts, etc.: {1 : 1,\n222: 2}.
-# `trailing-comma` allows a space between comma and closing bracket: (a, ).
-# `empty-line` allows space-only lines.
-no-space-check=
-
-# Maximum number of lines in a module
-max-module-lines=99999
-
-# String used as indentation unit. The internal Google style guide mandates 2
-# spaces. Google's externaly-published style guide says 4, consistent with
-# PEP 8. Here, we use 2 spaces, for conformity with many open-sourced Google
-# projects (like TensorFlow).
-indent-string=' '
-
-# Number of spaces of indent required inside a hanging or continued line.
-indent-after-paren=4
-
-# Expected format of line ending, e.g. empty (any line ending), LF or CRLF.
-expected-line-ending-format=
-
-
-[MISCELLANEOUS]
-
-# List of note tags to take in consideration, separated by a comma.
-notes=TODO
-
-
-[STRING]
-
-# This flag controls whether inconsistent-quotes generates a warning when the
-# character used as a quote delimiter is used inconsistently within a module.
-check-quote-consistency=yes
-
-
-[VARIABLES]
-
-# Tells whether we should check for unused import in __init__ files.
-init-import=no
-
-# A regular expression matching the name of dummy variables (i.e. expectedly
-# not used).
-dummy-variables-rgx=^\*{0,2}(_$|unused_|dummy_)
-
-# List of additional names supposed to be defined in builtins. Remember that
-# you should avoid to define new builtins when possible.
-additional-builtins=
-
-# List of strings which can identify a callback function by name. A callback
-# name must start or end with one of those strings.
-callbacks=cb_,_cb
-
-# List of qualified module names which can have objects that can redefine
-# builtins.
-redefining-builtins-modules=six,six.moves,past.builtins,future.builtins,functools
-
-
-[LOGGING]
-
-# Logging modules to check that the string format arguments are in logging
-# function parameter format
-logging-modules=logging,absl.logging,tensorflow.io.logging
-
-
-[SIMILARITIES]
-
-# Minimum lines number of a similarity.
-min-similarity-lines=4
-
-# Ignore comments when computing similarities.
-ignore-comments=yes
-
-# Ignore docstrings when computing similarities.
-ignore-docstrings=yes
-
-# Ignore imports when computing similarities.
-ignore-imports=no
-
-
-[SPELLING]
-
-# Spelling dictionary name. Available dictionaries: none. To make it working
-# install python-enchant package.
-spelling-dict=
-
-# List of comma separated words that should not be checked.
-spelling-ignore-words=
-
-# A path to a file that contains private dictionary; one word per line.
-spelling-private-dict-file=
-
-# Tells whether to store unknown words to indicated private dictionary in
-# --spelling-private-dict-file option instead of raising a message.
-spelling-store-unknown-words=no
-
-
-[IMPORTS]
-
-# Deprecated modules which should not be used, separated by a comma
-deprecated-modules=regsub,
- TERMIOS,
- Bastion,
- rexec,
- sets
-
-# Create a graph of every (i.e. internal and external) dependencies in the
-# given file (report RP0402 must not be disabled)
-import-graph=
-
-# Create a graph of external dependencies in the given file (report RP0402 must
-# not be disabled)
-ext-import-graph=
-
-# Create a graph of internal dependencies in the given file (report RP0402 must
-# not be disabled)
-int-import-graph=
-
-# Force import order to recognize a module as part of the standard
-# compatibility libraries.
-known-standard-library=
-
-# Force import order to recognize a module as part of a third party library.
-known-third-party=enchant, absl
-
-# Analyse import fallback blocks. This can be used to support both Python 2 and
-# 3 compatible code, which means that the block might have code that exists
-# only in one or another interpreter, leading to false positives when analysed.
-analyse-fallback-blocks=no
-
-
-[CLASSES]
-
-# List of method names used to declare (i.e. assign) instance attributes.
-defining-attr-methods=__init__,
- __new__,
- setUp
-
-# List of member names, which should be excluded from the protected access
-# warning.
-exclude-protected=_asdict,
- _fields,
- _replace,
- _source,
- _make
-
-# List of valid names for the first argument in a class method.
-valid-classmethod-first-arg=cls,
- class_
-
-# List of valid names for the first argument in a metaclass class method.
-valid-metaclass-classmethod-first-arg=mcs
-
-
-[EXCEPTIONS]
-
-# Exceptions that will emit a warning when being caught. Defaults to
-# "Exception"
-overgeneral-exceptions=StandardError,
- Exception,
- BaseException
diff --git a/benchmarks/perf-tool/.style.yapf b/benchmarks/perf-tool/.style.yapf
deleted file mode 100644
index 39b663a7a..000000000
--- a/benchmarks/perf-tool/.style.yapf
+++ /dev/null
@@ -1,10 +0,0 @@
-[style]
-COLUMN_LIMIT: 80
-DEDENT_CLOSING_BRACKETS: True
-INDENT_DICTIONARY_VALUE: True
-SPLIT_ALL_COMMA_SEPARATED_VALUES: True
-SPLIT_ARGUMENTS_WHEN_COMMA_TERMINATED: True
-SPLIT_BEFORE_CLOSING_BRACKET: True
-SPLIT_BEFORE_EXPRESSION_AFTER_OPENING_PAREN: True
-SPLIT_BEFORE_FIRST_ARGUMENT: True
-SPLIT_BEFORE_NAMED_ASSIGNS: True
diff --git a/benchmarks/perf-tool/README.md b/benchmarks/perf-tool/README.md
deleted file mode 100644
index 36f76bcdb..000000000
--- a/benchmarks/perf-tool/README.md
+++ /dev/null
@@ -1,449 +0,0 @@
-# IMPORTANT NOTE: No new features will be added to this tool . This tool is currently in maintanence mode. All new features will be added to [vector search workload]( https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch)
-
-# OpenSearch k-NN Benchmarking
-- [Welcome!](#welcome)
-- [Install Prerequisites](#install-prerequisites)
-- [Usage](#usage)
-- [Contributing](#contributing)
-
-## Welcome!
-
-This directory contains the code related to benchmarking the k-NN plugin.
-Benchmarks can be run against any OpenSearch cluster with the k-NN plugin
-installed. Benchmarks are highly configurable using the test configuration
-file.
-
-## Install Prerequisites
-
-### Setup
-
-K-NN perf requires Python 3.8 or greater to be installed. One of
-the easier ways to do this is through Conda, a package and environment
-management system for Python.
-
-First, follow the
-[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)
-to install Conda on your system.
-
-Next, create a Python 3.8 environment:
-```
-conda create -n knn-perf python=3.8
-```
-
-After the environment is created, activate it:
-```
-source activate knn-perf
-```
-
-Lastly, clone the k-NN repo and install all required python packages:
-```
-git clone https://github.com/opensearch-project/k-NN.git
-cd k-NN/benchmarks/perf-tool
-pip install -r requirements.txt
-```
-
-After all of this completes, you should be ready to run your first performance benchmarks!
-
-
-## Usage
-
-### Quick Start
-
-In order to run a benchmark, you must first create a test configuration yml
-file. Checkout [this example](https://github.com/opensearch-project/k-NN/blob/main/benchmarks/perf-tool/sample-configs) file
-for benchmarking *faiss*'s IVF method. This file contains the definition for
-the benchmark that you want to run. At the top are
-[test parameters](#test-parameters). These define high level settings of the
-test, such as the endpoint of the OpenSearch cluster.
-
-Next, you define the actions that the test will perform. These actions are
-referred to as steps. First, you can define "setup" steps. These are steps that
-are run once at the beginning of the execution to configure the cluster how you
-want it. These steps do not contribute to the final metrics.
-
-After that, you define the "steps". These are the steps that the test will be
-collecting metrics on. Each step emits certain metrics. These are run
-multiple times, depending on the test parameter "num_runs". At the end of the
-execution of all of the runs, the metrics from each run are collected and
-averaged.
-
-Lastly, you define the "cleanup" steps. The "cleanup" steps are executed after
-each test run. For instance, if you are measuring index performance, you may
-want to delete the index after each run.
-
-To run the test, execute the following command:
-```
-python knn-perf-tool.py [--log LOGLEVEL] test config-path.yml output.json
-
---log log level of tool, options are: info, debug, warning, error, critical
-```
-
-The output will be a json document containing the results.
-
-Additionally, you can get the difference between two test runs using the diff
-command:
-```
-python knn-perf-tool.py [--log LOGLEVEL] diff result1.json result2.json
-
---log log level of tool, options are: info, debug, warning, error, critical
-```
-
-The output will be the delta between the two metrics.
-
-### Test Parameters
-
-| Parameter Name | Description | Default |
-|----------------|------------------------------------------------------------------------------------|------------|
-| endpoint | Endpoint OpenSearch cluster is running on | localhost |
-| port | Port on which OpenSearch Cluster is running on | 9200 |
-| test_name | Name of test | No default |
-| test_id | String ID of test | No default |
-| num_runs | Number of runs to execute steps | 1 |
-| show_runs | Whether to output each run in addition to the total summary | false |
-| setup | List of steps to run once before metric collection starts | [] |
-| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default |
-| cleanup | List of steps to run after each test run | [] |
-
-### Steps
-
-Included are the list of steps that are currently supported. Each step contains
-a set of parameters that are passed in the test configuration file and a set
-of metrics that the test produces.
-
-#### create_index
-
-Creates an OpenSearch index.
-
-##### Parameters
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| index_name | Name of index to create | No default |
-| index_spec | Path to index specification | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end. | ms |
-
-#### disable_refresh
-
-Disables refresh for all indices in the cluster.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end. | ms |
-
-#### refresh_index
-
-Refreshes an OpenSearch index.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| index_name | Name of index to refresh | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end. | ms |
-| store_kb | Size of index after refresh completes | KB |
-
-#### force_merge
-
-Force merges an index to a specified number of segments.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| index_name | Name of index to force merge | No default |
-| max_num_segments | Number of segments to force merge to | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end. | ms |
-
-#### train_model
-
-Trains a model.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| model_id | Model id to set | Test |
-| train_index | Index to pull training data from | No default |
-| train_field | Field to pull training data from | No default |
-| dimension | Dimension of model | No default |
-| description | Description of model | No default |
-| max_training_vector_count | Number of training vectors to used | No default |
-| method_spec | Path to method specification | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end | ms |
-
-#### delete_model
-
-Deletes a model from the cluster.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| model_id | Model id to delete | Test |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end | ms |
-
-#### delete_index
-
-Deletes an index from the cluster.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| index_name | Name of index to delete | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Time to execute step end to end | ms |
-
-#### ingest
-
-Ingests a dataset of vectors into the cluster.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| index_name | Name of index to ingest into | No default |
-| field_name | Name of field to ingest into | No default |
-| bulk_size | Documents per bulk request | 300 |
-| dataset_format | Format the data-set is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| dataset_path | Path to data-set | No default |
-| doc_count | Number of documents to create from data-set | Size of the data-set |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Total time to ingest the dataset into the index.| ms |
-
-#### ingest_multi_field
-
-Ingests a dataset of multiple context types into the cluster.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
-| index_name | Name of index to ingest into | No default |
-| field_name | Name of field to ingest into | No default |
-| bulk_size | Documents per bulk request | 300 |
-| dataset_path | Path to data-set | No default |
-| doc_count | Number of documents to create from data-set | Size of the data-set |
-| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
-| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Total time to ingest the dataset into the index.| ms |
-
-#### ingest_nested_field
-
-Ingests a dataset with nested field into the cluster.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
-| index_name | Name of index to ingest into | No default |
-| field_name | Name of field to ingest into | No default |
-| dataset_path | Path to data-set | No default |
-| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
-| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file. It should contains { name: 'parent_id', type: 'int'} | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Total time to ingest the dataset into the index.| ms |
-
-#### query
-
-Runs a set of queries against an index.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- | ----------- | ----------- |
-| k | Number of neighbors to return on search | 100 |
-| r | r value in Recall@R | 1 |
-| index_name | Name of index to search | No default |
-| field_name | Name field to search | No default |
-| calculate_recall | Whether to calculate recall values | False |
-| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| dataset_path | Path to dataset | No default |
-| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| neighbors_path | Path to neighbors dataset | No default |
-| query_count | Number of queries to create from data-set | Size of the data-set |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- |---------------------------------------------------------------------------------------------------------| ----------- |
-| took | Took times returned per query aggregated as total, p50, p90, p99, p99.9 and p100 (when applicable) | ms |
-| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
-| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
-| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
-
-#### query_with_filter
-
-Runs a set of queries with filter against an index.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
-| k | Number of neighbors to return on search | 100 |
-| r | r value in Recall@R | 1 |
-| index_name | Name of index to search | No default |
-| field_name | Name field to search | No default |
-| calculate_recall | Whether to calculate recall values | False |
-| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| dataset_path | Path to dataset | No default |
-| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| neighbors_path | Path to neighbors dataset | No default |
-| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
-| filter_spec | Path to filter specification | No default |
-| filter_type | Type of filter format, we do support following types:
FILTER inner filter format for approximate k-NN search
SCRIPT score scripting with exact k-NN search and pre-filtering
BOOL_POST_FILTER Bool query with post-filtering | SCRIPT |
-| score_script_similarity | Similarity function that has been used to index dataset. Used for SCRIPT filter type and ignored for others | l2 |
-| query_count | Number of queries to create from data-set | Size of the data-set |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
-| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
-| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
-| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
-
-
-#### query_nested_field
-
-Runs a set of queries with nested field against an index.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
-| k | Number of neighbors to return on search | 100 |
-| r | r value in Recall@R | 1 |
-| index_name | Name of index to search | No default |
-| field_name | Name field to search | No default |
-| calculate_recall | Whether to calculate recall values | False |
-| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| dataset_path | Path to dataset | No default |
-| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
-| neighbors_path | Path to neighbors dataset | No default |
-| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
-| query_count | Number of queries to create from data-set | Size of the data-set |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- | ----------- | ----------- |
-| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
-| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
-| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
-| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |
-
-#### get_stats
-
-Gets the index stats.
-
-##### Parameters
-
-| Parameter Name | Description | Default |
-| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
-| index_name | Name of index to search | No default |
-
-##### Metrics
-
-| Metric Name | Description | Unit |
-| ----------- |-------------------------------------------------|------------|
-| num_of_committed_segments | Total number of commited segments in the index | integer >= 0 |
-| num_of_search_segments | Total number of search segments in the index | integer >= 0 |
-
-### Data sets
-
-This benchmark tool uses pre-generated data sets to run indexing and query workload. For some benchmark types existing dataset need to be
-extended. Filtering is an example of use case where such dataset extension is needed.
-
-It's possible to use script provided with this repo to generate dataset and run benchmark for filtering queries.
-You need to have existing dataset with vector data. This dataset will be used to generate additional attribute data and set of ground truth neighbours document ids.
-
-To generate dataset with attributes based on vectors only dataset use following command pattern:
-
-```commandline
-python add-filters-to-dataset.py True False
-```
-
-To generate neighbours dataset for different filters based on dataset with attributes use following command pattern:
-
-```commandline
-python add-filters-to-dataset.py False True
-```
-
-After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.
-
-To generate dataset with parent doc id based on vectors only dataset, use following command pattern:
-```commandline
-python add-parent-doc-id-to-dataset.py
-```
-This will generate neighbours dataset as well. This new dataset(s) can be referred from testcase definition in `ingest_nested_field` and `query_nested_field` steps.
-
-## Contributing
-
-### Linting
-
-Use pylint to lint the code:
-```
-pylint knn-perf-tool.py okpt/**/*.py okpt/**/**/*.py
-```
-
-### Formatting
-
-We use yapf and the google style to format our code. After installing yapf, you can format your code by running:
-
-```
-yapf --style google knn-perf-tool.py okpt/**/*.py okpt/**/**/*.py
-```
-
-### Updating requirements
-
-Add new requirements to "requirements.in" and run `pip-compile`
diff --git a/benchmarks/perf-tool/add-filters-to-dataset.py b/benchmarks/perf-tool/add-filters-to-dataset.py
deleted file mode 100644
index 0624f7323..000000000
--- a/benchmarks/perf-tool/add-filters-to-dataset.py
+++ /dev/null
@@ -1,200 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-"""
-Script builds complex dataset with additional attributes from exiting dataset that has only vectors.
-Additional attributes are predefined in the script: color, taste, age. Only HDF5 format of vector dataset is supported.
-
-Output dataset file will have additional dataset 'attributes' with multiple columns, each column corresponds to one attribute
-from an attribute set, and value is generated at random, e.g.:
-
-0: green None 71
-1: green bitter 28
-
-there is no explicit index reference in 'attributes' dataset, index of the row corresponds to a document id.
-For instance, in example above two rows of fields mapped to documents with ids '0' and '1'.
-
-If 'generate_filters' flag is set script generates additional dataset of neighbours (ground truth) for each filter type.
-Output is a new file with several datasets, each dataset corresponds to one filter. Datasets are named 'neighbour_filter_X'
-where X is 1 based index of particular filter.
-Each dataset has rows with array of integers, where integer corresponds to
-a document id from original dataset with additional fields. Array ca have -1 values that are treated as null, this is because
-subset of filtered documents is same of smaller than original set.
-
-For example, dataset file content may look like :
-
-neighbour_filter_1: [[ 2, 5, -1],
- [ 3, 1, -1],
- [ 2 5, 7]]
-neighbour_filter_2: [[-1, -1, -1],
- [ 5, 6, -1],
- [ 4, 2, 1]]
-
-In this case we do have datasets for two filters, 3 query results for each. [2, 5, -1] indicates that for first query
-if filter 1 is used most similar document is with id 2, next similar is 5, and the rest do not pass filter 1 criteria.
-
-Example of script usage:
-
- create new hdf5 file with attribute dataset
- add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-with-attr True False
-
- create new hdf5 file with filter datasets
- add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data-with-attr.hdf5 ~/dev/opensearch/datasets/data-with-filters False True
-"""
-
-import getopt
-import os
-import random
-import sys
-
-import h5py
-
-from osb.extensions.data_set import HDF5DataSet
-
-
-class _Dataset:
- """Type of dataset container for data with additional attributes"""
- DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME
-
- def create_dataset(self, source_dataset_path, out_file_path, generate_attrs: bool, generate_filters: bool) -> None:
- path_elements = os.path.split(os.path.abspath(source_dataset_path))
- data_set_dir = path_elements[0]
-
- # For HDF5, because multiple data sets can be grouped in the same file,
- # we will build data sets in memory and not write to disk until
- # _flush_data_sets_to_disk is called
- # read existing dataset
- data_hdf5 = os.path.join(os.path.dirname(os.path.realpath('/')), source_dataset_path)
-
- with h5py.File(data_hdf5, "r") as hf:
-
- if generate_attrs:
- data_set_w_attr = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir)
-
- possible_colors = ['red', 'green', 'yellow', 'blue', None]
- possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None]
- max_age = 100
-
- for key in hf.keys():
- if key not in ['neighbors', 'test', 'train']:
- continue
- data_set_w_attr.create_dataset(key, data=hf[key][()])
-
- attributes = []
- for i in range(len(hf['train'])):
- attr = [random.choice(possible_colors), random.choice(possible_tastes),
- random.randint(0, max_age + 1)]
- attributes.append(attr)
-
- data_set_w_attr.create_dataset('attributes', (len(attributes), 3), 'S10', data=attributes)
-
- data_set_w_attr.flush()
- data_set_w_attr.close()
-
- if generate_filters:
- attributes = hf['attributes'][()]
- expected_neighbors = hf['neighbors'][()]
-
- data_set_filters = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir)
-
- def filter1(attributes, vector_idx):
- if attributes[vector_idx][0].decode() == 'red' and int(attributes[vector_idx][2].decode()) >= 20:
- return True
- else:
- return False
-
- self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_1', filter1)
-
- # filter 2 - color = blue or None and taste = 'salty'
- def filter2(attributes, vector_idx):
- if (attributes[vector_idx][0].decode() == 'blue' or attributes[vector_idx][
- 0].decode() == 'None') and attributes[vector_idx][1].decode() == 'salty':
- return True
- else:
- return False
-
- self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_2', filter2)
-
- # filter 3 - color and taste are not None and age is between 20 and 80
- def filter3(attributes, vector_idx):
- if attributes[vector_idx][0].decode() != 'None' and attributes[vector_idx][
- 1].decode() != 'None' and 20 <= \
- int(attributes[vector_idx][2].decode()) <= 80:
- return True
- else:
- return False
-
- self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_3', filter3)
-
- # filter 4 - color green or blue and taste is bitter and age is between (30, 60)
- def filter4(attributes, vector_idx):
- if (attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue') \
- and (attributes[vector_idx][1].decode() == 'bitter') \
- and 30 <= int(attributes[vector_idx][2].decode()) <= 60:
- return True
- else:
- return False
-
- self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_4', filter4)
-
- # filter 5 color is (green or blue or yellow) or taste = sweet or age is between (30, 70)
- def filter5(attributes, vector_idx):
- if attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue' \
- or attributes[vector_idx][0].decode() == 'yellow' \
- or attributes[vector_idx][1].decode() == 'sweet' \
- or 30 <= int(attributes[vector_idx][2].decode()) <= 70:
- return True
- else:
- return False
-
- self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_5', filter5)
-
- data_set_filters.flush()
- data_set_filters.close()
-
- def apply_filter(self, expected_neighbors, attributes, data_set_w_filtering, filter_name, filter_func):
- neighbors_filter = []
- filtered_count = 0
- for expected_neighbors_row in expected_neighbors:
- neighbors_filter_row = [-1] * len(expected_neighbors_row)
- idx = 0
- for vector_idx in expected_neighbors_row:
- if filter_func(attributes, vector_idx):
- neighbors_filter_row[idx] = vector_idx
- idx += 1
- filtered_count += 1
- neighbors_filter.append(neighbors_filter_row)
- overall_count = len(expected_neighbors) * len(expected_neighbors[0])
- perc = float(filtered_count / overall_count) * 100
- print('ground truth size for {} is {}, percentage {}'.format(filter_name, filtered_count, perc))
- data_set_w_filtering.create_dataset(filter_name, data=neighbors_filter)
- return expected_neighbors
-
- def create_dataset_file(self, file_name, extension, data_set_dir) -> h5py.File:
- data_set_file_name = "{}.{}".format(file_name, extension)
- data_set_path = os.path.join(data_set_dir, data_set_file_name)
-
- data_set_w_filtering = h5py.File(data_set_path, 'a')
-
- return data_set_w_filtering
-
-
-def main(argv):
- opts, args = getopt.getopt(argv, "")
- in_file_path = args[0]
- out_file_path = args[1]
- generate_attr = str2bool(args[2])
- generate_filters = str2bool(args[3])
-
- worker = _Dataset()
- worker.create_dataset(in_file_path, out_file_path, generate_attr, generate_filters)
-
-
-def str2bool(v):
- return v.lower() in ("yes", "true", "t", "1")
-
-
-if __name__ == "__main__":
- main(sys.argv[1:])
diff --git a/benchmarks/perf-tool/add-parent-doc-id-to-dataset.py b/benchmarks/perf-tool/add-parent-doc-id-to-dataset.py
deleted file mode 100644
index a4acafd03..000000000
--- a/benchmarks/perf-tool/add-parent-doc-id-to-dataset.py
+++ /dev/null
@@ -1,291 +0,0 @@
-# Copyright OpenSearch Contributors
-# SPDX-License-Identifier: Apache-2.0
-
-"""
-Script builds complex dataset with additional attributes from exiting dataset that has only vectors.
-Additional attributes are predefined in the script: color, taste, age, and parent doc id. Only HDF5 format of vector dataset is supported.
-
-Output dataset file will have additional dataset 'attributes' with multiple columns, each column corresponds to one attribute
-from an attribute set, and value is generated at random, e.g.:
-
-0: green None 71 1
-1: green bitter 28 1
-2: green bitter 28 1
-3: green bitter 28 2
-...
-
-there is no explicit index reference in 'attributes' dataset, index of the row corresponds to a document id.
-For instance, in example above two rows of fields mapped to documents with ids '0' and '1'.
-
-The parend doc ids are assigned in non-decreasing order.
-
-If 'generate_filters' flag is set script generates additional dataset of neighbours (ground truth).
-Output is a new file with three dataset each of which corresponds to a certain type of query.
-Dataset name neighbour_nested is a ground truth for query without filtering.
-Dataset name neighbour_filtered_relaxed is a ground truth for query with filtering of (30 <= age <= 70) or color in ["green", "blue", "yellow"] or taste in ["sweet"]
-Dataset name neighbour_filtered_restrictive is a ground truth for query with filtering of (30 <= age <= 60) and color in ["green", "blue"] and taste in ["bitter"]
-
-
-Each dataset has rows with array of integers, where integer corresponds to
-a document id from original dataset with additional fields.
-
-Example of script usage:
-
- create new hdf5 file with attribute dataset
- add-parent-doc-id-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-nested.hdf5
-
-"""
-import getopt
-import multiprocessing
-import random
-import sys
-from multiprocessing import Process
-from typing import cast
-import traceback
-
-import h5py
-import numpy as np
-
-
-class MyVector:
- def __init__(self, vector, id, color=None, taste=None, age=None, parent_id=None):
- self.vector = vector
- self.id = id
- self.age = age
- self.color = color
- self.taste = taste
- self.parent_id = parent_id
-
- def apply_restricted_filter(self):
- return (30 <= self.age <= 60) and self.color in ["green", "blue"] and self.taste in ["bitter"]
-
- def apply_relaxed_filter(self):
- return (30 <= self.age <= 70) or self.color in ["green", "blue", "yellow"] or self.taste in ["sweet"]
-
- def __str__(self):
- return f'Vector : {self.vector}, id : {self.id}, color: {self.color}, taste: {self.taste}, age: {self.age}, parent_id: {self.parent_id}\n'
-
- def __repr__(self):
- return f'Vector : {self.vector}, id : {self.id}, color: {self.color}, taste: {self.taste}, age: {self.age}, parent_id: {self.parent_id}\n'
-
-class HDF5DataSet:
- def __init__(self, file_path, key):
- self.file_name = file_path
- self.file = h5py.File(self.file_name)
- self.key = key
- self.data = cast(h5py.Dataset, self.file[key])
- self.metadata = None
- self.metadata = cast(h5py.Dataset, self.file["attributes"]) if key == "train" else None
- print(f'Keys in the file are {self.file.keys()}')
-
- def read(self, start, end=None):
- if end is None:
- end = self.data.len()
- values = cast(np.ndarray, self.data[start:end])
- metadata = cast(list, self.metadata[start:end]) if self.metadata is not None else None
- if metadata is not None:
- print(metadata)
- vectors = []
- i = 0
- for value in values:
- if self.metadata is None:
- vector = MyVector(value, i)
- else:
- # color, taste, age, and parent id
- vector = MyVector(value, i, str(metadata[i][0].decode()), str(metadata[i][1].decode()),
- int(metadata[i][2]), int(metadata[i][3]))
- vectors.append(vector)
- i = i + 1
- return vectors
-
- def read_neighbors(self, start, end):
- return cast(np.ndarray, self.data[start:end])
-
- def size(self):
- return self.data.len()
-
- def close(self):
- self.file.close()
-
-class _Dataset:
- def run(self, source_path, target_path) -> None:
- # Add attributes
- print(f'Adding attributes started.')
- with h5py.File(source_path, "r") as in_file:
- out_file = h5py.File(target_path, "w")
- possible_colors = ['red', 'green', 'yellow', 'blue', None]
- possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None]
- max_age = 100
- min_field_size = 10
- max_field_size = 10
-
- # Copy train and test data
- for key in in_file.keys():
- if key not in ['test', 'train']:
- continue
- out_file.create_dataset(key, data=in_file[key][()])
-
- # Generate attributes
- attributes = []
- field_size = random.randint(min_field_size, max_field_size)
- parent_id = 1
- field_count = 0
- for i in range(len(in_file['train'])):
- attr = [random.choice(possible_colors), random.choice(possible_tastes),
- random.randint(0, max_age + 1), parent_id]
- attributes.append(attr)
- field_count += 1
- if field_count >= field_size:
- field_size = random.randint(min_field_size, max_field_size)
- field_count = 0
- parent_id += 1
- out_file.create_dataset('attributes', (len(attributes), 4), 'S10', data=attributes)
-
- out_file.flush()
- out_file.close()
-
- print(f'Adding attributes completed.')
-
-
- # Calculate ground truth
- print(f'Calculating ground truth started.')
- cpus = multiprocessing.cpu_count()
- total_clients = min(8, cpus) # 1 # 10
- hdf5Data_train = HDF5DataSet(target_path, "train")
- train_vectors = hdf5Data_train.read(0, hdf5Data_train.size())
- hdf5Data_train.close()
- print(f'Train vector size: {len(train_vectors)}')
-
- hdf5Data_test = HDF5DataSet(target_path, "test")
- total_queries = hdf5Data_test.size() # 10000
- dis = [] * total_queries
-
- for i in range(total_queries):
- dis.insert(i, [])
-
- queries_per_client = int(total_queries / total_clients + 0.5)
- if queries_per_client == 0:
- queries_per_client = total_queries
-
- processes = []
- test_vectors = hdf5Data_test.read(0, total_queries)
- hdf5Data_test.close()
- tasks_that_are_done = multiprocessing.Queue()
- for client in range(total_clients):
- start_index = int(client * queries_per_client)
- if start_index + queries_per_client <= total_queries:
- end_index = int(start_index + queries_per_client)
- else:
- end_index = total_queries
-
- print(f'Start Index: {start_index}, end Index: {end_index}')
- print(f'client is : {client}')
- p = Process(target=queryTask, args=(
- train_vectors, test_vectors, start_index, end_index, client, total_queries, tasks_that_are_done))
- processes.append(p)
- p.start()
- if end_index >= total_queries:
- print(f'Exiting end Index : {end_index} total_queries: {total_queries}')
- break
-
- # wait for tasks to be completed
- print('Waiting for all tasks to be completed')
- j = 0
- # This is required because threads can hang if the data sent from the sub process increases by a certain limit
- # https://stackoverflow.com/questions/21641887/python-multiprocessing-process-hangs-on-join-for-large-queue
- while j < total_queries:
- while not tasks_that_are_done.empty():
- calculatedDis = tasks_that_are_done.get()
- i = 0
- for d in calculatedDis:
- if d:
- dis[i] = d
- j = j + 1
- i = i + 1
-
- for p in processes:
- if p.is_alive():
- p.join()
- else:
- print("Process was not alive hence shutting down")
-
- data_set_file = h5py.File(target_path, "a")
- for type in ['nested', 'relaxed', 'restricted']:
- results = []
- for d in dis:
- r = []
- for i in range(min(10000, len(d[type]))):
- r.append(d[type][i]['id'])
- results.append(r)
-
-
- data_set_file.create_dataset("neighbour_" + type, (len(results), len(results[0])), data=results)
- data_set_file.flush()
- data_set_file.close()
-
-def calculateL2Distance(point1, point2):
- return np.linalg.norm(point1 - point2)
-
-
-def queryTask(train_vectors, test_vectors, startIndex, endIndex, process_number, total_queries, tasks_that_are_done):
- print(f'Starting Process number : {process_number}')
- all_distances = [] * total_queries
- for i in range(total_queries):
- all_distances.insert(i, {})
- try:
- test_vectors = test_vectors[startIndex:endIndex]
- i = startIndex
- for test in test_vectors:
- distances = []
- values = {}
- for value in train_vectors:
- values[value.id] = value
- distances.append({
- "dis": calculateL2Distance(test.vector, value.vector),
- "id": value.parent_id
- })
-
- distances.sort(key=lambda vector: vector['dis'])
- seen_set_nested = set()
- seen_set_restricted = set()
- seen_set_relaxed = set()
- nested = []
- restricted = []
- relaxed = []
- for sub_i in range(len(distances)):
- id = distances[sub_i]['id']
- # Check if the number has been seen before
- if len(nested) < 1000 and id not in seen_set_nested:
- # If not seen before, mark it as seen
- seen_set_nested.add(id)
- nested.append(distances[sub_i])
- if len(restricted) < 1000 and id not in seen_set_restricted and values[id].apply_restricted_filter():
- seen_set_restricted.add(id)
- restricted.append(distances[sub_i])
- if len(relaxed) < 1000 and id not in seen_set_relaxed and values[id].apply_relaxed_filter():
- seen_set_relaxed.add(id)
- relaxed.append(distances[sub_i])
-
- all_distances[i]['nested'] = nested
- all_distances[i]['restricted'] = restricted
- all_distances[i]['relaxed'] = relaxed
- print(f"Process {process_number} queries completed: {i + 1 - startIndex}, queries left: {endIndex - i - 1}")
- i = i + 1
- except:
- print(
- f"Got exception while running the thread: {process_number} with startIndex: {startIndex} endIndex: {endIndex} ")
- traceback.print_exc()
- tasks_that_are_done.put(all_distances)
- print(f'Exiting Process number : {process_number}')
-
-
-def main(argv):
- opts, args = getopt.getopt(argv, "")
- in_file_path = args[0]
- out_file_path = args[1]
-
- worker = _Dataset()
- worker.run(in_file_path, out_file_path)
-
-if __name__ == "__main__":
- main(sys.argv[1:])
\ No newline at end of file
diff --git a/benchmarks/perf-tool/dataset/data-nested.hdf5 b/benchmarks/perf-tool/dataset/data-nested.hdf5
deleted file mode 100644
index 4223d7281..000000000
Binary files a/benchmarks/perf-tool/dataset/data-nested.hdf5 and /dev/null differ
diff --git a/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5
deleted file mode 100644
index 01df75f83..000000000
Binary files a/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 and /dev/null differ
diff --git a/benchmarks/perf-tool/dataset/data-with-attr.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr.hdf5
deleted file mode 100644
index 22873b06c..000000000
Binary files a/benchmarks/perf-tool/dataset/data-with-attr.hdf5 and /dev/null differ
diff --git a/benchmarks/perf-tool/dataset/data.hdf5 b/benchmarks/perf-tool/dataset/data.hdf5
deleted file mode 100644
index c9268606d..000000000
Binary files a/benchmarks/perf-tool/dataset/data.hdf5 and /dev/null differ
diff --git a/benchmarks/perf-tool/knn-perf-tool.py b/benchmarks/perf-tool/knn-perf-tool.py
deleted file mode 100644
index 48eedc427..000000000
--- a/benchmarks/perf-tool/knn-perf-tool.py
+++ /dev/null
@@ -1,10 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-"""Script for user to run the testing tool."""
-
-import okpt.main
-
-okpt.main.main()
diff --git a/benchmarks/perf-tool/okpt/__init__.py b/benchmarks/perf-tool/okpt/__init__.py
deleted file mode 100644
index c3bffc54c..000000000
--- a/benchmarks/perf-tool/okpt/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
diff --git a/benchmarks/perf-tool/okpt/diff/diff.py b/benchmarks/perf-tool/okpt/diff/diff.py
deleted file mode 100644
index 23f424ab9..000000000
--- a/benchmarks/perf-tool/okpt/diff/diff.py
+++ /dev/null
@@ -1,142 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Provides the Diff class."""
-
-from enum import Enum
-from typing import Any, Dict, Tuple
-
-
-class InvalidTestResultsError(Exception):
- """Exception raised when the test results are invalid.
-
- The results can be invalid if they have different fields, non-numeric
- values, or if they don't follow the standard result format.
- """
- def __init__(self, msg: str):
- self.message = msg
- super().__init__(self.message)
-
-
-def _is_numeric(a) -> bool:
- return isinstance(a, (int, float))
-
-
-class TestResultFields(str, Enum):
- METADATA = 'metadata'
- RESULTS = 'results'
- TEST_PARAMETERS = 'test_parameters'
-
-
-class TestResultNames(str, Enum):
- BASE = 'base_result'
- CHANGED = 'changed_result'
-
-
-class Diff:
- """Diff class for validating and diffing two test result files.
-
- Methods:
- diff: Returns the diff between two test results. (changed - base)
- """
- def __init__(
- self,
- base_result: Dict[str,
- Any],
- changed_result: Dict[str,
- Any],
- metadata: bool
- ):
- """Initializes test results and validate them."""
- self.base_result = base_result
- self.changed_result = changed_result
- self.metadata = metadata
-
- # make sure results have proper test result fields
- is_valid, key, result = self._validate_keys()
- if not is_valid:
- raise InvalidTestResultsError(
- f'{result} has a missing or invalid key `{key}`.'
- )
-
- self.base_results = self.base_result[TestResultFields.RESULTS]
- self.changed_results = self.changed_result[TestResultFields.RESULTS]
-
- # make sure results have the same fields
- is_valid, key, result = self._validate_structure()
- if not is_valid:
- raise InvalidTestResultsError(
- f'key `{key}` is not present in {result}.'
- )
-
- # make sure results have numeric values
- is_valid, key, result = self._validate_types()
- if not is_valid:
- raise InvalidTestResultsError(
- f'key `{key}` in {result} points to a non-numeric value.'
- )
-
- def _validate_keys(self) -> Tuple[bool, str, str]:
- """Ensure both test results have `metadata` and `results` keys."""
- check_keydict = lambda key, res: key in res and isinstance(
- res[key], dict)
-
- # check if results have a `metadata` field and if `metadata` is a dict
- if self.metadata:
- if not check_keydict(TestResultFields.METADATA, self.base_result):
- return (False, TestResultFields.METADATA, TestResultNames.BASE)
- if not check_keydict(TestResultFields.METADATA,
- self.changed_result):
- return (
- False,
- TestResultFields.METADATA,
- TestResultNames.CHANGED
- )
- # check if results have a `results` field and `results` is a dict
- if not check_keydict(TestResultFields.RESULTS, self.base_result):
- return (False, TestResultFields.RESULTS, TestResultNames.BASE)
- if not check_keydict(TestResultFields.RESULTS, self.changed_result):
- return (False, TestResultFields.RESULTS, TestResultNames.CHANGED)
- return (True, '', '')
-
- def _validate_structure(self) -> Tuple[bool, str, str]:
- """Ensure both test results have the same keys."""
- for k in self.base_results:
- if not k in self.changed_results:
- return (False, k, TestResultNames.CHANGED)
- for k in self.changed_results:
- if not k in self.base_results:
- return (False, k, TestResultNames.BASE)
- return (True, '', '')
-
- def _validate_types(self) -> Tuple[bool, str, str]:
- """Ensure both test results have numeric values."""
- for k, v in self.base_results.items():
- if not _is_numeric(v):
- return (False, k, TestResultNames.BASE)
- for k, v in self.changed_results.items():
- if not _is_numeric(v):
- return (False, k, TestResultNames.BASE)
- return (True, '', '')
-
- def diff(self) -> Dict[str, Any]:
- """Return the diff between the two test results. (changed - base)"""
- results_diff = {
- key: self.changed_results[key] - self.base_results[key]
- for key in self.base_results
- }
-
- # add metadata if specified
- if self.metadata:
- return {
- f'{TestResultNames.BASE}_{TestResultFields.METADATA}':
- self.base_result[TestResultFields.METADATA],
- f'{TestResultNames.CHANGED}_{TestResultFields.METADATA}':
- self.changed_result[TestResultFields.METADATA],
- 'diff':
- results_diff
- }
- return results_diff
diff --git a/benchmarks/perf-tool/okpt/io/args.py b/benchmarks/perf-tool/okpt/io/args.py
deleted file mode 100644
index f8c5d8809..000000000
--- a/benchmarks/perf-tool/okpt/io/args.py
+++ /dev/null
@@ -1,178 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Parses and defines command line arguments for the program.
-
-Defines the subcommands `test` and `diff` and the corresponding
-files that are required by each command.
-
-Functions:
- define_args(): Define the command line arguments.
- get_args(): Returns a dictionary of the command line args.
-"""
-
-import argparse
-import sys
-from dataclasses import dataclass
-from io import TextIOWrapper
-from typing import Union
-
-_read_type = argparse.FileType('r')
-_write_type = argparse.FileType('w')
-
-
-def _add_config(parser, name, **kwargs):
- """"Add configuration file path argument."""
- opts = {
- 'type': _read_type,
- 'help': 'Path of configuration file.',
- 'metavar': 'config_path',
- **kwargs,
- }
- parser.add_argument(name, **opts)
-
-
-def _add_result(parser, name, **kwargs):
- """"Add results files paths argument."""
- opts = {
- 'type': _read_type,
- 'help': 'Path of one result file.',
- 'metavar': 'result_path',
- **kwargs,
- }
- parser.add_argument(name, **opts)
-
-
-def _add_results(parser, name, **kwargs):
- """"Add results files paths argument."""
- opts = {
- 'nargs': '+',
- 'type': _read_type,
- 'help': 'Paths of result files.',
- 'metavar': 'result_paths',
- **kwargs,
- }
- parser.add_argument(name, **opts)
-
-
-def _add_output(parser, name, **kwargs):
- """"Add output file path argument."""
- opts = {
- 'type': _write_type,
- 'help': 'Path of output file.',
- 'metavar': 'output_path',
- **kwargs,
- }
- parser.add_argument(name, **opts)
-
-
-def _add_metadata(parser, name, **kwargs):
- opts = {
- 'action': 'store_true',
- **kwargs,
- }
- parser.add_argument(name, **opts)
-
-
-def _add_test_cmd(subparsers):
- test_parser = subparsers.add_parser('test')
- _add_config(test_parser, 'config')
- _add_output(test_parser, 'output')
-
-
-def _add_diff_cmd(subparsers):
- diff_parser = subparsers.add_parser('diff')
- _add_metadata(diff_parser, '--metadata')
- _add_result(
- diff_parser,
- 'base_result',
- help='Base test result.',
- metavar='base_result'
- )
- _add_result(
- diff_parser,
- 'changed_result',
- help='Changed test result.',
- metavar='changed_result'
- )
- _add_output(diff_parser, '--output', default=sys.stdout)
-
-
-@dataclass
-class TestArgs:
- log: str
- command: str
- config: TextIOWrapper
- output: TextIOWrapper
-
-
-@dataclass
-class DiffArgs:
- log: str
- command: str
- metadata: bool
- base_result: TextIOWrapper
- changed_result: TextIOWrapper
- output: TextIOWrapper
-
-
-def get_args() -> Union[TestArgs, DiffArgs]:
- """Define, parse and return command line args.
-
- Returns:
- A dict containing the command line args.
- """
- parser = argparse.ArgumentParser(
- description=
- 'Run performance tests against the OpenSearch plugin and various ANN '
- 'libaries.'
- )
-
- def define_args():
- """Define tool commands."""
-
- # add log level arg
- parser.add_argument(
- '--log',
- default='info',
- type=str,
- choices=['debug',
- 'info',
- 'warning',
- 'error',
- 'critical'],
- help='Log level of the tool.'
- )
-
- subparsers = parser.add_subparsers(
- title='commands',
- dest='command',
- help='sub-command help'
- )
- subparsers.required = True
-
- # add subcommands
- _add_test_cmd(subparsers)
- _add_diff_cmd(subparsers)
-
- define_args()
- args = parser.parse_args()
- if args.command == 'test':
- return TestArgs(
- log=args.log,
- command=args.command,
- config=args.config,
- output=args.output
- )
- else:
- return DiffArgs(
- log=args.log,
- command=args.command,
- metadata=args.metadata,
- base_result=args.base_result,
- changed_result=args.changed_result,
- output=args.output
- )
diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/base.py b/benchmarks/perf-tool/okpt/io/config/parsers/base.py
deleted file mode 100644
index 795aab1b2..000000000
--- a/benchmarks/perf-tool/okpt/io/config/parsers/base.py
+++ /dev/null
@@ -1,67 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Base Parser class.
-
-Classes:
- BaseParser: Base class for config parsers.
-
-Exceptions:
- ConfigurationError: An error in the configuration syntax.
-"""
-
-import os
-from io import TextIOWrapper
-
-import cerberus
-
-from okpt.io.utils import reader
-
-
-class ConfigurationError(Exception):
- """Exception raised for errors in the tool configuration.
-
- Attributes:
- message -- explanation of the error
- """
-
- def __init__(self, message: str):
- self.message = f'{message}'
- super().__init__(self.message)
-
-
-def _get_validator_from_schema_name(schema_name: str):
- """Get the corresponding Cerberus validator from a schema name."""
- curr_file_dir = os.path.dirname(os.path.abspath(__file__))
- schemas_dir = os.path.join(os.path.dirname(curr_file_dir), 'schemas')
- schema_file_path = os.path.join(schemas_dir, f'{schema_name}.yml')
- schema_obj = reader.parse_yaml_from_path(schema_file_path)
- return cerberus.Validator(schema_obj)
-
-
-class BaseParser:
- """Base class for config parsers.
-
- Attributes:
- validator: Cerberus validator for a particular schema
- errors: Cerberus validation errors (if any are found during validation)
-
- Methods:
- parse: Parse config.
- """
-
- def __init__(self, schema_name: str):
- self.validator = _get_validator_from_schema_name(schema_name)
- self.errors = ''
-
- def parse(self, file_obj: TextIOWrapper):
- """Convert file object to dict, while validating against config schema."""
- config_obj = reader.parse_yaml(file_obj)
- is_config_valid = self.validator.validate(config_obj)
- if not is_config_valid:
- raise ConfigurationError(self.validator.errors)
-
- return self.validator.document
diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/test.py b/benchmarks/perf-tool/okpt/io/config/parsers/test.py
deleted file mode 100644
index c47e30ecc..000000000
--- a/benchmarks/perf-tool/okpt/io/config/parsers/test.py
+++ /dev/null
@@ -1,81 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Provides ToolParser.
-
-Classes:
- ToolParser: Tool config parser.
-"""
-from dataclasses import dataclass
-from io import TextIOWrapper
-from typing import List
-
-from okpt.io.config.parsers import base
-from okpt.test.steps.base import Step, StepConfig
-from okpt.test.steps.factory import create_step
-
-
-@dataclass
-class TestConfig:
- test_name: str
- test_id: str
- endpoint: str
- port: int
- timeout: int
- num_runs: int
- show_runs: bool
- setup: List[Step]
- steps: List[Step]
- cleanup: List[Step]
-
-
-class TestParser(base.BaseParser):
- """Parser for Test config.
-
- Methods:
- parse: Parse and validate the Test config.
- """
-
- def __init__(self):
- super().__init__('test')
-
- def parse(self, file_obj: TextIOWrapper) -> TestConfig:
- """See base class."""
- config_obj = super().parse(file_obj)
-
- implicit_step_config = dict()
- if 'endpoint' in config_obj:
- implicit_step_config['endpoint'] = config_obj['endpoint']
-
- if 'port' in config_obj:
- implicit_step_config['port'] = config_obj['port']
-
- # Each step should have its own parse - take the config object and check if its valid
- setup = []
- if 'setup' in config_obj:
- setup = [create_step(StepConfig(step["name"], step, implicit_step_config)) for step in config_obj['setup']]
-
- steps = [create_step(StepConfig(step["name"], step, implicit_step_config)) for step in config_obj['steps']]
-
- cleanup = []
- if 'cleanup' in config_obj:
- cleanup = [create_step(StepConfig(step["name"], step, implicit_step_config)) for step
- in config_obj['cleanup']]
-
- test_config = TestConfig(
- endpoint=config_obj['endpoint'],
- port=config_obj['port'],
- timeout=config_obj['timeout'],
- test_name=config_obj['test_name'],
- test_id=config_obj['test_id'],
- num_runs=config_obj['num_runs'],
- show_runs=config_obj['show_runs'],
- setup=setup,
- steps=steps,
- cleanup=cleanup
- )
-
- return test_config
diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/util.py b/benchmarks/perf-tool/okpt/io/config/parsers/util.py
deleted file mode 100644
index 454fec5a0..000000000
--- a/benchmarks/perf-tool/okpt/io/config/parsers/util.py
+++ /dev/null
@@ -1,116 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Utility functions for parsing"""
-
-
-from okpt.io.config.parsers.base import ConfigurationError
-from okpt.io.dataset import HDF5DataSet, BigANNNeighborDataSet, \
- BigANNVectorDataSet, DataSet, Context
-
-
-def parse_dataset(dataset_format: str, dataset_path: str,
- context: Context, custom_context=None) -> DataSet:
- if dataset_format == 'hdf5':
- return HDF5DataSet(dataset_path, context, custom_context)
-
- if dataset_format == 'bigann' and context == Context.NEIGHBORS:
- return BigANNNeighborDataSet(dataset_path)
-
- if dataset_format == 'bigann':
- return BigANNVectorDataSet(dataset_path)
-
- raise Exception("Unsupported data-set format")
-
-
-def parse_string_param(key: str, first_map, second_map, default) -> str:
- value = first_map.get(key)
- if value is not None:
- if type(value) is str:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- value = second_map.get(key)
- if value is not None:
- if type(value) is str:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- if default is None:
- raise ConfigurationError("{} must be set".format(key))
- return default
-
-
-def parse_int_param(key: str, first_map, second_map, default) -> int:
- value = first_map.get(key)
- if value is not None:
- if type(value) is int:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- value = second_map.get(key)
- if value is not None:
- if type(value) is int:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- if default is None:
- raise ConfigurationError("{} must be set".format(key))
- return default
-
-
-def parse_bool_param(key: str, first_map, second_map, default) -> bool:
- value = first_map.get(key)
- if value is not None:
- if type(value) is bool:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- value = second_map.get(key)
- if value is not None:
- if type(value) is bool:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- if default is None:
- raise ConfigurationError("{} must be set".format(key))
- return default
-
-
-def parse_dict_param(key: str, first_map, second_map, default) -> dict:
- value = first_map.get(key)
- if value is not None:
- if type(value) is dict:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- value = second_map.get(key)
- if value is not None:
- if type(value) is dict:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- if default is None:
- raise ConfigurationError("{} must be set".format(key))
- return default
-
-
-def parse_list_param(key: str, first_map, second_map, default) -> list:
- value = first_map.get(key)
- if value is not None:
- if type(value) is list:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- value = second_map.get(key)
- if value is not None:
- if type(value) is list:
- return value
- raise ConfigurationError("Invalid type for {}".format(key))
-
- if default is None:
- raise ConfigurationError("{} must be set".format(key))
- return default
diff --git a/benchmarks/perf-tool/okpt/io/config/schemas/test.yml b/benchmarks/perf-tool/okpt/io/config/schemas/test.yml
deleted file mode 100644
index 4d5c21a15..000000000
--- a/benchmarks/perf-tool/okpt/io/config/schemas/test.yml
+++ /dev/null
@@ -1,35 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-# defined using the cerberus validation API
-# https://docs.python-cerberus.org/en/stable/index.html
-endpoint:
- type: string
- default: "localhost"
-port:
- type: integer
- default: 9200
-timeout:
- type: integer
- default: 60
-test_name:
- type: string
-test_id:
- type: string
-num_runs:
- type: integer
- default: 1
- min: 1
- max: 10000
-show_runs:
- type: boolean
- default: false
-setup:
- type: list
-steps:
- type: list
-cleanup:
- type: list
diff --git a/benchmarks/perf-tool/okpt/io/dataset.py b/benchmarks/perf-tool/okpt/io/dataset.py
deleted file mode 100644
index 001563bab..000000000
--- a/benchmarks/perf-tool/okpt/io/dataset.py
+++ /dev/null
@@ -1,222 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Defines DataSet interface and implements particular formats
-
-A DataSet is the basic functionality that it can be read in chunks, or
-read completely and reset to the start.
-
-Currently, we support HDF5 formats from ann-benchmarks and big-ann-benchmarks
-datasets.
-
-Classes:
- HDF5DataSet: Format used in ann-benchmarks
- BigANNNeighborDataSet: Neighbor format for big-ann-benchmarks
- BigANNVectorDataSet: Vector format for big-ann-benchmarks
-"""
-import os
-from abc import ABC, ABCMeta, abstractmethod
-from enum import Enum
-from typing import cast
-import h5py
-import numpy as np
-
-import struct
-
-
-class Context(Enum):
- """DataSet context enum. Can be used to add additional context for how a
- data-set should be interpreted.
- """
- INDEX = 1
- QUERY = 2
- NEIGHBORS = 3
- CUSTOM = 4
-
-
-class DataSet(ABC):
- """DataSet interface. Used for reading data-sets from files.
-
- Methods:
- read: Read a chunk of data from the data-set
- size: Gets the number of items in the data-set
- reset: Resets internal state of data-set to beginning
- """
- __metaclass__ = ABCMeta
-
- @abstractmethod
- def read(self, chunk_size: int):
- pass
-
- @abstractmethod
- def size(self):
- pass
-
- @abstractmethod
- def reset(self):
- pass
-
-
-class HDF5DataSet(DataSet):
- """ Data-set format corresponding to `ANN Benchmarks
- `_
- """
-
- def __init__(self, dataset_path: str, context: Context, custom_context=None):
- file = h5py.File(dataset_path)
- self.data = cast(h5py.Dataset, file[self._parse_context(context, custom_context)])
- self.current = 0
-
- def read(self, chunk_size: int):
- if self.current >= self.size():
- return None
-
- end_i = self.current + chunk_size
- if end_i > self.size():
- end_i = self.size()
-
- v = cast(np.ndarray, self.data[self.current:end_i])
- self.current = end_i
- return v
-
- def size(self):
- return self.data.len()
-
- def reset(self):
- self.current = 0
-
- @staticmethod
- def _parse_context(context: Context, custom_context=None) -> str:
- if context == Context.NEIGHBORS:
- return "neighbors"
-
- if context == Context.INDEX:
- return "train"
-
- if context == Context.QUERY:
- return "test"
-
- if context == Context.CUSTOM:
- return custom_context
-
- raise Exception("Unsupported context")
-
-
-class BigANNNeighborDataSet(DataSet):
- """ Data-set format for neighbor data-sets for `Big ANN Benchmarks
- `_"""
-
- def __init__(self, dataset_path: str):
- self.file = open(dataset_path, 'rb')
- self.file.seek(0, os.SEEK_END)
- num_bytes = self.file.tell()
- self.file.seek(0)
-
- if num_bytes < 8:
- raise Exception("File is invalid")
-
- self.num_queries = int.from_bytes(self.file.read(4), "little")
- self.k = int.from_bytes(self.file.read(4), "little")
-
- # According to the website, the number of bytes that will follow will
- # be: num_queries X K x sizeof(uint32_t) bytes + num_queries X K x
- # sizeof(float)
- if (num_bytes - 8) != 2 * (self.num_queries * self.k * 4):
- raise Exception("File is invalid")
-
- self.current = 0
-
- def read(self, chunk_size: int):
- if self.current >= self.size():
- return None
-
- end_i = self.current + chunk_size
- if end_i > self.size():
- end_i = self.size()
-
- v = [[int.from_bytes(self.file.read(4), "little") for _ in
- range(self.k)] for _ in range(end_i - self.current)]
-
- self.current = end_i
- return v
-
- def size(self):
- return self.num_queries
-
- def reset(self):
- self.file.seek(8)
- self.current = 0
-
-
-class BigANNVectorDataSet(DataSet):
- """ Data-set format for vector data-sets for `Big ANN Benchmarks
- `_
- """
-
- def __init__(self, dataset_path: str):
- self.file = open(dataset_path, 'rb')
- self.file.seek(0, os.SEEK_END)
- num_bytes = self.file.tell()
- self.file.seek(0)
-
- if num_bytes < 8:
- raise Exception("File is invalid")
-
- self.num_points = int.from_bytes(self.file.read(4), "little")
- self.dimension = int.from_bytes(self.file.read(4), "little")
- bytes_per_num = self._get_data_size(dataset_path)
-
- if (num_bytes - 8) != self.num_points * self.dimension * bytes_per_num:
- raise Exception("File is invalid")
-
- self.reader = self._value_reader(dataset_path)
- self.current = 0
-
- def read(self, chunk_size: int):
- if self.current >= self.size():
- return None
-
- end_i = self.current + chunk_size
- if end_i > self.size():
- end_i = self.size()
-
- v = np.asarray([self._read_vector() for _ in
- range(end_i - self.current)])
- self.current = end_i
- return v
-
- def _read_vector(self):
- return np.asarray([self.reader(self.file) for _ in
- range(self.dimension)])
-
- def size(self):
- return self.num_points
-
- def reset(self):
- self.file.seek(8) # Seek to 8 bytes to skip re-reading metadata
- self.current = 0
-
- @staticmethod
- def _get_data_size(file_name):
- ext = file_name.split('.')[-1]
- if ext == "u8bin":
- return 1
-
- if ext == "fbin":
- return 4
-
- raise Exception("Unknown extension")
-
- @staticmethod
- def _value_reader(file_name):
- ext = file_name.split('.')[-1]
- if ext == "u8bin":
- return lambda file: float(int.from_bytes(file.read(1), "little"))
-
- if ext == "fbin":
- return lambda file: struct.unpack(' TextIOWrapper:
- """Given a file path, get a readable file object.
-
- Args:
- file path
-
- Returns:
- Writeable file object
- """
- return open(path, 'r', encoding='UTF-8')
-
-
-def parse_yaml(file: TextIOWrapper) -> Dict[str, Any]:
- """Parses YAML file from file object.
-
- Args:
- file: file object to parse
-
- Returns:
- A dict representing the YAML file.
- """
- return yaml.load(file, Loader=yaml.SafeLoader)
-
-
-def parse_yaml_from_path(path: str) -> Dict[str, Any]:
- """Parses YAML file from file path.
-
- Args:
- path: file path to parse
-
- Returns:
- A dict representing the YAML file.
- """
- file = reader.get_file_obj(path)
- return parse_yaml(file)
-
-
-def parse_json(file: TextIOWrapper) -> Dict[str, Any]:
- """Parses JSON file from file object.
-
- Args:
- file: file object to parse
-
- Returns:
- A dict representing the JSON file.
- """
- return json.load(file)
-
-
-def parse_json_from_path(path: str) -> Dict[str, Any]:
- """Parses JSON file from file path.
-
- Args:
- path: file path to parse
-
- Returns:
- A dict representing the JSON file.
- """
- file = reader.get_file_obj(path)
- return json.load(file)
diff --git a/benchmarks/perf-tool/okpt/io/utils/writer.py b/benchmarks/perf-tool/okpt/io/utils/writer.py
deleted file mode 100644
index 1f14bfd94..000000000
--- a/benchmarks/perf-tool/okpt/io/utils/writer.py
+++ /dev/null
@@ -1,40 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-"""Provides functions for writing to file.
-
-Functions:
- get_file_obj(): Get a writeable file object.
- write_json(): Writes a python dictionary to a JSON file
-"""
-
-import json
-from io import TextIOWrapper
-from typing import Any, Dict, TextIO, Union
-
-
-def get_file_obj(path: str) -> TextIOWrapper:
- """Get a writeable file object from a file path.
-
- Args:
- file path
-
- Returns:
- Writeable file object
- """
- return open(path, 'w', encoding='UTF-8')
-
-
-def write_json(data: Dict[str, Any],
- file: Union[TextIOWrapper, TextIO],
- pretty=False):
- """Writes a dictionary to a JSON file.
-
- Args:
- data: A dict to write to JSON.
- file: Path of output file.
- """
- indent = 2 if pretty else 0
- json.dump(data, file, indent=indent)
diff --git a/benchmarks/perf-tool/okpt/main.py b/benchmarks/perf-tool/okpt/main.py
deleted file mode 100644
index 3e6e022d4..000000000
--- a/benchmarks/perf-tool/okpt/main.py
+++ /dev/null
@@ -1,55 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-""" Runner script that serves as the main controller of the testing tool."""
-
-import logging
-import sys
-from typing import cast
-
-from okpt.diff import diff
-from okpt.io import args
-from okpt.io.config.parsers import test
-from okpt.io.utils import reader, writer
-from okpt.test import runner
-
-
-def main():
- """Main function of entry module."""
- cli_args = args.get_args()
- output = cli_args.output
- if cli_args.log:
- log_level = getattr(logging, cli_args.log.upper())
- logging.basicConfig(level=log_level)
-
- if cli_args.command == 'test':
- cli_args = cast(args.TestArgs, cli_args)
-
- # parse config
- parser = test.TestParser()
- test_config = parser.parse(cli_args.config)
- logging.info('Configs are valid.')
-
- # run tests
- test_runner = runner.TestRunner(test_config=test_config)
- test_result = test_runner.execute()
-
- # write test results
- logging.debug(
- f'Test Result:\n {writer.write_json(test_result, sys.stdout, pretty=True)}'
- )
- writer.write_json(test_result, output, pretty=True)
- elif cli_args.command == 'diff':
- cli_args = cast(args.DiffArgs, cli_args)
-
- # parse test results
- base_result = reader.parse_json(cli_args.base_result)
- changed_result = reader.parse_json(cli_args.changed_result)
-
- # get diff
- diff_result = diff.Diff(base_result, changed_result,
- cli_args.metadata).diff()
- writer.write_json(data=diff_result, file=output, pretty=True)
diff --git a/benchmarks/perf-tool/okpt/test/__init__.py b/benchmarks/perf-tool/okpt/test/__init__.py
deleted file mode 100644
index ff4fd04d1..000000000
--- a/benchmarks/perf-tool/okpt/test/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
diff --git a/benchmarks/perf-tool/okpt/test/profile.py b/benchmarks/perf-tool/okpt/test/profile.py
deleted file mode 100644
index d96860f9a..000000000
--- a/benchmarks/perf-tool/okpt/test/profile.py
+++ /dev/null
@@ -1,86 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Provides decorators to profile functions.
-
-The decorators work by adding a `measureable` (time, memory, etc) field to a
-dictionary returned by the wrapped function. So the wrapped functions must
-return a dictionary in order to be profiled.
-"""
-import functools
-import time
-from typing import Callable
-
-
-class TimerStoppedWithoutStartingError(Exception):
- """Error raised when Timer is stopped without having been started."""
-
- def __init__(self):
- super().__init__()
- self.message = 'Timer must call start() before calling end().'
-
-
-class _Timer():
- """Timer class for timing.
-
- Methods:
- start: Starts the timer.
- end: Stops the timer and returns the time elapsed since start.
-
- Raises:
- TimerStoppedWithoutStartingError: Timer must start before ending.
- """
-
- def __init__(self):
- self.start_time = None
-
- def start(self):
- """Starts the timer."""
- self.start_time = time.perf_counter()
-
- def end(self) -> float:
- """Stops the timer.
-
- Returns:
- The time elapsed in milliseconds.
- """
- # ensure timer has started before ending
- if self.start_time is None:
- raise TimerStoppedWithoutStartingError()
-
- elapsed = (time.perf_counter() - self.start_time) * 1000
- self.start_time = None
- return elapsed
-
-
-def took(f: Callable):
- """Profiles a functions execution time.
-
- Args:
- f: Function to profile.
-
- Returns:
- A function that wraps the passed in function and adds a time took field
- to the return value.
- """
-
- @functools.wraps(f)
- def wrapper(*args, **kwargs):
- """Wrapper function."""
- timer = _Timer()
- timer.start()
- result = f(*args, **kwargs)
- time_took = timer.end()
-
- # if result already has a `took` field, don't modify the result
- if isinstance(result, dict) and 'took' in result:
- return result
- # `result` may not be a dictionary, so it may not be unpackable
- elif isinstance(result, dict):
- return {**result, 'took': time_took}
- return {'took': time_took}
-
- return wrapper
diff --git a/benchmarks/perf-tool/okpt/test/runner.py b/benchmarks/perf-tool/okpt/test/runner.py
deleted file mode 100644
index 150154691..000000000
--- a/benchmarks/perf-tool/okpt/test/runner.py
+++ /dev/null
@@ -1,107 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Provides a test runner class."""
-import logging
-import platform
-import sys
-from datetime import datetime
-from typing import Any, Dict, List
-
-import psutil
-
-from okpt.io.config.parsers import test
-from okpt.test.test import Test, get_avg
-
-
-def _aggregate_runs(runs: List[Dict[str, Any]]):
- """Aggregates and averages a list of test results.
-
- Args:
- results: A list of test results.
- num_runs: Number of times the tests were ran.
-
- Returns:
- A dictionary containing the averages of the test results.
- """
- aggregate: Dict[str, Any] = {}
- for run in runs:
- for key, value in run.items():
- if key in aggregate:
- aggregate[key].append(value)
- else:
- aggregate[key] = [value]
-
- aggregate = {key: get_avg(value) for key, value in aggregate.items()}
- return aggregate
-
-
-class TestRunner:
- """Test runner class for running tests and aggregating the results.
-
- Methods:
- execute: Run the tests and aggregate the results.
- """
-
- def __init__(self, test_config: test.TestConfig):
- """"Initializes test state."""
- self.test_config = test_config
- self.test = Test(test_config)
-
- def _get_metadata(self):
- """"Retrieves the test metadata."""
- svmem = psutil.virtual_memory()
- return {
- 'test_name':
- self.test_config.test_name,
- 'test_id':
- self.test_config.test_id,
- 'date':
- datetime.now().strftime('%m/%d/%Y %H:%M:%S'),
- 'python_version':
- sys.version,
- 'os_version':
- platform.platform(),
- 'processor':
- platform.processor() + ', ' +
- str(psutil.cpu_count(logical=True)) + ' cores',
- 'memory':
- str(svmem.used) + ' (used) / ' + str(svmem.available) +
- ' (available) / ' + str(svmem.total) + ' (total)',
- }
-
- def execute(self) -> Dict[str, Any]:
- """Runs the tests and aggregates the results.
-
- Returns:
- A dictionary containing the aggregate of test results.
- """
- logging.info('Setting up tests.')
- self.test.setup()
- logging.info('Beginning to run tests.')
- runs = []
- for i in range(self.test_config.num_runs):
- logging.info(
- f'Running test {i + 1} of {self.test_config.num_runs}'
- )
- runs.append(self.test.execute())
-
- logging.info('Finished running tests.')
- aggregate = _aggregate_runs(runs)
-
- # add metadata to test results
- test_result = {
- 'metadata':
- self._get_metadata(),
- 'results':
- aggregate
- }
-
- # include info about all test runs if specified in config
- if self.test_config.show_runs:
- test_result['runs'] = runs
-
- return test_result
diff --git a/benchmarks/perf-tool/okpt/test/steps/base.py b/benchmarks/perf-tool/okpt/test/steps/base.py
deleted file mode 100644
index 829980421..000000000
--- a/benchmarks/perf-tool/okpt/test/steps/base.py
+++ /dev/null
@@ -1,60 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-"""Provides base Step interface."""
-
-from dataclasses import dataclass
-from typing import Any, Dict, List
-
-from okpt.test import profile
-
-
-@dataclass
-class StepConfig:
- step_name: str
- config: Dict[str, object]
- implicit_config: Dict[str, object]
-
-
-class Step:
- """Test step interface.
-
- Attributes:
- label: Name of the step.
-
- Methods:
- execute: Run the step and return a step response with the label and
- corresponding measures.
- """
-
- label = 'base_step'
-
- def __init__(self, step_config: StepConfig):
- self.step_config = step_config
-
- def _action(self):
- """Step logic/behavior to be executed and profiled."""
- pass
-
- def _get_measures(self) -> List[str]:
- """Gets the measures for a particular test"""
- pass
-
- def execute(self) -> List[Dict[str, Any]]:
- """Execute step logic while profiling various measures.
-
- Returns:
- Dict containing step label and various step measures.
- """
- action = self._action
-
- # profile the action with measure decorators - add if necessary
- action = getattr(profile, 'took')(action)
-
- result = action()
- if isinstance(result, dict):
- return [{'label': self.label, **result}]
-
- raise ValueError('Invalid return by a step')
diff --git a/benchmarks/perf-tool/okpt/test/steps/factory.py b/benchmarks/perf-tool/okpt/test/steps/factory.py
deleted file mode 100644
index 2033f2672..000000000
--- a/benchmarks/perf-tool/okpt/test/steps/factory.py
+++ /dev/null
@@ -1,50 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-"""Factory for creating steps."""
-
-from okpt.io.config.parsers.base import ConfigurationError
-from okpt.test.steps.base import Step, StepConfig
-
-from okpt.test.steps.steps import CreateIndexStep, DisableRefreshStep, RefreshIndexStep, DeleteIndexStep, \
- TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, IngestMultiFieldStep, \
- IngestNestedFieldStep, QueryStep, QueryWithFilterStep, QueryNestedFieldStep, GetStatsStep, WarmupStep
-
-
-def create_step(step_config: StepConfig) -> Step:
- if step_config.step_name == CreateIndexStep.label:
- return CreateIndexStep(step_config)
- elif step_config.step_name == DisableRefreshStep.label:
- return DisableRefreshStep(step_config)
- elif step_config.step_name == RefreshIndexStep.label:
- return RefreshIndexStep(step_config)
- elif step_config.step_name == TrainModelStep.label:
- return TrainModelStep(step_config)
- elif step_config.step_name == DeleteModelStep.label:
- return DeleteModelStep(step_config)
- elif step_config.step_name == DeleteIndexStep.label:
- return DeleteIndexStep(step_config)
- elif step_config.step_name == IngestStep.label:
- return IngestStep(step_config)
- elif step_config.step_name == IngestMultiFieldStep.label:
- return IngestMultiFieldStep(step_config)
- elif step_config.step_name == IngestNestedFieldStep.label:
- return IngestNestedFieldStep(step_config)
- elif step_config.step_name == QueryStep.label:
- return QueryStep(step_config)
- elif step_config.step_name == QueryWithFilterStep.label:
- return QueryWithFilterStep(step_config)
- elif step_config.step_name == QueryNestedFieldStep.label:
- return QueryNestedFieldStep(step_config)
- elif step_config.step_name == ForceMergeStep.label:
- return ForceMergeStep(step_config)
- elif step_config.step_name == ClearCacheStep.label:
- return ClearCacheStep(step_config)
- elif step_config.step_name == GetStatsStep.label:
- return GetStatsStep(step_config)
- elif step_config.step_name == WarmupStep.label:
- return WarmupStep(step_config)
-
- raise ConfigurationError(f'Invalid step {step_config.step_name}')
diff --git a/benchmarks/perf-tool/okpt/test/steps/steps.py b/benchmarks/perf-tool/okpt/test/steps/steps.py
deleted file mode 100644
index 99b2728dc..000000000
--- a/benchmarks/perf-tool/okpt/test/steps/steps.py
+++ /dev/null
@@ -1,987 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-"""Provides steps for OpenSearch tests.
-
-Some OpenSearch operations return a `took` field in the response body,
-so the profiling decorators aren't needed for some functions.
-"""
-import json
-from abc import abstractmethod
-from typing import Any, Dict, List
-
-import numpy as np
-import requests
-import time
-
-from opensearchpy import OpenSearch, RequestsHttpConnection
-
-from okpt.io.config.parsers.base import ConfigurationError
-from okpt.io.config.parsers.util import parse_string_param, parse_int_param, parse_dataset, parse_bool_param, \
- parse_list_param
-from okpt.io.dataset import Context
-from okpt.io.utils.reader import parse_json_from_path
-from okpt.test.steps import base
-from okpt.test.steps.base import StepConfig
-
-
-class OpenSearchStep(base.Step):
- """See base class."""
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.endpoint = parse_string_param('endpoint', step_config.config,
- step_config.implicit_config,
- 'localhost')
- default_port = 9200 if self.endpoint == 'localhost' else 80
- self.port = parse_int_param('port', step_config.config,
- step_config.implicit_config, default_port)
- self.timeout = parse_int_param('timeout', step_config.config, {}, 60)
- self.opensearch = get_opensearch_client(str(self.endpoint),
- int(self.port), int(self.timeout))
-
-
-class CreateIndexStep(OpenSearchStep):
- """See base class."""
-
- label = 'create_index'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
- index_spec = parse_string_param('index_spec', step_config.config, {},
- None)
- self.body = parse_json_from_path(index_spec)
- if self.body is None:
- raise ConfigurationError('Index body must be passed in')
-
- def _action(self):
- """Creates an OpenSearch index, applying the index settings/mappings.
-
- Returns:
- An OpenSearch index creation response body.
- """
- self.opensearch.indices.create(index=self.index_name, body=self.body)
- return {}
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class DisableRefreshStep(OpenSearchStep):
- """See base class."""
-
- label = 'disable_refresh'
-
- def _action(self):
- """Disables the refresh interval for an OpenSearch index.
-
- Returns:
- An OpenSearch index settings update response body.
- """
- self.opensearch.indices.put_settings(
- body={'index': {
- 'refresh_interval': -1
- }})
-
- return {}
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class RefreshIndexStep(OpenSearchStep):
- """See base class."""
-
- label = 'refresh_index'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
-
- def _action(self):
- while True:
- try:
- self.opensearch.indices.refresh(index=self.index_name)
- return {'store_kb': get_index_size_in_kb(self.opensearch,
- self.index_name)}
- except:
- pass
-
- def _get_measures(self) -> List[str]:
- return ['took', 'store_kb']
-
-
-class ForceMergeStep(OpenSearchStep):
- """See base class."""
-
- label = 'force_merge'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
- self.max_num_segments = parse_int_param('max_num_segments',
- step_config.config, {}, None)
-
- def _action(self):
- while True:
- try:
- self.opensearch.indices.forcemerge(
- index=self.index_name,
- max_num_segments=self.max_num_segments)
- return {}
- except:
- pass
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-class ClearCacheStep(OpenSearchStep):
- """See base class."""
-
- label = 'clear_cache'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
-
- def _action(self):
- while True:
- try:
- self.opensearch.indices.clear_cache(
- index=self.index_name)
- return {}
- except:
- pass
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class WarmupStep(OpenSearchStep):
- """See base class."""
-
- label = 'warmup_operation'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.index_name = parse_string_param('index_name', step_config.config, {},
- None)
-
- def _action(self):
- """Performs warmup operation on an index."""
- warmup_operation(self.endpoint, self.port, self.index_name)
- return {}
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class TrainModelStep(OpenSearchStep):
- """See base class."""
-
- label = 'train_model'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- self.model_id = parse_string_param('model_id', step_config.config, {},
- 'Test')
- self.train_index_name = parse_string_param('train_index',
- step_config.config, {}, None)
- self.train_index_field = parse_string_param('train_field',
- step_config.config, {},
- None)
- self.dimension = parse_int_param('dimension', step_config.config, {},
- None)
- self.description = parse_string_param('description', step_config.config,
- {}, 'Default')
- self.max_training_vector_count = parse_int_param(
- 'max_training_vector_count', step_config.config, {}, 10000000000000)
-
- method_spec = parse_string_param('method_spec', step_config.config, {},
- None)
- self.method = parse_json_from_path(method_spec)
- if self.method is None:
- raise ConfigurationError('method must be passed in')
-
- def _action(self):
- """Train a model for an index.
-
- Returns:
- The trained model
- """
-
- # Build body
- body = {
- 'training_index': self.train_index_name,
- 'training_field': self.train_index_field,
- 'description': self.description,
- 'dimension': self.dimension,
- 'method': self.method,
- 'max_training_vector_count': self.max_training_vector_count
- }
-
- # So, we trained the model. Now we need to wait until we have to wait
- # until the model is created. Poll every
- # 1/10 second
- requests.post('http://' + self.endpoint + ':' + str(self.port) +
- '/_plugins/_knn/models/' + str(self.model_id) + '/_train',
- json.dumps(body),
- headers={'content-type': 'application/json'})
-
- sleep_time = 0.1
- timeout = 100000
- i = 0
- while i < timeout:
- time.sleep(sleep_time)
- model_response = get_model(self.endpoint, self.port, self.model_id)
- if 'state' in model_response.keys() and model_response['state'] == \
- 'created':
- return {}
- i += 1
-
- raise TimeoutError('Failed to create model')
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class DeleteModelStep(OpenSearchStep):
- """See base class."""
-
- label = 'delete_model'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- self.model_id = parse_string_param('model_id', step_config.config, {},
- 'Test')
-
- def _action(self):
- """Train a model for an index.
-
- Returns:
- The trained model
- """
- delete_model(self.endpoint, self.port, self.model_id)
- return {}
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class DeleteIndexStep(OpenSearchStep):
- """See base class."""
-
- label = 'delete_index'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
-
- def _action(self):
- """Delete the index
-
- Returns:
- An empty dict
- """
- delete_index(self.opensearch, self.index_name)
- return {}
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
-
-class BaseIngestStep(OpenSearchStep):
- """See base class."""
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
- self.field_name = parse_string_param('field_name', step_config.config,
- {}, None)
- self.bulk_size = parse_int_param('bulk_size', step_config.config, {},
- 300)
- self.implicit_config = step_config.implicit_config
- dataset_format = parse_string_param('dataset_format',
- step_config.config, {}, 'hdf5')
- dataset_path = parse_string_param('dataset_path', step_config.config,
- {}, None)
- self.dataset = parse_dataset(dataset_format, dataset_path,
- Context.INDEX)
-
- self.input_doc_count = parse_int_param('doc_count', step_config.config, {},
- self.dataset.size())
- self.doc_count = min(self.input_doc_count, self.dataset.size())
-
- def _action(self):
-
- def action(doc_id):
- return {'index': {'_index': self.index_name, '_id': doc_id}}
-
- # Maintain minimal state outside of this loop. For large data sets, too
- # much state may cause out of memory failure
- for i in range(0, self.doc_count, self.bulk_size):
- partition = self.dataset.read(self.bulk_size)
- self._handle_data_bulk(partition, action, i)
- self.dataset.reset()
-
- return {}
-
- def _get_measures(self) -> List[str]:
- return ['took']
-
- @abstractmethod
- def _handle_data_bulk(self, partition, action, i):
- pass
-
-
-class IngestStep(BaseIngestStep):
- """See base class."""
-
- label = 'ingest'
-
- def _handle_data_bulk(self, partition, action, i):
- if partition is None:
- return
- body = bulk_transform(partition, self.field_name, action, i)
- bulk_index(self.opensearch, self.index_name, body)
-
-
-class IngestMultiFieldStep(BaseIngestStep):
- """See base class."""
-
- label = 'ingest_multi_field'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- dataset_path = parse_string_param('dataset_path', step_config.config,
- {}, None)
-
- self.attributes_dataset_name = parse_string_param('attributes_dataset_name',
- step_config.config, {}, None)
-
- self.attributes_dataset = parse_dataset('hdf5', dataset_path,
- Context.CUSTOM, self.attributes_dataset_name)
-
- self.attribute_spec = parse_list_param('attribute_spec',
- step_config.config, {}, [])
-
- self.partition_attr = self.attributes_dataset.read(self.doc_count)
- self.action_buffer = None
-
- def _handle_data_bulk(self, partition, action, i):
- if partition is None:
- return
- body = self.bulk_transform_with_attributes(partition, self.partition_attr, self.field_name,
- action, i, self.attribute_spec)
- bulk_index(self.opensearch, self.index_name, body)
-
- def bulk_transform_with_attributes(self, partition: np.ndarray, partition_attr, field_name: str,
- action, offset: int, attributes_def) -> List[Dict[str, Any]]:
- """Partitions and transforms a list of vectors into OpenSearch's bulk
- injection format.
- Args:
- partition: An array of vectors to transform.
- partition_attr: dictionary of additional data to transform
- field_name: field name for action
- action: Bulk API action.
- offset: to start counting from
- attributes_def: definition of additional doc fields
- Returns:
- An array of transformed vectors in bulk format.
- """
- actions = []
- _ = [
- actions.extend([action(i + offset), None])
- for i in range(len(partition))
- ]
- idx = 1
- part_list = partition.tolist()
- for i in range(len(partition)):
- actions[idx] = {field_name: part_list[i]}
- attr_idx = i + offset
- attr_def_idx = 0
- for attribute in attributes_def:
- attr_def_name = attribute['name']
- attr_def_type = attribute['type']
-
- if attr_def_type == 'str':
- val = partition_attr[attr_idx][attr_def_idx].decode()
- if val != 'None':
- actions[idx][attr_def_name] = val
- elif attr_def_type == 'int':
- val = int(partition_attr[attr_idx][attr_def_idx].decode())
- actions[idx][attr_def_name] = val
- attr_def_idx += 1
- idx += 2
-
- return actions
-
-
-class IngestNestedFieldStep(BaseIngestStep):
- """See base class."""
-
- label = 'ingest_nested_field'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- dataset_path = parse_string_param('dataset_path', step_config.config,
- {}, None)
-
- self.attributes_dataset_name = parse_string_param('attributes_dataset_name',
- step_config.config, {}, None)
-
- self.attributes_dataset = parse_dataset('hdf5', dataset_path,
- Context.CUSTOM, self.attributes_dataset_name)
-
- self.attribute_spec = parse_list_param('attribute_spec',
- step_config.config, {}, [])
-
- self.partition_attr = self.attributes_dataset.read(self.doc_count)
-
- if self.dataset.size() != self.doc_count:
- raise ValueError("custom doc_count is not supported for nested field")
- self.action_buffer = None
- self.action_parent_id = None
- self.count = 0
-
- def _handle_data_bulk(self, partition, action, i):
- if partition is None:
- return
- body = self.bulk_transform_with_nested(partition, self.partition_attr, self.field_name,
- action, i, self.attribute_spec)
- if len(body) > 0:
- bulk_index(self.opensearch, self.index_name, body)
-
- def bulk_transform_with_nested(self, partition: np.ndarray, partition_attr, field_name: str,
- action, offset: int, attributes_def) -> List[Dict[str, Any]]:
- """Partitions and transforms a list of vectors into OpenSearch's bulk
- injection format.
- Args:
- partition: An array of vectors to transform.
- partition_attr: dictionary of additional data to transform
- field_name: field name for action
- action: Bulk API action.
- offset: to start counting from
- attributes_def: definition of additional doc fields
- Returns:
- An array of transformed vectors in bulk format.
- """
- # offset is index of start row. We need number of parent doc - 1.
- # The number of parent document can be calculated by using partition_attr data.
- # We need to keep the last parent doc aside so that additional data can be added later.
- parent_id_idx = next((index for (index, d) in enumerate(attributes_def) if d.get('name') == 'parent_id'), None)
- if parent_id_idx is None:
- raise ValueError("parent_id should be provided as attribute spec")
- if attributes_def[parent_id_idx]['type'] != 'int':
- raise ValueError("parent_id should be int type")
-
- first_index = offset
- last_index = offset + len(partition) - 1
- num_of_actions = int(partition_attr[last_index][parent_id_idx].decode()) - int(partition_attr[first_index][parent_id_idx].decode())
- if self.action_buffer is None:
- self.action_buffer = {"nested_field": []}
- self.action_parent_id = int(partition_attr[first_index][parent_id_idx].decode())
-
- actions = []
- _ = [
- actions.extend([action(i + self.action_parent_id), None])
- for i in range(num_of_actions)
- ]
-
- idx = 1
- part_list = partition.tolist()
- for i in range(len(partition)):
- self.count += 1
- nested = {field_name: part_list[i]}
- attr_idx = i + offset
- attr_def_idx = 0
- current_parent_id = None
- for attribute in attributes_def:
- attr_def_name = attribute['name']
- attr_def_type = attribute['type']
- if attr_def_name == "parent_id":
- current_parent_id = int(partition_attr[attr_idx][attr_def_idx].decode())
- attr_def_idx += 1
- continue
-
- if attr_def_type == 'str':
- val = partition_attr[attr_idx][attr_def_idx].decode()
- if val != 'None':
- nested[attr_def_name] = val
- elif attr_def_type == 'int':
- val = int(partition_attr[attr_idx][attr_def_idx].decode())
- nested[attr_def_name] = val
- attr_def_idx += 1
-
- if self.action_parent_id == current_parent_id:
- self.action_buffer["nested_field"].append(nested)
- else:
- actions.extend([action(self.action_parent_id), self.action_buffer])
- self.action_buffer = {"nested_field": []}
- self.action_buffer["nested_field"].append(nested)
- self.action_parent_id = current_parent_id
- idx += 2
-
- if self.count == self.doc_count:
- actions.extend([action(self.action_parent_id), self.action_buffer])
-
- return actions
-
-
-class BaseQueryStep(OpenSearchStep):
- """See base class."""
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.k = parse_int_param('k', step_config.config, {}, 100)
- self.r = parse_int_param('r', step_config.config, {}, 1)
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
- self.field_name = parse_string_param('field_name', step_config.config,
- {}, None)
- self.calculate_recall = parse_bool_param('calculate_recall',
- step_config.config, {}, False)
- dataset_format = parse_string_param('dataset_format',
- step_config.config, {}, 'hdf5')
- dataset_path = parse_string_param('dataset_path',
- step_config.config, {}, None)
- self.dataset = parse_dataset(dataset_format, dataset_path,
- Context.QUERY)
-
- input_query_count = parse_int_param('query_count',
- step_config.config, {},
- self.dataset.size())
- self.query_count = min(input_query_count, self.dataset.size())
-
- self.neighbors_format = parse_string_param('neighbors_format',
- step_config.config, {}, 'hdf5')
- self.neighbors_path = parse_string_param('neighbors_path',
- step_config.config, {}, None)
-
- def _action(self):
-
- results = {}
- query_responses = []
- for _ in range(self.query_count):
- query = self.dataset.read(1)
- if query is None:
- break
- query_responses.append(
- query_index(self.opensearch, self.index_name,
- self.get_body(query[0]) , self.get_exclude_fields()))
-
- results['took'] = [
- float(query_response['took']) for query_response in query_responses
- ]
- results['client_time'] = [
- float(query_response['client_time']) for query_response in query_responses
- ]
- results['memory_kb'] = get_cache_size_in_kb(self.endpoint, self.port)
-
- if self.calculate_recall:
- ids = [[int(hit['_id'])
- for hit in query_response['hits']['hits']]
- for query_response in query_responses]
- results['recall@K'] = recall_at_r(ids, self.neighbors,
- self.k, self.k, self.query_count)
- self.neighbors.reset()
- results[f'recall@{str(self.r)}'] = recall_at_r(
- ids, self.neighbors, self.r, self.k, self.query_count)
- self.neighbors.reset()
-
- self.dataset.reset()
-
- return results
-
- def _get_measures(self) -> List[str]:
- measures = ['took', 'memory_kb', 'client_time']
-
- if self.calculate_recall:
- measures.extend(['recall@K', f'recall@{str(self.r)}'])
-
- return measures
-
- @abstractmethod
- def get_body(self, vec):
- pass
-
- def get_exclude_fields(self):
- return [self.field_name]
-
-class QueryStep(BaseQueryStep):
- """See base class."""
-
- label = 'query'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
- self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,
- Context.NEIGHBORS)
- self.implicit_config = step_config.implicit_config
-
- def get_body(self, vec):
- return {
- 'size': self.k,
- 'query': {
- 'knn': {
- self.field_name: {
- 'vector': vec,
- 'k': self.k
- }
- }
- }
- }
-
-
-class QueryWithFilterStep(BaseQueryStep):
- """See base class."""
-
- label = 'query_with_filter'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- neighbors_dataset = parse_string_param('neighbors_dataset',
- step_config.config, {}, None)
-
- self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,
- Context.CUSTOM, neighbors_dataset)
-
- self.filter_type = parse_string_param('filter_type', step_config.config, {}, 'SCRIPT')
- self.filter_spec = parse_string_param('filter_spec', step_config.config, {}, None)
- self.score_script_similarity = parse_string_param('score_script_similarity', step_config.config, {}, 'l2')
-
- self.implicit_config = step_config.implicit_config
-
- def get_body(self, vec):
- filter_json = json.load(open(self.filter_spec))
- if self.filter_type == 'FILTER':
- return {
- 'size': self.k,
- 'query': {
- 'knn': {
- self.field_name: {
- 'vector': vec,
- 'k': self.k,
- 'filter': filter_json
- }
- }
- }
- }
- elif self.filter_type == 'SCRIPT':
- return {
- 'size': self.k,
- 'query': {
- 'script_score': {
- 'query': {
- 'bool': {
- 'filter': filter_json
- }
- },
- 'script': {
- 'source': 'knn_score',
- 'lang': 'knn',
- 'params': {
- 'field': self.field_name,
- 'query_value': vec,
- 'space_type': self.score_script_similarity
- }
- }
- }
- }
- }
- elif self.filter_type == 'BOOL_POST_FILTER':
- return {
- 'size': self.k,
- 'query': {
- 'bool': {
- 'filter': filter_json,
- 'must': [
- {
- 'knn': {
- self.field_name: {
- 'vector': vec,
- 'k': self.k
- }
- }
- }
- ]
- }
- }
- }
- else:
- raise ConfigurationError('Not supported filter type {}'.format(self.filter_type))
-
-class QueryNestedFieldStep(BaseQueryStep):
- """See base class."""
-
- label = 'query_nested_field'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- neighbors_dataset = parse_string_param('neighbors_dataset',
- step_config.config, {}, None)
-
- self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path,
- Context.CUSTOM, neighbors_dataset)
-
- self.implicit_config = step_config.implicit_config
-
- def get_body(self, vec):
- return {
- 'size': self.k,
- 'query': {
- 'nested': {
- 'path': 'nested_field',
- 'query': {
- 'knn': {
- 'nested_field.' + self.field_name: {
- 'vector': vec,
- 'k': self.k
- }
- }
- }
- }
- }
- }
-
-class GetStatsStep(OpenSearchStep):
- """See base class."""
-
- label = 'get_stats'
-
- def __init__(self, step_config: StepConfig):
- super().__init__(step_config)
-
- self.index_name = parse_string_param('index_name', step_config.config,
- {}, None)
-
- def _action(self):
- """Get stats for cluster/index etc.
-
- Returns:
- Stats with following info:
- - number of committed and search segments in the index
- """
- results = {}
- segment_stats = get_segment_stats(self.opensearch, self.index_name)
- shards = segment_stats["indices"][self.index_name]["shards"]
- num_of_committed_segments = 0
- num_of_search_segments = 0;
- for shard_key in shards.keys():
- for segment in shards[shard_key]:
- num_of_committed_segments += segment["num_committed_segments"]
- num_of_search_segments += segment["num_search_segments"]
-
- results['committed_segments'] = num_of_committed_segments
- results['search_segments'] = num_of_search_segments
- return results
-
- def _get_measures(self) -> List[str]:
- return ['committed_segments', 'search_segments']
-
-# Helper functions - (AKA not steps)
-def bulk_transform(partition: np.ndarray, field_name: str, action,
- offset: int) -> List[Dict[str, Any]]:
- """Partitions and transforms a list of vectors into OpenSearch's bulk
- injection format.
- Args:
- offset: to start counting from
- partition: An array of vectors to transform.
- field_name: field name for action
- action: Bulk API action.
- Returns:
- An array of transformed vectors in bulk format.
- """
- actions = []
- _ = [
- actions.extend([action(i + offset), None])
- for i in range(len(partition))
- ]
- actions[1::2] = [{field_name: vec} for vec in partition.tolist()]
- return actions
-
-
-def delete_index(opensearch: OpenSearch, index_name: str):
- """Deletes an OpenSearch index.
-
- Args:
- opensearch: An OpenSearch client.
- index_name: Name of the OpenSearch index to be deleted.
- """
- opensearch.indices.delete(index=index_name, ignore=[400, 404])
-
-
-def get_model(endpoint, port, model_id):
- """
- Retrieve a model from an OpenSearch cluster
- Args:
- endpoint: Endpoint OpenSearch is running on
- port: Port OpenSearch is running on
- model_id: ID of model to be deleted
- Returns:
- Get model response
- """
- response = requests.get('http://' + endpoint + ':' + str(port) +
- '/_plugins/_knn/models/' + model_id,
- headers={'content-type': 'application/json'})
- return response.json()
-
-
-def delete_model(endpoint, port, model_id):
- """
- Deletes a model from OpenSearch cluster
- Args:
- endpoint: Endpoint OpenSearch is running on
- port: Port OpenSearch is running on
- model_id: ID of model to be deleted
- Returns:
- Deleted model response
- """
- response = requests.delete('http://' + endpoint + ':' + str(port) +
- '/_plugins/_knn/models/' + model_id,
- headers={'content-type': 'application/json'})
- return response.json()
-
-
-def warmup_operation(endpoint, port, index):
- """
- Performs warmup operation on index to load native library files
- of that index to reduce query latencies.
- Args:
- endpoint: Endpoint OpenSearch is running on
- port: Port OpenSearch is running on
- index: index name
- Returns:
- number of shards the plugin succeeded and failed to warm up.
- """
- response = requests.get('http://' + endpoint + ':' + str(port) +
- '/_plugins/_knn/warmup/' + index,
- headers={'content-type': 'application/json'})
- return response.json()
-
-
-def get_opensearch_client(endpoint: str, port: int, timeout=60):
- """
- Get an opensearch client from an endpoint and port
- Args:
- endpoint: Endpoint OpenSearch is running on
- port: Port OpenSearch is running on
- timeout: timeout for OpenSearch client, default value 60
- Returns:
- OpenSearch client
-
- """
- # TODO: fix for security in the future
- return OpenSearch(
- hosts=[{
- 'host': endpoint,
- 'port': port
- }],
- use_ssl=False,
- verify_certs=False,
- connection_class=RequestsHttpConnection,
- timeout=timeout,
- )
-
-
-def recall_at_r(results, neighbor_dataset, r, k, query_count):
- """
- Calculates the recall@R for a set of queries against a ground truth nearest
- neighbor set
- Args:
- results: 2D list containing ids of results returned by OpenSearch.
- results[i][j] i refers to query, j refers to
- result in the query
- neighbor_dataset: 2D dataset containing ids of the true nearest
- neighbors for a set of queries
- r: number of top results to check if they are in the ground truth k-NN
- set.
- k: k value for the query
- query_count: number of queries
- Returns:
- Recall at R
- """
- correct = 0.0
- total_num_of_results = 0
- for query in range(query_count):
- true_neighbors = neighbor_dataset.read(1)
- if true_neighbors is None:
- break
- true_neighbors_set = set(true_neighbors[0][:k])
- true_neighbors_set.discard(-1)
- min_r = min(r, len(true_neighbors_set))
- total_num_of_results += min_r
- for j in range(min_r):
- if results[query][j] in true_neighbors_set:
- correct += 1.0
-
- return correct / total_num_of_results
-
-
-def get_index_size_in_kb(opensearch, index_name):
- """
- Gets the size of an index in kilobytes
- Args:
- opensearch: opensearch client
- index_name: name of index to look up
- Returns:
- size of index in kilobytes
- """
- return int(
- opensearch.indices.stats(index_name, metric='store')['indices']
- [index_name]['total']['store']['size_in_bytes']) / 1024
-
-
-def get_cache_size_in_kb(endpoint, port):
- """
- Gets the size of the k-NN cache in kilobytes
- Args:
- endpoint: endpoint of OpenSearch cluster
- port: port of endpoint OpenSearch is running on
- Returns:
- size of cache in kilobytes
- """
- response = requests.get('http://' + endpoint + ':' + str(port) +
- '/_plugins/_knn/stats',
- headers={'content-type': 'application/json'})
- stats = response.json()
-
- keys = stats['nodes'].keys()
-
- total_used = 0
- for key in keys:
- total_used += int(stats['nodes'][key]['graph_memory_usage'])
- return total_used
-
-
-def query_index(opensearch: OpenSearch, index_name: str, body: dict,
- excluded_fields: list):
- start_time = round(time.time()*1000)
- queryResponse = opensearch.search(index=index_name,
- body=body,
- _source_excludes=excluded_fields)
- end_time = round(time.time() * 1000)
- queryResponse['client_time'] = end_time - start_time
- return queryResponse
-
-
-def bulk_index(opensearch: OpenSearch, index_name: str, body: List):
- return opensearch.bulk(index=index_name, body=body)
-
-def get_segment_stats(opensearch: OpenSearch, index_name: str):
- return opensearch.indices.segments(index=index_name)
diff --git a/benchmarks/perf-tool/okpt/test/test.py b/benchmarks/perf-tool/okpt/test/test.py
deleted file mode 100644
index c947545ad..000000000
--- a/benchmarks/perf-tool/okpt/test/test.py
+++ /dev/null
@@ -1,188 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-#
-# The OpenSearch Contributors require contributions made to
-# this file be licensed under the Apache-2.0 license or a
-# compatible open source license.
-
-"""Provides a base Test class."""
-from math import floor
-from typing import Any, Dict, List
-
-from okpt.io.config.parsers.test import TestConfig
-from okpt.test.steps.base import Step
-
-
-def get_avg(values: List[Any]):
- """Get average value of a list.
-
- Args:
- values: A list of values.
-
- Returns:
- The average value in the list.
- """
- valid_total = len(values)
- running_sum = 0.0
-
- for value in values:
- if value == -1:
- valid_total -= 1
- continue
- running_sum += value
-
- if valid_total == 0:
- return -1
- return running_sum / valid_total
-
-
-def _pxx(values: List[Any], p: float):
- """Calculates the pXX statistics for a given list.
-
- Args:
- values: List of values.
- p: Percentile (between 0 and 1).
-
- Returns:
- The corresponding pXX metric.
- """
- lowest_percentile = 1 / len(values)
- highest_percentile = (len(values) - 1) / len(values)
-
- # return -1 if p is out of range or if the list doesn't have enough elements
- # to support the specified percentile
- if p < 0 or p > 1:
- return -1.0
- elif p < lowest_percentile or p > highest_percentile:
- if p == 1.0 and len(values) > 1:
- return float(values[len(values) - 1])
- return -1.0
- else:
- return float(values[floor(len(values) * p)])
-
-
-def _aggregate_steps(step_results: List[Dict[str, Any]],
- measure_labels=None):
- """Aggregates the steps for a given Test.
-
- The aggregation process extracts the measures from each step and calculates
- the total time spent performing each step measure, including the
- percentile metrics, if possible.
-
- The aggregation process also extracts the test measures by simply summing
- up the respective step measures.
-
- A step measure is formatted as `{step_name}_{measure_name}`, for example,
- {bulk_index}_{took} or {query_index}_{memory}. The braces are not included
- in the actual key string.
-
- Percentile/Total step measures are give as
- `{step_name}_{measure_name}_{percentile|total}`.
-
- Test measures are just step measure sums so they just given as
- `test_{measure_name}`.
-
- Args:
- steps: List of test steps to be aggregated.
- measures: List of step metrics to account for.
-
- Returns:
- A complete test result.
- """
- if measure_labels is None:
- measure_labels = ['took']
- test_measures = {
- f'test_{measure_label}': 0
- for measure_label in measure_labels
- }
- step_measures: Dict[str, Any] = {}
-
- # iterate over all test steps
- for step in step_results:
- step_label = step['label']
-
- step_measure_labels = list(step.keys())
- step_measure_labels.remove('label')
-
- # iterate over all measures in each test step
- for measure_label in step_measure_labels:
-
- step_measure = step[measure_label]
- step_measure_label = f'{measure_label}' if step_label == 'get_stats' else f'{step_label}_{measure_label}'
-
- # Add cumulative test measures from steps to test measures
- if measure_label in measure_labels:
- test_measures[f'test_{measure_label}'] += sum(step_measure) if \
- isinstance(step_measure, list) else step_measure
-
- if step_measure_label in step_measures:
- _ = step_measures[step_measure_label].extend(step_measure) \
- if isinstance(step_measure, list) else \
- step_measures[step_measure_label].append(step_measure)
- else:
- step_measures[step_measure_label] = step_measure if \
- isinstance(step_measure, list) else [step_measure]
-
- aggregate = {**test_measures}
- # calculate the totals and percentile statistics for each step measure
- # where relevant
- for step_measure_label, step_measure in step_measures.items():
- step_measure.sort()
-
- aggregate[step_measure_label + '_total'] = float(sum(step_measure))
-
- p50 = _pxx(step_measure, 0.50)
- if p50 != -1:
- aggregate[step_measure_label + '_p50'] = p50
- p90 = _pxx(step_measure, 0.90)
- if p90 != -1:
- aggregate[step_measure_label + '_p90'] = p90
- p99 = _pxx(step_measure, 0.99)
- if p99 != -1:
- aggregate[step_measure_label + '_p99'] = p99
- p99_9 = _pxx(step_measure, 0.999)
- if p99_9 != -1:
- aggregate[step_measure_label + '_p99.9'] = p99_9
- p100 = _pxx(step_measure, 1.00)
- if p100 != -1:
- aggregate[step_measure_label + '_p100'] = p100
-
- return aggregate
-
-
-class Test:
- """A base Test class, representing a collection of steps to profiled and
- aggregated.
-
- Methods:
- setup: Performs test setup. Usually for steps not intended to be
- profiled.
- run_steps: Runs the test steps, aggregating the results into the
- `step_results` instance field.
- cleanup: Perform test cleanup. Useful for clearing the state of a
- persistent process like OpenSearch. Cleanup steps are executed after
- each run.
- execute: Runs steps, cleans up, and aggregates the test result.
- """
- def __init__(self, test_config: TestConfig):
- """Initializes the test state.
- """
- self.test_config = test_config
- self.setup_steps: List[Step] = test_config.setup
- self.test_steps: List[Step] = test_config.steps
- self.cleanup_steps: List[Step] = test_config.cleanup
-
- def setup(self):
- _ = [step.execute() for step in self.setup_steps]
-
- def _run_steps(self):
- step_results = []
- _ = [step_results.extend(step.execute()) for step in self.test_steps]
- return step_results
-
- def _cleanup(self):
- _ = [step.execute() for step in self.cleanup_steps]
-
- def execute(self):
- results = self._run_steps()
- self._cleanup()
- return _aggregate_steps(results)
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json
deleted file mode 100644
index 7e8ddda8e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1,
- "knn.algo_param.ef_search": 100
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "faiss",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
deleted file mode 100644
index 3e04d12c4..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
+++ /dev/null
@@ -1,42 +0,0 @@
-{
- "bool":
- {
- "should":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 70
- }
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- },
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "yellow"
- }
- },
- {
- "term":
- {
- "taste": "sweet"
- }
- }
- ]
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
deleted file mode 100644
index ba8850e1d..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
+++ /dev/null
@@ -1,40 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss HNSW Relaxed Filter Test"
-test_id: "Faiss HNSW Relaxed Filter Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-hnsw/filtering/relaxed-filter/index.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query_with_filter
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-with-relaxed-filters.hdf5
- neighbors_dataset: neighbors_filter_5
- filter_spec: release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
- filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json
deleted file mode 100644
index 7e8ddda8e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1,
- "knn.algo_param.ef_search": 100
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "faiss",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
deleted file mode 100644
index 9e6356f1c..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
+++ /dev/null
@@ -1,44 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 60
- }
- }
- },
- {
- "term":
- {
- "taste": "bitter"
- }
- },
- {
- "bool":
- {
- "should":
- [
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- }
- ]
- }
- }
- ]
- }
-}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml
deleted file mode 100644
index 94f4073c7..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml
+++ /dev/null
@@ -1,40 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss HNSW Restrictive Filter Test"
-test_id: "Faiss HNSW Restrictive Filter Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-hnsw/filtering/restrictive-filter/index.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query_with_filter
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-with-restrictive-filters.hdf5
- neighbors_dataset: neighbors_filter_4
- filter_spec: release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
- filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/index.json
deleted file mode 100644
index 7e8ddda8e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/index.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1,
- "knn.algo_param.ef_search": 100
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "faiss",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/index.json
deleted file mode 100644
index 338ceb1f4..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/index.json
+++ /dev/null
@@ -1,35 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1,
- "knn.algo_param.ef_search": 100
- }
- },
- "mappings": {
- "_source": {
- "excludes": ["nested_field"]
- },
- "properties": {
- "nested_field": {
- "type": "nested",
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "faiss",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml
deleted file mode 100644
index 151b2014d..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml
+++ /dev/null
@@ -1,37 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss HNSW Nested Field Test"
-test_id: "Faiss HNSW Nested Field Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-hnsw/nested/simple/index.json
- - name: ingest_nested_field
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-nested.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' }, { name: 'parent_id', type: 'int'} ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query_nested_field
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-nested.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-nested.hdf5
- neighbors_dataset: neighbour_nested
\ No newline at end of file
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/test.yml
deleted file mode 100644
index c4740acf5..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnsw/test.yml
+++ /dev/null
@@ -1,35 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss HNSW Test"
-test_id: "Faiss HNSW Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-hnsw/index.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean.hdf5
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/index.json b/benchmarks/perf-tool/release-configs/faiss-hnswpq/index.json
deleted file mode 100644
index 479703412..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/index.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "model_id": "test-model"
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnswpq/method-spec.json
deleted file mode 100644
index 2d67bf2df..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/method-spec.json
+++ /dev/null
@@ -1,15 +0,0 @@
-{
- "name":"hnsw",
- "engine":"faiss",
- "space_type": "l2",
- "parameters":{
- "ef_construction": 256,
- "m": 16,
- "encoder": {
- "name": "pq",
- "parameters": {
- "m": 16
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/test.yml b/benchmarks/perf-tool/release-configs/faiss-hnswpq/test.yml
deleted file mode 100644
index f573ede9c..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/test.yml
+++ /dev/null
@@ -1,59 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss HNSW PQ Test"
-test_id: "Faiss HNSW PQ Test"
-num_runs: 3
-show_runs: false
-setup:
- - name: delete_index
- index_name: train_index
- - name: create_index
- index_name: train_index
- index_spec: release-configs/faiss-hnswpq/train-index-spec.json
- - name: ingest
- index_name: train_index
- field_name: train_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- doc_count: 50000
- - name: refresh_index
- index_name: train_index
-steps:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
- - name: train_model
- model_id: test-model
- train_index: train_index
- train_field: train_field
- dimension: 128
- method_spec: release-configs/faiss-hnswpq/method-spec.json
- max_training_vector_count: 50000
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-hnswpq/index.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean.hdf5
diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnswpq/train-index-spec.json
deleted file mode 100644
index 804a5707e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/train-index-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": 24,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "train_field": {
- "type": "knn_vector",
- "dimension": 128
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/index.json
deleted file mode 100644
index ade7fa377..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/index.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "model_id": "test-model"
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json
deleted file mode 100644
index 51ae89877..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json
+++ /dev/null
@@ -1,9 +0,0 @@
-{
- "name":"ivf",
- "engine":"faiss",
- "space_type": "l2",
- "parameters":{
- "nlist": 128,
- "nprobes": 8
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json
deleted file mode 100644
index 3e04d12c4..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json
+++ /dev/null
@@ -1,42 +0,0 @@
-{
- "bool":
- {
- "should":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 70
- }
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- },
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "yellow"
- }
- },
- {
- "term":
- {
- "taste": "sweet"
- }
- }
- ]
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml
deleted file mode 100644
index adb25a04d..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml
+++ /dev/null
@@ -1,64 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss IVF Relaxed Filter Test"
-test_id: "Faiss IVF Relaxed Filter Test"
-num_runs: 3
-show_runs: false
-setup:
- - name: delete_index
- index_name: train_index
- - name: create_index
- index_name: train_index
- index_spec: release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json
- - name: ingest
- index_name: train_index
- field_name: train_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- doc_count: 50000
- - name: refresh_index
- index_name: train_index
-steps:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
- - name: train_model
- model_id: test-model
- train_index: train_index
- train_field: train_field
- dimension: 128
- method_spec: release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json
- max_training_vector_count: 50000
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-ivf/filtering/relaxed-filter/index.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query_with_filter
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-with-relaxed-filters.hdf5
- neighbors_dataset: neighbors_filter_5
- filter_spec: release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json
- filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json
deleted file mode 100644
index 137fac9d8..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "train_field": {
- "type": "knn_vector",
- "dimension": 128
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/index.json
deleted file mode 100644
index ade7fa377..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/index.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "model_id": "test-model"
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json
deleted file mode 100644
index 51ae89877..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json
+++ /dev/null
@@ -1,9 +0,0 @@
-{
- "name":"ivf",
- "engine":"faiss",
- "space_type": "l2",
- "parameters":{
- "nlist": 128,
- "nprobes": 8
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json
deleted file mode 100644
index 9e6356f1c..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json
+++ /dev/null
@@ -1,44 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 60
- }
- }
- },
- {
- "term":
- {
- "taste": "bitter"
- }
- },
- {
- "bool":
- {
- "should":
- [
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- }
- ]
- }
- }
- ]
- }
-}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml
deleted file mode 100644
index bad047eab..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml
+++ /dev/null
@@ -1,64 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss IVF restrictive Filter Test"
-test_id: "Faiss IVF restrictive Filter Test"
-num_runs: 3
-show_runs: false
-setup:
- - name: delete_index
- index_name: train_index
- - name: create_index
- index_name: train_index
- index_spec: release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json
- - name: ingest
- index_name: train_index
- field_name: train_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- doc_count: 50000
- - name: refresh_index
- index_name: train_index
-steps:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
- - name: train_model
- model_id: test-model
- train_index: train_index
- train_field: train_field
- dimension: 128
- method_spec: release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json
- max_training_vector_count: 50000
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-ivf/filtering/restrictive-filter/index.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query_with_filter
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-with-restrictive-filters.hdf5
- neighbors_dataset: neighbors_filter_4
- filter_spec: release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json
- filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json
deleted file mode 100644
index 804a5707e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": 24,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "train_field": {
- "type": "knn_vector",
- "dimension": 128
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/index.json b/benchmarks/perf-tool/release-configs/faiss-ivf/index.json
deleted file mode 100644
index 479703412..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/index.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "model_id": "test-model"
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/method-spec.json
deleted file mode 100644
index 51ae89877..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/method-spec.json
+++ /dev/null
@@ -1,9 +0,0 @@
-{
- "name":"ivf",
- "engine":"faiss",
- "space_type": "l2",
- "parameters":{
- "nlist": 128,
- "nprobes": 8
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/test.yml b/benchmarks/perf-tool/release-configs/faiss-ivf/test.yml
deleted file mode 100644
index 367c42594..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/test.yml
+++ /dev/null
@@ -1,59 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss IVF"
-test_id: "Faiss IVF"
-num_runs: 3
-show_runs: false
-setup:
- - name: delete_index
- index_name: train_index
- - name: create_index
- index_name: train_index
- index_spec: release-configs/faiss-ivf/train-index-spec.json
- - name: ingest
- index_name: train_index
- field_name: train_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- doc_count: 50000
- - name: refresh_index
- index_name: train_index
-steps:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
- - name: train_model
- model_id: test-model
- train_index: train_index
- train_field: train_field
- dimension: 128
- method_spec: release-configs/faiss-ivf/method-spec.json
- max_training_vector_count: 50000
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-ivf/index.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean.hdf5
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/train-index-spec.json
deleted file mode 100644
index 804a5707e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivf/train-index-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": 24,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "train_field": {
- "type": "knn_vector",
- "dimension": 128
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/index.json b/benchmarks/perf-tool/release-configs/faiss-ivfpq/index.json
deleted file mode 100644
index 479703412..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/index.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "model_id": "test-model"
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivfpq/method-spec.json
deleted file mode 100644
index 204b0a653..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/method-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "name":"ivf",
- "engine":"faiss",
- "space_type": "l2",
- "parameters":{
- "nlist": 128,
- "nprobes": 8,
- "encoder": {
- "name": "pq",
- "parameters": {
- "m": 16,
- "code_size": 8
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/test.yml b/benchmarks/perf-tool/release-configs/faiss-ivfpq/test.yml
deleted file mode 100644
index c3f63348b..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/test.yml
+++ /dev/null
@@ -1,59 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Faiss IVF PQ Test"
-test_id: "Faiss IVF PQ Test"
-num_runs: 3
-show_runs: false
-setup:
- - name: delete_index
- index_name: train_index
- - name: create_index
- index_name: train_index
- index_spec: release-configs/faiss-ivfpq/train-index-spec.json
- - name: ingest
- index_name: train_index
- field_name: train_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- doc_count: 50000
- - name: refresh_index
- index_name: train_index
-steps:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
- - name: train_model
- model_id: test-model
- train_index: train_index
- train_field: train_field
- dimension: 128
- method_spec: release-configs/faiss-ivfpq/method-spec.json
- max_training_vector_count: 50000
- - name: create_index
- index_name: target_index
- index_spec: release-configs/faiss-ivfpq/index.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean.hdf5
diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivfpq/train-index-spec.json
deleted file mode 100644
index 804a5707e..000000000
--- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/train-index-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": 24,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "train_field": {
- "type": "knn_vector",
- "dimension": 128
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/index.json
deleted file mode 100644
index 7a9ff2890..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/index.json
+++ /dev/null
@@ -1,26 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "lucene",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
deleted file mode 100644
index 3e04d12c4..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
+++ /dev/null
@@ -1,42 +0,0 @@
-{
- "bool":
- {
- "should":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 70
- }
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- },
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "yellow"
- }
- },
- {
- "term":
- {
- "taste": "sweet"
- }
- }
- ]
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
deleted file mode 100644
index 3bbb99a0f..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
+++ /dev/null
@@ -1,38 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Lucene HNSW Relaxed Filter Test"
-test_id: "Lucene HNSW Relaxed Filter Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/lucene-hnsw/filtering/relaxed-filter/index.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: query_with_filter
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-with-relaxed-filters.hdf5
- neighbors_dataset: neighbors_filter_5
- filter_spec: release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json
- filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/index.json
deleted file mode 100644
index 7a9ff2890..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/index.json
+++ /dev/null
@@ -1,26 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "lucene",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
deleted file mode 100644
index 9e6356f1c..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
+++ /dev/null
@@ -1,44 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 60
- }
- }
- },
- {
- "term":
- {
- "taste": "bitter"
- }
- },
- {
- "bool":
- {
- "should":
- [
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- }
- ]
- }
- }
- ]
- }
-}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml
deleted file mode 100644
index aa4c5193f..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml
+++ /dev/null
@@ -1,38 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Lucene HNSW Restrictive Filter Test"
-test_id: "Lucene HNSW Restrictive Filter Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/lucene-hnsw/filtering/restrictive-filter/index.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: query_with_filter
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-with-restrictive-filters.hdf5
- neighbors_dataset: neighbors_filter_4
- filter_spec: release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json
- filter_type: FILTER
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/index.json
deleted file mode 100644
index 7a9ff2890..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/index.json
+++ /dev/null
@@ -1,26 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "lucene",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/index.json
deleted file mode 100644
index b41b51c77..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/index.json
+++ /dev/null
@@ -1,34 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1
- }
- },
- "mappings": {
- "_source": {
- "excludes": ["nested_field"]
- },
- "properties": {
- "nested_field": {
- "type": "nested",
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "lucene",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml
deleted file mode 100644
index be825487a..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml
+++ /dev/null
@@ -1,37 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Lucene HNSW Nested Field Test"
-test_id: "Lucene HNSW Nested Field Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/lucene-hnsw/nested/simple/index.json
- - name: ingest_nested_field
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-nested.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' }, { name: 'parent_id', type: 'int'} ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query_nested_field
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean-nested.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean-nested.hdf5
- neighbors_dataset: neighbour_nested
\ No newline at end of file
diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/test.yml
deleted file mode 100644
index b253ee08e..000000000
--- a/benchmarks/perf-tool/release-configs/lucene-hnsw/test.yml
+++ /dev/null
@@ -1,33 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Lucene HNSW"
-test_id: "Lucene HNSW"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/lucene-hnsw/index.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean.hdf5
diff --git a/benchmarks/perf-tool/release-configs/nmslib-hnsw/index.json b/benchmarks/perf-tool/release-configs/nmslib-hnsw/index.json
deleted file mode 100644
index eb714c5c8..000000000
--- a/benchmarks/perf-tool/release-configs/nmslib-hnsw/index.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 24,
- "number_of_replicas": 1,
- "knn.algo_param.ef_search": 100
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "nmslib",
- "parameters": {
- "ef_construction": 256,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/release-configs/nmslib-hnsw/test.yml b/benchmarks/perf-tool/release-configs/nmslib-hnsw/test.yml
deleted file mode 100644
index 94ad9b131..000000000
--- a/benchmarks/perf-tool/release-configs/nmslib-hnsw/test.yml
+++ /dev/null
@@ -1,35 +0,0 @@
-endpoint: [ENDPOINT]
-port: [PORT]
-test_name: "Nmslib HNSW Test"
-test_id: "Nmslib HNSW Test"
-num_runs: 3
-show_runs: false
-steps:
- - name: delete_index
- index_name: target_index
- - name: create_index
- index_name: target_index
- index_spec: release-configs/nmslib-hnsw/index.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 1
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: dataset/sift-128-euclidean.hdf5
diff --git a/benchmarks/perf-tool/release-configs/run_all_tests.sh b/benchmarks/perf-tool/release-configs/run_all_tests.sh
deleted file mode 100755
index e65d5b5c4..000000000
--- a/benchmarks/perf-tool/release-configs/run_all_tests.sh
+++ /dev/null
@@ -1,102 +0,0 @@
-#!/bin/bash
-set -e
-
-# Description:
-# Run a performance test for release
-# Dataset should be available in perf-tool/dataset before running this script
-#
-# Example:
-# ./run-test.sh --endpoint localhost
-#
-# Usage:
-# ./run-test.sh \
-# --endpoint
-# --port 80 \
-# --num-runs 3 \
-# --outputs ~/outputs
-
-while [ "$1" != "" ]; do
- case $1 in
- -url | --endpoint ) shift
- ENDPOINT=$1
- ;;
- -p | --port ) shift
- PORT=$1
- ;;
- -n | --num-runs ) shift
- NUM_RUNS=$1
- ;;
- -o | --outputs ) shift
- OUTPUTS=$1
- ;;
- * ) echo "Unknown parameter"
- echo $1
- exit 1
- ;;
- esac
- shift
-done
-
-if [ ! -n "$ENDPOINT" ]; then
- echo "--endpoint should be specified"
- exit
-fi
-
-if [ ! -n "$PORT" ]; then
- PORT=80
- echo "--port is not specified. Using default values $PORT"
-fi
-
-if [ ! -n "$NUM_RUNS" ]; then
- NUM_RUNS=3
- echo "--num-runs is not specified. Using default values $NUM_RUNS"
-fi
-
-if [ ! -n "$OUTPUTS" ]; then
- OUTPUTS="$HOME/outputs"
- echo "--outputs is not specified. Using default values $OUTPUTS"
-fi
-
-
-curl -X PUT "http://$ENDPOINT:$PORT/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
-{
- "persistent" : {
- "knn.algo_param.index_thread_qty" : 4
- }
-}
-'
-
-TESTS="./release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
-./release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml
-./release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml
-./release-configs/faiss-hnsw/test.yml
-./release-configs/faiss-hnswpq/test.yml
-./release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml
-./release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml
-./release-configs/faiss-ivf/test.yml
-./release-configs/faiss-ivfpq/test.yml
-./release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml
-./release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml
-./release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml
-./release-configs/lucene-hnsw/test.yml
-./release-configs/nmslib-hnsw/test.yml"
-
-if [ ! -d $OUTPUTS ]
-then
- mkdir $OUTPUTS
-fi
-
-for TEST in $TESTS
-do
- ORG_FILE=$TEST
- NEW_FILE="$ORG_FILE.tmp"
- OUT_FILE=$(grep test_id $ORG_FILE | cut -d':' -f2 | sed -r 's/^ "|"$//g' | sed 's/ /_/g')
- echo "cp $ORG_FILE $NEW_FILE"
- cp $ORG_FILE $NEW_FILE
- sed -i "/^endpoint:/c\endpoint: $ENDPOINT" $NEW_FILE
- sed -i "/^port:/c\port: $PORT" $NEW_FILE
- sed -i "/^num_runs:/c\num_runs: $NUM_RUNS" $NEW_FILE
- python3 knn-perf-tool.py test $NEW_FILE $OUTPUTS/$OUT_FILE
- #Sleep for 1 min to cool down cpu from the previous run
- sleep 60
-done
diff --git a/benchmarks/perf-tool/requirements.in b/benchmarks/perf-tool/requirements.in
deleted file mode 100644
index fd3555aab..000000000
--- a/benchmarks/perf-tool/requirements.in
+++ /dev/null
@@ -1,7 +0,0 @@
-Cerberus
-opensearch-py
-PyYAML
-numpy
-h5py
-requests
-psutil
diff --git a/benchmarks/perf-tool/requirements.txt b/benchmarks/perf-tool/requirements.txt
deleted file mode 100644
index 46cec00ed..000000000
--- a/benchmarks/perf-tool/requirements.txt
+++ /dev/null
@@ -1,37 +0,0 @@
-#
-# This file is autogenerated by pip-compile with python 3.9
-# To update, run:
-#
-# pip-compile
-#
-cerberus==1.3.4
- # via -r requirements.in
-certifi==2023.7.22
- # via
- # opensearch-py
- # requests
-charset-normalizer==2.0.4
- # via requests
-h5py==3.3.0
- # via -r requirements.in
-idna==3.7
- # via requests
-numpy==1.24.2
- # via
- # -r requirements.in
- # h5py
-opensearch-py==1.0.0
- # via -r requirements.in
-psutil==5.8.0
- # via -r requirements.in
-pyyaml==5.4.1
- # via -r requirements.in
-requests==2.31.0
- # via -r requirements.in
-urllib3==1.26.18
- # via
- # opensearch-py
- # requests
-
-# The following packages are considered to be unsafe in a requirements file:
-# setuptools
diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/index-spec.json b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/index-spec.json
deleted file mode 100644
index 5542ef387..000000000
--- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/index-spec.json
+++ /dev/null
@@ -1,17 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "number_of_shards": 3,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "model_id": "test-model"
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/method-spec.json b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/method-spec.json
deleted file mode 100644
index 1aa7f809f..000000000
--- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/method-spec.json
+++ /dev/null
@@ -1,8 +0,0 @@
-{
- "name":"ivf",
- "engine":"faiss",
- "parameters":{
- "nlist":16,
- "nprobes": 4
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/test.yml b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/test.yml
deleted file mode 100644
index 027ba8683..000000000
--- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/test.yml
+++ /dev/null
@@ -1,62 +0,0 @@
-endpoint: localhost
-test_name: faiss_sift_ivf
-test_id: "Test workflow for faiss ivf"
-num_runs: 3
-show_runs: true
-setup:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
- - name: delete_index
- index_name: train_index
- - name: create_index
- index_name: train_index
- index_spec: sample-configs/faiss-sift-ivf/train-index-spec.json
- - name: ingest
- index_name: train_index
- field_name: train_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: train_index
-steps:
- - name: train_model
- model_id: test-model
- train_index: train_index
- train_field: train_field
- dimension: 128
- method_spec: sample-configs/faiss-sift-ivf/method-spec.json
- max_training_vector_count: 1000000000
- - name: create_index
- index_name: target_index
- index_spec: sample-configs/faiss-sift-ivf/index-spec.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 10
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: ../dataset/sift-128-euclidean.hdf5
-cleanup:
- - name: delete_model
- model_id: test-model
- - name: delete_index
- index_name: target_index
diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/train-index-spec.json b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/train-index-spec.json
deleted file mode 100644
index 00a418e4f..000000000
--- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/train-index-spec.json
+++ /dev/null
@@ -1,16 +0,0 @@
-{
- "settings": {
- "index": {
- "number_of_shards": 3,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "train_field": {
- "type": "knn_vector",
- "dimension": 128
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json
deleted file mode 100644
index f529de4fe..000000000
--- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json
+++ /dev/null
@@ -1,24 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 20,
- "lte": 100
- }
- }
- },
- {
- "term":
- {
- "color": "red"
- }
- }
- ]
- }
-}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json
deleted file mode 100644
index 9d4514e62..000000000
--- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json
+++ /dev/null
@@ -1,40 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "term":
- {
- "taste": "salty"
- }
- },
- {
- "bool":
- {
- "should":
- [
- {
- "bool":
- {
- "must_not":
- {
- "exists":
- {
- "field": "color"
- }
- }
- }
- },
- {
- "term":
- {
- "color": "blue"
- }
- }
- ]
- }
- }
- ]
- }
-}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json
deleted file mode 100644
index d69f8768e..000000000
--- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json
+++ /dev/null
@@ -1,30 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 20,
- "lte": 80
- }
- }
- },
- {
- "exists":
- {
- "field": "color"
- }
- },
- {
- "exists":
- {
- "field": "taste"
- }
- }
- ]
- }
-}
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json
deleted file mode 100644
index 822d63b37..000000000
--- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json
+++ /dev/null
@@ -1,44 +0,0 @@
-{
- "bool":
- {
- "must":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 60
- }
- }
- },
- {
- "term":
- {
- "taste": "bitter"
- }
- },
- {
- "bool":
- {
- "should":
- [
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- }
- ]
- }
- }
- ]
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json
deleted file mode 100644
index 3e04d12c4..000000000
--- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json
+++ /dev/null
@@ -1,42 +0,0 @@
-{
- "bool":
- {
- "should":
- [
- {
- "range":
- {
- "age":
- {
- "gte": 30,
- "lte": 70
- }
- }
- },
- {
- "term":
- {
- "color": "green"
- }
- },
- {
- "term":
- {
- "color": "blue"
- }
- },
- {
- "term":
- {
- "color": "yellow"
- }
- },
- {
- "term":
- {
- "taste": "sweet"
- }
- }
- ]
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json
deleted file mode 100644
index 83ea79b15..000000000
--- a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "refresh_interval": "10s",
- "number_of_shards": 30,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "lucene",
- "parameters": {
- "ef_construction": 100,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml
deleted file mode 100644
index aa2ee6389..000000000
--- a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml
+++ /dev/null
@@ -1,41 +0,0 @@
-endpoint: localhost
-test_name: lucene_sift_hnsw
-test_id: "Test workflow for lucene hnsw"
-num_runs: 1
-show_runs: false
-setup:
- - name: delete_index
- index_name: target_index
-steps:
- - name: create_index
- index_name: target_index
- index_spec: sample-configs/lucene-sift-hnsw-filter/index-spec.json
- - name: ingest_multi_field
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
- attributes_dataset_name: attributes
- attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ]
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 10
- - name: query_with_filter
- k: 10
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5
- neighbors_format: hdf5
- neighbors_path: ../dataset/sift-128-euclidean-with-attr-with-filters.hdf5
- neighbors_dataset: neighbors_filter_1
- filter_spec: sample-configs/filter-spec/filter-1-spec.json
- query_count: 100
-cleanup:
- - name: delete_index
- index_name: target_index
\ No newline at end of file
diff --git a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/index-spec.json b/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/index-spec.json
deleted file mode 100644
index 75abe7baa..000000000
--- a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/index-spec.json
+++ /dev/null
@@ -1,28 +0,0 @@
-{
- "settings": {
- "index": {
- "knn": true,
- "knn.algo_param.ef_search": 512,
- "refresh_interval": "10s",
- "number_of_shards": 1,
- "number_of_replicas": 0
- }
- },
- "mappings": {
- "properties": {
- "target_field": {
- "type": "knn_vector",
- "dimension": 128,
- "method": {
- "name": "hnsw",
- "space_type": "l2",
- "engine": "nmslib",
- "parameters": {
- "ef_construction": 512,
- "m": 16
- }
- }
- }
- }
- }
-}
diff --git a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/test.yml b/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/test.yml
deleted file mode 100644
index 6d96bf80c..000000000
--- a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/test.yml
+++ /dev/null
@@ -1,38 +0,0 @@
-endpoint: localhost
-test_name: nmslib_sift_hnsw
-test_id: "Test workflow for nmslib hnsw"
-num_runs: 2
-show_runs: false
-setup:
- - name: delete_index
- index_name: target_index
-steps:
- - name: create_index
- index_name: target_index
- index_spec: sample-configs/nmslib-sift-hnsw/index-spec.json
- - name: ingest
- index_name: target_index
- field_name: target_field
- bulk_size: 500
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean.hdf5
- - name: refresh_index
- index_name: target_index
- - name: force_merge
- index_name: target_index
- max_num_segments: 10
- - name: warmup_operation
- index_name: target_index
- - name: query
- k: 100
- r: 1
- calculate_recall: true
- index_name: target_index
- field_name: target_field
- dataset_format: hdf5
- dataset_path: ../dataset/sift-128-euclidean.hdf5
- neighbors_format: hdf5
- neighbors_path: ../dataset/sift-128-euclidean.hdf5
-cleanup:
- - name: delete_index
- index_name: target_index