Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add YAML config files to run parameter sweeps for ANN benchmarks #1929

Merged
merged 34 commits into from
Oct 31, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
f53f128
Adding some initial yaml param conf files
cjnolet Oct 20, 2023
0385543
Adding name
cjnolet Oct 20, 2023
1bf6fa7
Adding datasets.yml
cjnolet Oct 20, 2023
9e52c11
More cleanup
cjnolet Oct 20, 2023
14e2c5d
add cross products for yaml->json configs
divyegala Oct 24, 2023
b00942d
add algo-groups
divyegala Oct 25, 2023
1f81d45
Updated validatoes
cjnolet Oct 25, 2023
bcf92c5
remove only cagra yaml load
divyegala Oct 26, 2023
933732c
fix configs
divyegala Oct 26, 2023
32430c1
working yaml param sweeps
divyegala Oct 27, 2023
b65e6d3
add docs
divyegala Oct 28, 2023
1a810fd
add bench-ann cuda12 envs
divyegala Oct 28, 2023
e0e3ac1
merge upstream
divyegala Oct 28, 2023
a2e9b98
style fixes
divyegala Oct 28, 2023
fa193b8
fix filename
divyegala Oct 28, 2023
3b0a956
correct filename again
divyegala Oct 28, 2023
4345124
remove debug print
divyegala Oct 28, 2023
834f567
try to fix bad merge
divyegala Oct 28, 2023
4d764f0
remove comment
divyegala Oct 28, 2023
c7f9af3
Merge remote-tracking branch 'upstream/branch-23.12' into fea-2312-be…
divyegala Oct 30, 2023
f644dfe
address review comments
divyegala Oct 30, 2023
f470d86
add wiki datasets to datasets.yaml
divyegala Oct 30, 2023
2b21ad0
fix style
divyegala Oct 30, 2023
9ff0fbb
more style fixes
divyegala Oct 30, 2023
cf086c2
fix when --configuration is a file
divyegala Oct 31, 2023
0b6cc09
don't read json confs
divyegala Oct 31, 2023
227486c
add nvtx dependency
divyegala Oct 31, 2023
d80af8b
fix docs again
divyegala Oct 31, 2023
33349ce
again fix docs
divyegala Oct 31, 2023
879c744
fix style
divyegala Oct 31, 2023
d2ccf2a
fix wike datasets config
divyegala Oct 31, 2023
124d091
Changing FAISS M to M_ratio. Adding build validator for ivf-pq. Addin…
cjnolet Oct 31, 2023
683cebe
Merge branch 'fea-2312-bench-ann-conf' of github.com:divyegala/raft i…
cjnolet Oct 31, 2023
5538b05
More work on configs and validators
cjnolet Oct 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions conda/environments/bench_ann_cuda-120_arch-aarch64.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# This file is generated by `rapids-dependency-file-generator`.
# To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`.
channels:
- rapidsai
- rapidsai-nightly
- dask/label/dev
- conda-forge
- nvidia
dependencies:
- benchmark>=1.8.2
- c-compiler
- clang-tools=16.0.6
- clang==16.0.6
- cmake>=3.26.4
- cuda-cudart-dev
- cuda-nvcc
- cuda-profiler-api
- cuda-version=12.0
- cxx-compiler
- cython>=3.0.0
- gcc_linux-aarch64=11.*
- glog>=0.6.0
- h5py>=3.8.0
- hnswlib=0.7.0
- libcublas-dev
- libcurand-dev
- libcusolver-dev
- libcusparse-dev
- matplotlib
- nccl>=2.9.9
- ninja
- nlohmann_json>=3.11.2
- openblas
- pandas
- pyyaml
- rmm==23.12.*
- scikit-build>=0.13.1
- sysroot_linux-aarch64==2.17
name: bench_ann_cuda-120_arch-aarch64
39 changes: 39 additions & 0 deletions conda/environments/bench_ann_cuda-120_arch-x86_64.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# This file is generated by `rapids-dependency-file-generator`.
# To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`.
channels:
- rapidsai
- rapidsai-nightly
- dask/label/dev
- conda-forge
- nvidia
dependencies:
- benchmark>=1.8.2
- c-compiler
- clang-tools=16.0.6
- clang==16.0.6
- cmake>=3.26.4
- cuda-cudart-dev
- cuda-nvcc
- cuda-profiler-api
- cuda-version=12.0
- cxx-compiler
- cython>=3.0.0
- gcc_linux-64=11.*
- glog>=0.6.0
- h5py>=3.8.0
- hnswlib=0.7.0
- libcublas-dev
- libcurand-dev
- libcusolver-dev
- libcusparse-dev
- matplotlib
- nccl>=2.9.9
- ninja
- nlohmann_json>=3.11.2
- openblas
- pandas
- pyyaml
- rmm==23.12.*
- scikit-build>=0.13.1
- sysroot_linux-64==2.17
name: bench_ann_cuda-120_arch-x86_64
2 changes: 1 addition & 1 deletion dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ files:
bench_ann:
output: conda
matrix:
cuda: ["11.8"]
cuda: ["11.8", "12.0"]
arch: [x86_64, aarch64]
includes:
- build
Expand Down
165 changes: 96 additions & 69 deletions docs/source/raft_ann_benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,28 @@ This project provides a benchmark program for various ANN search implementations

## Table of Contents

- [Installing the benchmarks](#installing-the-benchmarks)
- [Conda](#conda)
- [Docker](#docker)
- [How to run the benchmarks](#how-to-run-the-benchmarks)
- [Step 1: prepare dataset](#step-1-prepare-dataset)
- [Step 2: build and search index](#step-2-build-and-search-index)
- [Step 3: data export](#step-3-data-export)
- [Step 4: plot results](#step-4-plot-results)
- [Running the benchmarks](#running-the-benchmarks)
- [End to end: small-scale (<1M to 10M)](#end-to-end-small-scale-benchmarks-1m-to-10m)
- [End to end: large-scale (>10M)](#end-to-end-large-scale-benchmarks-10m-vectors)
- [Running with Docker containers](#running-with-docker-containers)
- [Evaluating the results](#evaluating-the-results)
- [Creating and customizing dataset configurations](#creating-and-customizing-dataset-configurations)
- [Adding a new ANN algorithm](#adding-a-new-ann-algorithm)
- [Parameter tuning guide](https://docs.rapids.ai/api/raft/nightly/ann_benchmarks_param_tuning/)
- [Wiki-all RAG/LLM Dataset](https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/)
- [RAFT ANN Benchmarks](#raft-ann-benchmarks)
- [Table of Contents](#table-of-contents)
- [Installing the benchmarks](#installing-the-benchmarks)
- [Conda](#conda)
- [Docker](#docker)
- [How to run the benchmarks](#how-to-run-the-benchmarks)
- [Step 1: Prepare Dataset](#step-1-prepare-dataset)
- [Step 2: Build and Search Index](#step-2-build-and-search-index)
- [Step 3: Data Export](#step-3-data-export)
- [Step 4: Plot Results](#step-4-plot-results)
- [Running the benchmarks](#running-the-benchmarks)
- [End to end: small-scale benchmarks (\<1M to 10M)](#end-to-end-small-scale-benchmarks-1m-to-10m)
- [End to end: large-scale benchmarks (\>10M vectors)](#end-to-end-large-scale-benchmarks-10m-vectors)
- [Running with Docker containers](#running-with-docker-containers)
- [End-to-end run on GPU](#end-to-end-run-on-gpu)
- [End-to-end run on CPU](#end-to-end-run-on-cpu)
- [Manually run the scripts inside the container](#manually-run-the-scripts-inside-the-container)
- [Evaluating the results](#evaluating-the-results)
- [Creating and customizing dataset configurations](#creating-and-customizing-dataset-configurations)
- [Adding a new ANN algorithm](#adding-a-new-ann-algorithm)
- [Implementation and Configuration](#implementation-and-configuration)
- [Adding a CMake Target](#adding-a-cmake-target)

## Installing the benchmarks

Expand Down Expand Up @@ -122,38 +127,52 @@ specified configuration.

The usage of the script `raft-ann-bench.run` is:
```bash
usage: run.py [-h] [-k COUNT] [-bs BATCH_SIZE] [--configuration CONFIGURATION] [--dataset DATASET] [--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--indices INDICES]
[-f]
usage: __main__.py [-h] [--subset-size SUBSET_SIZE] [-k COUNT] [-bs BATCH_SIZE] [--dataset-configuration DATASET_CONFIGURATION] [--configuration CONFIGURATION] [--dataset DATASET]
[--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS] [-f] [-m SEARCH_MODE]

options:
-h, --help show this help message and exit
--subset-size SUBSET_SIZE
the number of subset rows of the dataset to build the index (default: None)
-k COUNT, --count COUNT
the number of nearest neighbors to search for (default: 10)
-bs BATCH_SIZE, --batch-size BATCH_SIZE
number of query vectors to use in each query trial (default: 10000)
--dataset-configuration DATASET_CONFIGURATION
path to YAML configuration file for datasets (default: None)
--configuration CONFIGURATION
path to configuration file for a dataset (default: None)
--dataset DATASET dataset whose configuration file will be used (default: glove-100-inner)
path to YAML configuration file or directory for algorithms Any run groups found in the specified file/directory will automatically override groups of the same name
present in the default configurations, including `base` (default: None)
--dataset DATASET name of dataset (default: glove-100-inner)
--dataset-path DATASET_PATH
path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
path to dataset folder, by default will look in RAPIDS_DATASET_ROOT_DIR if defined, otherwise a datasets subdirectory from the calling directory (default:
os.getcwd()/datasets/)
--build
--search
--algorithms ALGORITHMS
run only comma separated list of named algorithms (default: None)
--indices INDICES run only comma separated list of named indices. parameter `algorithms` is ignored (default: None)
run only comma separated list of named algorithms. If parameters `groups` and `algo-groups are both undefined, then group `base` is run by default (default: None)
--groups GROUPS run only comma separated groups of parameters (default: base)
--algo-groups ALGO_GROUPS
add comma separated <algorithm>.<group> to run. Example usage: "--algo-groups=raft_cagra.large,hnswlib.large" (default: None)
-f, --force re-run algorithms even if their results already exist (default: False)
divyegala marked this conversation as resolved.
Show resolved Hide resolved
-m MODE, --search-mode MODE
run search in 'latency' (measure individual batches) or
'throughput' (pipeline batches and measure end-to-end) mode.
(default: 'latency')
-m SEARCH_MODE, --search-mode SEARCH_MODE
run search in 'latency' (measure individual batches) or 'throughput' (pipeline batches and measure end-to-end) mode (default: throughput)
```

`configuration` and `dataset` : `configuration` is a path to a configuration file for a given dataset.
The configuration file should be name as `<dataset>.json`. It is optional if the name of the dataset is
provided with the `dataset` argument, in which case
a configuration file will be searched for as `python/raft-ann-bench/src/raft-ann-bench/run/conf/<dataset>.json`.
For every algorithm run by this script, it outputs an index build statistics JSON file in `<dataset-path/<dataset>/result/build/<algo-k{k}-batch_size{batch_size}.json>`
and an index search statistics JSON file in `<dataset-path/<dataset>/result/search/<algo-k{k}-batch_size{batch_size}.json>`.
`dataset`: name of the dataset to be searched in [datasets.yaml](#yaml-dataset-config)

`dataset-configuration`: optional filepath to custom dataset YAML config which has an entry for arg `dataset`

`configuration`: optional filepath to YAML configuration for an algorithm or to directory that contains YAML configurations for several algorithms. [Here's how to configure an algorithm.](#yaml-algo-config)

`algorithms`: runs all algorithms that it can find in YAML configs found by `configuration`. By default, only `base` group will be run.

`groups`: run only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group

`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to run the benchmark for in addition to all the arguments from `algorithms` and `groups`. It is of the format `<algorithm>.<group>`, or for example, `raft_cagra.large`
divyegala marked this conversation as resolved.
Show resolved Hide resolved

For every algorithm run by this script, it outputs an index build statistics JSON file in `<dataset-path/<dataset>/result/build/<algo_{group}-{k}-{batch_size}.json>`
and an index search statistics JSON file in `<dataset-path/<dataset>/result/search/<algo_{group}-{k}-{batch_size}.json>`. NOTE: The filenams will not have "_{group}" if `group = "base"`.

`dataset-path` :
1. data is read from `<dataset-path>/<dataset>`
Expand Down Expand Up @@ -188,18 +207,21 @@ CSV file in `<dataset-path/<dataset>/result/search/<-k{k}-batch_size{batch_size}

The usage of this script is:
```bash
usage: plot.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [-k COUNT] [-bs BATCH_SIZE] [--build] [--search]
[--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--raw]
usage: __main__.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS] [-k COUNT]
[-bs BATCH_SIZE] [--build] [--search] [--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--raw]

options:
-h, --help show this help message and exit
--dataset DATASET dataset to download (default: glove-100-inner)
--dataset DATASET dataset to plot (default: glove-100-inner)
--dataset-path DATASET_PATH
path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
path to dataset folder (default: os.getcwd()/datasets/)
--output-filepath OUTPUT_FILEPATH
directory for PNG to be saved (default: os.getcwd())
--algorithms ALGORITHMS
plot only comma separated list of named algorithms (default: None)
plot only comma separated list of named algorithms. If parameters `groups` and `algo-groups are both undefined, then group `base` is plot by default (default: None)
--groups GROUPS plot only comma separated groups of parameters (default: base)
--algo-groups ALGO_GROUPS, --algo-groups ALGO_GROUPS
add comma separated <algorithm>.<group> to plot. Example usage: "--algo-groups=raft_cagra.large,hnswlib.large" (default: None)
-k COUNT, --count COUNT
the number of nearest neighbors to search for (default: 10)
-bs BATCH_SIZE, --batch-size BATCH_SIZE
Expand All @@ -211,6 +233,11 @@ options:
Scale to use when drawing the Y-axis (default: linear)
--raw Show raw results (not just Pareto frontier) in faded colours (default: False)
```
`algorithms`: plots all algorithms that it can find results for the specified `dataset`. By default, only `base` group will be plotted.

`groups`: plot only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group

`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to plot results for in addition to all the arguments from `algorithms` and `groups`. It is of the format `<algorithm>.<group>`, or for example, `raft_cagra.large`

The figure below is the resulting plot of running our benchmarks as of August 2023 for a batch size of 10, on an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU. It presents the throughput (in Queries-Per-Second) performance for every level of recall.

Expand Down Expand Up @@ -394,40 +421,42 @@ Note that the actual table displayed on the screen may differ slightly as the hy

## Creating and customizing dataset configurations

A single configuration file will often define a set of algorithms, with associated index and search parameters, for a specific dataset. A configuration file uses json format with 4 major parts:
1. Dataset information
2. Algorithm information
3. Index parameters
4. Search parameters
A single configuration will often define a set of algorithms, with associated index and search parameters, that can be generalize across datasets. We use YAML to define dataset specific and algorithm specific configurations.

Below is a simple example configuration file for the 1M-scale `sift-128-euclidean` dataset:
<a id='yaml-dataset-config'></a>A default `datasets.yaml` is provided by RAFT in `${RAFT_HOME}/python/raft-ann-bench/src/raft-ann-bench/run/conf` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset:

```json
{
"dataset": {
"name": "sift-128-euclidean",
"base_file": "sift-128-euclidean/base.fbin",
"query_file": "sift-128-euclidean/query.fbin",
"subset_size": 1000000,
"groundtruth_neighbors_file": "sift-128-euclidean/groundtruth.neighbors.ibin",
"distance": "euclidean"
},
"index": []
}
```yaml
- name: sift-128-euclidean
base_file: sift-128-euclidean/base.fbin
query_file: sift-128-euclidean/query.fbin
groundtruth_neighbors_file: sift-128-euclidean/groundtruth.neighbors.ibin
distance: euclidean
```

The `index` section will contain a list of index objects, each of which will have the following form:
```json
{
"name": "algo_name.unique_index_name",
"algo": "algo_name",
"file": "sift-128-euclidean/algo_name/param1_val1-param2_val2",
"build_param": { "param1": "val1", "param2": "val2" },
"search_params": [{ "search_param1": "search_val1" }]
}
<a id='yaml-algo-config'></a>Configuration files for ANN algorithms supported by `raft-ann-bench` are provided in `${RAFT_HOME}/python/raft-ann-bench/src/raft-ann-bench/run/conf`. `raft_cagra` algorithm configuration looks like:
```yaml
name: raft_cagra
groups:
base:
build:
graph_degree: [32, 64]
intermediate_graph_degree: [64, 96]
search:
itopk: [32, 64, 128]

large:
build:
graph_degree: [32, 64]
search:
itopk: [32, 64, 128]
```
The default parameters for which the benchmarks are run can be overridden by creating a custom YAML file for algorithms with a `base` group.

The table below contains the possible settings for the `algo` field. Each unique algorithm will have its own set of `build_param` and `search_params` settings. The [ANN Algorithm Parameter Tuning Guide](ann_benchmarks_param_tuning.md) contains detailed instructions on choosing build and search parameters for each supported algorithm.
There config above has 2 fields:
1. `name` - define the name of the algorithm for which the parameters are being specified.
2. `groups` - define a run group which has a particular set of parameters. Each group helps create a cross-product of all hyper-parameter fields for `build` and `search`.

The table below contains all algorithms supported by RAFT. Each unique algorithm will have its own set of `build` and `search` settings. The [ANN Algorithm Parameter Tuning Guide](ann_benchmarks_param_tuning.md) contains detailed instructions on choosing build and search parameters for each supported algorithm.

| Library | Algorithms |
|-----------|------------------------------------------------------------------|
Expand All @@ -437,8 +466,6 @@ The table below contains the possible settings for the `algo` field. Each unique
| HNSWlib | `hnswlib` |
| RAFT | `raft_brute_force`, `raft_cagra`, `raft_ivf_flat`, `raft_ivf_pq` |

By default, the index will be placed in `bench/ann/data/<dataset_name>/index/<name>`. Using `sift-128-euclidean` for the dataset with the `algo` example above, the indexes would be placed in `bench/ann/data/sift-128-euclidean/index/algo_name/param1_val1-param2_val2`.

## Adding a new ANN algorithm

### Implementation and Configuration
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,10 @@ def convert_json_to_csv_build(dataset, dataset_path):
"time": df["real_time"],
}
)
write.to_csv(file.replace(".json", ".csv"), index=False)
filepath = os.path.normpath(file).split(os.sep)
filename = filepath[-1].split("-")[0] + ".csv"
write.to_csv(os.path.join(f"{os.sep}".join(filepath[:-1]), filename),
index=False)


def convert_json_to_csv_search(dataset, dataset_path):
Expand Down
Loading
Loading