rapidsai · rapids-bot · Jul 26, 2023 · Jul 11, 2023 · Jul 11, 2023 · Jul 11, 2023
@@ -30,6 +30,7 @@ dependencies:
 - libcusparse-dev=11.7.5.86
 - libcusparse=11.7.5.86
 - libfaiss>=1.7.1
+- matplotlib
 - nccl>=2.9.9
 - ninja
 - nlohmann_json>=3.11.2

@@ -789,9 +789,5 @@
 
       ],
       "search_result_file" : "result/glove-100-inner/ggnn/kbuild96-segment64-refine2-k10"
-    },
-
-
-  ]
-
+    }]
 }
@@ -79,6 +79,8 @@ class RaftCagra : public ANN<T> {
   void save(const std::string& file) const override;
   void load(const std::string&) override;
 
+  ~RaftCagra() noexcept { rmm::mr::set_current_device_resource(mr_.get_upstream()); }
+
  private:
   raft::device_resources handle_;
   BuildParam index_params_;

@@ -79,6 +79,8 @@ class RaftIvfFlatGpu : public ANN<T> {
   void save(const std::string& file) const override;
   void load(const std::string&) override;
 
+  ~RaftIvfFlatGpu() noexcept { rmm::mr::set_current_device_resource(mr_.get_upstream()); }
+
  private:
   raft::device_resources handle_;
   BuildParam index_params_;

@@ -79,6 +79,8 @@ class RaftIvfPQ : public ANN<T> {
   void save(const std::string& file) const override;
   void load(const std::string&) override;
 
+  ~RaftIvfPQ() noexcept { rmm::mr::set_current_device_resource(mr_.get_upstream()); }
+
  private:
   raft::device_resources handle_;
   BuildParam index_params_;

@@ -169,6 +169,7 @@ dependencies:
           - h5py>=3.8.0
           - libfaiss>=1.7.1
           - faiss-proc=*=cuda
+          - matplotlib
 
   cudatoolkit:
     specific:

diff --git a/docs/source/ann_benchmarks_build.md b/docs/source/ann_benchmarks_build.md
@@ -0,0 +1,48 @@
+### Dependencies
+
+CUDA 11 and a GPU with Pascal architecture or later are required to run the benchmarks. 
+
+Please refer to the  [installation docs](https://docs.rapids.ai/api/raft/stable/build.html#cuda-gpu-requirements) for the base requirements to build RAFT. 
+
+In addition to the base requirements for building RAFT, additional dependencies needed to build the ANN benchmarks include:
+1. FAISS GPU >= 1.7.1
+2. Google Logging (GLog)
+3. H5Py
+4. HNSWLib
+5. nlohmann_json
+6. GGNN
+
+[rapids-cmake](https://github.com/rapidsai/rapids-cmake) is used to build the ANN benchmarks so the code for dependencies not already supplied in the CUDA toolkit will be downloaded and built automatically.
+
+The easiest (and most reproducible) way to install the dependencies needed to build the ANN benchmarks is to use the conda environment file located in the `conda/environments` directory of the RAFT repository. The following command will use `mamba` (which is preferred over `conda`) to build and activate a new environment for compiling the benchmarks:
+
+```bash
+mamba env create --name raft_ann_benchmarks -f conda/environments/bench_ann_cuda-118_arch-x86_64.yaml
+conda activate raft_ann_benchmarks
+```
+
+The above conda environment will also reduce the compile times as dependencies like FAISS will already be installed and not need to be compiled with `rapids-cmake`.
+
+### Compiling the Benchmarks
+
+After the needed dependencies are satisfied, the easiest way to compile ANN benchmarks is through the `build.sh` script in the root of the RAFT source code repository. The following will build the executables for all the support algorithms:
+```bash
+./build.sh bench-ann
+```
+
+You can limit the algorithms that are built by providing a semicolon-delimited list of executable names (each algorithm is suffixed with `_ANN_BENCH`):
+```bash
+./build.sh bench-ann -n --limit-bench-ann=HNSWLIB_ANN_BENCH;RAFT_IVF_PQ_ANN_BENCH
+```
+
+Available targets to use with `--limit-bench-ann` are:
+- FAISS_IVF_FLAT_ANN_BENCH
+- FAISS_IVF_PQ_ANN_BENCH
+- FAISS_BFKNN_ANN_BENCH
+- GGNN_ANN_BENCH
+- HNSWLIB_ANN_BENCH
+- RAFT_CAGRA_ANN_BENCH
+- RAFT_IVF_PQ_ANN_BENCH
+- RAFT_IVF_FLAT_ANN_BENCH
+
+By default, the `*_ANN_BENCH` executables program infer the dataset's datatype from the filename's extension. For example, an extension of `fbin` uses a `float` datatype, `f16bin` uses a `float16` datatype, extension of `i8bin` uses `int8_t` datatype, and `u8bin` uses `uint8_t` type. Currently, only `float`, `float16`, int8_t`, and `unit8_t` are supported.
diff --git a/docs/source/cuda_ann_benchmarks.md → docs/source/ann_benchmarks_low_level.md b/docs/source/cuda_ann_benchmarks.md → docs/source/ann_benchmarks_low_level.md
@@ -1,65 +1,4 @@
-# CUDA ANN Benchmarks
-
-This project provides a benchmark program for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU.
-
-## Benchmark
-
-### Dependencies
-
-CUDA 11 and a GPU with Pascal architecture or later are required to run the benchmarks. 
-
-Please refer to the  [installation docs](https://docs.rapids.ai/api/raft/stable/build.html#cuda-gpu-requirements) for the base requirements to build RAFT. 
-
-In addition to the base requirements for building RAFT, additional dependencies needed to build the ANN benchmarks include:
-1. FAISS GPU >= 1.7.1
-2. Google Logging (GLog)
-3. H5Py
-4. HNSWLib
-5. nlohmann_json
-6. GGNN
-
-[rapids-cmake](https://github.com/rapidsai/rapids-cmake) is used to build the ANN benchmarks so the code for dependencies not already supplied in the CUDA toolkit will be downloaded and built automatically.
-
-The easiest (and most reproducible) way to install the dependencies needed to build the ANN benchmarks is to use the conda environment file located in the `conda/environments` directory of the RAFT repository. The following command will use `mamba` (which is preferred over `conda`) to build and activate a new environment for compiling the benchmarks:
-
-```bash
-mamba env create --name raft_ann_benchmarks -f conda/environments/bench_ann_cuda-118_arch-x86_64.yaml
-conda activate raft_ann_benchmarks
-```
-
-The above conda environment will also reduce the compile times as dependencies like FAISS will already be installed and not need to be compiled with `rapids-cmake`.
-
-### Compiling the Benchmarks
-
-After the needed dependencies are satisfied, the easiest way to compile ANN benchmarks is through the `build.sh` script in the root of the RAFT source code repository. The following will build the executables for all the support algorithms:
-```bash
-./build.sh bench-ann
-```
-
-You can limit the algorithms that are built by providing a semicolon-delimited list of executable names (each algorithm is suffixed with `_ANN_BENCH`):
-```bash
-./build.sh bench-ann -n --limit-bench-ann=HNSWLIB_ANN_BENCH;RAFT_IVF_PQ_ANN_BENCH
-```
-
-Available targets to use with `--limit-bench-ann` are:
-- FAISS_IVF_FLAT_ANN_BENCH
-- FAISS_IVF_PQ_ANN_BENCH
-- FAISS_BFKNN_ANN_BENCH
-- GGNN_ANN_BENCH
-- HNSWLIB_ANN_BENCH
-- RAFT_CAGRA_ANN_BENCH
-- RAFT_IVF_PQ_ANN_BENCH
-- RAFT_IVF_FLAT_ANN_BENCH
-
-By default, the `*_ANN_BENCH` executables program infer the dataset's datatype from the filename's extension. For example, an extension of `fbin` uses a `float` datatype, `f16bin` uses a `float16` datatype, extension of `i8bin` uses `int8_t` datatype, and `u8bin` uses `uint8_t` type. Currently, only `float`, `float16`, int8_t`, and `unit8_t` are supported.
-
-### Usage
-There are 4 general steps to running the benchmarks:
-1. Prepare Dataset
-2. Build Index
-3. Search Using Built Index
-4. Evaluate Result
-
+### Low-level Scripts and Executables
 #### End-to-end Example
 An end-to-end example (run from the RAFT source code root directory):
 ```bash
@@ -99,7 +38,7 @@ popd
 # optional step: plot QPS-Recall figure using data in result.csv with your favorite tool
 ```
 
-##### Step 1: Prepare Dataset
+##### Step 1: Prepare Dataset <a id='bash-prepare-dataset'></a>
 A dataset usually has 4 binary files containing database vectors, query vectors, ground truth neighbors and their corresponding distances. For example, Glove-100 dataset has files `base.fbin` (database vectors), `query.fbin` (query vectors), `groundtruth.neighbors.ibin` (ground truth neighbors), and `groundtruth.distances.fbin` (ground truth distances). The first two files are for index building and searching, while the other two are associated with a particular distance and are used for evaluation.
 
 The file suffixes `.fbin`, `.f16bin`, `.ibin`, `.u8bin`, and `.i8bin` denote that the data type of vectors stored in the file are `float32`, `float16`(a.k.a `half`), `int`, `uint8`, and `int8`, respectively.
@@ -128,7 +67,7 @@ Commonly used datasets can be downloaded from two websites:
 
     Most datasets provided by `ann-benchmarks` use `Angular` or `Euclidean` distance. `Angular` denotes cosine distance. However, computing cosine distance reduces to computing inner product by normalizing vectors beforehand. In practice, we can always do the normalization to decrease computation cost, so it's better to measure the performance of inner product rather than cosine distance. The `-n` option of `hdf5_to_fbin.py` can be used to normalize the dataset.
 
-2. Billion-scale datasets can be found at [`big-ann-benchmarks`](http://big-ann-benchmarks.com). The ground truth file contains both neighbors and distances, thus should be split. A script is provided for this:
+2. <a id='billion-scale'></a>Billion-scale datasets can be found at [`big-ann-benchmarks`](http://big-ann-benchmarks.com). The ground truth file contains both neighbors and distances, thus should be split. A script is provided for this:
     ```bash
     $ cpp/bench/ann/scripts/split_groundtruth.pl
     usage: script/split_groundtruth.pl input output_prefix
@@ -237,7 +176,7 @@ usage: [-f] [-o output.csv] groundtruth.neighbors.ibin result_paths...
   -f: force to recompute recall and update it in result file if needed
   -o: also write result to a csv file
 ```
-Note that there can be multiple arguments for paths of result files. Each argument can be either a file name or a path. If it's a directory, all files found under it recursively will be used as input files.
+<a id='result-filepath-example'></a>Note that there can be multiple arguments for paths of result files. Each argument can be either a file name or a path. If it's a directory, all files found under it recursively will be used as input files.
 An example:
 ```bash
 cpp/bench/ann/scripts/eval.pl groundtruth.neighbors.ibin \
@@ -274,7 +213,7 @@ public:
 };
 ```
 
-The benchmark program uses JSON configuration file. To add the new algorithm to the benchmark, need be able to specify `build_param`, whose value is a JSON object, and `search_params`, whose value is an array of JSON objects, for this algorithm in configuration file. Still take the configuration for `HnswLib` as an example:
+<a id='json-index-config'></a>The benchmark program uses JSON configuration file. To add the new algorithm to the benchmark, need be able to specify `build_param`, whose value is a JSON object, and `search_params`, whose value is an array of JSON objects, for this algorithm in configuration file. Still take the configuration for `HnswLib` as an example:
 ```json
 {
   "name" : "...",

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -44,7 +44,7 @@ While not exhaustive, the following general categories help summarize the accele
    developer_guide.md
    cpp_api.rst
    pylibraft_api.rst
-   cuda_ann_benchmarks.md
+   raft_ann_benchmarks.md
    raft_dask_api.rst
    using_comms.rst
    using_libraft.md

diff --git a/docs/source/raft_ann_benchmarks.md b/docs/source/raft_ann_benchmarks.md
@@ -0,0 +1,169 @@
+# RAFT ANN Benchmarks
+
+This project provides a benchmark program for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU.
+
+## Installing the benchmarks
+
+You can easily install the benchmarks through conda with the following instructions:
+```bash
+mamba env create --name raft_ann_benchmarks -f conda/environments/bench_ann_cuda-118_arch-x86_64.yaml
+conda activate raft_ann_benchmarks
+
+mamba install -c rapidsai libraft-ann-bench
+```
+The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired.
+
+Please see the [build instructions](ann_benchmarks_build.md) to build the benchmarks from source.
+
+## Running the benchmarks
+
+### Usage
+There are 4 general steps to running the benchmarks:
+1. Prepare Dataset
+2. Build Index and Search Index
+3. Evaluate Results
+4. Plot Results
+
+### Python-based Scripts
+We provide a collection of lightweight Python based scripts that are wrappers over
+lower level scripts and executables to run our benchmarks. Either Python scripts or
+[low-level scripts and executables](ann_benchmarks_low_level.md) are valid methods to run benchmarks,
+however plots are only provided through our Python scripts.
+#### End-to-end example: Million-scale
+```bash
+# All scripts are present in directory raft/scripts/ann-benchmarks
+
+# (1) prepare dataset
+python scripts/ann-benchmarks/get_dataset.py --name glove-100-angular --normalize
+
+# (2) build and search index
+python scripts/ann-benchmarks/run.py --configuration conf/glove-100-inner.json
+
+# (3) evaluate results
+python scripts/ann-benchmarks/data_export.py --output out.csv --groundtruth data/glove-100-inner/groundtruth.neighbors.ibin result/glove-100-inner/
+
+# (4) plot results
+python scripts/ann-benchmarks/plot.py --result_csv out.csv
+```
+
+#### End-to-end example: Billion-scale
+The above example does not work at Billion-scale because [data preparation](#prep-dataset) is not yet
+supported by `scripts/get_dataset.py`. To download and prepare [billion-scale datasets](ann_benchmarks_low_level.html#billion-scale),
+please follow linked section. All other python scripts mentioned below work as intended once the
+billion-scale dataset has been downloaded.
+To download Billion-scale datasets, visit [big-ann-benchmarks](http://big-ann-benchmarks.com/neurips21.html)
+
+```bash
+mkdir -p data/deep-1B && cd data/deep-1B
+# (1) prepare dataset
+# download manually "Ground Truth" file of "Yandex DEEP"
+# suppose the file name is deep_new_groundtruth.public.10K.bin
+../../scripts/split_groundtruth.pl deep_new_groundtruth.public.10K.bin groundtruth
+# two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
+
+# (2) build and search index
+python scripts/run.py --configuration conf/deep-1B.json
+
+# (3) evaluate results
+python scripts/data_export.py --output out.csv --groundtruth data/deep-1B/groundtruth.neighbors.ibin result/deep-1B/
+
+# (4) plot results
+python scripts/plot.py --result_csv out.csv
+```
+
+##### Step 1: Prepare Dataset<a id='prep-dataset'></a>
+The script `scripts/ann-benchmarks/get_dataset.py` will download and unpack the dataset in directory
+that the user provides. As of now, only million-scale datasets are supported by this
+script. For more information on [datasets and formats](ann_benchmarks_low_level.html#bash-prepare-dataset).
+
+The usage of this script is:
+```bash
+usage: get_dataset.py [-h] [--name NAME] [--path PATH] [--normalize]
+
+options:
+  -h, --help   show this help message and exit
+  --name NAME  dataset to download (default: glove-100-angular)
+  --path PATH  path to download dataset (default: {os.getcwd()}/data)
+  --normalize  normalize cosine distance to inner product (default: False)
+```
+
+When option `normalize` is provided to the script, any dataset that has cosine distances
+will be normalized to inner product. So, for example, the dataset `glove-100-angular` 
+will be written at location `data/glove-100-inner/`.
+
+#### Step 2: Build and Search Index
+The script `scripts/ann-benchmarks/run.py` will build and search indices for a given dataset and its
+specified configuration.
+To confirgure which algorithms are available, we use `algos.yaml`.
+To configure building/searching indices for a dataset, look at [index configuration](ann_benchmarks_low_level.html#json-index-config).
+An entry in `algos.yaml` looks like:
+```yaml
+raft_ivf_pq:
+  executable: RAFT_IVF_PQ_ANN_BENCH
+  disabled: false
+```
+`executable` : specifies the binary that will build/search the index. It is assumed to be
+available in `raft/cpp/build/`.
+`disabled` : denotes whether an algorithm should be excluded from benchmark runs.
+
+The usage of the script `scripts/run.py` is:
+```bash
+usage: run.py [-h] --configuration CONFIGURATION [--build] [--search] [--algorithms ALGORITHMS] [--indices INDICES] [--force]
+
+options:
+  -h, --help            show this help message and exit
+  --configuration CONFIGURATION
+                        path to configuration file for a dataset (default: None)
+  --build
+  --search
+  --algorithms ALGORITHMS
+                        run only comma separated list of named algorithms (default: None)
+  --indices INDICES     run only comma separated list of named indices. parameter `algorithms` is ignored (default: None)
+  --force               re-run algorithms even if their results already exist (default: False)
+```
+
+`build` and `search` : if both parameters are not supplied to the script then
+it is assumed both are `True`.
+
+`indices` and `algorithms` : these parameters ensure that the algorithm specified for an index 
+is available in `algos.yaml` and not disabled, as well as having an associated executable.
+
+#### Step 3: Evaluating Results
+The script `scripts/ann-benchmarks/data_export.py` will evaluate results for a dataset whose index has been built
+and search with at least one algorithm. For every result file that is supplied to the script, the output
+will be combined and written to a CSV file.
+
+The usage of this script is:
+```bash
+usage: data_export.py [-h] --output OUTPUT [--recompute] --groundtruth GROUNDTRUTH <result_filepaths>
+
+options:
+  -h, --help            show this help message and exit
+  --output OUTPUT       Path to the CSV output file (default: None)
+  --recompute           Recompute metrics (default: False)
+  --groundtruth GROUNDTRUTH
+                        Path to groundtruth.neighbors.ibin file for a dataset (default: None)
+```
+
+`result_filepaths` : whitespace delimited list of result files/directories that can be capture via pattern match. For more [information and examples](ann_benchmarks_low_level.html#result-filepath-example)
+
+#### Step 4: Plot Results
+The script `scripts/ann-benchmarks/plot.py` will plot all results evaluated to a CSV file for a given dataset.
+
+The usage of this script is:
+```bash
+usage: plot.py [-h] --result_csv RESULT_CSV [--output OUTPUT] [--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--raw]
+
+options:
+  -h, --help            show this help message and exit
+  --result_csv RESULT_CSV
+                        Path to CSV Results (default: None)
+  --output OUTPUT       Path to the PNG output file (default: /home/nfs/dgala/raft/out.png)
+  --x-scale X_SCALE     Scale to use when drawing the X-axis. Typically linear, logit or a2 (default: linear)
+  --y-scale {linear,log,symlog,logit}
+                        Scale to use when drawing the Y-axis (default: linear)
+  --raw                 Show raw results (not just Pareto frontier) in faded colours (default: False)
+```
+
+All algorithms present in the CSV file supplied to this script with parameter `result_csv`
+will appear in the plot.
-Original file line number
+Diff line change
@@ Expand Up / @@ -789,9 +789,5 @@ @@
           ],
           "search_result_file" : "result/glove-100-inner/ggnn/kbuild96-segment64-refine2-k10"
-        },
-      ]
+        }]
     }