Skip to content

Commit

Permalink
ANN-benchmarks: switch to use gbench (#1661)
Browse files Browse the repository at this point in the history
Make the ANN benchmarks use the same google benchmark infrastructure as the prim benchmarks while keeping the functional changes minimal.

### Overview
  - The command-line API largely stays the same, but enhanced with gbench-specific parameters, such as using regex to select algo configs, control the minimum run-time, and flexible reporting to console/files.
  - There's just one executable `ANN_BENCH`, all of the algorithms are loaded as shared libraries. The CPU-only components do not require cuda at runtime (ANN_BENCH itself, hnswlib).
  - Some dependencies are linked statically, it's possible to just copy the executable and the libs and run the benchmark on a linux machine with very few packages installed.
  - Search benchmarks do not produce any output anymore, they use ground truth files to compute and report the recall in-place.
  - Search/build parameters visible in the config files are passed as benchmark counters/labels/context.
  - Extra functionality:
    - `--data_prefix` to specify a custom path where the data sets are stored
    - `--index_prefix` to specify a custom path where the index sets are stored
    - `--override_kv=<key:value1:value2:...:valueN>` override one or more parameters of search/build for parameter-sweep benchmarks

__Breaking change__: the behavior of the ANN benchmark executables (library API is not touched). The executable CLI flags have changed, so the newer, adapted wrapper scripts won't work with the executables from the libraft-ann-bench-23.08 conda package.

### A primer
```bash
./cpp/build/ANN_BENCH                         \ # benchmark executable
  --data_prefix=/datastore/my/local/data/path \ # override (prefix) path to local data
  --benchmark_min_warmup_time=0.001           \ # spend some minimal time warming up
  --benchmark_min_time=3s                     \ # run minimum 3 seconds on each case
  --benchmark_out=ivf_pq.csv                  \ # duplicate output to this file
  --benchmark_out_format=csv                  \ # the file output should be in CSV format
  --benchmark_counters_tabular                \ # the console output should be tabular
  --benchmark_filter="raft_ivf_pq\..*"        \ # use regex to filter benchmarks
  --search                                    \ # 'search' mode
  --override_kv=k:1:10:100:200:500            \ # Parameter-sweep over the top-k value
  --override_kv=n_queries:1:10:10000          \ #                  and the search batch size
  --override_kv=smemLutDtype:"fp8"            \ # Override a search parameter
  cpp/bench/ann/conf/bigann-100M.json           # specify the path to the config file
```

### Motivation

#### Eliminate huge bug-prone configs
The current config fixes the batch size and k to one value per-config, so the whole config needs to be copied to try multiple values. In the PR, both these parameters can be overwritten in the search parameters and/or via command line (`ANN_BENCH --override_kv=n_queries:1:100:1000 --override_kv=k:1:10:20:50:100:200:500:1000` would test all combinations in one go). Any of the build/search parameters can be overwritten at the same time.

#### Run the benchmarks and aggregate the data in the minimal environment
The new executable generates reports with QPS, Recall, and other metrics using gbench. Hence there's no need to copy back and forth dozens of result files and no need to install python environment for running or evaluating. A single CSV or JSON can be produced for all algorithms and run configurations per dataset+hardware pair.

#### Speedup the benchmarks
The current benchmark framework is extremely slow due to two factors:
  - The dataset and the index need to be loaded for every test case, this takes orders of magnitude longer than the search test itself for large datasets. In my tests, the preparation phase for bigann-1B took ten minutes and the search could take anywhere between a few seconds and a minute.
  - The benchmark always goes through the whole query dataset. That is, if the query set is 10K and the batch size is 1, the benchmark repeats 10K times (to produce the result file for evaluating the recall).

In the proposed solution, a user can set the desired time or number of iterations to run; the data is loaded only once and the index is cached between the search test cases. My subjective conservative estimate is the overall speedup of more than x100 for running a typical large-scale benchmark.

#### Better measurement of QPS
By default, the current benchmark reports the average execution time and does not warm-up iterations. As a result, the first test case on most of our plots is distorted (e.g. the first iteration of the first case takes about a second or two to run, and that significantly affects the average of the rest 999 ~100us iterations). `gbench` provides the `--benchmark_min_warmup_time` parameters to skip first one or few iterations, which solves the problem.

#### Extra context in the reports
The new benchmark executable uses gbench context to augment the report with some essential information: base and query set name, dimensionality, and size, distance metric, some CPU and GPU info, CUDA version. All this is appended directly to the generated CSV/JSON files, which makes the bookkeeping much easier.
In addition, a user may pass extra context via command line `--benchmark_context=<key>=<value>`; this could be e.g. the hostname, some ENV variables, etc.

#### Easier profiling
Thanks to flexible regex filtering and parameter overriding, now it's possible to specify a subset of cases and an exact number of times they should run. This makes the profiling using such tools as `nsys` and `ncu` much easier.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1661
  • Loading branch information
achirkin authored Aug 30, 2023
1 parent a4c9613 commit f6d35ae
Show file tree
Hide file tree
Showing 34 changed files with 2,590 additions and 4,017 deletions.
248 changes: 119 additions & 129 deletions bench/ann/conf/bigann-100M.json
Original file line number Diff line number Diff line change
@@ -1,80 +1,90 @@
{
"dataset" : {
"name" : "bigann-100M",
"base_file" : "data/bigann-1B/base.1B.u8bin",
"subset_size" : 100000000,
"query_file" : "data/bigann-1B/query.public.10K.u8bin",
"distance" : "euclidean"
"dataset": {
"name": "bigann-100M",
"base_file": "bigann-1B/base.1B.u8bin",
"subset_size": 100000000,
"query_file": "bigann-1B/query.public.10K.u8bin",
"groundtruth_neighbors_file": "bigann-100M/groundtruth.neighbors.ibin",
"distance": "euclidean"
},

"search_basic_param" : {
"batch_size" : 10000,
"k" : 10,
"run_count" : 2
"search_basic_param": {
"batch_size": 10000,
"k": 10
},

"index" : [
"index": [
{
"name": "raft_ivf_pq.dimpq64-cluster5K-float-float",
"name": "raft_ivf_pq.dimpq64-cluster5K",
"algo": "raft_ivf_pq",
"build_param": {"niter": 25, "nlist": 5000, "pq_dim": 64, "ratio": 10},
"file": "bigann-100M/raft_ivf_pq/dimpq64-cluster5K",
"dataset_memtype": "host",
"build_param": {
"niter": 25,
"nlist": 5000,
"pq_dim": 64,
"ratio": 10
},
"file": "index/bigann-100M/raft_ivf_pq/dimpq64-cluster5K",
"search_params": [
{
"numProbes": 20,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 30,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 40,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 50,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 100,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 200,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 500,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
},
{
"numProbes": 1000,
"internalDistanceDtype": "float",
"smemLutDtype": "float"
}
],
"search_result_file": "result/bigann-100M/raft_ivf_pq/dimpq64-cluster5K-float-float"
{ "nprobe": 20, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 30, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 40, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 50, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 100, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 200, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 500, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 1000, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 20, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 30, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 40, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 50, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 100, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 200, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 500, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 1000, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 20, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 1000, "internalDistanceDtype": "half", "smemLutDtype": "half" }
]
},
{
"name" : "hnswlib.M12",
"algo" : "hnswlib",
"name": "raft_ivf_pq.dimpq64-cluster10K",
"algo": "raft_ivf_pq",
"build_param": {"niter": 25, "nlist": 10000, "pq_dim": 64, "ratio": 10},
"file": "bigann-100M/raft_ivf_pq/dimpq64-cluster5K",
"search_params": [
{ "nprobe": 20, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 30, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 40, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 50, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 100, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 200, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 500, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 1000, "internalDistanceDtype": "float", "smemLutDtype": "float" },
{ "nprobe": 20, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 30, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 40, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 50, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 100, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 200, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 500, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 1000, "internalDistanceDtype": "float", "smemLutDtype": "fp8" },
{ "nprobe": 20, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 30, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 40, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 50, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 100, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 200, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 500, "internalDistanceDtype": "half", "smemLutDtype": "half" },
{ "nprobe": 1000, "internalDistanceDtype": "half", "smemLutDtype": "half" }
]
},
{
"name": "hnswlib.M12",
"algo": "hnswlib",
"build_param": {"M":12, "efConstruction":500, "numThreads":32},
"file" : "index/bigann-100M/hnswlib/M12",
"search_params" : [
"file": "bigann-100M/hnswlib/M12",
"search_params": [
{"ef":10, "numThreads":1},
{"ef":20, "numThreads":1},
{"ef":40, "numThreads":1},
Expand All @@ -85,15 +95,14 @@
{"ef":400, "numThreads":1},
{"ef":600, "numThreads":1},
{"ef":800, "numThreads":1}
],
"search_result_file" : "result/bigann-100M/hnswlib/M12"
]
},
{
"name" : "hnswlib.M16",
"algo" : "hnswlib",
"name": "hnswlib.M16",
"algo": "hnswlib",
"build_param": {"M":16, "efConstruction":500, "numThreads":32},
"file" : "index/bigann-100M/hnswlib/M16",
"search_params" : [
"file": "bigann-100M/hnswlib/M16",
"search_params": [
{"ef":10, "numThreads":1},
{"ef":20, "numThreads":1},
{"ef":40, "numThreads":1},
Expand All @@ -104,15 +113,14 @@
{"ef":400, "numThreads":1},
{"ef":600, "numThreads":1},
{"ef":800, "numThreads":1}
],
"search_result_file" : "result/bigann-100M/hnswlib/M16"
]
},
{
"name" : "hnswlib.M24",
"algo" : "hnswlib",
"name": "hnswlib.M24",
"algo": "hnswlib",
"build_param": {"M":24, "efConstruction":500, "numThreads":32},
"file" : "index/bigann-100M/hnswlib/M24",
"search_params" : [
"file": "bigann-100M/hnswlib/M24",
"search_params": [
{"ef":10, "numThreads":1},
{"ef":20, "numThreads":1},
{"ef":40, "numThreads":1},
Expand All @@ -123,15 +131,14 @@
{"ef":400, "numThreads":1},
{"ef":600, "numThreads":1},
{"ef":800, "numThreads":1}
],
"search_result_file" : "result/bigann-100M/hnswlib/M24"
]
},
{
"name" : "hnswlib.M36",
"algo" : "hnswlib",
"name": "hnswlib.M36",
"algo": "hnswlib",
"build_param": {"M":36, "efConstruction":500, "numThreads":32},
"file" : "index/bigann-100M/hnswlib/M36",
"search_params" : [
"file": "bigann-100M/hnswlib/M36",
"search_params": [
{"ef":10, "numThreads":1},
{"ef":20, "numThreads":1},
{"ef":40, "numThreads":1},
Expand All @@ -142,65 +149,48 @@
{"ef":400, "numThreads":1},
{"ef":600, "numThreads":1},
{"ef":800, "numThreads":1}
],
"search_result_file" : "result/bigann-100M/hnswlib/M36"
]
},


{
"name" : "raft_ivf_flat.nlist100K",
"algo" : "raft_ivf_flat",
"dataset_memtype": "host",
"build_param": {
"nlist" : 100000,
"niter" : 25,
"ratio" : 5
},
"file" : "index/bigann-100M/raft_ivf_flat/nlist100K",
"search_params" : [
{"nprobe":20},
{"nprobe":30},
{"nprobe":40},
{"nprobe":50},
{"nprobe":100},
{"nprobe":200},
{"nprobe":500},
{"nprobe":1000}
],
"search_result_file" : "result/bigann-100M/raft_ivf_flat/nlist100K"
"name": "raft_ivf_flat.nlist100K",
"algo": "raft_ivf_flat",
"build_param": {"nlist": 100000, "niter": 25, "ratio": 5},
"dataset_memtype":"host",
"file": "bigann-100M/raft_ivf_flat/nlist100K",
"search_params": [
{"max_batch":10000, "max_k":10, "nprobe":20},
{"max_batch":10000, "max_k":10, "nprobe":30},
{"max_batch":10000, "max_k":10, "nprobe":40},
{"max_batch":10000, "max_k":10, "nprobe":50},
{"max_batch":10000, "max_k":10, "nprobe":100},
{"max_batch":10000, "max_k":10, "nprobe":200},
{"max_batch":10000, "max_k":10, "nprobe":500},
{"max_batch":10000, "max_k":10, "nprobe":1000}
]
},

{
"name" : "raft_cagra.dim32",
"algo" : "raft_cagra",
"name": "raft_cagra.dim32",
"algo": "raft_cagra",
"dataset_memtype": "host",
"build_param": {
"index_dim" : 32
},
"file" : "index/bigann-100M/raft_cagra/dim32",
"search_params" : [
"build_param": {"index_dim": 32},
"file": "bigann-100M/raft_cagra/dim32",
"search_params": [
{"itopk": 32},
{"itopk": 64},
{"itopk": 128}
],
"search_result_file" : "result/bigann-100M/raft_cagra/dim32"
]
},


{
"name" : "raft_cagra.dim64",
"algo" : "raft_cagra",
"dataset_memtype": "host",
"build_param": {
"index_dim" : 64
},
"file" : "index/bigann-100M/raft_cagra/dim64",
"search_params" : [
"name": "raft_cagra.dim64",
"algo": "raft_cagra",
"dataset_memtype":"host",
"build_param": {"index_dim": 64},
"file": "bigann-100M/raft_cagra/dim64",
"search_params": [
{"itopk": 32},
{"itopk": 64},
{"itopk": 128}
],
"search_result_file" : "result/bigann-100M/raft_cagra/dim64"
]
}
]
}
Loading

0 comments on commit f6d35ae

Please sign in to comment.