Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev enh google benchmarks #1

Merged

Conversation

divyegala
Copy link

No description provided.

cjnolet and others added 6 commits August 30, 2023 00:55
Make the ANN benchmarks use the same google benchmark infrastructure as the prim benchmarks while keeping the functional changes minimal.

### Overview
  - The command-line API largely stays the same, but enhanced with gbench-specific parameters, such as using regex to select algo configs, control the minimum run-time, and flexible reporting to console/files.
  - There's just one executable `ANN_BENCH`, all of the algorithms are loaded as shared libraries. The CPU-only components do not require cuda at runtime (ANN_BENCH itself, hnswlib).
  - Some dependencies are linked statically, it's possible to just copy the executable and the libs and run the benchmark on a linux machine with very few packages installed.
  - Search benchmarks do not produce any output anymore, they use ground truth files to compute and report the recall in-place.
  - Search/build parameters visible in the config files are passed as benchmark counters/labels/context.
  - Extra functionality:
    - `--data_prefix` to specify a custom path where the data sets are stored
    - `--index_prefix` to specify a custom path where the index sets are stored
    - `--override_kv=<key:value1:value2:...:valueN>` override one or more parameters of search/build for parameter-sweep benchmarks

__Breaking change__: the behavior of the ANN benchmark executables (library API is not touched). The executable CLI flags have changed, so the newer, adapted wrapper scripts won't work with the executables from the libraft-ann-bench-23.08 conda package.

### A primer
```bash
./cpp/build/ANN_BENCH                         \ # benchmark executable
  --data_prefix=/datastore/my/local/data/path \ # override (prefix) path to local data
  --benchmark_min_warmup_time=0.001           \ # spend some minimal time warming up
  --benchmark_min_time=3s                     \ # run minimum 3 seconds on each case
  --benchmark_out=ivf_pq.csv                  \ # duplicate output to this file
  --benchmark_out_format=csv                  \ # the file output should be in CSV format
  --benchmark_counters_tabular                \ # the console output should be tabular
  --benchmark_filter="raft_ivf_pq\..*"        \ # use regex to filter benchmarks
  --search                                    \ # 'search' mode
  --override_kv=k:1:10:100:200:500            \ # Parameter-sweep over the top-k value
  --override_kv=n_queries:1:10:10000          \ #                  and the search batch size
  --override_kv=smemLutDtype:"fp8"            \ # Override a search parameter
  cpp/bench/ann/conf/bigann-100M.json           # specify the path to the config file
```

### Motivation

#### Eliminate huge bug-prone configs
The current config fixes the batch size and k to one value per-config, so the whole config needs to be copied to try multiple values. In the PR, both these parameters can be overwritten in the search parameters and/or via command line (`ANN_BENCH --override_kv=n_queries:1:100:1000 --override_kv=k:1:10:20:50:100:200:500:1000` would test all combinations in one go). Any of the build/search parameters can be overwritten at the same time.

#### Run the benchmarks and aggregate the data in the minimal environment
The new executable generates reports with QPS, Recall, and other metrics using gbench. Hence there's no need to copy back and forth dozens of result files and no need to install python environment for running or evaluating. A single CSV or JSON can be produced for all algorithms and run configurations per dataset+hardware pair.

#### Speedup the benchmarks
The current benchmark framework is extremely slow due to two factors:
  - The dataset and the index need to be loaded for every test case, this takes orders of magnitude longer than the search test itself for large datasets. In my tests, the preparation phase for bigann-1B took ten minutes and the search could take anywhere between a few seconds and a minute.
  - The benchmark always goes through the whole query dataset. That is, if the query set is 10K and the batch size is 1, the benchmark repeats 10K times (to produce the result file for evaluating the recall).

In the proposed solution, a user can set the desired time or number of iterations to run; the data is loaded only once and the index is cached between the search test cases. My subjective conservative estimate is the overall speedup of more than x100 for running a typical large-scale benchmark.

#### Better measurement of QPS
By default, the current benchmark reports the average execution time and does not warm-up iterations. As a result, the first test case on most of our plots is distorted (e.g. the first iteration of the first case takes about a second or two to run, and that significantly affects the average of the rest 999 ~100us iterations). `gbench` provides the `--benchmark_min_warmup_time` parameters to skip first one or few iterations, which solves the problem.

#### Extra context in the reports
The new benchmark executable uses gbench context to augment the report with some essential information: base and query set name, dimensionality, and size, distance metric, some CPU and GPU info, CUDA version. All this is appended directly to the generated CSV/JSON files, which makes the bookkeeping much easier.
In addition, a user may pass extra context via command line `--benchmark_context=<key>=<value>`; this could be e.g. the hostname, some ENV variables, etc.

#### Easier profiling
Thanks to flexible regex filtering and parameter overriding, now it's possible to specify a subset of cases and an exact number of times they should run. This makes the profiling using such tools as `nsys` and `ncu` much easier.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1661
An error occurs when using CAGRA multi-CTA implementation with topk>32. This PR fixes the bug.

Authors:
  - tsuki (https://github.com/enp1s0)

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Divye Gala (https://github.com/divyegala)
  - Micka (https://github.com/lowener)

URL: rapidsai#1784
This PR adds the citation information for the CAGRA paper preprint to README.md.

Authors:
  - tsuki (https://github.com/enp1s0)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1787
…entations. (rapidsai#1769)

This is just fixing merge conflicts for rapidsai#1661 to continue making progress on new self-contained Python packaging. 

Closes rapidsai#1762

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Artem M. Chirkin (https://github.com/achirkin)
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Artem M. Chirkin (https://github.com/achirkin)

URL: rapidsai#1769
@cjnolet cjnolet merged commit d6757c1 into dantegd:dev-enh-google-benchmarks Aug 30, 2023
dantegd pushed a commit that referenced this pull request Jul 23, 2024
RAPIDS repos are using the `main` branch of https://github.com/actions/labeler which recently introduced [breaking changes](https://github.com/actions/labeler/releases/tag/v5.0.0).

This PR pins to the latest v4 release of the labeler action until we can evaluate the changes required for v5.

Authors:
   - Ray Douglass (https://github.com/raydouglass)

Approvers:
   - AJ Schmidt (https://github.com/ajschmidt8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants