Fix ANN bench latency #1940

tfeher · 2023-10-31T01:09:22Z

This adds explicit latency column to the benchmark.

tfeher · 2023-10-31T01:22:00Z

The initial version of this PR adds a helper ANN benchmark class: FixLatencyWorkload. This can be configured to spend
10ms on CPU or GPU time sleeping, and it is useful to illustrate the shortcommings of the current benchmark counters.

./RAFT_IVF_PQ_ANN_BENCH --search --data_prefix=/data --benchmark_filter=fix_latency --benchmark_min_time=100x --benchmark_counters_tabular=true --mode=latency fix_latency.json
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries total_queries    use_gpu
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/0/process_time/real_time/threads:1       10.2 ms        0.032 ms          100  0.0101664  0.0101838          0    1.01838        982.414/s         10         10          1000          0
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries sync_stream total_queries    use_gpu
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/1/process_time/real_time/threads:1      0.024 ms        0.024 ms          100   13.5014u  0.0104866          0    1.04866       408.261k/s         10         10           0          1000          1
fix_latency/2/process_time/real_time/threads:1       10.5 ms         10.5 ms          100  0.0104682  0.0104769          0    1.04769        954.496/s         10         10           1          1000          1

We can see the following issues:

CPU: measures user time and system time. This does not include idle time. Idle time can also occur when we wait for GPU sync, therefore this is not a correct measure for latency.
GPU: If the GPU lib does not sync stream, then GPU time will not work.

tfeher · 2023-10-31T01:31:11Z

In throughput mode Time is not a correct measure of latency. Time is the wall clock time diff divided by the number of iterations. Since we run iterations in parallel, time becomes effectively latency of single iteration divided by number of threads.

I believe we want to define vector search Latency as wall-clock time for finishing a batch. This PR adds such a column to the benchmark output. Time gives us almost what we need, but it is divided by the number of threads. Unfortunately the timers are private, so we cannot access them from the benchmark state. We can use the end_to_end duration variable, and divide by the iteration count to get the average latency (notice the kAvgThreads).

/RAFT_IVF_PQ_ANN_BENCH --search --data_prefix=/data --benchmark_filter=fix_latency --benchmark_min_time=100x  --benchmark_counters_tabular=true --mode=throughput /workspace1/config/fix_latency.json

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries total_queries    use_gpu
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/0/process_time/real_time/threads:1        10.2 ms        0.032 ms          100  0.0101596  0.0101772          0    1.01772        983.058/s         10         10          1000          0
fix_latency/0/process_time/real_time/threads:2        5.10 ms        0.046 ms          200  0.0101801  0.0102127          0    1.02172       1.95936k/s         10         10            2k          0
fix_latency/0/process_time/real_time/threads:4        2.55 ms        0.057 ms          400  0.0101708   0.010227          0    1.02396       3.91614k/s         10         10            4k          0
fix_latency/0/process_time/real_time/threads:8        1.28 ms        0.064 ms          800  0.0101425  0.0102338          0    1.02196       7.84236k/s         10         10            8k          0
fix_latency/0/process_time/real_time/threads:16      0.633 ms        0.039 ms         1600  0.0101036  0.0101894          0    1.01989       15.8025k/s         10         10           16k          0
fix_latency/0/process_time/real_time/threads:32      0.316 ms        0.033 ms         3200  0.0100886  0.0102426          0    1.01987       31.6601k/s         10         10           32k          0
fix_latency/0/process_time/real_time/threads:48      0.211 ms        0.033 ms         4800   0.010088  0.0102979          0    1.02507       47.4775k/s         10         10           48k          0
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries sync_stream total_queries    use_gpu
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/1/process_time/real_time/threads:1       0.025 ms        0.025 ms          100   13.8858u  0.0104901          0    1.04901       404.242k/s         10         10           0          1000          1
fix_latency/1/process_time/real_time/threads:2       0.013 ms        0.023 ms          200   8.14784u  0.0104827          0    1.04866       797.795k/s         10         10           0            2k          1
fix_latency/1/process_time/real_time/threads:4       0.014 ms        0.041 ms          400   10.9731u  0.0104859          0    1.04857       709.049k/s         10         10           0            4k          1
fix_latency/1/process_time/real_time/threads:8       0.368 ms        0.657 ms          800   105.973u  0.0296742          0    2.96675       27.1756k/s         10         10           0            8k          1
fix_latency/1/process_time/real_time/threads:16      0.945 ms         12.0 ms         1600   1.43132m  0.0289549          0    2.89458       10.5867k/s         10         10           0           16k          1
fix_latency/1/process_time/real_time/threads:32      0.516 ms         16.1 ms         3200   1023.67u  0.0180644          0    1.80312       19.3707k/s         10         10           0           32k          1
fix_latency/1/process_time/real_time/threads:48      0.354 ms         16.4 ms         4800   790.397u   0.017988          0    1.79329        28.237k/s         10         10           0           48k          1
fix_latency/2/process_time/real_time/threads:1        10.5 ms         10.5 ms          100  0.0104754  0.0104842          0    1.04842        953.833/s         10         10           1          1000          1
fix_latency/2/process_time/real_time/threads:2        5.24 ms         10.5 ms          200  0.0104645  0.0104835          0    1.04876       1.90857k/s         10         10           1            2k          1
fix_latency/2/process_time/real_time/threads:4        2.62 ms         10.5 ms          400  0.0104443  0.0104819          0    1.04864       3.81809k/s         10         10           1            4k          1
fix_latency/2/process_time/real_time/threads:8        1.31 ms         10.4 ms          800   0.010427  0.0104937          0    1.04861       7.63035k/s         10         10           1            8k          1
fix_latency/2/process_time/real_time/threads:16      0.655 ms         10.4 ms         1600  0.0103763  0.0105217          0    1.05113       15.2569k/s         10         10           1           16k          1
fix_latency/2/process_time/real_time/threads:32      0.328 ms         10.3 ms         3200  0.0103018  0.0105683          0     1.0542       30.4933k/s         10         10           1           32k          1
fix_latency/2/process_time/real_time/threads:48      0.219 ms         10.3 ms         4800  0.0103413   0.010683          0    1.06139        45.568k/s         10         10           1           48k          1

cpp/bench/ann/src/common/benchmark.hpp

cpp/bench/ann/src/raft/fix_latency_workload.h

tfeher · 2023-10-31T01:49:15Z

TODO: add command line arg to control the number of threads. This would be useful when running larger set of benchmarks / profiling.

cjnolet · 2023-11-01T01:30:42Z

TODO: add command line arg to control the number of threads. This would be useful when running larger set of benchmarks / profiling.

It would be nice to even be able to set the min/max threads (and set them equal if one desires to run only a single threaded experiment)

cpp/bench/ann/src/raft/raft_benchmark.cu

…_latency

cjnolet

LGTM!

tfeher · 2023-11-02T20:51:01Z

~~I am still investigating on bug. Please do not merge until its fixed.~~ Fixed.

tfeher · 2023-11-02T21:37:37Z

Fixed the problem. Pending CI it is ready to merge. Example output

./RAFT_IVF_FLAT_ANN_BENCH --search --data_prefix=/data --benchmark_filter=raft_ivf_flat --benchmark_out_format=csv --benchmark_out=res.csv  --override_kv=n_queries:10 --benchmark_counters_tabular=true --mode=throughput --threads=1:24 --benchmark_min_warmup_time=1 /workspace1/config/deep-10M.json
...
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries     nprobe total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:1       0.111 ms        0.111 ms         6335   101.801u   110.565u    0.90378   0.700427       90.4461k/s         10         10         50        63.35k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:2       0.075 ms        0.146 ms         9250   133.635u   151.153u    0.90378   0.699437       132.519k/s         10         10         50         92.5k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:4       0.052 ms        0.187 ms        13560   177.321u   206.465u    0.90378   0.699672       194.025k/s         10         10         50        135.6k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:8       0.040 ms        0.250 ms        17696   276.013u   324.323u    0.90378   0.717858       249.676k/s         10         10         50       176.96k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:16      0.037 ms        0.331 ms        18528    508.35u   596.448u    0.90378   0.689932       271.407k/s         10         10         50       185.28k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:24      0.037 ms        0.400 ms        18144   757.152u   892.142u    0.90377   0.671828       272.899k/s         10         10         50       181.44k

It is recommended to use --benchmark_min_warmup_time=1, otherwise the first benchmark is not warmed up and the realtive perf with different number of threads is misleading.

python/raft-ann-bench/src/raft-ann-bench/run/__main__.py

Co-authored-by: Tamas Bela Feher <[email protected]>

…ix_ann_bench_latency

cjnolet · 2023-11-03T05:15:01Z

/merge

tfeher · 2023-11-03T13:24:19Z

cpp/bench/ann/src/common/benchmark.hpp

-    auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
-    state.counters.insert({{"end_to_end", duration}});
-  }
+  cudaDeviceSynchronize();


CPU only builds fail at this point

benchmark.hpp:299:3: error: there are no arguments to 'cudaDeviceSynchronize' that depend on a template parameter, so a declaration of 'cudaDeviceSynchronize' must be available [-fpermissive] 299 | cudaDeviceSynchronize()

I thought we cuda_stub.hpp to help with this, but it does not work. If we are in a hurry we could disable the sync here, but it would be better to fix the cuda_stubs.

…ix_ann_bench_latency

This adds explicit latency column to the benchmark. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1940

tfeher added 3 commits October 31, 2023 02:06

Define latency, add sync, add FixLatencyWorkload

98152ff

Remove manual timing from within the loop

9604b62

Update timer description

884b0bc

tfeher requested a review from a team as a code owner October 31, 2023 01:09

github-actions bot added the cpp label Oct 31, 2023

tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search and removed cpp labels Oct 31, 2023

Add configuration for ANN benchmark tests

fb81d3f

tfeher requested a review from a team as a code owner October 31, 2023 01:11

github-actions bot added cpp python labels Oct 31, 2023

tfeher commented Oct 31, 2023

View reviewed changes

cpp/bench/ann/src/common/benchmark.hpp Show resolved Hide resolved

tfeher commented Oct 31, 2023

View reviewed changes

cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved

tfeher mentioned this pull request Oct 31, 2023

Adding throughput and latency modes to raft-ann-bench #1920

Merged

tfeher commented Oct 31, 2023

View reviewed changes

cpp/bench/ann/src/common/benchmark.hpp Show resolved Hide resolved

tfeher commented Oct 31, 2023

View reviewed changes

cpp/bench/ann/src/raft/fix_latency_workload.h Outdated Show resolved Hide resolved

update doc

b4f37f1

cjnolet reviewed Nov 1, 2023

View reviewed changes

cpp/bench/ann/src/raft/raft_benchmark.cu Outdated Show resolved Hide resolved

cjnolet assigned tfeher Nov 1, 2023

tfeher added 4 commits November 2, 2023 19:46

Add --threads benchmark arg

5186c12

Add docstring for --threads arg

dc8fc0f

Merge remote-tracking branch 'origin/branch-23.12' into fix_ann_bench…

274283c

…_latency

remove fixed_latency_workload that was used for debugging

fefc16c

github-actions bot removed the python label Nov 2, 2023

tfeher added 2 commits November 2, 2023 20:16

restore raft_benchmark.cu

52670f7

Extend comments on synchronization

c310095

cjnolet approved these changes Nov 2, 2023

View reviewed changes

tfeher added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 2, 2023

Fix query_set initialization

777ded7

tfeher removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 2, 2023

Adding search-threads option

71659a2

github-actions bot added the python label Nov 2, 2023

cjnolet added 2 commits November 2, 2023 18:23

Correctly passing search-thread

f87430b

Using for search

7645a3d

tfeher commented Nov 2, 2023

View reviewed changes

python/raft-ann-bench/src/raft-ann-bench/run/__main__.py Outdated Show resolved Hide resolved

cjnolet and others added 5 commits November 3, 2023 02:04

Update __main__.py

2a8822c

Co-authored-by: Tamas Bela Feher <[email protected]>

Including cuda_stub

4a7aec5

Merge branch 'fix_ann_bench_latency' of github.com:tfeher/raft into f…

be90181

…ix_ann_bench_latency

Fixing style

cca71e8

Merge branch 'branch-23.12' into fix_ann_bench_latency

a57700b

tfeher commented Nov 3, 2023

View reviewed changes

tfeher and others added 3 commits November 3, 2023 14:47

Remove cudaDeviceSynchronize() to fix CPU_ONLY build

d903245

Removing cudart calls from benchmark.hpp

b19fb44

Merge branch 'fix_ann_bench_latency' of github.com:tfeher/raft into f…

ab049c3

…ix_ann_bench_latency

rapids-bot bot merged commit b21cad3 into rapidsai:branch-23.12 Nov 3, 2023
57 checks passed

tfeher mentioned this pull request Jan 10, 2024

ANN bench fix latency measurement overhead #2084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ANN bench latency #1940

Fix ANN bench latency #1940

tfeher commented Oct 31, 2023

tfeher commented Oct 31, 2023 •

edited

Loading

tfeher commented Oct 31, 2023

tfeher commented Oct 31, 2023 •

edited

Loading

cjnolet commented Nov 1, 2023

cjnolet left a comment

tfeher commented Nov 2, 2023 •

edited

Loading

tfeher commented Nov 2, 2023

cjnolet commented Nov 3, 2023

tfeher Nov 3, 2023

Fix ANN bench latency #1940

Fix ANN bench latency #1940

Conversation

tfeher commented Oct 31, 2023

tfeher commented Oct 31, 2023 • edited Loading

tfeher commented Oct 31, 2023

tfeher commented Oct 31, 2023 • edited Loading

cjnolet commented Nov 1, 2023

cjnolet left a comment

Choose a reason for hiding this comment

tfeher commented Nov 2, 2023 • edited Loading

tfeher commented Nov 2, 2023

cjnolet commented Nov 3, 2023

tfeher Nov 3, 2023

Choose a reason for hiding this comment

tfeher commented Oct 31, 2023 •

edited

Loading

tfeher commented Oct 31, 2023 •

edited

Loading

tfeher commented Nov 2, 2023 •

edited

Loading