ANN bench fix latency measurement overhead #2084

tfeher · 2024-01-09T21:20:43Z

The CPU time stamp start is taken before the ANN algo is copied to all threads. This is fixed by initializing start a few lines later.

achirkin

The move of .lap() functions breaks the functionality. I'd suggest to only leave change that moves start to initialize after the algo is copied.

cpp/bench/ann/src/common/benchmark.hpp

achirkin · 2024-01-10T08:49:48Z

There also seems to be a problem with how we compute both "Latency" and "GPU" counters. In both cases, we divide the the values by the number of iterations state.iterations() to compute the average.
However, we don't know how many times the counter is stored and what's the value of state.iterations() at these moments. You see, gbench can probably call the same benchmark case multiple times within a single thread. It may do so to control the overall benchmark time. If that's the case, the counter values are summed up, which is not what we want to compute.

I've tried to add this change and compare the benchmark outputs:

...
    auto end      = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
    if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
    state.counters.insert(
      {"Latency",
       {duration, benchmark::Counter::kAvgThreads | benchmark::Counter::kAvgIterations}});
  }

  state.SetItemsProcessed(queries_processed);
  if (cudart.found()) {
    state.counters.insert({"GPU",
                           {gpu_timer.total_time(),
                            benchmark::Counter::kAvgThreads | benchmark::Counter::kAvgIterations}});
...

As a result I've got Latency counter match almost perfectly with the real_time counter (make sure start time is initialized after the index copy though), this confirms that there's a redundancy between Latency and real_time and that we measured Latency and GPU time incorrectly until now.

NB: "items_per_second" metric has always been using "real_time" under the hood, so we can rely on it being correct in our previous benchmark results.

tfeher · 2024-01-10T12:20:35Z

@achirkin gbench real_time is not the value that we want to define as latency. See details here: #1940 (comment)

achirkin · 2024-01-10T14:04:03Z

Thanks, @tfeher , for clarification! Indeed, now I see that the total iterations counter sums the iterations from all threads, whereas the real_time counts the time only by the main thread; as a result, the real_time shows the proper average latency divided by the number of threads.

This means the GPU counter is probably correct and the Latency counter is probably only slightly off due to the wrapper copy overheads.

In that case, I modify my suggestion: to be on the safe side, let's replace the explicit state.iterations() division with the kAvgIterations flag; it divides the total time by the total number of iterations (in all threads), hence gives the proper average latency for a single thread:

...
    auto end      = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
    if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
    state.counters.insert({"Latency", {duration, benchmark::Counter::kAvgIterations}});
  }

  state.SetItemsProcessed(queries_processed);
  if (cudart.found()) {
    state.counters.insert({"GPU", {gpu_timer.total_time(), benchmark::Counter::kAvgIterations}});
  }
...

tfeher · 2024-01-23T22:36:25Z

The PR is reduced to fixing the starting of the timer. Other overheads are not addressed. For reference, here is the other issue that was discussed in a previous version of the PR.

The current latency measurement includes overhead in the order of 10 microseconds, which is visible when we compare the benchmark output columns "GPU" and "Latency". This could be significant for benchmarks with small batch size.

This overhead is mainly due to the lap() call within the benchmark loop. It would be probably better to make it optional to measure GPU latency.

tfeher · 2024-01-23T22:55:16Z

cpp/bench/ann/src/common/benchmark.hpp

+    auto end      = std::chrono::high_resolution_clock::now();
+    auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
+    if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
+    state.counters.insert({"Latency", {duration, benchmark::Counter::kAvgIterations}});


I have also applied Artem's suggestion to store latency values with benchmark::Counter::kAvgIterations marker.

Earlier we manually divided by number of iterations and let gbench average over threads using kAvgThreads. Since iterations are counted as total iterations performed by all threads, using kAviIterations leads to the same results (without manual divisions by iterations).

cpp/bench/ann/src/common/benchmark.hpp

achirkin

LGTM!

cjnolet · 2024-01-24T20:20:59Z

/merge

tfeher self-assigned this Jan 9, 2024

github-actions bot added the cpp label Jan 9, 2024

tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search and removed cpp labels Jan 9, 2024

tfeher requested a review from achirkin January 9, 2024 21:21

cjnolet approved these changes Jan 9, 2024

View reviewed changes

achirkin requested changes Jan 10, 2024

View reviewed changes

cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved

cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved

cjnolet approved these changes Jan 23, 2024

View reviewed changes

tfeher added 2 commits January 23, 2024 23:46

Decrease latency measurement overhead

508c57e

Fix timer start, use kAvgIteratins for latency

6efd970

tfeher force-pushed the ann_bench_timers branch from b9736b2 to 6efd970 Compare January 23, 2024 22:46

github-actions bot added the cpp label Jan 23, 2024

tfeher marked this pull request as ready for review January 23, 2024 22:50

tfeher requested a review from a team as a code owner January 23, 2024 22:50

tfeher commented Jan 23, 2024

View reviewed changes

achirkin reviewed Jan 24, 2024

View reviewed changes

cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved

achirkin added 2 commits January 24, 2024 06:15

Update cpp/bench/ann/src/common/benchmark.hpp

ec33ce6

Merge branch 'branch-24.02' into ann_bench_timers

1b3c8ae

achirkin approved these changes Jan 24, 2024

View reviewed changes

Fix copyright year

6241439

rapids-bot bot merged commit 3ce00d3 into rapidsai:branch-24.02 Jan 24, 2024
60 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANN bench fix latency measurement overhead #2084

ANN bench fix latency measurement overhead #2084

tfeher commented Jan 9, 2024 •

edited

Loading

achirkin left a comment

achirkin commented Jan 10, 2024 •

edited

Loading

tfeher commented Jan 10, 2024

achirkin commented Jan 10, 2024

tfeher commented Jan 23, 2024

tfeher Jan 23, 2024

achirkin left a comment

cjnolet commented Jan 24, 2024

ANN bench fix latency measurement overhead #2084

ANN bench fix latency measurement overhead #2084

Conversation

tfeher commented Jan 9, 2024 • edited Loading

achirkin left a comment

Choose a reason for hiding this comment

achirkin commented Jan 10, 2024 • edited Loading

tfeher commented Jan 10, 2024

achirkin commented Jan 10, 2024

tfeher commented Jan 23, 2024

tfeher Jan 23, 2024

Choose a reason for hiding this comment

achirkin left a comment

Choose a reason for hiding this comment

cjnolet commented Jan 24, 2024

tfeher commented Jan 9, 2024 •

edited

Loading

achirkin commented Jan 10, 2024 •

edited

Loading