Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANN bench fix latency measurement overhead #2084

Merged
merged 5 commits into from
Jan 24, 2024

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Jan 9, 2024

The CPU time stamp start is taken before the ANN algo is copied to all threads. This is fixed by initializing start a few lines later.

@tfeher tfeher self-assigned this Jan 9, 2024
@github-actions github-actions bot added the cpp label Jan 9, 2024
@tfeher tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search and removed cpp labels Jan 9, 2024
@tfeher tfeher requested a review from achirkin January 9, 2024 21:21
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The move of .lap() functions breaks the functionality. I'd suggest to only leave change that moves start to initialize after the algo is copied.

cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved
cpp/bench/ann/src/common/benchmark.hpp Outdated Show resolved Hide resolved
@achirkin
Copy link
Contributor

achirkin commented Jan 10, 2024

There also seems to be a problem with how we compute both "Latency" and "GPU" counters. In both cases, we divide the the values by the number of iterations state.iterations() to compute the average.
However, we don't know how many times the counter is stored and what's the value of state.iterations() at these moments. You see, gbench can probably call the same benchmark case multiple times within a single thread. It may do so to control the overall benchmark time. If that's the case, the counter values are summed up, which is not what we want to compute.

I've tried to add this change and compare the benchmark outputs:

...
    auto end      = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
    if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
    state.counters.insert(
      {"Latency",
       {duration, benchmark::Counter::kAvgThreads | benchmark::Counter::kAvgIterations}});
  }

  state.SetItemsProcessed(queries_processed);
  if (cudart.found()) {
    state.counters.insert({"GPU",
                           {gpu_timer.total_time(),
                            benchmark::Counter::kAvgThreads | benchmark::Counter::kAvgIterations}});
...

As a result I've got Latency counter match almost perfectly with the real_time counter (make sure start time is initialized after the index copy though), this confirms that there's a redundancy between Latency and real_time and that we measured Latency and GPU time incorrectly until now.

NB: "items_per_second" metric has always been using "real_time" under the hood, so we can rely on it being correct in our previous benchmark results.

@tfeher
Copy link
Contributor Author

tfeher commented Jan 10, 2024

@achirkin gbench real_time is not the value that we want to define as latency. See details here: #1940 (comment)

@achirkin
Copy link
Contributor

Thanks, @tfeher , for clarification! Indeed, now I see that the total iterations counter sums the iterations from all threads, whereas the real_time counts the time only by the main thread; as a result, the real_time shows the proper average latency divided by the number of threads.

This means the GPU counter is probably correct and the Latency counter is probably only slightly off due to the wrapper copy overheads.

In that case, I modify my suggestion: to be on the safe side, let's replace the explicit state.iterations() division with the kAvgIterations flag; it divides the total time by the total number of iterations (in all threads), hence gives the proper average latency for a single thread:

...
    auto end      = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
    if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
    state.counters.insert({"Latency", {duration, benchmark::Counter::kAvgIterations}});
  }

  state.SetItemsProcessed(queries_processed);
  if (cudart.found()) {
    state.counters.insert({"GPU", {gpu_timer.total_time(), benchmark::Counter::kAvgIterations}});
  }
...

@tfeher
Copy link
Contributor Author

tfeher commented Jan 23, 2024

The PR is reduced to fixing the starting of the timer. Other overheads are not addressed. For reference, here is the other issue that was discussed in a previous version of the PR.

The current latency measurement includes overhead in the order of 10 microseconds, which is visible when we compare the benchmark output columns "GPU" and "Latency". This could be significant for benchmarks with small batch size.

This overhead is mainly due to the lap() call within the benchmark loop. It would be probably better to make it optional to measure GPU latency.

@github-actions github-actions bot added the cpp label Jan 23, 2024
@tfeher tfeher marked this pull request as ready for review January 23, 2024 22:50
@tfeher tfeher requested a review from a team as a code owner January 23, 2024 22:50
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
state.counters.insert({"Latency", {duration, benchmark::Counter::kAvgIterations}});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also applied Artem's suggestion to store latency values with benchmark::Counter::kAvgIterations marker.

Earlier we manually divided by number of iterations and let gbench average over threads using kAvgThreads. Since iterations are counted as total iterations performed by all threads, using kAviIterations leads to the same results (without manual divisions by iterations).

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cjnolet
Copy link
Member

cjnolet commented Jan 24, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3ce00d3 into rapidsai:branch-24.02 Jan 24, 2024
60 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

3 participants