-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ANN bench fix latency measurement overhead #2084
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The move of .lap()
functions breaks the functionality. I'd suggest to only leave change that moves start
to initialize after the algo
is copied.
There also seems to be a problem with how we compute both "Latency" and "GPU" counters. In both cases, we divide the the values by the number of iterations I've tried to add this change and compare the benchmark outputs: ...
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
state.counters.insert(
{"Latency",
{duration, benchmark::Counter::kAvgThreads | benchmark::Counter::kAvgIterations}});
}
state.SetItemsProcessed(queries_processed);
if (cudart.found()) {
state.counters.insert({"GPU",
{gpu_timer.total_time(),
benchmark::Counter::kAvgThreads | benchmark::Counter::kAvgIterations}});
... As a result I've got NB: "items_per_second" metric has always been using "real_time" under the hood, so we can rely on it being correct in our previous benchmark results. |
@achirkin gbench |
Thanks, @tfeher , for clarification! Indeed, now I see that the total iterations counter sums the iterations from all threads, whereas the real_time counts the time only by the main thread; as a result, the This means the In that case, I modify my suggestion: to be on the safe side, let's replace the explicit ...
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); }
state.counters.insert({"Latency", {duration, benchmark::Counter::kAvgIterations}});
}
state.SetItemsProcessed(queries_processed);
if (cudart.found()) {
state.counters.insert({"GPU", {gpu_timer.total_time(), benchmark::Counter::kAvgIterations}});
}
... |
The PR is reduced to fixing the starting of the timer. Other overheads are not addressed. For reference, here is the other issue that was discussed in a previous version of the PR.
|
b9736b2
to
6efd970
Compare
auto end = std::chrono::high_resolution_clock::now(); | ||
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count(); | ||
if (state.thread_index() == 0) { state.counters.insert({{"end_to_end", duration}}); } | ||
state.counters.insert({"Latency", {duration, benchmark::Counter::kAvgIterations}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also applied Artem's suggestion to store latency values with benchmark::Counter::kAvgIterations
marker.
Earlier we manually divided by number of iterations and let gbench average over threads using kAvgThreads
. Since iterations are counted as total iterations performed by all threads, using kAviIterations leads to the same results (without manual divisions by iterations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
/merge |
The CPU time stamp
start
is taken before the ANN algo is copied to all threads. This is fixed by initializingstart
a few lines later.