-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent CAGRA kernel #215
Persistent CAGRA kernel #215
Conversation
…ks adjusted to avoid unnecessary sync/copies
b748ec8
to
73ca412
Compare
…ily to avoid using unnecessary resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Artem, I am going through the implementation details. Here are my first few comments.
Is the only way to exit the kernel to wait until kLiveInterval
time has passed? Could we have an additional explicit method to stop the kernel?
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
Outdated
Show resolved
Hide resolved
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
Outdated
Show resolved
Hide resolved
Co-authored-by: Tamas Bela Feher <[email protected]>
Co-authored-by: Tamas Bela Feher <[email protected]>
Update on benchmarks: after a bit of algorithm tweaking, it seems the QPS/sleep behavior is now rather stable for both single-query case and other batch sizes. I added the plots to the PR description. Even with 8k threads on 32-core machine, the system stays responsive and the cumulative CPU load stays at 90%-100% (i.e. all threads don't suddenly go to sleep as before). To strain the system even more, I reduced the CTA size to 64 threads (i.e. gridDim > 2k), which increased the maximum QPS by 2x. So far I tested it with a few different configurations (H100/A10/RTX3090Ti), it's more or less stable. With this, I'm not sure, which compile time constants in the code should better be exposed. Here's the summary: // Artificially reduce the grid size by a constant < 1. Ideally we shouldn't need this, but this may be useful to make the system a bit more responsive (e.g. on workstations where the GPU is used for other purposes as well).
double kDeviceUsage = 1.0;
// The time to live for the kernel: if no calls in this time happened, the kernel stops.
// It may be useful to expose this in the production setting to balance high-percentile-latency vs CPU/GPU load.
auto kLiveInterval = std::chrono::milliseconds(2000);
// The size of the jobs queue, it's not smaller than max expected thread count.
// Can move it to search parameters, but:
// 1. Will have to change all arrays to std vectors and a few compile-time checks will not be possible
// 2. There's no harm in setting this to a large value, such as 8192
uint32_t kMaxJobsNum = 8192;
// The size of the worker queue = max number of CTAs.
// Same two arguments apply
uint32_t kMaxWorkersNum = 4096;
// How many CTAs/workers can be hold by one IO thread at the same time & starting with how many
// the IO thread "should be more eager" to return them. I believe these should not be exposed, as the performance impact
// of them is unknown.
uint32_t kMaxWorkersPerThread = 256;
uint32_t kSoftMaxWorkersPerThread = 16;
// The default time to sleep. Maybe it makes sense to expose this, but also maybe
auto kDefaultLatency = std::chrono::nanoseconds(50000);
// This is the base for computing maximum time a thread is allowed to sleep.
// It's a good multiple of default latency
auto kMaxExpectedLatency = kDefaultLatency * std::max<std::uint32_t>(10, kMaxJobsNum / 128);
// -----------------------------------------------------------------------
// The rest are constants local to functions, which, I think too specific for anyone to adequately use/change
// Exponential moving average constant for tracking the configuration-specific expected latency
inline ~launcher_t() noexcept // NOLINT
{
// bookkeeping: update the expected latency to wait more efficiently later
constexpr size_t kWindow = 100; // moving average memory
expected_latency = std::min<std::chrono::nanoseconds>(
((kWindow - 1) * expected_latency + now - start) / kWindow, kMaxExpectedLatency);
}
[[nodiscard]] inline auto sleep_limit() const
{
constexpr auto kMinWakeTime = std::chrono::nanoseconds(10000);
constexpr double kSleepLimit = 0.6;
return start + expected_latency * kSleepLimit - kMinWakeTime;
}
[[nodiscard]] inline auto calc_pause_factor(uint32_t n_queries) const -> uint32_t
{
constexpr uint32_t kMultiplier = 10;
return kMultiplier * raft::div_rounding_up_safe(n_queries, idle_worker_ids.capacity());
}
inline void pause()
{
// Don't sleep this many times hoping for smoother run
constexpr auto kSpinLimit = 3;
// It doesn't make much sense to slee less than this
constexpr auto kPauseTimeMin = std::chrono::nanoseconds(1000);
// Bound sleeping time
constexpr auto kPauseTimeMax = std::chrono::nanoseconds(50000);
...
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the updates! I have looked into the implementation details. It is an elaborate mechanism, but in general the code is clean, and I appreciate the developer docstrings.
As you mention in the PR description, it would be great to consolidate the persistent kernel scheduling structures into a separate namespace. But this can be done as a follow up.
Apart from the small comments below, the main question is the API. I like the current design which enables persistent mode by a bool flag. Thanks for listing the parameters that influence the behavior of the persistent kernel. I would suggest we just expose two of these: kDeviceUsage
and kLiveInterval
. And once we are allowed to change kLiveInterval
, it would be great to have an API to explicitly stop the kernel.
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
Outdated
Show resolved
Hide resolved
Thanks for suggestion, @cjnolet and @tfeher . I've added a comprehensive example on how to use the persistent kernel into examples/cpp/src/cagra_persistent_example.cu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the elaborate example! While it does the job of illustrating persistent-kernel usage, but I think it contains more than what we need, and therefore brings up a few questions. I would prefer to simplify this (although I would be fine to keep the current form as well if others think it is useful).
[persistent/async B] execution time: 1316.97 ms | ||
[I] [15:56:55.756952] Destroyed the persistent runner. | ||
``` | ||
Note, the persistent kernel time in async mode (1 query per job) is up to 2x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Isn't large batch kernel executing longer simply because of the batch size? Or do you mean time per query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We measure the total wall time of processing the given number of queries here, either in one batch or one query at a time in an async loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify: "While the persistent kernel provides minimal latency for each small batch search
call, the wall time to process all the queries in async mode (1 query per job) is up to 2x slower than ..."
|
…idsai#227) A small change that reduces the number of arguments in one of the wrapper layers in the detail namespace of CAGRA. The goal is twofold: 1) Simplify the overly long signature of `selet_and_run` (which has many instances) 2) Give access to all search parameters for future upgrades of the search kernel This is to simplify the integration (and review) of the persistent kernel (rapidsai#215). No performance or functional changes expected. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#227
Binary size increase is almost 60MB: 620 -> 679 MB However, it seems like we still have some unused search instances! e.g.
These are compiled, but not used, because there's no corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the updates. I have reviewed the recent changes, and the PR looks good to me.
Please remove the unused instantiations.
/merge |
@achirkin can you remove the unused instantiations in a follow up PR? |
Sure, I will remove the unused instantiations. Also I've only got partial benchmark results due to the machine getting shut down by the runtime limit, but the results I got seem to suggest there's no performance drop since the last results in August |
An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.
API
In the current implementation, the public API does not change. An extra parameter
persistent
is added to theann::cagra::search_params
(only valid whenalgo == SINGLE_CTA
).The persistent kernel is managed by a global runner object in a
shared_ptr
; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch
). When there's no heartbeat in the last few seconds (kLiveInterval
), the runner shuts down the kernel and cleans up the associated resources.An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.
Synchronization behavior and CUDA streams
The kernel is managed in a dedicated thread & a non-blocking stream; it's independent of any other (i.e. calling) threads.
Although we pass a CUDA stream to the search function to preserve the api, this CUDA stream is never used; in fact, there are no CUDA API calls happening in the calling thread.
All communication between the host calling thread and GPU workers happens via atomic variables.
The search function blocks the CPU thread, i.e. it waits till the results are back before returning.
Exceptions and safety
The kernel runner object is stored in a shared pointer. Hence, it provides all the same safety guarantees as smart pointers. For example, if a C++ exception is raised in the runner thread, the kernel is stopped during the destruction of the runner/last shared pointer.
It's hard to detect if something happens to the kernel or CUDA context. If the kernel does not return the results to the calling thread within the configured kernel lifetime (
persistent_lifetime
), the calling thread abandons the request and throws an exception.The designed behavior here is that all components can gracefully shutdown within the configured kernel lifetime independently.
Integration notes
lightweight_uvector
RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple
cudaGetDevice
calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.To workaround this, I introduce a
lightweight_uvector
in/detail/cagra/search_plan.cuh
for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version ofrmm::uvector
: it does not check during resize/destruction whether the current device has changed since construction.We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.
Shared resource queues / ring buffers
resource_queue_t
is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.We may consider putting this in a common public namespace in raft if we envision more uses for it.
Persistent runner structs
launcher_t
andpersistent_runner_base_t
look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.Adjusted benchmarks