Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent CAGRA kernel #215

Merged
merged 38 commits into from
Sep 27, 2024

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Jul 8, 2024

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

CAGRA throughput @ Recall = 0.94, n_queries = 1 CAGRA throughput @ Recall = 0.94, n_queries = 10

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA).
The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no heartbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Synchronization behavior and CUDA streams

The kernel is managed in a dedicated thread & a non-blocking stream; it's independent of any other (i.e. calling) threads.

Although we pass a CUDA stream to the search function to preserve the api, this CUDA stream is never used; in fact, there are no CUDA API calls happening in the calling thread.
All communication between the host calling thread and GPU workers happens via atomic variables.

The search function blocks the CPU thread, i.e. it waits till the results are back before returning.

Exceptions and safety

The kernel runner object is stored in a shared pointer. Hence, it provides all the same safety guarantees as smart pointers. For example, if a C++ exception is raised in the runner thread, the kernel is stopped during the destruction of the runner/last shared pointer.

It's hard to detect if something happens to the kernel or CUDA context. If the kernel does not return the results to the calling thread within the configured kernel lifetime (persistent_lifetime ), the calling thread abandons the request and throws an exception.
The designed behavior here is that all components can gracefully shutdown within the configured kernel lifetime independently.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction.
We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.
We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Adjusted benchmarks

  1. I introduced a global temporary buffer for keeping the intermediate results (e.g. neighbor candidates before refinement). This is needed to avoid unnecessary allocations alongside the persistent kernel (but also positively affects performance of the original non-persistent implementation)
  2. I adjusted cuvs common benchmark utils to avoid extra d2h copies and syncs during refinement.

…ks adjusted to avoid unnecessary sync/copies
@achirkin achirkin requested a review from a team as a code owner July 8, 2024 14:51
@github-actions github-actions bot added the cpp label Jul 8, 2024
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 11, 2024
Copy link

copy-pr-bot bot commented Jul 12, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@achirkin achirkin force-pushed the fea-persistent-cagra branch from b748ec8 to 73ca412 Compare July 12, 2024 10:58
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Artem, I am going through the implementation details. Here are my first few comments.

Is the only way to exit the kernel to wait until kLiveInterval time has passed? Could we have an additional explicit method to stop the kernel?

@achirkin
Copy link
Contributor Author

Update on benchmarks: after a bit of algorithm tweaking, it seems the QPS/sleep behavior is now rather stable for both single-query case and other batch sizes. I added the plots to the PR description. Even with 8k threads on 32-core machine, the system stays responsive and the cumulative CPU load stays at 90%-100% (i.e. all threads don't suddenly go to sleep as before). To strain the system even more, I reduced the CTA size to 64 threads (i.e. gridDim > 2k), which increased the maximum QPS by 2x.

So far I tested it with a few different configurations (H100/A10/RTX3090Ti), it's more or less stable. With this, I'm not sure, which compile time constants in the code should better be exposed. Here's the summary:

// Artificially reduce the grid size by a constant < 1. Ideally we shouldn't need this, but this may be useful to make the system a bit more responsive (e.g. on workstations where the GPU is used for other purposes as well).
double kDeviceUsage = 1.0;
// The time to live for the kernel: if no calls in this time happened, the kernel stops.
// It may be useful to expose this in the production setting to balance high-percentile-latency vs CPU/GPU load.
auto kLiveInterval = std::chrono::milliseconds(2000);
// The size of the jobs queue, it's not smaller than max expected thread count.
// Can move it to search parameters, but:
//    1. Will have to change all arrays to std vectors and a few compile-time checks will not be possible
//    2. There's no harm in setting this to a large value, such as 8192
uint32_t kMaxJobsNum = 8192;
// The size of the worker queue = max number of CTAs.
// Same two arguments apply
uint32_t kMaxWorkersNum           = 4096;
// How many CTAs/workers can be hold by one IO thread at the same time & starting with how many
// the IO thread "should be more eager" to return them. I believe these should not be exposed, as the performance impact
// of them is unknown.
uint32_t kMaxWorkersPerThread     = 256;
uint32_t kSoftMaxWorkersPerThread = 16;
// The default time to sleep. Maybe it makes sense to expose this, but also maybe 
auto kDefaultLatency = std::chrono::nanoseconds(50000);
// This is the base for computing maximum time a thread is allowed to sleep.
// It's a good multiple of default latency
auto kMaxExpectedLatency = kDefaultLatency * std::max<std::uint32_t>(10, kMaxJobsNum / 128);

// -----------------------------------------------------------------------
// The rest are constants local to functions, which, I think too specific for anyone to adequately use/change

// Exponential moving average constant for tracking the configuration-specific expected latency
  inline ~launcher_t() noexcept  // NOLINT
  {
    // bookkeeping: update the expected latency to wait more efficiently later
    constexpr size_t kWindow = 100;  // moving average memory
    expected_latency         = std::min<std::chrono::nanoseconds>(
      ((kWindow - 1) * expected_latency + now - start) / kWindow, kMaxExpectedLatency);
  }
 
   [[nodiscard]] inline auto sleep_limit() const
  {
    constexpr auto kMinWakeTime  = std::chrono::nanoseconds(10000);
    constexpr double kSleepLimit = 0.6;
    return start + expected_latency * kSleepLimit - kMinWakeTime;
  }
  

  [[nodiscard]] inline auto calc_pause_factor(uint32_t n_queries) const -> uint32_t
  {
    constexpr uint32_t kMultiplier = 10;
    return kMultiplier * raft::div_rounding_up_safe(n_queries, idle_worker_ids.capacity());
  }
  
  
  
    inline void pause()
  {
    // Don't sleep this many times hoping for smoother run
    constexpr auto kSpinLimit = 3;
    // It doesn't make much sense to slee less than this
    constexpr auto kPauseTimeMin = std::chrono::nanoseconds(1000);
    // Bound sleeping time
    constexpr auto kPauseTimeMax = std::chrono::nanoseconds(50000);
    ...
}

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the updates! I have looked into the implementation details. It is an elaborate mechanism, but in general the code is clean, and I appreciate the developer docstrings.

As you mention in the PR description, it would be great to consolidate the persistent kernel scheduling structures into a separate namespace. But this can be done as a follow up.

Apart from the small comments below, the main question is the API. I like the current design which enables persistent mode by a bool flag. Thanks for listing the parameters that influence the behavior of the persistent kernel. I would suggest we just expose two of these: kDeviceUsage and kLiveInterval. And once we are allowed to change kLiveInterval, it would be great to have an API to explicitly stop the kernel.

@achirkin achirkin requested a review from tfeher July 22, 2024 10:36
@achirkin
Copy link
Contributor Author

Thanks for suggestion, @cjnolet and @tfeher . I've added a comprehensive example on how to use the persistent kernel into examples/cpp/src/cagra_persistent_example.cu.

@achirkin achirkin requested a review from tfeher July 31, 2024 14:11
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the elaborate example! While it does the job of illustrating persistent-kernel usage, but I think it contains more than what we need, and therefore brings up a few questions. I would prefer to simplify this (although I would be fine to keep the current form as well if others think it is useful).

examples/cpp/src/cagra_persistent_example.cu Show resolved Hide resolved
[persistent/async B] execution time: 1316.97 ms
[I] [15:56:55.756952] Destroyed the persistent runner.
```
Note, the persistent kernel time in async mode (1 query per job) is up to 2x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Isn't large batch kernel executing longer simply because of the batch size? Or do you mean time per query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We measure the total wall time of processing the given number of queries here, either in one batch or one query at a time in an async loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify: "While the persistent kernel provides minimal latency for each small batch search call, the wall time to process all the queries in async mode (1 query per job) is up to 2x slower than ..."

examples/cpp/src/cagra_persistent_example.cu Outdated Show resolved Hide resolved
examples/cpp/src/cagra_persistent_example.cu Show resolved Hide resolved
@achirkin achirkin requested a review from tfeher July 31, 2024 16:11
@tfeher
Copy link
Contributor

tfeher commented Jul 31, 2024

divyegala pushed a commit to divyegala/cuvs that referenced this pull request Jul 31, 2024
…idsai#227)

A small change that reduces the number of arguments in one of the wrapper layers in the detail namespace of CAGRA. The goal is twofold:
  1) Simplify the overly long signature of `selet_and_run` (which has many instances) 
  2) Give access to all search parameters for future upgrades of the search kernel

This is to simplify the integration (and review) of the persistent kernel (rapidsai#215).
No performance or functional changes expected.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#227
@achirkin achirkin changed the base branch from branch-24.08 to branch-24.10 August 12, 2024 08:06
@achirkin
Copy link
Contributor Author

achirkin commented Sep 26, 2024

Binary size increase is almost 60MB: 620 -> 679 MB

However, it seems like we still have some unused search instances! e.g.

neighbors/detail/cagra/search_single_cta_float_uint64.cu
neighbors/detail/cagra/search_single_cta_half_uint64.cu

These are compiled, but not used, because there's no corresponding neighbors/cagra_search_xxx.cu for them.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the updates. I have reviewed the recent changes, and the PR looks good to me.

Please remove the unused instantiations.

@cjnolet
Copy link
Member

cjnolet commented Sep 27, 2024

/merge

@rapids-bot rapids-bot bot merged commit b93b8f6 into rapidsai:branch-24.10 Sep 27, 2024
54 checks passed
@cjnolet
Copy link
Member

cjnolet commented Sep 27, 2024

@achirkin can you remove the unused instantiations in a follow up PR?

@achirkin
Copy link
Contributor Author

Sure, I will remove the unused instantiations.

Also I've only got partial benchmark results due to the machine getting shut down by the runtime limit, but the results I got seem to suggest there's no performance drop since the last results in August
https://docs.google.com/spreadsheets/d/1oe58EjBzKd1m3PTFsK7ON0ZsqQH0fzSxFhpMhWy8neg/edit?gid=17032142#gid=17032142

@achirkin
Copy link
Contributor Author

@cjnolet I've removed these instances in #264, which is ready for review now. #264 uses many of the currently unused instances, but in total leads to small reduction in binary size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants