Persistent CAGRA kernel #215

achirkin · 2024-07-08T14:51:20Z

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA).
The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no heartbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Synchronization behavior and CUDA streams

The kernel is managed in a dedicated thread & a non-blocking stream; it's independent of any other (i.e. calling) threads.

Although we pass a CUDA stream to the search function to preserve the api, this CUDA stream is never used; in fact, there are no CUDA API calls happening in the calling thread.
All communication between the host calling thread and GPU workers happens via atomic variables.

The search function blocks the CPU thread, i.e. it waits till the results are back before returning.

Exceptions and safety

The kernel runner object is stored in a shared pointer. Hence, it provides all the same safety guarantees as smart pointers. For example, if a C++ exception is raised in the runner thread, the kernel is stopped during the destruction of the runner/last shared pointer.

It's hard to detect if something happens to the kernel or CUDA context. If the kernel does not return the results to the calling thread within the configured kernel lifetime (persistent_lifetime ), the calling thread abandons the request and throws an exception.
The designed behavior here is that all components can gracefully shutdown within the configured kernel lifetime independently.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction.
We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.
We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Adjusted benchmarks

I introduced a global temporary buffer for keeping the intermediate results (e.g. neighbor candidates before refinement). This is needed to avoid unnecessary allocations alongside the persistent kernel (but also positively affects performance of the original non-persistent implementation)
I adjusted cuvs common benchmark utils to avoid extra d2h copies and syncs during refinement.

…ks adjusted to avoid unnecessary sync/copies

copy-pr-bot · 2024-07-12T10:56:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/include/cuvs/neighbors/cagra.hpp

…ily to avoid using unnecessary resources

tfeher

Hi Artem, I am going through the implementation details. Here are my first few comments.

Is the only way to exit the kernel to wait until kLiveInterval time has passed? Could we have an additional explicit method to stop the kernel?

cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h

cpp/src/neighbors/detail/cagra/search_plan.cuh

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

Co-authored-by: Tamas Bela Feher <[email protected]>

achirkin · 2024-07-19T18:01:41Z

Update on benchmarks: after a bit of algorithm tweaking, it seems the QPS/sleep behavior is now rather stable for both single-query case and other batch sizes. I added the plots to the PR description. Even with 8k threads on 32-core machine, the system stays responsive and the cumulative CPU load stays at 90%-100% (i.e. all threads don't suddenly go to sleep as before). To strain the system even more, I reduced the CTA size to 64 threads (i.e. gridDim > 2k), which increased the maximum QPS by 2x.

So far I tested it with a few different configurations (H100/A10/RTX3090Ti), it's more or less stable. With this, I'm not sure, which compile time constants in the code should better be exposed. Here's the summary:

// Artificially reduce the grid size by a constant < 1. Ideally we shouldn't need this, but this may be useful to make the system a bit more responsive (e.g. on workstations where the GPU is used for other purposes as well).
double kDeviceUsage = 1.0;
// The time to live for the kernel: if no calls in this time happened, the kernel stops.
// It may be useful to expose this in the production setting to balance high-percentile-latency vs CPU/GPU load.
auto kLiveInterval = std::chrono::milliseconds(2000);
// The size of the jobs queue, it's not smaller than max expected thread count.
// Can move it to search parameters, but:
//    1. Will have to change all arrays to std vectors and a few compile-time checks will not be possible
//    2. There's no harm in setting this to a large value, such as 8192
uint32_t kMaxJobsNum = 8192;
// The size of the worker queue = max number of CTAs.
// Same two arguments apply
uint32_t kMaxWorkersNum           = 4096;
// How many CTAs/workers can be hold by one IO thread at the same time & starting with how many
// the IO thread "should be more eager" to return them. I believe these should not be exposed, as the performance impact
// of them is unknown.
uint32_t kMaxWorkersPerThread     = 256;
uint32_t kSoftMaxWorkersPerThread = 16;
// The default time to sleep. Maybe it makes sense to expose this, but also maybe 
auto kDefaultLatency = std::chrono::nanoseconds(50000);
// This is the base for computing maximum time a thread is allowed to sleep.
// It's a good multiple of default latency
auto kMaxExpectedLatency = kDefaultLatency * std::max<std::uint32_t>(10, kMaxJobsNum / 128);

// -----------------------------------------------------------------------
// The rest are constants local to functions, which, I think too specific for anyone to adequately use/change

// Exponential moving average constant for tracking the configuration-specific expected latency
  inline ~launcher_t() noexcept  // NOLINT
  {
    // bookkeeping: update the expected latency to wait more efficiently later
    constexpr size_t kWindow = 100;  // moving average memory
    expected_latency         = std::min<std::chrono::nanoseconds>(
      ((kWindow - 1) * expected_latency + now - start) / kWindow, kMaxExpectedLatency);
  }
 
   [[nodiscard]] inline auto sleep_limit() const
  {
    constexpr auto kMinWakeTime  = std::chrono::nanoseconds(10000);
    constexpr double kSleepLimit = 0.6;
    return start + expected_latency * kSleepLimit - kMinWakeTime;
  }
  

  [[nodiscard]] inline auto calc_pause_factor(uint32_t n_queries) const -> uint32_t
  {
    constexpr uint32_t kMultiplier = 10;
    return kMultiplier * raft::div_rounding_up_safe(n_queries, idle_worker_ids.capacity());
  }
  
  
  
    inline void pause()
  {
    // Don't sleep this many times hoping for smoother run
    constexpr auto kSpinLimit = 3;
    // It doesn't make much sense to slee less than this
    constexpr auto kPauseTimeMin = std::chrono::nanoseconds(1000);
    // Bound sleeping time
    constexpr auto kPauseTimeMax = std::chrono::nanoseconds(50000);
    ...
}

tfeher

Thanks Artem for the updates! I have looked into the implementation details. It is an elaborate mechanism, but in general the code is clean, and I appreciate the developer docstrings.

As you mention in the PR description, it would be great to consolidate the persistent kernel scheduling structures into a separate namespace. But this can be done as a follow up.

Apart from the small comments below, the main question is the API. I like the current design which enables persistent mode by a bool flag. Thanks for listing the parameters that influence the behavior of the persistent kernel. I would suggest we just expose two of these: kDeviceUsage and kLiveInterval. And once we are allowed to change kLiveInterval, it would be great to have an API to explicitly stop the kernel.

cpp/src/neighbors/detail/cagra/search_plan.cuh

cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

achirkin · 2024-07-31T14:11:43Z

Thanks for suggestion, @cjnolet and @tfeher . I've added a comprehensive example on how to use the persistent kernel into examples/cpp/src/cagra_persistent_example.cu.

tfeher

Thanks Artem for the elaborate example! While it does the job of illustrating persistent-kernel usage, but I think it contains more than what we need, and therefore brings up a few questions. I would prefer to simplify this (although I would be fine to keep the current form as well if others think it is useful).

examples/cpp/src/cagra_persistent_example.cu

tfeher · 2024-07-31T15:51:10Z

examples/cpp/src/cagra_persistent_example.cu

+[persistent/async B] execution time: 1316.97 ms
+[I] [15:56:55.756952] Destroyed the persistent runner.
+```
+Note, the persistent kernel time in async mode (1 query per job) is up to 2x


What does this mean? Isn't large batch kernel executing longer simply because of the batch size? Or do you mean time per query?

We measure the total wall time of processing the given number of queries here, either in one batch or one query at a time in an async loop.

Maybe clarify: "While the persistent kernel provides minimal latency for each small batch search call, the wall time to process all the queries in async mode (1 query per job) is up to 2x slower than ..."

examples/cpp/src/cagra_persistent_example.cu

tfeher · 2024-07-31T16:20:38Z

Library size in a recent build is 875 MB https://downloads.rapids.ai/ci/cuvs/pull-request/254/91f48f8/cuda11_x86_64.compile_lib.html
Library size after this PR 1.2 GB https://downloads.rapids.ai/ci/cuvs/pull-request/215/04c2745/cuda12_x86_64.compile_lib.html

…idsai#227) A small change that reduces the number of arguments in one of the wrapper layers in the detail namespace of CAGRA. The goal is twofold: 1) Simplify the overly long signature of `selet_and_run` (which has many instances) 2) Give access to all search parameters for future upgrades of the search kernel This is to simplify the integration (and review) of the persistent kernel (rapidsai#215). No performance or functional changes expected. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#227

…persistent kernel

achirkin · 2024-09-26T15:53:49Z

Binary size increase is almost 60MB: 620 -> 679 MB

However, it seems like we still have some unused search instances! e.g.

neighbors/detail/cagra/search_single_cta_float_uint64.cu
neighbors/detail/cagra/search_single_cta_half_uint64.cu

These are compiled, but not used, because there's no corresponding neighbors/cagra_search_xxx.cu for them.

tfeher

Thanks Artem for the updates. I have reviewed the recent changes, and the PR looks good to me.

Please remove the unused instantiations.

cjnolet · 2024-09-27T01:04:30Z

/merge

cjnolet · 2024-09-27T01:04:57Z

@achirkin can you remove the unused instantiations in a follow up PR?

achirkin · 2024-09-27T08:15:21Z

Sure, I will remove the unused instantiations.

Also I've only got partial benchmark results due to the machine getting shut down by the runtime limit, but the results I got seem to suggest there's no performance drop since the last results in August
https://docs.google.com/spreadsheets/d/1oe58EjBzKd1m3PTFsK7ON0ZsqQH0fzSxFhpMhWy8neg/edit?gid=17032142#gid=17032142

achirkin · 2024-09-27T11:56:43Z

@cjnolet I've removed these instances in #264, which is ready for review now. #264 uses many of the currently unused instances, but in total leads to small reduction in binary size.

Persistent kernel for CAGRA, with implicit kernel runner and benchmar…

8b76ff7

…ks adjusted to avoid unnecessary sync/copies

achirkin requested a review from a team as a code owner July 8, 2024 14:51

github-actions bot added the cpp label Jul 8, 2024

Fix style

80ba46e

cjnolet assigned achirkin Jul 11, 2024

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 11, 2024

achirkin mentioned this pull request Jul 12, 2024

[WIP] Persistent CAGRA kernel rapidsai/raft#2316

Closed

Don't store the source data in the CAGRA index file

73ca412

achirkin force-pushed the fea-persistent-cagra branch from b748ec8 to 73ca412 Compare July 12, 2024 10:58

Merge branch 'branch-24.08' into fea-persistent-cagra

12e228a

tfeher reviewed Jul 15, 2024

View reviewed changes

cpp/include/cuvs/neighbors/cagra.hpp Show resolved Hide resolved

achirkin and others added 4 commits July 16, 2024 09:57

Merge branch 'branch-24.08' into fea-persistent-cagra

14fc20c

Make the benchmark workspace resources allocate/initialize memory laz…

554fec7

…ily to avoid using unnecessary resources

Improve the scheduling logic to make the QPS more consistent

eeb50fc

Merge branch 'branch-24.08' into fea-persistent-cagra

0408707

achirkin mentioned this pull request Jul 17, 2024

CAGRA: reduce argument count in select_and_run() kernel wrappers #227

Merged

tfeher requested changes Jul 18, 2024

View reviewed changes

achirkin and others added 5 commits July 18, 2024 14:10

Update cpp/src/neighbors/detail/cagra/search_plan.cuh

4c8d432

Co-authored-by: Tamas Bela Feher <[email protected]>

Update cpp/bench/ann/src/cuvs/cuvs_cagra_wrapper.h

691a9b3

Co-authored-by: Tamas Bela Feher <[email protected]>

Merge branch 'branch-24.08' into fea-persistent-cagra

6883517

Fix style

8007f5a

Merge branch 'branch-24.08' into fea-persistent-cagra

628cb62

tfeher requested changes Jul 21, 2024

View reviewed changes

achirkin and others added 2 commits July 22, 2024 11:05

Add more docstrings addressing review questions

a5ee7bc

Merge branch 'branch-24.08' into fea-persistent-cagra

4e4317d

achirkin requested a review from tfeher July 22, 2024 10:36

Merge branch 'branch-24.08' into fea-persistent-cagra

acd9f5e

Fix a typo

04c2745

achirkin requested a review from tfeher July 31, 2024 14:11

tfeher requested changes Jul 31, 2024

View reviewed changes

achirkin requested a review from tfeher July 31, 2024 16:11

achirkin and others added 3 commits July 31, 2024 18:56

Merge branch 'branch-24.08' into fea-persistent-cagra

b757aa9

Improve comments in the example code

53a9ba5

Merge branch 'branch-24.08' into fea-persistent-cagra

69a8740

achirkin changed the base branch from branch-24.08 to branch-24.10 August 12, 2024 08:06

achirkin and others added 6 commits August 14, 2024 18:15

Merge branch 'branch-24.10' into fea-persistent-cagra

b7f5106

Merge branch 'branch-24.10' into fea-persistent-cagra

05100ce

Merge branch 'branch-24.10' into fea-persistent-cagra

1c23549

Allow setting team_size explicitly

f9ee7c7

Merge branch 'branch-24.10' into fea-persistent-cagra

5bc2982

Merge branch 'branch-24.10' into fea-persistent-cagra

ee35f74

achirkin mentioned this pull request Sep 16, 2024

CAGRA - runtime dispatch of distance functions #324

Closed

tfeher mentioned this pull request Sep 22, 2024

CAGRA - separable compilation for distance computation #296

Merged

achirkin and others added 3 commits September 23, 2024 12:45

Merge branch 'branch-24.10' into fea-persistent-cagra

7195a2b

Merge branch 'branch-24.10' into fea-persistent-cagra

b138a07

Lazy-initialize the dataset_descriptor to avoid its overheads in the …

0bfb6be

…persistent kernel

tfeher approved these changes Sep 26, 2024

View reviewed changes

cjnolet approved these changes Sep 27, 2024

View reviewed changes

rapids-bot bot merged commit b93b8f6 into rapidsai:branch-24.10 Sep 27, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent CAGRA kernel #215

Persistent CAGRA kernel #215

achirkin commented Jul 8, 2024 •

edited

Loading

copy-pr-bot bot commented Jul 12, 2024

tfeher left a comment

achirkin commented Jul 19, 2024

tfeher left a comment

achirkin commented Jul 31, 2024

tfeher left a comment

tfeher Jul 31, 2024

achirkin Jul 31, 2024

tfeher Jul 31, 2024

tfeher commented Jul 31, 2024

achirkin commented Sep 26, 2024 •

edited

Loading

tfeher left a comment

cjnolet commented Sep 27, 2024

cjnolet commented Sep 27, 2024

achirkin commented Sep 27, 2024

achirkin commented Sep 27, 2024

Persistent CAGRA kernel #215

Persistent CAGRA kernel #215

Conversation

achirkin commented Jul 8, 2024 • edited Loading

API

Synchronization behavior and CUDA streams

Exceptions and safety

Integration notes

lightweight_uvector

Shared resource queues / ring buffers

Persistent runner structs

Adjusted benchmarks

copy-pr-bot bot commented Jul 12, 2024

tfeher left a comment

Choose a reason for hiding this comment

achirkin commented Jul 19, 2024

tfeher left a comment

Choose a reason for hiding this comment

achirkin commented Jul 31, 2024

tfeher left a comment

Choose a reason for hiding this comment

tfeher Jul 31, 2024

Choose a reason for hiding this comment

achirkin Jul 31, 2024

Choose a reason for hiding this comment

tfeher Jul 31, 2024

Choose a reason for hiding this comment

tfeher commented Jul 31, 2024

achirkin commented Sep 26, 2024 • edited Loading

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Sep 27, 2024

cjnolet commented Sep 27, 2024

achirkin commented Sep 27, 2024

achirkin commented Sep 27, 2024

achirkin commented Jul 8, 2024 •

edited

Loading

achirkin commented Sep 26, 2024 •

edited

Loading