Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Abstract the decoupled-lookback pattern into a maintainable, stable, public API #226

Closed
cwharris opened this issue Oct 27, 2020 · 12 comments · Fixed by #700
Closed

Abstract the decoupled-lookback pattern into a maintainable, stable, public API #226

cwharris opened this issue Oct 27, 2020 · 12 comments · Fixed by #700
Labels
cub P2: nice to have Desired, but not necessary. type: enhancement New feature or request.

Comments

@cwharris
Copy link

Prefix sums are incredibly useful, and CUB provides both inclusive and exclusive variants of device-wide and block-wide public APIs for these tools. However, the need for device-wide scans is not limited to basic inclusive and exclusive scans. Other algorithms, such as syntax parsing, regex matching/finite state machines, copying ranges of inputs dependent up some state (ascending/descending), etc, would also benefit from device-wide scan building blocks. Some, if not all, of these use cases can be handled using the current public facing device-wide APIs. However, consuming these APIs all but requires the consumer to allocate N * sizeof(OutputT) bytes for N inputs, which is potentially a very large amount of memory.

The block-level scan APIs (BlockScan) already have a powerful callback parameter which can be (and is) used to implement device-wide scans. However, utilizing this callback is non-trivial, at best. Internally, CUB has an excellent implementation of this callback, and an associated tile state type which makes using the callback variants of BlockScan's APIs very easy. It would be, at the very least, nice to have building blocks such as these on the public facing APIs, as utilizing multiple device-level scans can consume large amounts of memory.

To be specific, public-API versions of TilePrefixCallbackOp and ScanTileState would be very useful.

@alliepiper
Copy link
Collaborator

alliepiper commented Oct 28, 2020

My concern is that these are implementation details for how the current scan algorithms are written. If we make them part of CUB's public API and our implementation changes, we'd be committed to maintaining them or putting them through deprecation. Deprecating wouldn't be too much of a burden for us, but still -- exposing our implementation details as public API doesn't seem like the right move here.

There's also the issue of where they'd live. We don't have an existing way to expose these sorts of algorithm-specific helpers, so that'd take some extra consideration.

So it's not impossible, but it's unusual and wouldn't be a straightforward fit into CUB's public APIs.

Alternative idea -- these utilities are about ~500 lines of code combined and only use public APIs. Why not just include a copy of them with your algorithm? That way you won't be tied to our implementation details, and you won't have to do anything if we need to change them.

@cwharris
Copy link
Author

Copy-paste isn't ideal, but sgtm. :)

@elstehle
Copy link
Collaborator

Not sure if I understand this correctly. But if this is about abstracting the mechanism of the single pass prefix scan [1] and exposing the functionality as a generic function, then I would agree that this indeed would be a useful building block. I have had the need for this a few times before as well.

Basically, it would present itself in many cases as a more efficient mechanism to "forward-propagate" information, all in a single pass, rather than materializing intermediate results (the prefix scan results), just due to lacking this functionality being provided.

[1] Single-pass Parallel Prefix Scan with Decoupled Look-back

@alliepiper
Copy link
Collaborator

A generic interface for this sort of thing would be useful, as the decoupled lookback technique is generally useful.

cc @canonizer, who recently used this technique in #204 and may have some input.

Let's discuss what that API would look like, if that's something ya'll would like to pursue. I can't fit this into my current workload, but I'd be happy to discuss it and land a PR.

@alliepiper alliepiper changed the title Provide public API versions of TilePrefixCallbackOp and ScanTileState Abstract the decoupled-lookback pattern into a maintainable, stable, public API Oct 28, 2020
@alliepiper alliepiper added the type: enhancement New feature or request. label Oct 28, 2020
@canonizer
Copy link
Contributor

In my case, the decoupled look-back implementation is different in at least 2 aspects from the generic one.

  • Instead of a single decoupled look-back, a large number of decoupled look-backs (currently 256) are performed, which are distributed among threads in the block.
  • My implementation uses 2 higher-order bits to indicate the value status (not ready / local sum ready / global sum ready), and as a result, doesn't cover the full range of values. However, it is fine for the sorting use case (multiple kernels are used if the input is too large, so there is no overflow), and may also be useful in other applications if the prefix sum is used for element counts.

Currently, my decoupled look-back implementation is part of the onesweep kernel agent. However, I'm considering making it standalone (within CUB).

@alliepiper
Copy link
Collaborator

However, I'm considering making it standalone (within CUB).

This would be cool.

Do you think the existing decoupled-lookback scan implementation would benefit from these changes?

@canonizer
Copy link
Contributor

No, I don't expect much benefit for the case where only a single decoupled look-back is performed.

@elstehle
Copy link
Collaborator

Probably the easiest would be to just expose TilePrefixCallbackOp and ScanTileState and provide documentation and examples on how to use them together with the BlockScan; then keep BlockScan compatible with it. I believe their usage is likely already as simple as it can be and I don't see a simpler way to support generic kernel fusion without more invasive changes to CUB. I could maybe take this on during the holidays towards the end of the year.

@canonizer do you have any insights from the sort? E.g., could this be extended to cover your use case as well?
I'll link @cwharris's PR to illustrate example usage:
https://github.com/rapidsai/cudf/pull/6593/files#diff-078c8ec60b1b0477dfebb86e00565fdcba5b56095aeb1cdf54337c49e3d84a2cR99-R124

@canonizer
Copy link
Contributor

At the very least, my implementation could use a similar interface.

Regarding using TilePrefixCallbackOp and ScanTileState (or versions thereof) in my case, it's possible provided that it doesn't negatively affect performance.

As described in one of the previous comments, there are 2 differences in my implementation of decoupled look-back:

  • supporting many independent decoupled look-backs per thread block and distributing them among threads;
  • using 2 bits of a 32-bit integer to indicate the status, with 30 bits used for the value.

@alliepiper alliepiper added the P2: nice to have Desired, but not necessary. label May 6, 2022
@alliepiper alliepiper added this to the Future Release milestone May 6, 2022
@jrhemstad jrhemstad added this to CCCL Aug 11, 2022
@jrhemstad jrhemstad moved this to Needs Triage in CCCL Aug 14, 2022
@jrhemstad jrhemstad removed the status in CCCL Aug 14, 2022
@jrhemstad jrhemstad added the cub label Feb 22, 2023
@alliepiper
Copy link
Collaborator

#699 and #700 may be of interest to folks in this conversation.

(@miscco, is your assignment here leftover from the triage marathon, or are you working on this?)

@miscco
Copy link
Collaborator

miscco commented May 24, 2023

That is definitely a leftover

@gevtushenko gevtushenko linked a pull request May 25, 2023 that will close this issue
@github-project-automation github-project-automation bot moved this to Done in CCCL May 26, 2023
@gevtushenko
Copy link
Collaborator

@cwharris the API is public now, please take a look at the example.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cub P2: nice to have Desired, but not necessary. type: enhancement New feature or request.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants