-
Notifications
You must be signed in to change notification settings - Fork 449
Abstract the decoupled-lookback pattern into a maintainable, stable, public API #226
Comments
My concern is that these are implementation details for how the current scan algorithms are written. If we make them part of CUB's public API and our implementation changes, we'd be committed to maintaining them or putting them through deprecation. Deprecating wouldn't be too much of a burden for us, but still -- exposing our implementation details as public API doesn't seem like the right move here. There's also the issue of where they'd live. We don't have an existing way to expose these sorts of algorithm-specific helpers, so that'd take some extra consideration. So it's not impossible, but it's unusual and wouldn't be a straightforward fit into CUB's public APIs. Alternative idea -- these utilities are about ~500 lines of code combined and only use public APIs. Why not just include a copy of them with your algorithm? That way you won't be tied to our implementation details, and you won't have to do anything if we need to change them. |
Copy-paste isn't ideal, but sgtm. :) |
Not sure if I understand this correctly. But if this is about abstracting the mechanism of the single pass prefix scan [1] and exposing the functionality as a generic function, then I would agree that this indeed would be a useful building block. I have had the need for this a few times before as well. Basically, it would present itself in many cases as a more efficient mechanism to "forward-propagate" information, all in a single pass, rather than materializing intermediate results (the prefix scan results), just due to lacking this functionality being provided. [1] Single-pass Parallel Prefix Scan with Decoupled Look-back |
A generic interface for this sort of thing would be useful, as the decoupled lookback technique is generally useful. cc @canonizer, who recently used this technique in #204 and may have some input. Let's discuss what that API would look like, if that's something ya'll would like to pursue. I can't fit this into my current workload, but I'd be happy to discuss it and land a PR. |
In my case, the decoupled look-back implementation is different in at least 2 aspects from the generic one.
Currently, my decoupled look-back implementation is part of the onesweep kernel agent. However, I'm considering making it standalone (within CUB). |
This would be cool. Do you think the existing decoupled-lookback scan implementation would benefit from these changes? |
No, I don't expect much benefit for the case where only a single decoupled look-back is performed. |
Probably the easiest would be to just expose @canonizer do you have any insights from the sort? E.g., could this be extended to cover your use case as well? |
At the very least, my implementation could use a similar interface. Regarding using As described in one of the previous comments, there are 2 differences in my implementation of decoupled look-back:
|
That is definitely a leftover |
Prefix sums are incredibly useful, and CUB provides both inclusive and exclusive variants of device-wide and block-wide public APIs for these tools. However, the need for device-wide scans is not limited to basic inclusive and exclusive scans. Other algorithms, such as syntax parsing, regex matching/finite state machines, copying ranges of inputs dependent up some state (ascending/descending), etc, would also benefit from device-wide scan building blocks. Some, if not all, of these use cases can be handled using the current public facing device-wide APIs. However, consuming these APIs all but requires the consumer to allocate
N * sizeof(OutputT)
bytes forN
inputs, which is potentially a very large amount of memory.The block-level scan APIs (
BlockScan
) already have a powerful callback parameter which can be (and is) used to implement device-wide scans. However, utilizing this callback is non-trivial, at best. Internally, CUB has an excellent implementation of this callback, and an associated tile state type which makes using the callback variants ofBlockScan
's APIs very easy. It would be, at the very least, nice to have building blocks such as these on the public facing APIs, as utilizing multiple device-level scans can consume large amounts of memory.To be specific, public-API versions of
TilePrefixCallbackOp
andScanTileState
would be very useful.The text was updated successfully, but these errors were encountered: