-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] scan_copy_if #6593
[WIP] scan_copy_if #6593
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #6593 +/- ##
============================================
Coverage 82.54% 82.54%
============================================
Files 90 90
Lines 14928 14928
============================================
Hits 12322 12322
Misses 2606 2606 Continue to review full report at Codecov.
|
cpp/tests/io/algo/scan_select_if.cuh
Outdated
void* allocations[2]; | ||
size_t allocation_sizes[2]; | ||
|
||
CUDA_TRY(Policy::ItemsTileState::AllocationSize(num_tiles, allocation_sizes[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a non-CUDART function wrapped in CUDA_TRY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ItemsTileState
is an alias for cub::ScanTileState
, who's static AllocationSize
function returns cudaError_t
. The implementation might as well be constexpr
, but the signature of the method suggests we should check for errors... 🤷
Am I missing something? I was under the impression we should use CUDA_TRY
for any functions returning cudaError_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that CUB returns cudaError_t
. Seems weird -- what if there are errors in CUB that are not CUDA errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a cursory pass, but my high level concern is that this implementation is too heavily based on CUB internals. It uses patterns and abstractions that look unfamiliar compared to the rest of the libcudf code base, e.g., the use of the policy
and agent
abstractions and the use of CUB internal functions.
cpp/tests/io/algo/scan_select_if.cuh
Outdated
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) | ||
{ | ||
CUDF_FUNC_RANGE(); | ||
using Input = typename InputIterator::value_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using Input = typename InputIterator::value_type; | |
using Input = typename std::iterator_traits<InputIterator>::value_type; |
cpp/tests/io/algo/scan_select_if.cuh
Outdated
bool do_initialize, | ||
bool do_scatter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool
params like these are a code smell. Usually this means there should be two functions. Perhaps two functions of a single function object to express that they are linked together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I was trying to figure out a clean way to solve this, but this seemed the cleanest. The problem is that the code paths are nearly identical. throwing in a couple of bools to enable/disable two sections seemed clean from the implementation standpoint, but utilizing the function requires three calls.
We could add an "is tile state initialized" flag to the temp memory, create the temp memory internally, initialize it to "tile state not initialized", remove bool do_initialize
, and return the temp memory such that it can be passed to the second and third call.
We could specialize the kernel to accept a "void" type for OutputIterator
, and disable scatter via that, but that would increase compilation time (trivially?) and make the implementation more complicated.
Or we can have two bools, both of which have no effect on the other (all four combinations are valid, though initializing on the scatter phase is practically useless), and only enable/disable an independent section of code.
Do you have an idea? I'd much rather have a better solution.
cpp/tests/io/algo/scan_select_if.cuh
Outdated
enum : uint32_t { | ||
THREADS_PER_INIT_BLOCK = 128, | ||
THREADS_PER_BLOCK = 128, | ||
ITEMS_PER_THREAD = 16, | ||
ITEMS_PER_TILE = ITEMS_PER_THREAD * THREADS_PER_BLOCK, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum
trick is no longer needed when we have static constexpr
.
@jrhemstad As I understand it, this is exactly the level of abstraction that CUB is meant to service, and using CUB has made this implementation succinct and efficient.
None of these APIs are in CUB's internal namespace. All correlate directly with well-defined patterns and are unlikely to change.
CUB uses
Perhaps we could contribute this code (with alterations) to http://github.com/nvidia/cub ? |
.Load(d_input + tile_offset, items); | ||
} | ||
|
||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer and better expresses intent when a __device__
functions that syncrhonizes requires a cooperative group as an input parameter as it lets the caller know that the group will be synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadn't heard of cooperatives groups until now. Seems useful for keeping things flexible or clearly expressing intent, but I'm having a hard time seeing how they'd improve this code. The CUB functions being called all __sync...
or __threadfence...
internally, so adding a thread_block_tile<32>
parameter to this function seems like it would actually be more misleading due to inconsistent paradigms. I should make this function private...
@cwharris can you add docs describing what this algorithm does? I'm still not really sure yet. |
@jrhemstad Will add docs. 30kft view is: inclusive_scan, followed by copy_if. |
…enum to constexpr
|
||
// Items Scan | ||
|
||
using ItemsTileState = cub::ScanTileState<Input>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use copies of these instead of reusing CUB's helper classes. I reserve the right to change the agent
layer of CUB at anytime, since those are the performance-critical backends for the device algorithms.
Alternatively, is this algorithm general purpose enough that it should be added to CUB directly? Can you give me a high-level description of what it does?
For the most part, this looks ok from a CUB API standpoint -- it's an implementation of a CUB-style algorithm, and mostly uses public APIs (functors, block collectives, thread APIs, etc). I pointed out an exception inline -- the scan helper classes from the For the device algorithms, the
I agree that these are unlikely to change, but I'm not going to commit to that :)
What's the advantage of combining them, instead of just calling
This would be welcome if there's a strong advantage to combining these and if it's a common usecase. Is this operation commonly used? |
The general idea is fusion of A general use case is "I have lots of inputs I need to scan, but I only need some of the results of said scan.". A specific use case is parsing a csv file via a non-commutative binary reduction operator, however I'm beginning to believe this specific use case calls for a more specific "parsing" algorithm - something which scans "tokens" and emits "artifacts" conditionally - i.e. the offset of every record/field start, which can only be determined non-commutatively. A trivial use case is "scan a million integers, only collect the results which are greater than 500,000, and less than 200,000". A slightly less trivial use case is "find only the ascending values, and only if they ascend significantly" (this, again, requires a specific/specialized non-commutative binary reduction operator and a transformation of the inputs/outputs to/from numbers/state). Yes, you can do this as I will be the first to admit that |
The main thing I don't like about this algorithm is that isn't primary use cases require dedicated "state" types, and those states would be larger than the inputs ( |
Why CUB and not Thrust? |
Both Thrust and CUB have device-level |
Either way I guess this is not a public API of libcudf, so getting it into CUB/Thrust makes sense, but you should prove it out for CSV parsing first. I just prefer std-style APIs to CUB-style... |
No description provided.