Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] scan_copy_if #6593

Closed
wants to merge 50 commits into from
Closed

Conversation

cwharris
Copy link
Contributor

No description provided.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Oct 25, 2020

Codecov Report

Merging #6593 (3e5a6eb) into branch-0.19 (cbd2726) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.19    #6593   +/-   ##
============================================
  Coverage        82.54%   82.54%           
============================================
  Files               90       90           
  Lines            14928    14928           
============================================
  Hits             12322    12322           
  Misses            2606     2606           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbd2726...3e5a6eb. Read the comment docs.

void* allocations[2];
size_t allocation_sizes[2];

CUDA_TRY(Policy::ItemsTileState::AllocationSize(num_tiles, allocation_sizes[0]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a non-CUDART function wrapped in CUDA_TRY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ItemsTileState is an alias for cub::ScanTileState, who's static AllocationSize function returns cudaError_t. The implementation might as well be constexpr, but the signature of the method suggests we should check for errors... 🤷

Am I missing something? I was under the impression we should use CUDA_TRY for any functions returning cudaError_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that CUB returns cudaError_t. Seems weird -- what if there are errors in CUB that are not CUDA errors.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a cursory pass, but my high level concern is that this implementation is too heavily based on CUB internals. It uses patterns and abstractions that look unfamiliar compared to the rest of the libcudf code base, e.g., the use of the policy and agent abstractions and the use of CUB internal functions.

rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource())
{
CUDF_FUNC_RANGE();
using Input = typename InputIterator::value_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using Input = typename InputIterator::value_type;
using Input = typename std::iterator_traits<InputIterator>::value_type;

Comment on lines 290 to 291
bool do_initialize,
bool do_scatter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool params like these are a code smell. Usually this means there should be two functions. Perhaps two functions of a single function object to express that they are linked together.

Copy link
Contributor Author

@cwharris cwharris Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I was trying to figure out a clean way to solve this, but this seemed the cleanest. The problem is that the code paths are nearly identical. throwing in a couple of bools to enable/disable two sections seemed clean from the implementation standpoint, but utilizing the function requires three calls.

We could add an "is tile state initialized" flag to the temp memory, create the temp memory internally, initialize it to "tile state not initialized", remove bool do_initialize, and return the temp memory such that it can be passed to the second and third call.

We could specialize the kernel to accept a "void" type for OutputIterator, and disable scatter via that, but that would increase compilation time (trivially?) and make the implementation more complicated.

Or we can have two bools, both of which have no effect on the other (all four combinations are valid, though initializing on the scatter phase is practically useless), and only enable/disable an independent section of code.

Do you have an idea? I'd much rather have a better solution.

Comment on lines 227 to 232
enum : uint32_t {
THREADS_PER_INIT_BLOCK = 128,
THREADS_PER_BLOCK = 128,
ITEMS_PER_THREAD = 16,
ITEMS_PER_TILE = ITEMS_PER_THREAD * THREADS_PER_BLOCK,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum trick is no longer needed when we have static constexpr.

@cwharris
Copy link
Contributor Author

@jrhemstad As I understand it, this is exactly the level of abstraction that CUB is meant to service, and using CUB has made this implementation succinct and efficient.

... too heavily based on CUB internals.

None of these APIs are in CUB's internal namespace. All correlate directly with well-defined patterns and are unlikely to change. TilePrefixCallbackOp and ScanTileState are the core of CUB's single-pass scan implementation, and have not changed significantly in seven years.

... the use of the policy and agent...

CUB uses Policy and Dispatch to fire off different implementations based on the architecture. While we're not doing that here, the use of policy does limit the number of template parameters needing to be passed (as well as aliases), reducing room for error and increasing readability (though it admittedly requires many typenames). I've omitted dispatch as it contributes nothing to this implementation. agent is included because it reduces the number of arguments needed to be passed to __device__ functions within the kernel. I started by avoiding CUB paradigms, and introduces them as they became beneficial from a readability or performance standpoint.

It uses patterns and abstractions that look unfamiliar compared to the rest of the libcudf code base

Perhaps we could contribute this code (with alterations) to http://github.com/nvidia/cub ?

@cwharris cwharris changed the title [WIP] scan_select_if [WIP] scan_copy_if Oct 26, 2020
.Load(d_input + tile_offset, items);
}

__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer and better expresses intent when a __device__ functions that syncrhonizes requires a cooperative group as an input parameter as it lets the caller know that the group will be synchronized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't heard of cooperatives groups until now. Seems useful for keeping things flexible or clearly expressing intent, but I'm having a hard time seeing how they'd improve this code. The CUB functions being called all __sync... or __threadfence... internally, so adding a thread_block_tile<32> parameter to this function seems like it would actually be more misleading due to inconsistent paradigms. I should make this function private...

@jrhemstad
Copy link
Contributor

@cwharris can you add docs describing what this algorithm does? I'm still not really sure yet.

@cwharris
Copy link
Contributor Author

@jrhemstad Will add docs. 30kft view is: inclusive_scan, followed by copy_if.


// Items Scan

using ItemsTileState = cub::ScanTileState<Input>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use copies of these instead of reusing CUB's helper classes. I reserve the right to change the agent layer of CUB at anytime, since those are the performance-critical backends for the device algorithms.

Alternatively, is this algorithm general purpose enough that it should be added to CUB directly? Can you give me a high-level description of what it does?

@alliepiper
Copy link
Contributor

alliepiper commented Oct 27, 2020

For the most part, this looks ok from a CUB API standpoint -- it's an implementation of a CUB-style algorithm, and mostly uses public APIs (functors, block collectives, thread APIs, etc). I pointed out an exception inline -- the scan helper classes from the Agent level are implementation details that may change.

For the device algorithms, the cub::Device* algorithms are the preferred/supported public APIs. I'm also planning to maintain stability for the cub::Dispatch* entry points since they're (unfortunately) used to workaround some current limitations of the Device layer. But the agent layer is too deep of an abstraction for us to call stable -- that is where drastic changes may occur to keep things performing well on new hardware.

TilePrefixCallbackOp and ScanTileState are the core of CUB's single-pass scan implementation, and have not changed significantly in seven years.

I agree that these are unlikely to change, but I'm not going to commit to that :)

30kft view is: inclusive_scan, followed by copy_if.

What's the advantage of combining them, instead of just calling cub::DeviceScan::InclusiveScan and cub::DeviceSelect::If?

Perhaps we could contribute this code (with alterations) to http://github.com/nvidia/cub ?

This would be welcome if there's a strong advantage to combining these and if it's a common usecase. Is this operation commonly used?

@cwharris
Copy link
Contributor Author

cwharris commented Oct 27, 2020

The general idea is fusion of inclusive_scan and copy_if, thereby reducing global memory usage (assuming you run once to count, then allocate result buffer, then run again to gather/scatter.

A general use case is "I have lots of inputs I need to scan, but I only need some of the results of said scan.".

A specific use case is parsing a csv file via a non-commutative binary reduction operator, however I'm beginning to believe this specific use case calls for a more specific "parsing" algorithm - something which scans "tokens" and emits "artifacts" conditionally - i.e. the offset of every record/field start, which can only be determined non-commutatively.

A trivial use case is "scan a million integers, only collect the results which are greater than 500,000, and less than 200,000".

A slightly less trivial use case is "find only the ascending values, and only if they ascend significantly" (this, again, requires a specific/specialized non-commutative binary reduction operator and a transformation of the inputs/outputs to/from numbers/state). Yes, you can do this as inclusive_scan followed by adjacent_difference, and then select_if, but that's now three algorithms and two intermediate result sets.

I will be the first to admit that scan_copy_if is not a perfect abstraction for parsing. It can in theory, however, be used to determine the data types of each column of a csv, and/or parse the values of each of those columns. But again, scan_copy_if is a generalization of this, and a dedicated parse/finite-state-machine algorithm would be a better fit.

@cwharris
Copy link
Contributor Author

The main thing I don't like about this algorithm is that isn't primary use cases require dedicated "state" types, and those states would be larger than the inputs (input => state via transform_iterator), and therefore require more shared memory. That could be mitigated by allowing a third operator to convert input => state immediately prior to the scan operator (agg_state = scan(agg_state, to_state(input))), but now we're leaving CUB land.

@harrism
Copy link
Member

harrism commented Oct 28, 2020

Why CUB and not Thrust?

@cwharris
Copy link
Contributor Author

Why CUB and not Thrust?

Both Thrust and CUB have device-level inclusive_scan and select_if/copy_if, but those are insufficient as they require intermediate allocations (scan, alloc, copy_if). That means a custom kernel is in order. Because CUB provides kernel-level abstractions, it is included here primarily to take advantage of the incredibly useful single-pass device-wide scan conventions. Unfortunately, that's exactly the functionality @allisonvacanti is saying is not public - thus the idea of contributing this functionality to CUB.

@harrism
Copy link
Member

harrism commented Oct 28, 2020

Either way I guess this is not a public API of libcudf, so getting it into CUB/Thrust makes sense, but you should prove it out for CSV parsing first.

I just prefer std-style APIs to CUB-style...

@kkraus14 kkraus14 added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Feb 2, 2021
@kkraus14 kkraus14 changed the base branch from branch-0.17 to branch-0.19 February 2, 2021 22:09
@cwharris cwharris closed this May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants