[WIP] scan_copy_if #6593

cwharris · 2020-10-25T03:27:24Z

No description provided.

…_scan_copy_if_2

…ct_if_2

…uires it.

GPUtester · 2020-10-25T03:28:49Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

codecov · 2020-10-25T05:47:16Z

Codecov Report

Merging #6593 (3e5a6eb) into branch-0.19 (cbd2726) will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           branch-0.19    #6593   +/-   ##
============================================
  Coverage        82.54%   82.54%           
============================================
  Files               90       90           
  Lines            14928    14928           
============================================
  Hits             12322    12322           
  Misses            2606     2606

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbd2726...3e5a6eb. Read the comment docs.

harrism · 2020-10-26T03:53:47Z

cpp/tests/io/algo/scan_select_if.cuh

+  void* allocations[2];
+  size_t allocation_sizes[2];
+
+  CUDA_TRY(Policy::ItemsTileState::AllocationSize(num_tiles, allocation_sizes[0]));


Why is a non-CUDART function wrapped in CUDA_TRY?

ItemsTileState is an alias for cub::ScanTileState, who's static AllocationSize function returns cudaError_t. The implementation might as well be constexpr, but the signature of the method suggests we should check for errors... 🤷

Am I missing something? I was under the impression we should use CUDA_TRY for any functions returning cudaError_t.

I didn't know that CUB returns cudaError_t. Seems weird -- what if there are errors in CUB that are not CUDA errors.

jrhemstad

I only did a cursory pass, but my high level concern is that this implementation is too heavily based on CUB internals. It uses patterns and abstractions that look unfamiliar compared to the rest of the libcudf code base, e.g., the use of the policy and agent abstractions and the use of CUB internal functions.

jrhemstad · 2020-10-26T15:53:04Z

cpp/tests/io/algo/scan_select_if.cuh

+  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource())
+{
+  CUDF_FUNC_RANGE();
+  using Input = typename InputIterator::value_type;


Suggested change

using Input = typename InputIterator::value_type;

using Input = typename std::iterator_traits<InputIterator>::value_type;

jrhemstad · 2020-10-26T15:55:13Z

cpp/tests/io/algo/scan_select_if.cuh

+  bool do_initialize,
+  bool do_scatter,


bool params like these are a code smell. Usually this means there should be two functions. Perhaps two functions of a single function object to express that they are linked together.

I agree. I was trying to figure out a clean way to solve this, but this seemed the cleanest. The problem is that the code paths are nearly identical. throwing in a couple of bools to enable/disable two sections seemed clean from the implementation standpoint, but utilizing the function requires three calls.

We could add an "is tile state initialized" flag to the temp memory, create the temp memory internally, initialize it to "tile state not initialized", remove bool do_initialize, and return the temp memory such that it can be passed to the second and third call.

We could specialize the kernel to accept a "void" type for OutputIterator, and disable scatter via that, but that would increase compilation time (trivially?) and make the implementation more complicated.

Or we can have two bools, both of which have no effect on the other (all four combinations are valid, though initializing on the scatter phase is practically useless), and only enable/disable an independent section of code.

Do you have an idea? I'd much rather have a better solution.

jrhemstad · 2020-10-26T15:56:06Z

cpp/tests/io/algo/scan_select_if.cuh

+  enum : uint32_t {
+    THREADS_PER_INIT_BLOCK = 128,
+    THREADS_PER_BLOCK      = 128,
+    ITEMS_PER_THREAD       = 16,
+    ITEMS_PER_TILE         = ITEMS_PER_THREAD * THREADS_PER_BLOCK,
+  };


This enum trick is no longer needed when we have static constexpr.

…ct_if_2

cwharris · 2020-10-26T19:44:21Z

@jrhemstad As I understand it, this is exactly the level of abstraction that CUB is meant to service, and using CUB has made this implementation succinct and efficient.

... too heavily based on CUB internals.

None of these APIs are in CUB's internal namespace. All correlate directly with well-defined patterns and are unlikely to change. TilePrefixCallbackOp and ScanTileState are the core of CUB's single-pass scan implementation, and have not changed significantly in seven years.

... the use of the policy and agent...

CUB uses Policy and Dispatch to fire off different implementations based on the architecture. While we're not doing that here, the use of policy does limit the number of template parameters needing to be passed (as well as aliases), reducing room for error and increasing readability (though it admittedly requires many typenames). I've omitted dispatch as it contributes nothing to this implementation. agent is included because it reduces the number of arguments needed to be passed to __device__ functions within the kernel. I started by avoiding CUB paradigms, and introduces them as they became beneficial from a readability or performance standpoint.

It uses patterns and abstractions that look unfamiliar compared to the rest of the libcudf code base

Perhaps we could contribute this code (with alterations) to http://github.com/nvidia/cub ?

jrhemstad · 2020-10-26T21:51:24Z

cpp/include/cudf/algorithm/scan_copy_if.cuh

+        .Load(d_input + tile_offset, items);
+    }
+
+    __syncthreads();


It's safer and better expresses intent when a __device__ functions that syncrhonizes requires a cooperative group as an input parameter as it lets the caller know that the group will be synchronized.

Hadn't heard of cooperatives groups until now. Seems useful for keeping things flexible or clearly expressing intent, but I'm having a hard time seeing how they'd improve this code. The CUB functions being called all __sync... or __threadfence... internally, so adding a thread_block_tile<32> parameter to this function seems like it would actually be more misleading due to inconsistent paradigms. I should make this function private...

jrhemstad · 2020-10-27T17:27:59Z

@cwharris can you add docs describing what this algorithm does? I'm still not really sure yet.

cwharris · 2020-10-27T17:47:06Z

@jrhemstad Will add docs. 30kft view is: inclusive_scan, followed by copy_if.

…enum to constexpr

alliepiper · 2020-10-27T19:56:43Z

cpp/include/cudf/algorithm/scan_copy_if.cuh

+
+  // Items Scan
+
+  using ItemsTileState      = cub::ScanTileState<Input>;


You might want to use copies of these instead of reusing CUB's helper classes. I reserve the right to change the agent layer of CUB at anytime, since those are the performance-critical backends for the device algorithms.

Alternatively, is this algorithm general purpose enough that it should be added to CUB directly? Can you give me a high-level description of what it does?

alliepiper · 2020-10-27T20:12:37Z

For the most part, this looks ok from a CUB API standpoint -- it's an implementation of a CUB-style algorithm, and mostly uses public APIs (functors, block collectives, thread APIs, etc). I pointed out an exception inline -- the scan helper classes from the Agent level are implementation details that may change.

For the device algorithms, the cub::Device* algorithms are the preferred/supported public APIs. I'm also planning to maintain stability for the cub::Dispatch* entry points since they're (unfortunately) used to workaround some current limitations of the Device layer. But the agent layer is too deep of an abstraction for us to call stable -- that is where drastic changes may occur to keep things performing well on new hardware.

TilePrefixCallbackOp and ScanTileState are the core of CUB's single-pass scan implementation, and have not changed significantly in seven years.

I agree that these are unlikely to change, but I'm not going to commit to that :)

30kft view is: inclusive_scan, followed by copy_if.

What's the advantage of combining them, instead of just calling cub::DeviceScan::InclusiveScan and cub::DeviceSelect::If?

Perhaps we could contribute this code (with alterations) to http://github.com/nvidia/cub ?

This would be welcome if there's a strong advantage to combining these and if it's a common usecase. Is this operation commonly used?

cwharris · 2020-10-27T20:28:21Z

The general idea is fusion of inclusive_scan and copy_if, thereby reducing global memory usage (assuming you run once to count, then allocate result buffer, then run again to gather/scatter.

A general use case is "I have lots of inputs I need to scan, but I only need some of the results of said scan.".

A specific use case is parsing a csv file via a non-commutative binary reduction operator, however I'm beginning to believe this specific use case calls for a more specific "parsing" algorithm - something which scans "tokens" and emits "artifacts" conditionally - i.e. the offset of every record/field start, which can only be determined non-commutatively.

A trivial use case is "scan a million integers, only collect the results which are greater than 500,000, and less than 200,000".

A slightly less trivial use case is "find only the ascending values, and only if they ascend significantly" (this, again, requires a specific/specialized non-commutative binary reduction operator and a transformation of the inputs/outputs to/from numbers/state). Yes, you can do this as inclusive_scan followed by adjacent_difference, and then select_if, but that's now three algorithms and two intermediate result sets.

I will be the first to admit that scan_copy_if is not a perfect abstraction for parsing. It can in theory, however, be used to determine the data types of each column of a csv, and/or parse the values of each of those columns. But again, scan_copy_if is a generalization of this, and a dedicated parse/finite-state-machine algorithm would be a better fit.

cwharris · 2020-10-27T20:37:32Z

The main thing I don't like about this algorithm is that isn't primary use cases require dedicated "state" types, and those states would be larger than the inputs (input => state via transform_iterator), and therefore require more shared memory. That could be mitigated by allowing a third operator to convert input => state immediately prior to the scan operator (agg_state = scan(agg_state, to_state(input))), but now we're leaving CUB land.

harrism · 2020-10-28T00:18:28Z

Why CUB and not Thrust?

cwharris · 2020-10-28T01:50:32Z

Why CUB and not Thrust?

Both Thrust and CUB have device-level inclusive_scan and select_if/copy_if, but those are insufficient as they require intermediate allocations (scan, alloc, copy_if). That means a custom kernel is in order. Because CUB provides kernel-level abstractions, it is included here primarily to take advantage of the incredibly useful single-pass device-wide scan conventions. Unfortunately, that's exactly the functionality @allisonvacanti is saying is not public - thus the idea of contributing this functionality to CUB.

harrism · 2020-10-28T04:10:38Z

Either way I guess this is not a public API of libcudf, so getting it into CUB/Thrust makes sense, but you should prove it out for CSV parsing first.

I just prefer std-style APIs to CUB-style...

cwharris added 30 commits October 8, 2020 19:50

wip avro reader

c9ba1f3

wip avro reader: zero-length string reads

40c9b46

wip

17433ac

csv row/column count operator

c58e857

CsvReaderTest

944ae78

wip general accumulate

78c093a

csv find test

8b806dd

csv gather positions

676b245

inclusive_scan_copy_if

0c863b9

inclusive_scan_copy_if

4ab40e3

wip

44689f3

inclusive_scan_copy_if working w/tests

3acba53

reorganize scan_copy_if tests

57ce1be

getting closer

3f2f058

wip

3de995c

InclusiveScanCopyIfPolicy

3715209

wip

cae1d4a

inclusive_scan_copy_if done

86facc7

rename to inclusive_copy_if

c66ea8e

string matcher

d4778f1

Merge branch 'branch-0.17' of github.com:rapidsai/cudf into inclusive…

39816e1

…_scan_copy_if_2

wip

5e5e104

wip

f76a727

wip, bugged

0408a38

Merge branch 'branch-0.17' of github.com:rapidsai/cudf into scan_selc…

d80616c

…ct_if_2

use full warp (32 threads) for scan_select_if kernel, because cub req…

eaa954d

…uires it.

wip

cc59fac

wip

9d8ab98

wip

a969207

working 100?

6334853

cwharris added 3 commits October 24, 2020 22:37

cleanup

ac84e7a

cleanup

c2d1e5e

cleanup

0742ada

reorg + csv test

45dfb5f

harrism reviewed Oct 26, 2020

View reviewed changes

jrhemstad requested changes Oct 26, 2020

View reviewed changes

cwharris added 2 commits October 26, 2020 10:31

Merge branch 'branch-0.17' of github.com:rapidsai/cudf into scan_sele…

c7b0e00

…ct_if_2

changelog rapidsai#6593

5cdd099

rename scan_select_if to scan_copy_if

11d2953

cwharris changed the title ~~[WIP] scan_select_if~~ [WIP] scan_copy_if Oct 26, 2020

jrhemstad reviewed Oct 26, 2020

View reviewed changes

use iterator traits instead of accessing value_type directly. change …

3e5a6eb

…enum to constexpr

alliepiper reviewed Oct 27, 2020

View reviewed changes

kkraus14 added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Feb 2, 2021

kkraus14 changed the base branch from branch-0.17 to branch-0.19 February 2, 2021 22:09

cwharris closed this May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] scan_copy_if #6593

[WIP] scan_copy_if #6593

cwharris commented Oct 25, 2020

GPUtester commented Oct 25, 2020

codecov bot commented Oct 25, 2020 •

edited

Loading

harrism Oct 26, 2020

cwharris Oct 26, 2020

harrism Oct 27, 2020

jrhemstad left a comment

jrhemstad Oct 26, 2020

jrhemstad Oct 26, 2020

cwharris Oct 26, 2020 •

edited

Loading

jrhemstad Oct 26, 2020

cwharris commented Oct 26, 2020

jrhemstad Oct 26, 2020

cwharris Oct 26, 2020

jrhemstad commented Oct 27, 2020

cwharris commented Oct 27, 2020

alliepiper Oct 27, 2020

alliepiper commented Oct 27, 2020 •

edited

Loading

cwharris commented Oct 27, 2020 •

edited

Loading

cwharris commented Oct 27, 2020

harrism commented Oct 28, 2020

cwharris commented Oct 28, 2020

harrism commented Oct 28, 2020

	using Input = typename InputIterator::value_type;
	using Input = typename std::iterator_traits<InputIterator>::value_type;


		// Items Scan

		using ItemsTileState = cub::ScanTileState<Input>;

[WIP] scan_copy_if #6593

[WIP] scan_copy_if #6593

Conversation

cwharris commented Oct 25, 2020

GPUtester commented Oct 25, 2020

codecov bot commented Oct 25, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwharris Oct 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwharris commented Oct 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad commented Oct 27, 2020

cwharris commented Oct 27, 2020

Choose a reason for hiding this comment

alliepiper commented Oct 27, 2020 • edited Loading

cwharris commented Oct 27, 2020 • edited Loading

cwharris commented Oct 27, 2020

harrism commented Oct 28, 2020

cwharris commented Oct 28, 2020

harrism commented Oct 28, 2020

codecov bot commented Oct 25, 2020 •

edited

Loading

cwharris Oct 26, 2020 •

edited

Loading

alliepiper commented Oct 27, 2020 •

edited

Loading

cwharris commented Oct 27, 2020 •

edited

Loading