Catch2 segmented sort #1484

alliepiper · 2024-03-04T21:39:33Z

Description

Ports CUB's device segmented sort test to catch2

- Allow explicit construction from any type to facilitate generic programming. A similar assignment operator already existed. - Make ==/!= operators into friend functions. This fixes compat with thrust::device_reference in testing code.

thrust::device_reference does not compile when operator== is a member function. Changing to friend functions WAR the issue.

This should fix some ambiguous overload issues we're seeing on CI.

This is intended to help with Catch2's CAPTURE macro: ``` CAPTURE(c2h::type_name<KeyT>(), c2h::type_name<ValueT>); output on failure: c2h::type_name<KeyT>() := "h" c2h::type_name<ValueT>() := "N3cub25CUB_200300_600_700_800_NS8NullTypeE" ``` ABI demangling would be a nice improvement for later.

- Add macros that can be enabled using `-DC2H_DEBUG_TIMING`. - Add RAII scoped_cpu_timer + macro. - Increase precision of output from ms -> us.

cub/test/catch2_segmented_sort_helper.cuh

cub/test/c2h/cpu_timer.cuh

/home/coder/cccl/cub/test/catch2_segmented_sort_helper.cuh(503): error NVIDIA#174-D: expression has no effect

cub/test/catch2_test_device_reduce.cu

cub/test/c2h/cpu_timer.cuh

cub/test/catch2_segmented_sort_helper.cuh

gevtushenko · 2024-03-22T23:01:48Z

cub/test/catch2_segmented_sort_helper.cuh

+  for (bool stable_sort : {unstable, stable})
+  {
+    for (bool sort_buffers : {pointers, double_buffer})


suggestion: as part of the Catch2 transition, I'd like to get away from this kind of manual dispatch. The motivation is the following:

This code is invoked for a few type combinations and input patterns. Stable sort, for instance, is no different from unstable one at the moment. For buffers / pointers API alternatives, the code path difference is also trivial. I'd like to keep a wide test (covering more than one type combination) for double buffers / stable, and one type / a few inputs for everything else.

Tests of district functionality should be in distinct test suits, so that there's no way for one test to affect state of another test.

I agree in general with these goals, but would prefer to keep this particular test as-is.

These are combined into the same test case to reduce execution time. The time needed to do the device sort + validation is trivial, while the time taken to sort the host reference is expensive. Combining the test cases like this allows the expensive reference solution to be reused for multiple test cases. We can test the multiple variants very cheaply. Taking the slowest test that passes through this function:

0.000686 s: generate_random_offsets 0.003239 s: Allocate device memory 0.000466 s: generate_random_unsorted_inputs 0.007262 s: D->H input arrays 0.002009 s: Clone input arrays on device 0.514597 s: host_sort_random_inputs 0.006783 s: H->D reference arrays 0.004699 s: cub::DeviceSegmentedSort 0.001485 s: validate_sorted_random_outputs 0.000456 s: Reset input/output device arrays 0.002804 s: cub::DeviceSegmentedSort 0.000972 s: validate_sorted_random_outputs 0.000453 s: Reset input/output device arrays 0.004621 s: cub::DeviceSegmentedSort 0.001034 s: validate_sorted_random_outputs 0.000417 s: Reset input/output device arrays 0.002716 s: cub::DeviceSegmentedSort 0.001026 s: validate_sorted_random_outputs 0.557 s: "DeviceSegmentedSortPairs: Randomly sized segments, random keys/values"(0) - types_161 - 2

Preparing the input and reference arrays takes 535ms, while running the device algorithm and verifying the outputs 4 times with different API options takes 20ms total, roughly 1% of the total reference sort time each -- they're practically free :)

Also, this is only for the tests with true random inputs. The tests with derived inputs execute each device sort as a separate test case.

alliepiper · 2024-03-27T19:47:33Z

Verified the new unstable test logic against #1552.

alliepiper requested review from a team as code owners March 4, 2024 21:39

alliepiper requested review from elstehle and miscco March 4, 2024 21:39

alliepiper force-pushed the catch2_segmented_sort branch 3 times, most recently from 66c1b39 to 8983fbd Compare March 5, 2024 02:20

alliepiper marked this pull request as draft March 5, 2024 02:21

alliepiper force-pushed the catch2_segmented_sort branch 4 times, most recently from a85dea2 to 6dce2a7 Compare March 8, 2024 19:03

alliepiper added 8 commits March 11, 2024 17:02

Make NullType more convenient.

4b760ab

- Allow explicit construction from any type to facilitate generic programming. A similar assignment operator already existed. - Make ==/!= operators into friend functions. This fixes compat with thrust::device_reference in testing code.

Make half_t and bfloat16_t device_reference compatible.

1e997a2

thrust::device_reference does not compile when operator== is a member function. Changing to friend functions WAR the issue.

Allow conversion of double -> half_t/bfloat116_t.

021b5ca

Make half/bfloat16 wrapper ctors expliict.

8025935

This should fix some ambiguous overload issues we're seeing on CI.

Add c2h::nosync_device_policy.

4ad4468

Improvements to c2h::cpu_timer.

81bb039

- Add macros that can be enabled using `-DC2H_DEBUG_TIMING`. - Add RAII scoped_cpu_timer + macro. - Increase precision of output from ms -> us.

Port DeviceSegmentedSort tests to catch2.

40eb5db

alliepiper force-pushed the catch2_segmented_sort branch from 6dce2a7 to 40eb5db Compare March 11, 2024 17:02

alliepiper commented Mar 11, 2024

View reviewed changes

cub/test/catch2_segmented_sort_helper.cuh Outdated Show resolved Hide resolved

alliepiper commented Mar 11, 2024

View reviewed changes

cub/test/catch2_segmented_sort_helper.cuh Outdated Show resolved Hide resolved

alliepiper commented Mar 11, 2024

View reviewed changes

cub/test/c2h/cpu_timer.cuh Show resolved Hide resolved

Address live-review feedback.

90e904b

alliepiper marked this pull request as ready for review March 11, 2024 20:37

Use void-cast instead of cuda::std::ignore to WAR warnings.

07a4bc1

/home/coder/cccl/cub/test/catch2_segmented_sort_helper.cuh(503): error NVIDIA#174-D: expression has no effect

Nyrio mentioned this pull request Mar 19, 2024

Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Open

2 tasks

alliepiper requested a review from gevtushenko March 22, 2024 15:19

gevtushenko approved these changes Mar 22, 2024

View reviewed changes

alliepiper added 4 commits March 27, 2024 17:21

Add support for unstable sort, address review feedback.

d06d66a

Merge remote-tracking branch 'origin/main' into catch2_segmented_sort

b8a5e45

Remove duplicate increment, leftover after while->for conversion.

adf7a1d

Update CUB_IF_CONSTEXPR to _CCCL_IF_CONSTEXPR

610ca3e

alliepiper enabled auto-merge (squash) March 27, 2024 19:47

alliepiper added 2 commits March 27, 2024 20:36

Address some CI failures.

d4f18fa

Fix unused variable warnings in catch2_segmented_sort_helper.cuh

d494c77

alliepiper merged commit e3758cf into NVIDIA:main Mar 27, 2024
584 checks passed

alliepiper deleted the catch2_segmented_sort branch March 28, 2024 00:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch2 segmented sort #1484

Catch2 segmented sort #1484

alliepiper commented Mar 4, 2024

gevtushenko Mar 22, 2024

alliepiper Mar 27, 2024

alliepiper Mar 27, 2024

alliepiper commented Mar 27, 2024

Catch2 segmented sort #1484

Catch2 segmented sort #1484

Conversation

alliepiper commented Mar 4, 2024

Description

gevtushenko Mar 22, 2024

Choose a reason for hiding this comment

alliepiper Mar 27, 2024

Choose a reason for hiding this comment

alliepiper Mar 27, 2024

Choose a reason for hiding this comment

alliepiper commented Mar 27, 2024