Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch2 segmented sort #1484

Merged
merged 16 commits into from
Mar 27, 2024
Merged

Conversation

alliepiper
Copy link
Collaborator

Description

closes #1380

Ports CUB's device segmented sort test to catch2

@alliepiper alliepiper requested review from a team as code owners March 4, 2024 21:39
@alliepiper alliepiper requested review from elstehle and miscco March 4, 2024 21:39
@alliepiper alliepiper force-pushed the catch2_segmented_sort branch 3 times, most recently from 66c1b39 to 8983fbd Compare March 5, 2024 02:20
@alliepiper alliepiper marked this pull request as draft March 5, 2024 02:21
@alliepiper alliepiper force-pushed the catch2_segmented_sort branch 4 times, most recently from a85dea2 to 6dce2a7 Compare March 8, 2024 19:03
- Allow explicit construction from any type to facilitate generic programming.
  A similar assignment operator already existed.
- Make ==/!= operators into friend functions. This fixes compat with thrust::device_reference
  in testing code.
thrust::device_reference does not compile when operator== is a member function.
Changing to friend functions WAR the issue.
This should fix some ambiguous overload issues we're seeing on CI.
This is intended to help with Catch2's CAPTURE macro:

```
CAPTURE(c2h::type_name<KeyT>(), c2h::type_name<ValueT>);

output on failure:
  c2h::type_name<KeyT>() := "h"
  c2h::type_name<ValueT>() := "N3cub25CUB_200300_600_700_800_NS8NullTypeE"
```

ABI demangling would be a nice improvement for later.
- Add macros that can be enabled using `-DC2H_DEBUG_TIMING`.
- Add RAII scoped_cpu_timer + macro.
- Increase precision of output from ms -> us.
@alliepiper alliepiper force-pushed the catch2_segmented_sort branch from 6dce2a7 to 40eb5db Compare March 11, 2024 17:02
@alliepiper alliepiper marked this pull request as ready for review March 11, 2024 20:37
 /home/coder/cccl/cub/test/catch2_segmented_sort_helper.cuh(503): error NVIDIA#174-D: expression has no effect
Comment on lines +1130 to +1132
for (bool stable_sort : {unstable, stable})
{
for (bool sort_buffers : {pointers, double_buffer})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: as part of the Catch2 transition, I'd like to get away from this kind of manual dispatch. The motivation is the following:

  1. This code is invoked for a few type combinations and input patterns. Stable sort, for instance, is no different from unstable one at the moment. For buffers / pointers API alternatives, the code path difference is also trivial. I'd like to keep a wide test (covering more than one type combination) for double buffers / stable, and one type / a few inputs for everything else.
  2. Tests of district functionality should be in distinct test suits, so that there's no way for one test to affect state of another test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general with these goals, but would prefer to keep this particular test as-is.

These are combined into the same test case to reduce execution time. The time needed to do the device sort + validation is trivial, while the time taken to sort the host reference is expensive. Combining the test cases like this allows the expensive reference solution to be reused for multiple test cases. We can test the multiple variants very cheaply. Taking the slowest test that passes through this function:

0.000686 s: generate_random_offsets
0.003239 s: Allocate device memory
0.000466 s: generate_random_unsorted_inputs
0.007262 s: D->H input arrays
0.002009 s: Clone input arrays on device
0.514597 s: host_sort_random_inputs
0.006783 s: H->D reference arrays
0.004699 s: cub::DeviceSegmentedSort
0.001485 s: validate_sorted_random_outputs
0.000456 s: Reset input/output device arrays
0.002804 s: cub::DeviceSegmentedSort
0.000972 s: validate_sorted_random_outputs
0.000453 s: Reset input/output device arrays
0.004621 s: cub::DeviceSegmentedSort
0.001034 s: validate_sorted_random_outputs
0.000417 s: Reset input/output device arrays
0.002716 s: cub::DeviceSegmentedSort
0.001026 s: validate_sorted_random_outputs
0.557 s: "DeviceSegmentedSortPairs: Randomly sized segments, random keys/values"(0) - types_161 - 2

Preparing the input and reference arrays takes 535ms, while running the device algorithm and verifying the outputs 4 times with different API options takes 20ms total, roughly 1% of the total reference sort time each -- they're practically free :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is only for the tests with true random inputs. The tests with derived inputs execute each device sort as a separate test case.

@alliepiper
Copy link
Collaborator Author

Verified the new unstable test logic against #1552.

@alliepiper alliepiper enabled auto-merge (squash) March 27, 2024 19:47
@alliepiper alliepiper merged commit e3758cf into NVIDIA:main Mar 27, 2024
584 checks passed
@alliepiper alliepiper deleted the catch2_segmented_sort branch March 28, 2024 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Port CUB's test/test_device_segmented_sort.cu to catch2
2 participants