-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch2 segmented sort #1484
Catch2 segmented sort #1484
Conversation
66c1b39
to
8983fbd
Compare
a85dea2
to
6dce2a7
Compare
- Allow explicit construction from any type to facilitate generic programming. A similar assignment operator already existed. - Make ==/!= operators into friend functions. This fixes compat with thrust::device_reference in testing code.
thrust::device_reference does not compile when operator== is a member function. Changing to friend functions WAR the issue.
This should fix some ambiguous overload issues we're seeing on CI.
This is intended to help with Catch2's CAPTURE macro: ``` CAPTURE(c2h::type_name<KeyT>(), c2h::type_name<ValueT>); output on failure: c2h::type_name<KeyT>() := "h" c2h::type_name<ValueT>() := "N3cub25CUB_200300_600_700_800_NS8NullTypeE" ``` ABI demangling would be a nice improvement for later.
- Add macros that can be enabled using `-DC2H_DEBUG_TIMING`. - Add RAII scoped_cpu_timer + macro. - Increase precision of output from ms -> us.
6dce2a7
to
40eb5db
Compare
/home/coder/cccl/cub/test/catch2_segmented_sort_helper.cuh(503): error NVIDIA#174-D: expression has no effect
for (bool stable_sort : {unstable, stable}) | ||
{ | ||
for (bool sort_buffers : {pointers, double_buffer}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: as part of the Catch2 transition, I'd like to get away from this kind of manual dispatch. The motivation is the following:
- This code is invoked for a few type combinations and input patterns. Stable sort, for instance, is no different from unstable one at the moment. For buffers / pointers API alternatives, the code path difference is also trivial. I'd like to keep a wide test (covering more than one type combination) for double buffers / stable, and one type / a few inputs for everything else.
- Tests of district functionality should be in distinct test suits, so that there's no way for one test to affect state of another test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general with these goals, but would prefer to keep this particular test as-is.
These are combined into the same test case to reduce execution time. The time needed to do the device sort + validation is trivial, while the time taken to sort the host reference is expensive. Combining the test cases like this allows the expensive reference solution to be reused for multiple test cases. We can test the multiple variants very cheaply. Taking the slowest test that passes through this function:
0.000686 s: generate_random_offsets
0.003239 s: Allocate device memory
0.000466 s: generate_random_unsorted_inputs
0.007262 s: D->H input arrays
0.002009 s: Clone input arrays on device
0.514597 s: host_sort_random_inputs
0.006783 s: H->D reference arrays
0.004699 s: cub::DeviceSegmentedSort
0.001485 s: validate_sorted_random_outputs
0.000456 s: Reset input/output device arrays
0.002804 s: cub::DeviceSegmentedSort
0.000972 s: validate_sorted_random_outputs
0.000453 s: Reset input/output device arrays
0.004621 s: cub::DeviceSegmentedSort
0.001034 s: validate_sorted_random_outputs
0.000417 s: Reset input/output device arrays
0.002716 s: cub::DeviceSegmentedSort
0.001026 s: validate_sorted_random_outputs
0.557 s: "DeviceSegmentedSortPairs: Randomly sized segments, random keys/values"(0) - types_161 - 2
Preparing the input and reference arrays takes 535ms, while running the device algorithm and verifying the outputs 4 times with different API options takes 20ms total, roughly 1% of the total reference sort time each -- they're practically free :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is only for the tests with true random inputs. The tests with derived inputs execute each device sort as a separate test case.
Verified the new unstable test logic against #1552. |
Description
closes #1380
Ports CUB's device segmented sort test to catch2