Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Retune radix sort, run length encoding, reduce by key, scan, select if, and histogram for SM70 and SM80 #208

Closed
wants to merge 1 commit into from

Conversation

brycelelbach
Copy link
Collaborator

No description provided.

@brycelelbach brycelelbach added this to the 1.11.0 milestone Oct 5, 2020
typedef AgentRadixSortDownsweepPolicy <192, 39, DominantT, BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS> SegmentedPolicy;
typedef AgentRadixSortDownsweepPolicy <384, 11, DominantT, BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS - 1> AltSegmentedPolicy;
typedef AgentRadixSortDownsweepPolicy <128, 39, DominantT, BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS> SegmentedPolicy;
typedef AgentRadixSortDownsweepPolicy <256, 11, DominantT, BLOCK_LOAD_TRANSPOSE, LOAD_DEFAULT, RADIX_RANK_MEMOIZE, BLOCK_SCAN_WARP_SCANS, SEGMENTED_RADIX_BITS - 1> AltSegmentedPolicy;
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add an SM80 policy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andy's PR adds one. Dunno if he tuned the non-onesweep parameters or just pulled them from the SM70 tunings.

cc: @canonizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Policy800 in my PR, I copied the parameters from Policy700 and only tuned the onesweep-related parameters.

Retune radix sort, run length encoding, reduce by key, scan, select if, and
histogram for SM70 and SM80.
@alliepiper alliepiper force-pushed the feature/sm70-and-sm80-retune branch from 8f881de to a3d89dc Compare November 2, 2020 21:43
@alliepiper
Copy link
Collaborator

DVS CL: 29273912

@alliepiper alliepiper added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Nov 2, 2020
canonizer added a commit to canonizer/cub that referenced this pull request Nov 2, 2020
@alliepiper alliepiper added the testing: gpuCI in progress Started gpuCI testing. label Nov 3, 2020
@alliepiper
Copy link
Collaborator

The device_histogram.thorough tests are failing on GV100 with the new SM70 tunings.

7985: CUB cub::DeviceHistogramEven (pointer) 2073600 pixels (1080 height, 1920 width, 7693-byte row stride), 8308440 1-byte h samples (entropy reduction 0), i counters, 3/4 channels, max sample 256
7985:
7985:   Channel 0: 256 bins [0, 256)
7985:
7985:   Channel 1: 128 bins [64, 192)
7985:
7985:   Channel 2: 64 bins [96, 160)
7985: CUDA error 716 [/home/av/code/src/thrust/dependencies/cub/cub/device/dispatch/../../agent/../block/../iterator/../util_device.cuh, 564]: misaligned address
7985: CUDA error 716 [/home/av/code/src/thrust/dependencies/cub/cub/device/dispatch/dispatch_histogram.cuh, 672]: misaligned address
7985: CUDA error 716 [/home/av/code/src/thrust/dependencies/cub/cub/device/dispatch/dispatch_histogram.cuh, 1048]: misaligned address
7985: Invoking DeviceHistogramInitKernel<<<1, 256, 0, 0>>>()
7985: Invoking histogram_sweep_kernel<<<{1, 320, 1}, 384, 0, 0>>>(), 5 pixels per thread, 4 SM occupancy
7985: INCORRECT: [0]: 0 != 8
7985: (/home/av/code/src/thrust/dependencies/cub/test/test_device_histogram.cu: 795)
1/1 Test #7985: cub.cpp17.test.device_histogram.thorough ...***Failed   26.55 sec

@alliepiper alliepiper removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Nov 3, 2020
@alliepiper alliepiper modified the milestones: 1.11.0, 1.12.0 Nov 12, 2020
@alliepiper
Copy link
Collaborator

The radix sort tunings from this PR caused a perf regression, nvbug 200676467. Those parameters were merged in #204 and reverted by #237.

@alliepiper alliepiper marked this pull request as draft November 30, 2020 19:38
@alliepiper alliepiper modified the milestones: 1.12.0, Backlog Nov 30, 2020
@brycelelbach brycelelbach modified the milestones: Backlog, 1.12.0, 1.13.0 Dec 18, 2020
@brycelelbach brycelelbach added P1: should have Necessary, but not critical. area: performance Does not perform as intended. labels Mar 29, 2021
@jrhemstad jrhemstad added the helps: rapids Helps or needed by RAPIDS. label Apr 20, 2021
@alliepiper alliepiper modified the milestones: 1.13.0, 1.14.0 Jun 9, 2021
@alliepiper alliepiper removed their assignment Jul 1, 2021
@alliepiper alliepiper removed this from the 1.14.0 milestone Aug 17, 2021
@gevtushenko
Copy link
Collaborator

Most of the algorithms have been changed significantly since this PR was opened. We'll have to tune algorithms in a more systematic way. Further work is tracked by the following issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area: performance Does not perform as intended. helps: rapids Helps or needed by RAPIDS. P1: should have Necessary, but not critical.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants