Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Updated thrust shuffle to use improved bijective function #1566

Merged
merged 4 commits into from
Jan 25, 2022

Conversation

djns99
Copy link
Contributor

@djns99 djns99 commented Nov 10, 2021

Updates the thrust shuffle to use the Variable Philox bijective function
with 24 rounds.

Updates the test suite to include new test statistic based on maximum mean
discrepency to enable more thorough testing of larger permutations.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@djns99
Copy link
Contributor Author

djns99 commented Nov 10, 2021

This merge includes the latest work from https://github.com/djns99/CUDA-Shuffle which integrates the Variable Philox function which is faster than the previous round function used, while maintaining equivalent quality. It also adds a new test statistic that targets larger permutation sizes than the existing tests.

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. Are you also look to to figure out why the docs aren't rendering and run the thrust benchmarks to establish the improved performance?

I will take a look at these tests again.

testing/shuffle_mmd.cu Outdated Show resolved Hide resolved
thrust/system/detail/generic/shuffle.inl Outdated Show resolved Hide resolved
@alliepiper
Copy link
Collaborator

Ping me when you're ready for us to review/merge -- I'll let you two iterate on this until you're happy with it.

Are you also look to to figure out why the docs aren't rendering

The docs are broken until #1475 is finished -- this isn't an issue with your contributions.

run the thrust benchmarks to establish the improved performance?

We'll be removing bench.cu at some point, and are in the process of switching everything over to use an NVBench-based benchmark project for perf testing. We have some basic thrust::shuffle tests already that might be useful.

Just a heads up -- feel free to keep using bench.cu for validation here if you're more comfortable with it, but I find the new framework a bit easier to use.

@alliepiper alliepiper added type: enhancement New feature or request. P2: nice to have Desired, but not necessary. labels Nov 10, 2021
@alliepiper alliepiper added this to the 1.16.0 milestone Nov 10, 2021
Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of notes from a quick scan through the patch, I'll do another more detailed pass once it settles.

I'm interested in seeing the perf results!

testing/shuffle_mmd.cu Outdated Show resolved Hide resolved
testing/shuffle_mmd.cu Outdated Show resolved Hide resolved
@djns99
Copy link
Contributor Author

djns99 commented Nov 14, 2021

  • I have removed the MMD test as suggested by @RAMitchell
  • I have also removed the C++-11 ifdefs where I saw them
  • I have updated the minimum number of bits to 4

Running the benchmarks I found ~10x performance gain on my 1050Ti setup from ~6E+07 element/s to ~7.5E+08 albeit with more variance

Old:

Thrust Version,Algorithm,Element Type,Element Size,Elements per Trial,Total Input Size,STL Trials,STL Average Walltime,STL Walltime Uncertainty,STL Average Throughput,STL Throughput Uncertainty,Thrust Trials,Thrust Average Walltime,Thrust Walltime Uncertainty,Thrust Average Throughput,Thrust Throughput Uncertainty
,,,bits/element,elements,MiBs,trials,secs,secs,elements/sec,elements/sec,trials,secs,secs,elements/sec,elements/sec
101600,shuffle,char,8,67108864,64,4,2.37,0.04,2.83E+07,500000,16,1.092,0.003,6.15E+07,200000
101600,shuffle,int,32,16777216,64,4,0.61,0.02,2.80E+07,1.00E+06,16,0.276,0.001,6.08E+07,300000
101600,shuffle,int8_t,8,67108864,64,4,2.36,0.04,2.84E+07,500000,16,1.099,0.003,6.11E+07,200000
101600,shuffle,int16_t,16,33554432,64,4,1.21,0.05,2.80E+07,1.00E+06,16,0.555,0.002,6.05E+07,200000
101600,shuffle,int32_t,32,16777216,64,4,0.62,0.03,2.70E+07,1.00E+06,16,0.279,0.002,6.01E+07,500000
101600,shuffle,int64_t,64,8388608,64,4,0.296,0.002,2.84E+07,200000,16,0.145,0.005,5.80E+07,2.00E+06
101600,shuffle,float,32,16777216,64,4,0.61,0.03,2.80E+07,1.00E+06,16,0.279,0.001,6.02E+07,300000
101600,shuffle,double,64,8388608,64,4,0.2919,0.0009,2.87E+07,90000,16,0.145,0.005,5.80E+07,2.00E+06
101600,shuffle,char,8,134217728,128,4,5.37,0.02,2.50E+07,90000,16,2.21,0.02,6.06E+07,600000
101600,shuffle,int,32,33554432,128,4,1.37,0.01,2.45E+07,200000,16,0.558,0.001,6.01E+07,100000
101600,shuffle,int8_t,8,134217728,128,4,5.41,0.05,2.48E+07,200000,16,2.23,0.03,6.02E+07,700000
101600,shuffle,int16_t,16,67108864,128,4,2.8,0.04,2.39E+07,300000,16,1.12,0.02,5.98E+07,900000
101600,shuffle,int32_t,32,33554432,128,4,1.351,0.004,2.48E+07,80000,16,0.57,0.01,5.90E+07,1.00E+06
101600,shuffle,int64_t,64,16777216,128,4,0.7,0.01,2.41E+07,400000,16,0.29,0.01,5.70E+07,2.00E+06
101600,shuffle,float,32,33554432,128,4,1.38,0.01,2.43E+07,300000,16,0.562,0.003,5.97E+07,300000
101600,shuffle,double,64,16777216,128,4,0.69,0.02,2.44E+07,700000,16,0.285,0.002,5.89E+07,400000

New:

Thrust Version,Algorithm,Element Type,Element Size,Elements per Trial,Total Input Size,STL Trials,STL Average Walltime,STL Walltime Uncertainty,STL Average Throughput,STL Throughput Uncertainty,Thrust Trials,Thrust Average Walltime,Thrust Walltime Uncertainty,Thrust Average Throughput,Thrust Throughput Uncertainty
,,,bits/element,elements,MiBs,trials,secs,secs,elements/sec,elements/sec,trials,secs,secs,elements/sec,elements/sec
101600,shuffle,char,8,67108864,64,4,2.341,0.002,2.87E+07,20000,16,0.0793,0.0003,8.46E+08,3.00E+06
101600,shuffle,int,32,16777216,64,4,0.593,0.002,2.83E+07,80000,16,0.0226,0.0003,7.40E+08,1.00E+07
101600,shuffle,int8_t,8,67108864,64,4,2.43,0.09,2.77E+07,1.00E+06,16,0.0797,0.0009,8.42E+08,1.00E+07
101600,shuffle,int16_t,16,33554432,64,4,1.25,0.04,2.69E+07,800000,16,0.045,0.003,7.50E+08,6.00E+07
101600,shuffle,int32_t,32,16777216,64,4,0.61,0.03,2.80E+07,1.00E+06,16,0.0225,0.0001,7.47E+08,4.00E+06
101600,shuffle,int64_t,64,8388608,64,4,0.298,0.001,2.82E+07,100000,16,0.0132,0.0001,6.35E+08,6.00E+06
101600,shuffle,float,32,16777216,64,4,0.59,0.003,2.84E+07,100000,16,0.0227,0.0005,7.40E+08,2.00E+07
101600,shuffle,double,64,8388608,64,4,0.31,0.01,2.70E+07,1.00E+06,16,0.01305,4.00E-05,6.43E+08,2.00E+06
101600,shuffle,char,8,134217728,128,4,5.46,0.01,2.46E+07,60000,16,0.1596,0.0003,8.41E+08,2.00E+06
101600,shuffle,int,32,33554432,128,4,1.39,0.02,2.41E+07,300000,16,0.0446,0.0003,7.52E+08,5.00E+06
101600,shuffle,int8_t,8,134217728,128,4,5.437,0.007,2.47E+07,30000,16,0.1596,0.0003,8.41E+08,2.00E+06
101600,shuffle,int16_t,16,67108864,128,4,2.76,0.03,2.43E+07,300000,16,0.0828,0.0002,8.10E+08,2.00E+06
101600,shuffle,int32_t,32,33554432,128,4,1.39,0.02,2.42E+07,300000,16,0.0447,0.0003,7.51E+08,5.00E+06
101600,shuffle,int64_t,64,16777216,128,4,0.71,0.02,2.37E+07,700000,16,0.0256,0.0005,6.60E+08,1.00E+07
101600,shuffle,float,32,33554432,128,4,1.39,0.03,2.42E+07,400000,16,0.0448,0.0007,7.50E+08,1.00E+07
101600,shuffle,double,64,16777216,128,4,0.7,0.02,2.39E+07,700000,16,0.026,0.0009,6.50E+08,2.00E+07

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alliepiper alliepiper assigned alliepiper and unassigned djns99 Nov 29, 2021
Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the constructors/c-casts need to be replaced with static casts, otherwise this LGTM.

Once the suggested changes are finished I'll start testing.

thrust/system/detail/generic/shuffle.inl Outdated Show resolved Hide resolved
thrust/system/detail/generic/shuffle.inl Outdated Show resolved Hide resolved
thrust/system/detail/generic/shuffle.inl Outdated Show resolved Hide resolved
@alliepiper alliepiper assigned djns99 and unassigned alliepiper Nov 30, 2021
djns99 added 3 commits January 4, 2022 13:05
Updates the thrust shuffle to use the Variable Philox bijective function
with 24 rounds.

Updates the test suite to include new test statistic based on maximum mean
discrepency to enable more thorough testing of larger permutations.
@djns99
Copy link
Contributor Author

djns99 commented Jan 4, 2022

Some of the constructors/c-casts need to be replaced with static casts, otherwise this LGTM.

Once the suggested changes are finished I'll start testing.

@allisonvacanti Sorry this notification slipped through the cracks it would seem. I have fixed up those changes now

@alliepiper alliepiper assigned alliepiper and unassigned djns99 Jan 10, 2022
@alliepiper
Copy link
Collaborator

Thanks @djns99 -- this LGTM, I'll start testing.

run tests

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Jan 12, 2022
@alliepiper
Copy link
Collaborator

DVS CL: 30868010

@alliepiper
Copy link
Collaborator

Tests are failing, looks like some of the restrictions on c++11 constexpr functions are being broken -- C++11 only allows a single return statement in the body of a constexpr function, and they must return non-void.

Check out the failing results (click "Details" next to a failing build, then "View as plain text" in the sidebar).

You can ignore the failures in icclatest and gcc10, those are currently broken.

@alliepiper alliepiper assigned djns99 and unassigned alliepiper Jan 19, 2022
@alliepiper alliepiper removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Jan 19, 2022
@djns99
Copy link
Contributor Author

djns99 commented Jan 23, 2022

I removed constexpr from the functions that were failing, is there a flag I need to supply when I run this locally to verify it is fixed? I didn't have any compilation issues on my machine before or after the change.

@djns99 djns99 removed their assignment Jan 23, 2022
@alliepiper
Copy link
Collaborator

alliepiper commented Jan 24, 2022

The issue is only on C++11 (which is deprecated) and by default CMake only compiles for C++14. To build C++11 tests:

For the default (single config) builds, set the CMake variable THRUST_CPP_DIALECT:

cd <build dir>
cmake -DTHRUST_CPP_DIALECT=11 -DTHRUST_IGNORE_DEPRECATED_CPP_DIALECT=ON .

or for multiconfig builds (which build multiple host/device/dialect tests at once):

cd <build dir>
cmake -DTHRUST_MULTICONFIG_ENABLE_DIALECT_CPP11=ON -DTHRUST_IGNORE_DEPRECATED_CPP_DIALECT=ON .

See https://github.com/NVIDIA/thrust/blob/main/CONTRIBUTING.md#cmake-options for more info.

@alliepiper
Copy link
Collaborator

DVS CL 30909633

run tests

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Jan 24, 2022
@alliepiper
Copy link
Collaborator

Looks good -- remaining test failures are unrelated.

@alliepiper alliepiper merged commit efc5fc8 into NVIDIA:main Jan 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary. testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants