Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimize string gather performance for large strings (#7980)
This PR intends to improve the string gather performance for large strings. There are two kernels implemented - String-parallel kernel assigns strings to warps and each warp collectively copies the characters with large data type. This kernel is best suited for large strings. - Char-parallel kernel assigns characters to threads. This is similar to the existing implementation, except this PR uses shared memory and assigns a fixed number of strings per threadblock to improve binary search performance. This kernel is best suited for small strings. This PR uses one of the two kernels depending on the average string size. The following benchmark results are collected on V100 through `./gbenchmarks/STRINGS_BENCH --benchmark_filter=StringCopy/gather` Before this PR at `8a504d19c725e0ff01e28f36e5f1daf02fbf86c4`: ``` ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------- StringCopy/gather/4096/32/manual_time 0.127 ms 0.151 ms 5269 bytes_per_second=545.365M/s StringCopy/gather/4096/128/manual_time 0.130 ms 0.154 ms 5135 bytes_per_second=2.0329G/s StringCopy/gather/4096/512/manual_time 0.156 ms 0.179 ms 4331 bytes_per_second=6.85358G/s StringCopy/gather/4096/2048/manual_time 0.255 ms 0.277 ms 2731 bytes_per_second=16.711G/s StringCopy/gather/4096/8192/manual_time 0.650 ms 0.673 ms 1076 bytes_per_second=26.0888G/s StringCopy/gather/32768/32/manual_time 0.148 ms 0.171 ms 4602 bytes_per_second=3.64833G/s StringCopy/gather/32768/128/manual_time 0.206 ms 0.228 ms 3345 bytes_per_second=10.3745G/s StringCopy/gather/32768/512/manual_time 0.438 ms 0.462 ms 1599 bytes_per_second=19.421G/s StringCopy/gather/32768/2048/manual_time 1.38 ms 1.40 ms 506 bytes_per_second=24.7168G/s StringCopy/gather/32768/8192/manual_time 5.14 ms 5.16 ms 136 bytes_per_second=26.5093G/s StringCopy/gather/262144/32/manual_time 0.336 ms 0.358 ms 2082 bytes_per_second=12.8318G/s StringCopy/gather/262144/128/manual_time 0.878 ms 0.901 ms 795 bytes_per_second=19.4286G/s StringCopy/gather/262144/512/manual_time 3.05 ms 3.07 ms 229 bytes_per_second=22.3358G/s StringCopy/gather/262144/2048/manual_time 11.8 ms 11.8 ms 59 bytes_per_second=23.2139G/s StringCopy/gather/2097152/32/manual_time 2.05 ms 2.07 ms 341 bytes_per_second=16.8261G/s StringCopy/gather/2097152/128/manual_time 6.96 ms 6.99 ms 100 bytes_per_second=19.6048G/s StringCopy/gather/2097152/512/manual_time 26.7 ms 26.7 ms 26 bytes_per_second=20.434G/s StringCopy/gather/16777216/32/manual_time 19.0 ms 19.0 ms 37 bytes_per_second=14.5447G/s StringCopy/gather/67108864/2/manual_time 34.1 ms 34.2 ms 20 bytes_per_second=2.01153G/s ``` This PR: ``` ---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------- StringCopy/gather/4096/32/manual_time 0.105 ms 0.127 ms 6430 bytes_per_second=660.581M/s StringCopy/gather/4096/128/manual_time 0.103 ms 0.125 ms 6383 bytes_per_second=2.57612G/s StringCopy/gather/4096/512/manual_time 0.105 ms 0.126 ms 6249 bytes_per_second=10.2033G/s StringCopy/gather/4096/2048/manual_time 0.114 ms 0.134 ms 6114 bytes_per_second=37.5549G/s StringCopy/gather/4096/8192/manual_time 0.155 ms 0.178 ms 4547 bytes_per_second=109.744G/s StringCopy/gather/32768/32/manual_time 0.109 ms 0.130 ms 6210 bytes_per_second=4.9546G/s StringCopy/gather/32768/128/manual_time 0.124 ms 0.145 ms 5441 bytes_per_second=17.1911G/s StringCopy/gather/32768/512/manual_time 0.137 ms 0.159 ms 5057 bytes_per_second=62.082G/s StringCopy/gather/32768/2048/manual_time 0.209 ms 0.232 ms 3362 bytes_per_second=163.045G/s StringCopy/gather/32768/8192/manual_time 0.526 ms 0.549 ms 1332 bytes_per_second=259.064G/s StringCopy/gather/262144/32/manual_time 0.184 ms 0.205 ms 3777 bytes_per_second=23.4435G/s StringCopy/gather/262144/128/manual_time 0.328 ms 0.349 ms 2132 bytes_per_second=51.986G/s StringCopy/gather/262144/512/manual_time 0.400 ms 0.421 ms 1751 bytes_per_second=170.506G/s StringCopy/gather/262144/2048/manual_time 0.965 ms 0.987 ms 725 bytes_per_second=282.969G/s StringCopy/gather/2097152/32/manual_time 1.10 ms 1.12 ms 637 bytes_per_second=31.35G/s StringCopy/gather/2097152/128/manual_time 1.92 ms 1.94 ms 364 bytes_per_second=71.1531G/s StringCopy/gather/2097152/512/manual_time 2.48 ms 2.50 ms 282 bytes_per_second=220.297G/s StringCopy/gather/16777216/32/manual_time 11.0 ms 11.0 ms 64 bytes_per_second=25.0771G/s StringCopy/gather/67108864/2/manual_time 33.7 ms 33.7 ms 21 bytes_per_second=2.03768G/s ``` When there are enough strings and string sizes are large (e.g. ` StringCopy/gather/262144/2048`), this PR improves throughput from 23.21 GB/s to 282.97 GB/s, which is a 12x improvement. For large strings, the ncu profile on 524288 strings, with average string size of 2048, shows the kernel takes 3.48ms, so achieved throughput is 308.55GB/s (which is 68.7% of DRAM SOL on V100). Authors: - https://github.com/gaohao95 Approvers: - David Wendt (https://github.com/davidwendt) - Nikolay Sakharnykh (https://github.com/nsakharnykh) - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) URL: #7980
- Loading branch information