-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize string gather performance for large strings #7980
Optimize string gather performance for large strings #7980
Conversation
Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation. Edit: This assumption is dropped. See #7980 (comment). |
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #7980 +/- ##
===============================================
Coverage ? 82.83%
===============================================
Files ? 109
Lines ? 17901
Branches ? 0
===============================================
Hits ? 14828
Misses ? 3073
Partials ? 0 Continue to review full report at Codecov.
|
This assumption has come up before. There is no guarantee that the input data is allocated from RMM since it is just a view to some device memory wrapped in a |
Sorry I think my original comment is poorly worded. I meant to say that the device memory allocation should always be 256B aligned, even if it's not allocated through RMM. The programming guide says:
Does this mean we are safe to read up to 256B aligned location, even if it's beyond the end or the start of the input string? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good - left some minor comments and a suggestion to try for the possible out of bound issue based on our discussion today.
…ptimize-large-string-gather-perf
To avoid reading beyond string boundaries, the start and end locations of using large datatype is calculated using Performance before 93dd1a3:
Performance after 93dd1a3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments seem to be addressed, thanks @gaohao95! approving
…optimize-large-string-gather-perf
/** | ||
* @brief Gather characters from the input iterator, with string parallel strategy. | ||
* | ||
* This strategy assigns strings to warps so that each warp can cooperatively copy from the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be future work, but we should explore using a sub-warp per string by using cooperative groups. Using a whole warp may be overkill compared to using 4, 8 or 16 threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rerun tests.
rerun tests |
Pushed to 21.08 to be safe. This is a large rework that we don't want to squeeze in right before release. |
rerun tests |
@gaohao95 , Is this good to merge ? |
I think it's good to go :) Thanks! |
@gpucibot merge |
@gpucibot merge |
This PR intends to improve the string gather performance for large strings. There are two kernels implemented
This PR uses one of the two kernels depending on the average string size.
The following benchmark results are collected on V100 through
./gbenchmarks/STRINGS_BENCH --benchmark_filter=StringCopy/gather
Before this PR at
8a504d19c725e0ff01e28f36e5f1daf02fbf86c4
:This PR:
When there are enough strings and string sizes are large (e.g.
StringCopy/gather/262144/2048
), this PR improves throughput from 23.21 GB/s to 282.97 GB/s, which is a 12x improvement.For large strings, the ncu profile on 524288 strings, with average string size of 2048, shows the kernel takes 3.48ms, so achieved throughput is 308.55GB/s (which is 68.7% of DRAM SOL on V100).