-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change stack-based regex state data to use global memory #10600
Change stack-based regex state data to use global memory #10600
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10600 +/- ##
================================================
+ Coverage 86.40% 86.45% +0.04%
================================================
Files 143 143
Lines 22448 22491 +43
================================================
+ Hits 19396 19444 +48
+ Misses 3052 3047 -5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a first pass of review, I think I'll need one more to fully grasp the import of the changes to the regex internals. I have some probably naive questions that will probably make sense on a second read, but I'll ask them anyway in case they have non-obvious answers:
- Why did all the functors change from owning a
reprog_device
to having the operator accept one? Do you observe better memory access patterns (more L1 traffic due to the parameter being cached) by passing it as aconst
parameter this way? - Am I correct that you are using
prog_idx
as the name for the third functor parameter when youlaunch_transform_kernel
while usingthread_idx
as the name forlaunch_foreach_kernel
because that more accurately reflects what each thread is acting on? I'm trying to see if there are helpful naming conventions that could help improve the clarity here. - I'm not sure that I fully understand the usage of the new
load
/store
methods and the utilization of shared memory. It looks like the initial state of thereprog_device
is stored into shared memory from one thread, then it's loaded onto all the threads. Is the hope that in the cases where the object fits into shared memory the parameter passed in will be discarded (or at least relegated to a much lower position in the cache hierarchy) once the code starts using the lane 0 version copied into shared memory? Is there no way to make that determination prior to the kernel launch?
This was to be able to put the object into shared-memory. Each thread essentially has their own 'copy' sitting on the stack though all the data in it is stored in shared-memory. Copying to shared-memory and resolving the object there is done on each thread so it needs to passed as a thread parameter now. |
Yes, that sounds right. The |
I'm not sure I'm following your question here. The |
When
|
The
I'm not following this question. The
The |
@davidwendt and I sorted most of this out offline. tl;dr the global memory buffers that I'm pretty much happy with the current state of the PR, but plan to give it another review with a fresh eye on Monday before approving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now. I would still be interested to see whether these kernels are ever occupancy bound. If they are, then making sure that every thread uses a pointer to the reprog_device
in shared memory rather than having to copy all of its members into a thread-local object might help you see even more performance gains by reducing register usage. That change can be explored in a future PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for @vyasr for adding a comment explaining the offline discussion. This looks good to me and is a nice win.
@gpucibot merge |
All libcudf strings regex calls will use global device memory for state data when evaluating regex on strings. Previously, separate templated kernels were used to store state data in fixed size stack memory depending on the number of instructions resolved from the provided regex pattern. This required the CUDA driver to allocate a large amount of device memory for when launching the kernel. This memory is managed by the launcher in the driver and so not under control of RMM.
This has been changed to use a memory-resource allocated global device memory to hold and manage the state data per string per instruction. This is an internal change only and results in no behavior changes. Overall, the performance based on the current benchmarks has not changed though much more memory may be required to execute any of the regex APIs depending on the number of instructions in the pattern and the total number of strings in the column.
Every effort has been made to not reduce performance from the stack-based approach. Additional optimizations here include copying the
reprog_device
class data to shared-memory (when it fits). Further optimizations are expected in later PRs as well.Overall, the compile time of the files that use regex is also faster since only a single kernel is generated instead of 4 in the templated, stack-based implementation.
This PR is dependent on PR #10573.