Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change stack-based regex state data to use global memory #10600

Merged

Conversation

davidwendt
Copy link
Contributor

All libcudf strings regex calls will use global device memory for state data when evaluating regex on strings. Previously, separate templated kernels were used to store state data in fixed size stack memory depending on the number of instructions resolved from the provided regex pattern. This required the CUDA driver to allocate a large amount of device memory for when launching the kernel. This memory is managed by the launcher in the driver and so not under control of RMM.

This has been changed to use a memory-resource allocated global device memory to hold and manage the state data per string per instruction. This is an internal change only and results in no behavior changes. Overall, the performance based on the current benchmarks has not changed though much more memory may be required to execute any of the regex APIs depending on the number of instructions in the pattern and the total number of strings in the column.

Every effort has been made to not reduce performance from the stack-based approach. Additional optimizations here include copying the reprog_device class data to shared-memory (when it fits). Further optimizations are expected in later PRs as well.

Overall, the compile time of the files that use regex is also faster since only a single kernel is generated instead of 4 in the templated, stack-based implementation.

This PR is dependent on PR #10573.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 5, 2022
@davidwendt davidwendt self-assigned this Apr 5, 2022
@codecov
Copy link

codecov bot commented Apr 5, 2022

Codecov Report

Merging #10600 (a6922f4) into branch-22.06 (8d861ce) will increase coverage by 0.04%.
The diff coverage is 96.29%.

❗ Current head a6922f4 differs from pull request most recent head c2d6b05. Consider uploading reports for the commit c2d6b05 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10600      +/-   ##
================================================
+ Coverage         86.40%   86.45%   +0.04%     
================================================
  Files               143      143              
  Lines             22448    22491      +43     
================================================
+ Hits              19396    19444      +48     
+ Misses             3052     3047       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/core/dataframe.py 93.77% <96.29%> (+0.08%) ⬆️
python/cudf/cudf/core/column/string.py 89.21% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.17% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 92.91% <0.00%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e52a1eb...c2d6b05. Read the comment docs.

@davidwendt davidwendt requested a review from jrhemstad April 21, 2022 20:38
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a first pass of review, I think I'll need one more to fully grasp the import of the changes to the regex internals. I have some probably naive questions that will probably make sense on a second read, but I'll ask them anyway in case they have non-obvious answers:

  • Why did all the functors change from owning a reprog_device to having the operator accept one? Do you observe better memory access patterns (more L1 traffic due to the parameter being cached) by passing it as a const parameter this way?
  • Am I correct that you are using prog_idx as the name for the third functor parameter when you launch_transform_kernel while using thread_idx as the name for launch_foreach_kernel because that more accurately reflects what each thread is acting on? I'm trying to see if there are helpful naming conventions that could help improve the clarity here.
  • I'm not sure that I fully understand the usage of the new load/store methods and the utilization of shared memory. It looks like the initial state of the reprog_device is stored into shared memory from one thread, then it's loaded onto all the threads. Is the hope that in the cases where the object fits into shared memory the parameter passed in will be discarded (or at least relegated to a much lower position in the cache hierarchy) once the code starts using the lane 0 version copied into shared memory? Is there no way to make that determination prior to the kernel launch?

cpp/src/strings/contains.cu Outdated Show resolved Hide resolved
cpp/src/strings/regex/utilities.cuh Outdated Show resolved Hide resolved
cpp/src/strings/regex/utilities.cuh Show resolved Hide resolved
cpp/src/strings/extract/extract_all.cu Show resolved Hide resolved
cpp/src/strings/regex/regex.cuh Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

davidwendt commented Apr 28, 2022

  • Why did all the functors change from owning a reprog_device to having the operator accept one? Do you observe better memory access patterns (more L1 traffic due to the parameter being cached) by passing it as a const parameter this way?

This was to be able to put the object into shared-memory. Each thread essentially has their own 'copy' sitting on the stack though all the data in it is stored in shared-memory. Copying to shared-memory and resolving the object there is done on each thread so it needs to passed as a thread parameter now.
Yes, we observed better memory access patterns changing this to const though it was done in conjunction with other similar changes.

@davidwendt
Copy link
Contributor Author

  • Am I correct that you are using prog_idx as the name for the third functor parameter when you launch_transform_kernel while using thread_idx as the name for launch_foreach_kernel because that more accurately reflects what each thread is acting on? I'm trying to see if there are helpful naming conventions that could help improve the clarity here.

Yes, that sounds right. The reprog_device is sort of ignorant of threads. It just needs an index to locate where to put its state data in the global state memory buffer.

@davidwendt
Copy link
Contributor Author

davidwendt commented Apr 28, 2022

  • I'm not sure that I fully understand the usage of the new load/store methods and the utilization of shared memory. It looks like the initial state of the reprog_device is stored into shared memory from one thread, then it's loaded onto all the threads. Is the hope that in the cases where the object fits into shared memory the parameter passed in will be discarded (or at least relegated to a much lower position in the cache hierarchy) once the code starts using the lane 0 version copied into shared memory? Is there no way to make that determination prior to the kernel launch?

I'm not sure I'm following your question here. The reprog_device data is stored in global memory and must be copied to shared-memory. The usual pattern is for the first thread of the block to do this work and sync. The extra step here is to create an appropriate stack variable that wraps this data for each of the block threads.
If the object data does not fit in shared memory, no shared memory is created at launch and the load/store functions do nothing.

@davidwendt davidwendt requested a review from vyasr April 28, 2022 21:57
@vyasr
Copy link
Contributor

vyasr commented Apr 29, 2022

I'm not sure I'm following your question here.

When d_prog is not too big to fit in shared memory, thread 0 will store its copy into shared memory and then all the other threads will load that object back in. I have a couple of questions:

  1. Whether or not d_prog will fit in shared memory is a runtime constant, not something known at compile time. This optimization relies on the runtime recognizing that d_prog is not accessed on any thread > 0 and therefore not accessing it in global memory, right? Otherwise you'll still initially load d_prog from global memory on all threads. Is that something we can rely on the runtime doing correctly, or is the idea that even with that initial load of d_prog from global memory it's worthwhile to access shared memory in all subsequent accesses to s_prog because the original d_prog will immediately be evicted from the cache if it goes unused?
  2. load returns an object by value instead of returning a pointer. I think this is necessary to support the first scenario where the object isn't in shared memory so you have to return it by value. However, that also means that you're not maximizing memory savings in the shared memory case because now every thread has a local copy of all the members of reprog_device. Couldn't you rework this to avoid that extra memory usage by using a pointer to a reprog_device as the local variable in the kernel and then conditionally pointing that directly into the shared memory buffer if the write to shared memory succeeded? Are the buffers pointed to by _insts etc really so much larger that we don't care about the bytes that we could save by not having the other variables present on each thread?

@davidwendt
Copy link
Contributor Author

  1. Whether or not d_prog will fit in shared memory is a runtime constant, not something known at compile time.

The d_prog object is generated base on the given regex pattern so its size is only known at runtime.

This optimization relies on the runtime recognizing that d_prog is not accessed on any thread > 0 and therefore not accessing it in global memory, right?

I'm not following this question. The d_prog is accessed by all threads (after the syncthreads). The first thread in the block loads the d_prog data into shared-memory. This is a common usage pattern for loading shared-memory.

Otherwise you'll still initially load d_prog from global memory on all threads. Is that something we can rely on the runtime doing correctly, or is the idea that even with that initial load of d_prog from global memory it's worthwhile to access shared memory in all subsequent accesses to s_prog because the original d_prog will immediately be evicted from the cache if it goes unused?

The d_prog will either fit into shared-memory or not. If it fits, the store copies the d_prog into the shared-memory buffer and the load will give a new d_prog using the shared-memory. If it does not fit, the store does nothing and the load will just return whatever is passed to it. Either way the d_prog returned from load will be valid for every thread to use.

@vyasr
Copy link
Contributor

vyasr commented Apr 29, 2022

@davidwendt and I sorted most of this out offline. tl;dr the global memory buffers that d_prog points to are much larger than the object itself, which is essentially a handful of integers and pointers. Most of my suggestions boiled down to asking whether we could avoid d_prog being copied/used on all threads, but since the object is so small it's not worthwhile to try to optimize that out. The shared memory buffer is already doing the important work. The important part is reducing global memory accesses, whereas my changes would just be helping to slightly reduce stack variables (and therefore register usage), which is not currently a bottleneck.

I'm pretty much happy with the current state of the PR, but plan to give it another review with a fresh eye on Monday before approving.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now. I would still be interested to see whether these kernels are ever occupancy bound. If they are, then making sure that every thread uses a pointer to the reprog_device in shared memory rather than having to copy all of its members into a thread-local object might help you see even more performance gains by reducing register usage. That change can be explored in a future PR though.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for @vyasr for adding a comment explaining the offline discussion. This looks good to me and is a nice win.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit de0f7e0 into rapidsai:branch-22.06 May 6, 2022
@davidwendt davidwendt deleted the regex-global-memory-state branch May 6, 2022 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants