Change stack-based regex state data to use global memory #10600

davidwendt · 2022-04-05T21:31:49Z

All libcudf strings regex calls will use global device memory for state data when evaluating regex on strings. Previously, separate templated kernels were used to store state data in fixed size stack memory depending on the number of instructions resolved from the provided regex pattern. This required the CUDA driver to allocate a large amount of device memory for when launching the kernel. This memory is managed by the launcher in the driver and so not under control of RMM.

This has been changed to use a memory-resource allocated global device memory to hold and manage the state data per string per instruction. This is an internal change only and results in no behavior changes. Overall, the performance based on the current benchmarks has not changed though much more memory may be required to execute any of the regex APIs depending on the number of instructions in the pattern and the total number of strings in the column.

Every effort has been made to not reduce performance from the stack-based approach. Additional optimizations here include copying the reprog_device class data to shared-memory (when it fits). Further optimizations are expected in later PRs as well.

Overall, the compile time of the files that use regex is also faster since only a single kernel is generated instead of 4 in the templated, stack-based implementation.

This PR is dependent on PR #10573.

codecov · 2022-04-05T22:43:12Z

Codecov Report

Merging #10600 (a6922f4) into branch-22.06 (8d861ce) will increase coverage by 0.04%.
The diff coverage is 96.29%.

❗ Current head a6922f4 differs from pull request most recent head c2d6b05. Consider uploading reports for the commit c2d6b05 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10600      +/-   ##
================================================
+ Coverage         86.40%   86.45%   +0.04%     
================================================
  Files               143      143              
  Lines             22448    22491      +43     
================================================
+ Hits              19396    19444      +48     
+ Misses             3052     3047       -5

Impacted Files	Coverage Δ
python/cudf/cudf/core/indexed_frame.py	`91.70% <ø> (ø)`
python/cudf/cudf/core/dataframe.py	`93.77% <96.29%> (+0.08%)`	⬆️
python/cudf/cudf/core/column/string.py	`89.21% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.79% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`96.17% <0.00%> (+0.29%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`92.91% <0.00%> (+0.83%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e52a1eb...c2d6b05. Read the comment docs.

cpp/src/strings/regex/regexec.cu

cpp/src/strings/regex/utilities.cuh

vyasr

This is just a first pass of review, I think I'll need one more to fully grasp the import of the changes to the regex internals. I have some probably naive questions that will probably make sense on a second read, but I'll ask them anyway in case they have non-obvious answers:

Why did all the functors change from owning a reprog_device to having the operator accept one? Do you observe better memory access patterns (more L1 traffic due to the parameter being cached) by passing it as a const parameter this way?
Am I correct that you are using prog_idx as the name for the third functor parameter when you launch_transform_kernel while using thread_idx as the name for launch_foreach_kernel because that more accurately reflects what each thread is acting on? I'm trying to see if there are helpful naming conventions that could help improve the clarity here.
I'm not sure that I fully understand the usage of the new load/store methods and the utilization of shared memory. It looks like the initial state of the reprog_device is stored into shared memory from one thread, then it's loaded onto all the threads. Is the hope that in the cases where the object fits into shared memory the parameter passed in will be discarded (or at least relegated to a much lower position in the cache hierarchy) once the code starts using the lane 0 version copied into shared memory? Is there no way to make that determination prior to the kernel launch?

cpp/src/strings/contains.cu

cpp/src/strings/regex/utilities.cuh

cpp/src/strings/extract/extract_all.cu

cpp/src/strings/regex/regex.cuh

davidwendt · 2022-04-28T21:35:38Z

Why did all the functors change from owning a reprog_device to having the operator accept one? Do you observe better memory access patterns (more L1 traffic due to the parameter being cached) by passing it as a const parameter this way?

This was to be able to put the object into shared-memory. Each thread essentially has their own 'copy' sitting on the stack though all the data in it is stored in shared-memory. Copying to shared-memory and resolving the object there is done on each thread so it needs to passed as a thread parameter now.
Yes, we observed better memory access patterns changing this to const though it was done in conjunction with other similar changes.

davidwendt · 2022-04-28T21:38:07Z

Am I correct that you are using prog_idx as the name for the third functor parameter when you launch_transform_kernel while using thread_idx as the name for launch_foreach_kernel because that more accurately reflects what each thread is acting on? I'm trying to see if there are helpful naming conventions that could help improve the clarity here.

Yes, that sounds right. The reprog_device is sort of ignorant of threads. It just needs an index to locate where to put its state data in the global state memory buffer.

davidwendt · 2022-04-28T21:42:21Z

I'm not sure that I fully understand the usage of the new load/store methods and the utilization of shared memory. It looks like the initial state of the reprog_device is stored into shared memory from one thread, then it's loaded onto all the threads. Is the hope that in the cases where the object fits into shared memory the parameter passed in will be discarded (or at least relegated to a much lower position in the cache hierarchy) once the code starts using the lane 0 version copied into shared memory? Is there no way to make that determination prior to the kernel launch?

I'm not sure I'm following your question here. The reprog_device data is stored in global memory and must be copied to shared-memory. The usual pattern is for the first thread of the block to do this work and sync. The extra step here is to create an appropriate stack variable that wraps this data for each of the block threads.
If the object data does not fit in shared memory, no shared memory is created at launch and the load/store functions do nothing.

vyasr · 2022-04-29T17:56:03Z

I'm not sure I'm following your question here.

When d_prog is not too big to fit in shared memory, thread 0 will store its copy into shared memory and then all the other threads will load that object back in. I have a couple of questions:

Whether or not d_prog will fit in shared memory is a runtime constant, not something known at compile time. This optimization relies on the runtime recognizing that d_prog is not accessed on any thread > 0 and therefore not accessing it in global memory, right? Otherwise you'll still initially load d_prog from global memory on all threads. Is that something we can rely on the runtime doing correctly, or is the idea that even with that initial load of d_prog from global memory it's worthwhile to access shared memory in all subsequent accesses to s_prog because the original d_prog will immediately be evicted from the cache if it goes unused?
load returns an object by value instead of returning a pointer. I think this is necessary to support the first scenario where the object isn't in shared memory so you have to return it by value. However, that also means that you're not maximizing memory savings in the shared memory case because now every thread has a local copy of all the members of reprog_device. Couldn't you rework this to avoid that extra memory usage by using a pointer to a reprog_device as the local variable in the kernel and then conditionally pointing that directly into the shared memory buffer if the write to shared memory succeeded? Are the buffers pointed to by _insts etc really so much larger that we don't care about the bytes that we could save by not having the other variables present on each thread?

davidwendt · 2022-04-29T18:05:53Z

Whether or not d_prog will fit in shared memory is a runtime constant, not something known at compile time.

The d_prog object is generated base on the given regex pattern so its size is only known at runtime.

This optimization relies on the runtime recognizing that d_prog is not accessed on any thread > 0 and therefore not accessing it in global memory, right?

I'm not following this question. The d_prog is accessed by all threads (after the syncthreads). The first thread in the block loads the d_prog data into shared-memory. This is a common usage pattern for loading shared-memory.

Otherwise you'll still initially load d_prog from global memory on all threads. Is that something we can rely on the runtime doing correctly, or is the idea that even with that initial load of d_prog from global memory it's worthwhile to access shared memory in all subsequent accesses to s_prog because the original d_prog will immediately be evicted from the cache if it goes unused?

The d_prog will either fit into shared-memory or not. If it fits, the store copies the d_prog into the shared-memory buffer and the load will give a new d_prog using the shared-memory. If it does not fit, the store does nothing and the load will just return whatever is passed to it. Either way the d_prog returned from load will be valid for every thread to use.

vyasr · 2022-04-29T19:26:56Z

@davidwendt and I sorted most of this out offline. tl;dr the global memory buffers that d_prog points to are much larger than the object itself, which is essentially a handful of integers and pointers. Most of my suggestions boiled down to asking whether we could avoid d_prog being copied/used on all threads, but since the object is so small it's not worthwhile to try to optimize that out. The shared memory buffer is already doing the important work. The important part is reducing global memory accesses, whereas my changes would just be helping to slightly reduce stack variables (and therefore register usage), which is not currently a bottleneck.

I'm pretty much happy with the current state of the PR, but plan to give it another review with a fresh eye on Monday before approving.

vyasr

This LGTM now. I would still be interested to see whether these kernels are ever occupancy bound. If they are, then making sure that every thread uses a pointer to the reprog_device in shared memory rather than having to copy all of its members into a thread-local object might help you see even more performance gains by reducing register usage. That change can be explored in a future PR though.

hyperbolic2346

Thanks for @vyasr for adding a comment explaining the offline discussion. This looks good to me and is a nice win.

davidwendt · 2022-05-06T11:59:01Z

@gpucibot merge

davidwendt added 12 commits April 1, 2022 15:49

Cleanup libcudf strings regex classes

2f50daf

fix merge conflicts

996b369

Merge branch 'branch-22.06' into regex-classes-cleanup

0f257bf

change idx parameter to id

82b462b

Merge branch 'branch-22.06' into regex-classes-cleanup

25ab493

change std::string to std::string_view in recomp::create_from

37641f7

Change stack-based regex state data to use global memory

f7ef47c

Merge branch 'branch-22.06' into regex-classes-cleanup

f1f598e

fix flat reprog-device memory size calc

04f70a8

fix merge conflicts

9c2dc17

add alignas(16) to reclass decl

83c0e2c

stride merge conflict

5d60d4b

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 5, 2022

davidwendt self-assigned this Apr 5, 2022

davidwendt added 7 commits April 7, 2022 12:31

Merge branch 'branch-22.06' into regex-classes-cleanup

eba572e

add restrict keyword to some pointers

e842438

use div_round_up for mask size calc

50d9f78

Merge branch 'branch-22.06' into regex-classes-cleanup

e9b0867

fix merge conflict

7e2272b

Merge branch 'branch-22.06' into regex-classes-cleanup

0652d30

fix merge conflict

00ef10d

andygrove mentioned this pull request Apr 8, 2022

[TASK] Test regexp global memory performance improvements NVIDIA/spark-rapids#5178

Closed

davidwendt added 2 commits April 8, 2022 14:31

Merge branch 'branch-22.06' into regex-classes-cleanup

5261da9

Merge branch 'regex-classes-cleanup' into regex-global-memory-state

8f7adff

davidwendt requested a review from jrhemstad April 21, 2022 20:38

Merge branch 'branch-22.06' into regex-global-memory-state

3c8e535

hyperbolic2346 reviewed Apr 27, 2022

View reviewed changes

cpp/src/strings/regex/regexec.cu Show resolved Hide resolved

hyperbolic2346 reviewed Apr 27, 2022

View reviewed changes

cpp/src/strings/regex/utilities.cuh Show resolved Hide resolved

davidwendt added 3 commits April 27, 2022 10:49

clamp to min_rows in compute_strided_working_memory

d5187b8

Merge branch 'branch-22.06' into regex-global-memory-state

8f01ec0

change max_size to requested_max_size

4c61dc0

vyasr requested changes Apr 28, 2022

View reviewed changes

davidwendt added 2 commits April 28, 2022 17:55

name the block-size value

367e264

Merge branch 'branch-22.06' into regex-global-memory-state

6ddcaa1

davidwendt requested a review from vyasr April 28, 2022 21:57

Merge branch 'branch-22.06' into regex-global-memory-state

7d4e18a

vyasr approved these changes May 2, 2022

View reviewed changes

davidwendt added 2 commits May 2, 2022 13:04

change contains_util to contains_impl

0d5720d

Merge branch 'branch-22.06' into regex-global-memory-state

25def14

hyperbolic2346 approved these changes May 3, 2022

View reviewed changes

Merge branch 'branch-22.06' into regex-global-memory-state

25df987

jrhemstad approved these changes May 5, 2022

View reviewed changes

Merge branch 'branch-22.06' into regex-global-memory-state

c2d6b05

rapids-bot bot merged commit de0f7e0 into rapidsai:branch-22.06 May 6, 2022

davidwendt deleted the regex-global-memory-state branch May 6, 2022 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change stack-based regex state data to use global memory #10600

Change stack-based regex state data to use global memory #10600

davidwendt commented Apr 5, 2022

codecov bot commented Apr 5, 2022 •

edited

Loading

vyasr left a comment

davidwendt commented Apr 28, 2022 •

edited

Loading

davidwendt commented Apr 28, 2022

davidwendt commented Apr 28, 2022 •

edited

Loading

vyasr commented Apr 29, 2022

davidwendt commented Apr 29, 2022

vyasr commented Apr 29, 2022 •

edited

Loading

vyasr left a comment

hyperbolic2346 left a comment

davidwendt commented May 6, 2022

Change stack-based regex state data to use global memory #10600

Change stack-based regex state data to use global memory #10600

Conversation

davidwendt commented Apr 5, 2022

codecov bot commented Apr 5, 2022 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

davidwendt commented Apr 28, 2022 • edited Loading

davidwendt commented Apr 28, 2022

davidwendt commented Apr 28, 2022 • edited Loading

vyasr commented Apr 29, 2022

davidwendt commented Apr 29, 2022

vyasr commented Apr 29, 2022 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

davidwendt commented May 6, 2022

codecov bot commented Apr 5, 2022 •

edited

Loading

davidwendt commented Apr 28, 2022 •

edited

Loading

davidwendt commented Apr 28, 2022 •

edited

Loading

vyasr commented Apr 29, 2022 •

edited

Loading