[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694

revans2 · 2023-02-03T20:11:52Z

Describe the bug
We have a customer that is complaining about a specific query that is taking a crazy long time to complete compared to the time that the CPU takes (orders of magnitude slower)

I did some benchmarks with synthetic data (happy to upload some or all of it, but the 1% of the data that really shows the problem is 150 MiB compressed). The data involves a lot of really long strings. We saw that for most operators, like reading parquet and even doing a regular expression replace on the long strings, the GPU was only 3x to 5x slower than a 12 thread 6 core CPU doing the same thing in java/scala. But for cudf::strings:split_record it was a massively huge change. 15x slower than a single thread and over 238x slower than the 12 thread 6 core CPU.

I did some profiling with nsight systems and the slowness appears to come from the kernel that counts the number of matches per-input string and also the kernel that outputs the pointer and length of the matched parts.

The other thing that I noticed is that these kernels appear to take the entire GPU despite the size of the grid being so small that it should not (nsight says that it would in theory, but there are only a few thousand strings in the input data and the a6000 has a huge number of threads). On the larger full data set if I try to run with multiple threads it takes as long as running all of the data with a single thread, assuming the same number of batches.

First kernel

_kernel_agent
Begins: 6.67374s
Ends: 136.101s (+129.427 s)
grid:  <<<5, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 38
Local Memory Per Thread: 0 bytes
Local Memory Total: 74,317,824 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 100 %
Launched from thread: 588148
Latency: ←12.216 μs
Correlation ID: 1229
Stream: Stream 13

Second Kernel

_kernel_agent
Begins: 136.101s
Ends: 348.364s (+212.263 s)
grid:  <<<5, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 39
Local Memory Per Thread: 0 bytes
Local Memory Total: 74,317,824 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 100 %
Launched from thread: 588148
Latency: ←11.004 μs
Correlation ID: 1289
Stream: Stream 13

For reference here are some kernels doing a replace_re on the same data

for_each_kernel
Begins: 31.0558s
Ends: 33.1574s (+2.102 s)
grid:  <<<10, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 192 bytes
Registers Per Thread: 79
Local Memory Per Thread: 0 bytes
Local Memory Total: 85,327,872 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 498744
Latency: ←20.579 μs
Correlation ID: 28726
Stream: Stream 21

for_each_kernel
Begins: 33.1575s
Ends: 35.384s (+2.227 s)
grid:  <<<10, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 192 bytes
Registers Per Thread: 79
Local Memory Per Thread: 0 bytes
Local Memory Total: 85,327,872 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 498744
Latency: ←21.218 μs
Correlation ID: 28775
Stream: Stream 21

4.329 seconds for regex_replace vs 341+ seconds for split_records. (79x slower)

Steps/Code to reproduce bug

  auto file_name = "/data/tmp/DUMP_2.parquet";
  auto source = cudf::io::source_info(file_name);
  auto builder = cudf::io::parquet_reader_options::builder(source);
  builder.columns({"value"});
  auto read = read_parquet(builder.build());
  std::cerr << "READ IN DATA" << std::endl;
  auto const strs_input = cudf::strings_column_view{read.tbl->get_column(0)};
  cudf::string_scalar sp("},{");
  std::cerr << "SETUP DONE... (RUNNING)" << std::endl;
  auto ret = cudf::strings::split_record(
          strs_input,
          sp,
          -1);
  std::cerr << "ALL DONE" << std::endl;

Like I said I am happy to upload DUMP_2.parquet for anyone who wants to work on it. (it is too large for github).

Expected behavior
I would love it if we could fix this so it was faster than the CPU by a significant amount, but I am happy if it just matches the other long string expressions where it is only 5x slower than the entire CPU (i.e. make it at least 50x faster than it is today)

The text was updated successfully, but these errors were encountered:

davidwendt · 2023-02-03T20:51:24Z

The existing libcudf benchmarks for cudf::strings::split_record do show performance degrades as strings get longer.
The cudf::strings::split algorithm was updated awhile back to support parallelism over characters/bytes to help performance with longer strings. It seems possible at first glance to use the same approach here.

revans2 · 2023-02-06T15:36:57Z

@davidwendt that sounds great. I'll run some performance tests to see if that helps.

revans2 · 2023-02-06T16:18:10Z

I ran the test with split instead of split_record and the performance changed from a 355.8 seconds to 5.9 seconds (60x speedup). Seeing how just reading the input file took around 6.2 seconds, the split appears to be essentially free here.

So yes @davidwendt if we could do something similar it would help out a lot.

revans2 · 2023-02-06T16:27:11Z

Yup nsys shows split taking 181ms, and most of that time was creating the 1300 strings columns for output. So it should be much faster because we would only need to create 1 string column and some offsets.

ttnghia · 2023-02-14T18:41:45Z

I tested split and split_record using the same input parquet file that @revans2 did:

split: 156.642541ms
split_record: 358426.275505ms

Using the patch in #12729:

split: 152.304989ms
split_record: 14.842435ms

Now I'm going to verify with the spark-rapids plugin to confirm if the patch indeed resolves our issue (most likely, just to make sure). Will update the result later.

Update: spark-rapids plugin with the new patch has a consistent runtime with my test above 👍 .

…12729) Updates the `cudf::strings::split_record` logic to match the more optimized code in `cudf::strings:split`. The optimized code performs much better for longer strings (>64 bytes) by parallelizing over the character bytes to find delimiters before determining split tokens. This led to refactoring the code so it both APIs can share the optimized code. Also fixes a bug found when using overlapped delimiters. Additional tests were added for multi-byte delimiters which can overlap and span multiple adjacent strings. Closes #12694 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - https://github.com/nvdbaranec URL: #12729

revans2 added bug Something isn't working Needs Triage Need team to review and classify Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Feb 3, 2023

davidwendt self-assigned this Feb 3, 2023

davidwendt mentioned this issue Feb 8, 2023

Rework logic in cudf::strings::split_record to improve performance #12729

Merged

3 tasks

ttnghia mentioned this issue Feb 14, 2023

[BUG] cudf::strings::replace_re can be 20X slower than a single thread on the CPU for some cases #12778

Closed

rapids-bot bot closed this as completed in #12729 Feb 21, 2023

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694

[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694

revans2 commented Feb 3, 2023

davidwendt commented Feb 3, 2023

revans2 commented Feb 6, 2023

revans2 commented Feb 6, 2023

revans2 commented Feb 6, 2023

ttnghia commented Feb 14, 2023 •

edited

Loading

[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694

[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694

Comments

revans2 commented Feb 3, 2023

davidwendt commented Feb 3, 2023

revans2 commented Feb 6, 2023

revans2 commented Feb 6, 2023

revans2 commented Feb 6, 2023

ttnghia commented Feb 14, 2023 • edited Loading

ttnghia commented Feb 14, 2023 •

edited

Loading