Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694

Closed
revans2 opened this issue Feb 3, 2023 · 5 comments · Fixed by #12729
Assignees
Labels
bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Feb 3, 2023

Describe the bug
We have a customer that is complaining about a specific query that is taking a crazy long time to complete compared to the time that the CPU takes (orders of magnitude slower)

I did some benchmarks with synthetic data (happy to upload some or all of it, but the 1% of the data that really shows the problem is 150 MiB compressed). The data involves a lot of really long strings. We saw that for most operators, like reading parquet and even doing a regular expression replace on the long strings, the GPU was only 3x to 5x slower than a 12 thread 6 core CPU doing the same thing in java/scala. But for cudf::strings:split_record it was a massively huge change. 15x slower than a single thread and over 238x slower than the 12 thread 6 core CPU.

I did some profiling with nsight systems and the slowness appears to come from the kernel that counts the number of matches per-input string and also the kernel that outputs the pointer and length of the matched parts.

The other thing that I noticed is that these kernels appear to take the entire GPU despite the size of the grid being so small that it should not (nsight says that it would in theory, but there are only a few thousand strings in the input data and the a6000 has a huge number of threads). On the larger full data set if I try to run with multiple threads it takes as long as running all of the data with a single thread, assuming the same number of batches.

First kernel

_kernel_agent
Begins: 6.67374s
Ends: 136.101s (+129.427 s)
grid:  <<<5, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 38
Local Memory Per Thread: 0 bytes
Local Memory Total: 74,317,824 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 100 %
Launched from thread: 588148
Latency: ←12.216 μs
Correlation ID: 1229
Stream: Stream 13

Second Kernel

_kernel_agent
Begins: 136.101s
Ends: 348.364s (+212.263 s)
grid:  <<<5, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 39
Local Memory Per Thread: 0 bytes
Local Memory Total: 74,317,824 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 100 %
Launched from thread: 588148
Latency: ←11.004 μs
Correlation ID: 1289
Stream: Stream 13

For reference here are some kernels doing a replace_re on the same data

for_each_kernel
Begins: 31.0558s
Ends: 33.1574s (+2.102 s)
grid:  <<<10, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 192 bytes
Registers Per Thread: 79
Local Memory Per Thread: 0 bytes
Local Memory Total: 85,327,872 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 498744
Latency: ←20.579 μs
Correlation ID: 28726
Stream: Stream 21
for_each_kernel
Begins: 33.1575s
Ends: 35.384s (+2.227 s)
grid:  <<<10, 1, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 192 bytes
Registers Per Thread: 79
Local Memory Per Thread: 0 bytes
Local Memory Total: 85,327,872 bytes
Shared Memory executed: 8,192 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 498744
Latency: ←21.218 μs
Correlation ID: 28775
Stream: Stream 21

4.329 seconds for regex_replace vs 341+ seconds for split_records. (79x slower)

Steps/Code to reproduce bug

  auto file_name = "/data/tmp/DUMP_2.parquet";
  auto source = cudf::io::source_info(file_name);
  auto builder = cudf::io::parquet_reader_options::builder(source);
  builder.columns({"value"});
  auto read = read_parquet(builder.build());
  std::cerr << "READ IN DATA" << std::endl;
  auto const strs_input = cudf::strings_column_view{read.tbl->get_column(0)};
  cudf::string_scalar sp("},{");
  std::cerr << "SETUP DONE... (RUNNING)" << std::endl;
  auto ret = cudf::strings::split_record(
          strs_input,
          sp,
          -1);
  std::cerr << "ALL DONE" << std::endl;

Like I said I am happy to upload DUMP_2.parquet for anyone who wants to work on it. (it is too large for github).

Expected behavior
I would love it if we could fix this so it was faster than the CPU by a significant amount, but I am happy if it just matches the other long string expressions where it is only 5x slower than the entire CPU (i.e. make it at least 50x faster than it is today)

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Feb 3, 2023
@davidwendt davidwendt self-assigned this Feb 3, 2023
@davidwendt
Copy link
Contributor

The existing libcudf benchmarks for cudf::strings::split_record do show performance degrades as strings get longer.
The cudf::strings::split algorithm was updated awhile back to support parallelism over characters/bytes to help performance with longer strings. It seems possible at first glance to use the same approach here.

@revans2
Copy link
Contributor Author

revans2 commented Feb 6, 2023

@davidwendt that sounds great. I'll run some performance tests to see if that helps.

@revans2
Copy link
Contributor Author

revans2 commented Feb 6, 2023

I ran the test with split instead of split_record and the performance changed from a 355.8 seconds to 5.9 seconds (60x speedup). Seeing how just reading the input file took around 6.2 seconds, the split appears to be essentially free here.

So yes @davidwendt if we could do something similar it would help out a lot.

@revans2
Copy link
Contributor Author

revans2 commented Feb 6, 2023

Yup nsys shows split taking 181ms, and most of that time was creating the 1300 strings columns for output. So it should be much faster because we would only need to create 1 string column and some offsets.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 14, 2023

I tested split and split_record using the same input parquet file that @revans2 did:

split: 156.642541ms
split_record: 358426.275505ms

Using the patch in #12729:

split: 152.304989ms
split_record: 14.842435ms

Now I'm going to verify with the spark-rapids plugin to confirm if the patch indeed resolves our issue (most likely, just to make sure). Will update the result later.

Update: spark-rapids plugin with the new patch has a consistent runtime with my test above 👍 .

rapids-bot bot pushed a commit that referenced this issue Feb 21, 2023
…12729)

Updates the `cudf::strings::split_record` logic to match the more optimized code in `cudf::strings:split`.
The optimized code performs much better for longer strings (>64 bytes) by parallelizing over the character bytes to find delimiters before determining split tokens. 
This led to refactoring the code so it both APIs can share the optimized code.
Also fixes a bug found when using overlapped delimiters.
Additional tests were added for multi-byte delimiters which can overlap and span multiple adjacent strings.

Closes #12694

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)
  - https://github.com/nvdbaranec

URL: #12729
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants