-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudf::strings::split_record can be over 15x slower than a single thread on the CPU for some cases #12694
Comments
The existing libcudf benchmarks for |
@davidwendt that sounds great. I'll run some performance tests to see if that helps. |
I ran the test with So yes @davidwendt if we could do something similar it would help out a lot. |
Yup nsys shows split taking 181ms, and most of that time was creating the 1300 strings columns for output. So it should be much faster because we would only need to create 1 string column and some offsets. |
I tested
Using the patch in #12729:
Now I'm going to verify with the spark-rapids plugin to confirm if the patch indeed resolves our issue (most likely, just to make sure). Will update the result later. Update: spark-rapids plugin with the new patch has a consistent runtime with my test above 👍 . |
…12729) Updates the `cudf::strings::split_record` logic to match the more optimized code in `cudf::strings:split`. The optimized code performs much better for longer strings (>64 bytes) by parallelizing over the character bytes to find delimiters before determining split tokens. This led to refactoring the code so it both APIs can share the optimized code. Also fixes a bug found when using overlapped delimiters. Additional tests were added for multi-byte delimiters which can overlap and span multiple adjacent strings. Closes #12694 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - https://github.com/nvdbaranec URL: #12729
Describe the bug
We have a customer that is complaining about a specific query that is taking a crazy long time to complete compared to the time that the CPU takes (orders of magnitude slower)
I did some benchmarks with synthetic data (happy to upload some or all of it, but the 1% of the data that really shows the problem is 150 MiB compressed). The data involves a lot of really long strings. We saw that for most operators, like reading parquet and even doing a regular expression replace on the long strings, the GPU was only 3x to 5x slower than a 12 thread 6 core CPU doing the same thing in java/scala. But for
cudf::strings:split_record
it was a massively huge change. 15x slower than a single thread and over 238x slower than the 12 thread 6 core CPU.I did some profiling with nsight systems and the slowness appears to come from the kernel that counts the number of matches per-input string and also the kernel that outputs the pointer and length of the matched parts.
The other thing that I noticed is that these kernels appear to take the entire GPU despite the size of the grid being so small that it should not (nsight says that it would in theory, but there are only a few thousand strings in the input data and the a6000 has a huge number of threads). On the larger full data set if I try to run with multiple threads it takes as long as running all of the data with a single thread, assuming the same number of batches.
First kernel
Second Kernel
For reference here are some kernels doing a replace_re on the same data
4.329 seconds for regex_replace vs 341+ seconds for split_records. (79x slower)
Steps/Code to reproduce bug
Like I said I am happy to upload DUMP_2.parquet for anyone who wants to work on it. (it is too large for github).
Expected behavior
I would love it if we could fix this so it was faster than the CPU by a significant amount, but I am happy if it just matches the other long string expressions where it is only 5x slower than the entire CPU (i.e. make it at least 50x faster than it is today)
The text was updated successfully, but these errors were encountered: