[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas. #4885

drobison00 · 2020-04-13T19:21:51Z

Describe the bug
A clear and concise description of what the bug is.

cuDF Series str.split performance appears to be orders of magnitude slower than Pandas.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

# Construct DataFrame illustration
import cudf
import pandas as pd
import numpy as np

data = {
    "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)]
}

df = pd.DataFrame(data=data)
cdf = cudf.from_pandas(df)

print("Pandas:", flush=True)
%time df['col_0'].str.split(",", expand = True).reset_index(drop=True)
Pandas:
CPU times: user 402 ms, sys: 35 ms, total: 437 ms
Wall time: 26.8 ms

print("cuDF:", flush=True)
%time cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)
cuDF:
CPU times: user 42.2 s, sys: 16.7 s, total: 58.9 s
Wall time: 58.8 s

Expected behavior
A clear and concise description of what you expected to happen.

Performance should be more closely aligned with Pandas.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of cuDF install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

Bare-metal with Conda environment + Conda based cuDF installation

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

cudatoolkit               10.1.243             h6bb024c_0    nvidia
cudf                      0.13.0                   py37_0    rapidsai
cudnn                     7.6.0                cuda10.1_0    nvidia
cuml                      0.13.0          cuda10.1_py37_0    rapidsai
cupy                      7.3.0            py37h0632833_0    conda-forge
dask-cudf                 0.13.0                   py37_0    rapidsai
libcudf                   0.13.0               cuda10.1_0    rapidsai
libcuml                   0.13.0               cuda10.1_0    rapidsai
libcumlprims              0.13.0               cuda10.1_0    nvidia
libnvstrings              0.13.0               cuda10.1_0    rapidsai
librmm                    0.13.0               cuda10.1_0    rapidsai
ucx                       1.7.0+g9d06c3a       cuda10.1_0    rapidsai

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

rgsl888prabhu · 2020-04-13T20:33:15Z

It seems libcudf is taking most of the time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.683   47.683   47.714   47.714 {cudf._lib.strings.split.split.split}
      400    0.167    0.000    0.221    0.001 {cudf._lib.copying.copy_column}
        1    0.018    0.018   47.732   47.732 string.py:804(split)
     5604    0.015    0.000    0.038    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.027    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.046    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.965   47.965 <string>:1(<module>)

kkraus14 · 2020-04-13T20:35:10Z

It seems libcudf is taking most of the time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.683   47.683   47.714   47.714 {cudf._lib.strings.split.split.split}
      400    0.167    0.000    0.221    0.001 {cudf._lib.copying.copy_column}
        1    0.018    0.018   47.732   47.732 string.py:804(split)
     5604    0.015    0.000    0.038    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.027    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.046    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.965   47.965 <string>:1(<module>)

Can you try with the pool allocator? This looks like a situation where allocating 400 columns is likely the bottleneck, but I could be mistaken.

kkraus14 · 2020-04-13T20:36:37Z

cc @davidwendt

rgsl888prabhu · 2020-04-13T20:55:30Z

Tried with pool allocator, and problem still persists

In [1]: import cudf 
   ...:  
   ...: # Use the pool allocator with a 2GiB initial pool size 
   ...: cudf.set_allocator(pool=True, initial_pool_size=2<<30) 
In [3]: import pandas as pd 
   ...: import numpy as np                                                                                                                                                                                                       

In [4]: data = { 
   ...:     "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)] 
   ...: }                                                                                                                                                                                                                        

In [5]:                                                                                                                                                                                                                          

In [5]: df = pd.DataFrame(data=data) 
   ...: cdf = cudf.from_pandas(df)                                                                                                                                                                                               

In [6]: print("cuDF:", flush=True) 
   ...: %prun cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.646   47.646   47.676   47.676 {cudf._lib.strings.split.split.split}
      400    0.043    0.000    0.094    0.000 {cudf._lib.copying.copy_column}
     5604    0.015    0.000    0.037    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.026    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.044    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.782   47.782 <string>:1(<module>)
     2400    0.005    0.000    0.008    0.000 column.py:41(__init__)
     8006    0.003    0.000    0.005    0.000 common.py:2020(pandas_dtype)
     2400    0.003    0.000    0.004    0.000 buffer.py:135(_buffer_data_from_array_interface)
     2400    0.002    0.000    0.007    0.000 buffer.py:87(_init_from_array_like)

davidwendt · 2020-04-13T21:00:32Z

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

kkraus14 · 2020-04-13T21:02:18Z

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

Yea I understand that, but ~47s seems crazy excessive, no?

drobison00 · 2020-04-13T21:22:28Z

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

My observations are that this is not the case. You can check by simply setting the range argument from 1 to to however many N strings you want.

For example, with N=1000, and reducing the number of linear steps to 200 (with 400 it runs much longer):
Pandas:
CPU times: user 48.1 ms, sys: 4 ms, total: 52.1 ms
Wall time: 51.8 ms
cuDF:
CPU times: user 34.9 s, sys: 12.1 s, total: 47.1 s
Wall time: 47.1 s

N=10,000, steps=200
Pandas:
CPU times: user 824 ms, sys: 74.2 ms, total: 898 ms
Wall time: 897 ms
cuDF:
CPU times: user 57 s, sys: 19.5 s, total: 1min 16s
Wall time: 1min 16s

davidwendt · 2020-04-13T21:29:29Z

I see that! Wow. Thanks for the code snippet here Devin. This will be very helpful. I should be able to profile this to see what is going on.

drobison00 added Needs Triage Need team to review and classify bug Something isn't working labels Apr 13, 2020

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 13, 2020

kkraus14 added the libcudf Affects libcudf (C++/CUDA) code. label Apr 13, 2020

davidwendt self-assigned this Apr 14, 2020

davidwendt mentioned this issue Apr 16, 2020

[REVIEW] Fix cudf::strings:split logic for many columns #4922

Merged

harrism added Performance Performance related issue strings strings issues (C++ and Python) labels Apr 17, 2020

Garfounkel mentioned this issue Apr 17, 2020

[BUG] cudf.Series str.replace performance appears slower than nvstrings #4937

Closed

harrism closed this as completed in #4922 Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas. #4885

[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas. #4885

drobison00 commented Apr 13, 2020

rgsl888prabhu commented Apr 13, 2020

kkraus14 commented Apr 13, 2020

kkraus14 commented Apr 13, 2020

rgsl888prabhu commented Apr 13, 2020

davidwendt commented Apr 13, 2020

kkraus14 commented Apr 13, 2020

drobison00 commented Apr 13, 2020

davidwendt commented Apr 13, 2020

[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas. #4885

[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas. #4885

Comments

drobison00 commented Apr 13, 2020

rgsl888prabhu commented Apr 13, 2020

kkraus14 commented Apr 13, 2020

kkraus14 commented Apr 13, 2020

rgsl888prabhu commented Apr 13, 2020

davidwendt commented Apr 13, 2020

kkraus14 commented Apr 13, 2020

drobison00 commented Apr 13, 2020

davidwendt commented Apr 13, 2020