Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas. #4885

Closed
drobison00 opened this issue Apr 13, 2020 · 8 comments · Fixed by #4922
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@drobison00
Copy link

Describe the bug
A clear and concise description of what the bug is.

cuDF Series str.split performance appears to be orders of magnitude slower than Pandas.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

# Construct DataFrame illustration
import cudf
import pandas as pd
import numpy as np

data = {
    "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)]
}

df = pd.DataFrame(data=data)
cdf = cudf.from_pandas(df)

print("Pandas:", flush=True)
%time df['col_0'].str.split(",", expand = True).reset_index(drop=True)
Pandas:
CPU times: user 402 ms, sys: 35 ms, total: 437 ms
Wall time: 26.8 ms

print("cuDF:", flush=True)
%time cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)
cuDF:
CPU times: user 42.2 s, sys: 16.7 s, total: 58.9 s
Wall time: 58.8 s

Expected behavior
A clear and concise description of what you expected to happen.

Performance should be more closely aligned with Pandas.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Bare-metal with Conda environment + Conda based cuDF installation

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

cudatoolkit               10.1.243             h6bb024c_0    nvidia
cudf                      0.13.0                   py37_0    rapidsai
cudnn                     7.6.0                cuda10.1_0    nvidia
cuml                      0.13.0          cuda10.1_py37_0    rapidsai
cupy                      7.3.0            py37h0632833_0    conda-forge
dask-cudf                 0.13.0                   py37_0    rapidsai
libcudf                   0.13.0               cuda10.1_0    rapidsai
libcuml                   0.13.0               cuda10.1_0    rapidsai
libcumlprims              0.13.0               cuda10.1_0    nvidia
libnvstrings              0.13.0               cuda10.1_0    rapidsai
librmm                    0.13.0               cuda10.1_0    rapidsai
ucx                       1.7.0+g9d06c3a       cuda10.1_0    rapidsai

Additional context
Add any other context about the problem here.

@drobison00 drobison00 added Needs Triage Need team to review and classify bug Something isn't working labels Apr 13, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 13, 2020
@rgsl888prabhu
Copy link
Contributor

It seems libcudf is taking most of the time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.683   47.683   47.714   47.714 {cudf._lib.strings.split.split.split}
      400    0.167    0.000    0.221    0.001 {cudf._lib.copying.copy_column}
        1    0.018    0.018   47.732   47.732 string.py:804(split)
     5604    0.015    0.000    0.038    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.027    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.046    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.965   47.965 <string>:1(<module>)

@kkraus14
Copy link
Collaborator

It seems libcudf is taking most of the time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.683   47.683   47.714   47.714 {cudf._lib.strings.split.split.split}
      400    0.167    0.000    0.221    0.001 {cudf._lib.copying.copy_column}
        1    0.018    0.018   47.732   47.732 string.py:804(split)
     5604    0.015    0.000    0.038    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.027    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.046    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.965   47.965 <string>:1(<module>)

Can you try with the pool allocator? This looks like a situation where allocating 400 columns is likely the bottleneck, but I could be mistaken.

@kkraus14
Copy link
Collaborator

cc @davidwendt

@kkraus14 kkraus14 added the libcudf Affects libcudf (C++/CUDA) code. label Apr 13, 2020
@rgsl888prabhu
Copy link
Contributor

Tried with pool allocator, and problem still persists

In [1]: import cudf 
   ...:  
   ...: # Use the pool allocator with a 2GiB initial pool size 
   ...: cudf.set_allocator(pool=True, initial_pool_size=2<<30) 
In [3]: import pandas as pd 
   ...: import numpy as np                                                                                                                                                                                                       

In [4]: data = { 
   ...:     "col_0": [','.join(map(str, np.linspace(-1, 1, 400))) for _ in range(1)] 
   ...: }                                                                                                                                                                                                                        

In [5]:                                                                                                                                                                                                                          

In [5]: df = pd.DataFrame(data=data) 
   ...: cdf = cudf.from_pandas(df)                                                                                                                                                                                               

In [6]: print("cuDF:", flush=True) 
   ...: %prun cdf['col_0'].str.split(",", expand = True).reset_index(drop=True)                       
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.646   47.646   47.676   47.676 {cudf._lib.strings.split.split.split}
      400    0.043    0.000    0.094    0.000 {cudf._lib.copying.copy_column}
     5604    0.015    0.000    0.037    0.000 dtypes.py:79(is_categorical_dtype)
    73666    0.015    0.000    0.015    0.000 {built-in method builtins.hasattr}
    32026    0.013    0.000    0.026    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
     2400    0.008    0.000    0.044    0.000 column.py:1039(build_column)
    72948    0.007    0.000    0.007    0.000 {built-in method builtins.isinstance}
        1    0.007    0.007   47.782   47.782 <string>:1(<module>)
     2400    0.005    0.000    0.008    0.000 column.py:41(__init__)
     8006    0.003    0.000    0.005    0.000 common.py:2020(pandas_dtype)
     2400    0.003    0.000    0.004    0.000 buffer.py:135(_buffer_data_from_array_interface)
     2400    0.002    0.000    0.007    0.000 buffer.py:87(_init_from_array_like)

@davidwendt
Copy link
Contributor

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

@kkraus14
Copy link
Collaborator

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

Yea I understand that, but ~47s seems crazy excessive, no?

@drobison00
Copy link
Author

Also, this is a single string (one row) and therefore only a single GPU thread is used to split it. There is no parallelism used here and so this will run much slower than on the CPU. The GPU would likely have an advantage over the CPU here only when 1000s or 10000s strings are being processed in parallel.

My observations are that this is not the case. You can check by simply setting the range argument from 1 to to however many N strings you want.

For example, with N=1000, and reducing the number of linear steps to 200 (with 400 it runs much longer):
Pandas:
CPU times: user 48.1 ms, sys: 4 ms, total: 52.1 ms
Wall time: 51.8 ms
cuDF:
CPU times: user 34.9 s, sys: 12.1 s, total: 47.1 s
Wall time: 47.1 s

N=10,000, steps=200
Pandas:
CPU times: user 824 ms, sys: 74.2 ms, total: 898 ms
Wall time: 897 ms
cuDF:
CPU times: user 57 s, sys: 19.5 s, total: 1min 16s
Wall time: 1min 16s

@davidwendt
Copy link
Contributor

I see that! Wow. Thanks for the code snippet here Devin. This will be very helpful. I should be able to profile this to see what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants