Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf.Series str.replace performance appears slower than nvstrings #4937

Closed
Garfounkel opened this issue Apr 17, 2020 · 1 comment · Fixed by #4958
Closed

[BUG] cudf.Series str.replace performance appears slower than nvstrings #4937

Garfounkel opened this issue Apr 17, 2020 · 1 comment · Fixed by #4958
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python)

Comments

@Garfounkel
Copy link

Garfounkel commented Apr 17, 2020

Describe the bug
cudf.Series str.replace performance appears slower than nvstrings. Probably related to #4885.

Steps/Code to reproduce bug

import nvstrings
from cudf import Series
import string

corpus = [
     'this is the first document.',
     'this document is the second document.',
     'and this is the third one.',
     'is this the first document?',
]
n = 500

corpus_wide = [doc * n for doc in corpus]

d = nvstrings.to_device(corpus_wide)
s = Series(corpus_wide)

print("nvstrings:", flush=True)
%time d.replace('[{}]'.format(string.punctuation), '')  # 0.1s
# nvstrings:
# CPU times: user 84 ms, sys: 28 ms, total: 112 ms
# Wall time: 111 ms

print("cuDF:", flush=True)
%time s.str.replace('[{}]'.format(string.punctuation), '')  # 30s
# cuDF:
# CPU times: user 18.6 s, sys: 12.1 s, total: 30.8 s
# Wall time: 30.8 s

Expected behavior
Performance should be more closely aligned with nvstrings.

@Garfounkel Garfounkel added Needs Triage Need team to review and classify bug Something isn't working labels Apr 17, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 17, 2020
@kkraus14
Copy link
Collaborator

cc @davidwendt for if you have any ideas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants