[REVIEW] Add python/cython bindings for `str.join` API #8085

galipremsagar · 2021-04-27T21:53:59Z

Resolves #8079

This PR:

Introduces bindings for concatenate_list_elements in cython and plumbs it to our python API, .str.join
Enabled and adds more test coverage for str.join.
Docstring addition and misc docs cleanup.

codecov · 2021-04-28T00:39:22Z

Codecov Report

Merging #8085 (a378208) into branch-0.20 (51336df) will increase coverage by 0.00%.
The diff coverage is 86.52%.

❗ Current head a378208 differs from pull request most recent head 743981d. Consider uploading reports for the commit 743981d to get more accurate results

@@              Coverage Diff              @@
##           branch-0.20    #8085    +/-   ##
=============================================
  Coverage        82.88%   82.89%            
=============================================
  Files              103      103            
  Lines            17668    17849   +181     
=============================================
+ Hits             14645    14796   +151     
- Misses            3023     3053    +30

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/io/orc.py	`86.89% <ø> (ø)`
python/cudf/cudf/utils/cudautils.py	`57.75% <25.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`81.87% <41.66%> (-1.57%)`	⬇️
python/cudf/cudf/core/column/lists.py	`86.98% <66.66%> (-0.43%)`	⬇️
python/cudf/cudf/core/column/struct.py	`94.73% <66.66%> (-1.56%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.43% <72.72%> (ø)`
python/cudf/cudf/core/tools/datetimes.py	`80.42% <75.29%> (-4.11%)`	⬇️
python/cudf/cudf/core/groupby/groupby.py	`91.55% <76.92%> (+0.11%)`	⬆️
python/cudf/cudf/core/column/column.py	`88.64% <77.77%> (ø)`
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c66bdc...743981d. Read the comment docs.

kkraus14 · 2021-04-28T01:10:47Z

python/cudf/cudf/core/column/string.py

+            # If self._column is not a ListColumn, we will have to
+            # split each row by character and create a ListColumn out of it.
+            strings_column = self._split_by_character()


This is really expensive both computation and memory wise. We may want to raise an issue for a future optimization to prevent us from having to materialize the offsets here.

Opened a FEA: #8094 and added a todo here.

python/cudf/cudf/core/column/string.py

kkraus14 · 2021-04-29T14:33:21Z

@gpucibot merge

This PR adds a benchmark to the current `tokenize_benchmark.cpp` to measure the `nvtext::character_tokenize` API. PR #8085 added code for using the `nvtext::character_tokenize` function. The benchmark was also useful while investigating #8094. Also found and removed an unused variable in the code logic. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #8125

galipremsagar added 2 commits April 27, 2021 14:01

enable str.join API

dde5c8b

add docs

59a488f

galipremsagar added feature request New feature or request 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer strings strings issues (C++ and Python) non-breaking Non-breaking change labels Apr 27, 2021

galipremsagar requested review from shwina, kkraus14 and davidwendt April 27, 2021 21:54

galipremsagar requested a review from a team as a code owner April 27, 2021 21:54

galipremsagar self-assigned this Apr 27, 2021

kkraus14 reviewed Apr 28, 2021

View reviewed changes

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits April 28, 2021 07:17

Merge remote-tracking branch 'upstream/branch-0.20' into 8079

1677047

use empty column

49e9248

galipremsagar mentioned this pull request Apr 28, 2021

[FEA] Support for the ability to split strings by character #8094

Closed

add todo

743981d

kkraus14 approved these changes Apr 29, 2021

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Apr 29, 2021

rapids-bot bot merged commit ac25e97 into rapidsai:branch-0.20 Apr 29, 2021

davidwendt mentioned this pull request Apr 30, 2021

Add chars-tokenizer to nvtext tokenize_benchmark.cpp #8125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add python/cython bindings for `str.join` API #8085

[REVIEW] Add python/cython bindings for `str.join` API #8085

galipremsagar commented Apr 27, 2021

codecov bot commented Apr 28, 2021 •

edited

Loading

kkraus14 Apr 28, 2021

galipremsagar Apr 28, 2021

kkraus14 commented Apr 29, 2021

[REVIEW] Add python/cython bindings for str.join API #8085

[REVIEW] Add python/cython bindings for str.join API #8085

Conversation

galipremsagar commented Apr 27, 2021

codecov bot commented Apr 28, 2021 • edited Loading

Codecov Report

kkraus14 Apr 28, 2021

Choose a reason for hiding this comment

galipremsagar Apr 28, 2021

Choose a reason for hiding this comment

kkraus14 commented Apr 29, 2021

[REVIEW] Add python/cython bindings for `str.join` API #8085

[REVIEW] Add python/cython bindings for `str.join` API #8085

codecov bot commented Apr 28, 2021 •

edited

Loading