Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add python/cython bindings for str.join API #8085

Merged
merged 5 commits into from
Apr 29, 2021

Conversation

galipremsagar
Copy link
Contributor

Resolves #8079

This PR:

  • Introduces bindings for concatenate_list_elements in cython and plumbs it to our python API, .str.join
  • Enabled and adds more test coverage for str.join.
  • Docstring addition and misc docs cleanup.

@galipremsagar galipremsagar added feature request New feature or request 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer strings strings issues (C++ and Python) non-breaking Non-breaking change labels Apr 27, 2021
@galipremsagar galipremsagar requested a review from a team as a code owner April 27, 2021 21:54
@galipremsagar galipremsagar self-assigned this Apr 27, 2021
@codecov
Copy link

codecov bot commented Apr 28, 2021

Codecov Report

Merging #8085 (a378208) into branch-0.20 (51336df) will increase coverage by 0.00%.
The diff coverage is 86.52%.

❗ Current head a378208 differs from pull request most recent head 743981d. Consider uploading reports for the commit 743981d to get more accurate results
Impacted file tree graph

@@              Coverage Diff              @@
##           branch-0.20    #8085    +/-   ##
=============================================
  Coverage        82.88%   82.89%            
=============================================
  Files              103      103            
  Lines            17668    17849   +181     
=============================================
+ Hits             14645    14796   +151     
- Misses            3023     3053    +30     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/io/orc.py 86.89% <ø> (ø)
python/cudf/cudf/utils/cudautils.py 57.75% <25.00%> (ø)
python/cudf/cudf/utils/dtypes.py 81.87% <41.66%> (-1.57%) ⬇️
python/cudf/cudf/core/column/lists.py 86.98% <66.66%> (-0.43%) ⬇️
python/cudf/cudf/core/column/struct.py 94.73% <66.66%> (-1.56%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.43% <72.72%> (ø)
python/cudf/cudf/core/tools/datetimes.py 80.42% <75.29%> (-4.11%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 91.55% <76.92%> (+0.11%) ⬆️
python/cudf/cudf/core/column/column.py 88.64% <77.77%> (ø)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c66bdc...743981d. Read the comment docs.

Comment on lines 600 to 602
# If self._column is not a ListColumn, we will have to
# split each row by character and create a ListColumn out of it.
strings_column = self._split_by_character()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really expensive both computation and memory wise. We may want to raise an issue for a future optimization to prevent us from having to materialize the offsets here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a FEA: #8094 and added a todo here.

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Apr 29, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ac25e97 into rapidsai:branch-0.20 Apr 29, 2021
rapids-bot bot pushed a commit that referenced this pull request May 6, 2021
This PR adds a benchmark to the current `tokenize_benchmark.cpp` to measure the `nvtext::character_tokenize` API.

PR #8085 added code for using the `nvtext::character_tokenize` function. 
The benchmark was also useful while investigating #8094.
Also found and removed an unused variable in the code logic.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

URL: #8125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add Python bindings for concatenating lists as strings w/ separators
2 participants