Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add nvtext::detokenize API #5739

Merged

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jul 21, 2020

Closes #5630

This API accepts a strings column of tokens and recombines them using an indices column and a separator.

std::unique_ptr<cudf::column> detokenize(
  cudf::strings_column_view const& strings,
  cudf::column_view const& row_indices,
  cudf::string_scalar const& separator = cudf::string_scalar(" "),
  rmm::mr::device_memory_resource* mr  = rmm::mr::get_default_resource());

Example pseudo-code:

 s = ["hello", "world", "one", "two", "three"]
 r = [0, 0, 1, 1, 1]
 t = detokenize(s)
 t is now ["hello world", "one two three"]

@davidwendt davidwendt self-assigned this Jul 21, 2020
@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jul 21, 2020
@codecov
Copy link

codecov bot commented Jul 22, 2020

Codecov Report

Merging #5739 into branch-0.15 will decrease coverage by 2.20%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.15    #5739      +/-   ##
===============================================
- Coverage        88.65%   86.44%   -2.21%     
===============================================
  Files               57       75      +18     
  Lines            10945    13024    +2079     
===============================================
+ Hits              9703    11259    +1556     
- Misses            1242     1765     +523     
Impacted Files Coverage Δ
python/cudf/cudf/core/dtypes.py 86.84% <0.00%> (-10.09%) ⬇️
python/cudf/cudf/core/abc.py 91.30% <0.00%> (-3.15%) ⬇️
python/cudf/cudf/core/frame.py 89.70% <0.00%> (-1.48%) ⬇️
python/dask_cudf/dask_cudf/backends.py 89.51% <0.00%> (-1.47%) ⬇️
python/cudf/cudf/utils/ioutils.py 86.13% <0.00%> (-1.21%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 93.12% <0.00%> (-0.99%) ⬇️
python/cudf/cudf/core/indexing.py 95.22% <0.00%> (-0.68%) ⬇️
python/cudf/cudf/utils/dtypes.py 85.98% <0.00%> (-0.45%) ⬇️
python/cudf/cudf/core/multiindex.py 80.53% <0.00%> (-0.10%) ⬇️
python/cudf/cudf/io/avro.py 81.81% <0.00%> (ø)
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4d8eb5...aaeb9ea. Read the comment docs.

@davidwendt davidwendt changed the title [WIP] Add nvtext::detokenize API [REVIEW] Add nvtext::detokenize API Jul 22, 2020
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 22, 2020
@davidwendt davidwendt marked this pull request as ready for review July 22, 2020 18:48
@davidwendt davidwendt requested review from a team as code owners July 22, 2020 18:48
@davidwendt davidwendt requested review from cwharris and vuule July 22, 2020 18:48
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor docstring issue, otherwise Python / Cython / CMake LGTM

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from vuule July 24, 2020 20:29
@davidwendt
Copy link
Contributor Author

rerun tests

@davidwendt davidwendt added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 28, 2020
@kkraus14
Copy link
Collaborator

rerun tests

@davidwendt davidwendt merged commit 9951f71 into rapidsai:branch-0.15 Jul 28, 2020
@davidwendt davidwendt deleted the fea-create-strings-from-tokens branch July 28, 2020 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Combine tokenized strings into a single string column
5 participants