Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add filter_tokens nvtext API #5658

Merged
merged 20 commits into from
Jul 16, 2020

Conversation

davidwendt
Copy link
Contributor

Closes #5521

Add function to filter tokens from strings based on the length of each token as described in #5521

std::unique_ptr<cudf::column> filter_tokens(
  cudf::strings_column_view const& strings,
  cudf::size_type min_token_length,
  cudf::string_scalar const& replacement = cudf::string_scalar{""},
  cudf::string_scalar const& delimiter   = cudf::string_scalar{""},
  rmm::mr::device_memory_resource* mr    = rmm::mr::get_default_resource());

This PR also includes the Python strings API and the required Cython interfaces.

    def filter_tokens(
        self, min_token_length, replacement=None, delimiter=None, **kwargs
    ):

@davidwendt davidwendt self-assigned this Jul 8, 2020
@davidwendt davidwendt added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jul 8, 2020
@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #5658 into branch-0.15 will decrease coverage by 0.27%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.15    #5658      +/-   ##
===============================================
- Coverage        86.21%   85.94%   -0.28%     
===============================================
  Files               72       72              
  Lines            12727    12439     -288     
===============================================
- Hits             10973    10691     -282     
+ Misses            1754     1748       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 86.61% <100.00%> (+0.03%) ⬆️
python/cudf/cudf/io/hdf.py 50.00% <0.00%> (-7.15%) ⬇️
python/cudf/cudf/utils/cudautils.py 46.80% <0.00%> (-3.53%) ⬇️
python/cudf/cudf/core/abc.py 88.09% <0.00%> (-3.21%) ⬇️
python/cudf/cudf/comm/gpuarrow.py 87.67% <0.00%> (-1.62%) ⬇️
python/cudf/cudf/core/column/datetime.py 84.41% <0.00%> (-1.04%) ⬇️
python/cudf/cudf/core/multiindex.py 80.11% <0.00%> (-0.77%) ⬇️
python/cudf/cudf/core/buffer.py 82.00% <0.00%> (-0.70%) ⬇️
python/cudf/cudf/core/index.py 91.17% <0.00%> (-0.59%) ⬇️
python/cudf/cudf/utils/utils.py 84.40% <0.00%> (-0.41%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f4209c...39a4409. Read the comment docs.

@davidwendt davidwendt changed the title [WIP] Add filter_tokens nvtext API [WIP] Add filter_tokens nvtext API Jul 9, 2020
@davidwendt davidwendt marked this pull request as ready for review July 9, 2020 13:53
@davidwendt davidwendt requested review from a team as code owners July 9, 2020 13:53
@davidwendt davidwendt requested review from jrhemstad and vuule July 9, 2020 13:53
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 9, 2020
@davidwendt davidwendt changed the title [WIP] Add filter_tokens nvtext API [REVIEW] Add filter_tokens nvtext API Jul 9, 2020
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python / Cython looks good

Copy link
Contributor

@trevorsm7 trevorsm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Good use of 🔥 in the unit tests 😄. Just a minor suggestion to try empty_like

cpp/src/text/replace.cu Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor questions/suggestions, the PR looks great overall!

cpp/include/nvtext/replace.hpp Show resolved Hide resolved
cpp/src/text/replace.cu Outdated Show resolved Hide resolved
cpp/src/text/replace.cu Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from vuule July 15, 2020 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Filtering words based on size
4 participants