Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add list output option to character_ngrams() function #9499

Merged
merged 12 commits into from
Oct 30, 2021

Conversation

davidwendt
Copy link
Contributor

Closes #8190

Depends on #9498

Adds an as_list output option to the cudf character_ngrams() API. The generated ngrams can be grouped within list offsets.

Example 1

import cudf
s = cudf.Series(["bumblebee", "transformers"])
print(s.str.character_ngrams(as_list=True))

0                [bu, um, mb, bl, le, eb, be, ee]
1    [tr, ra, an, ns, sf, fo, or, rm, me, er, rs]
dtype: list

Strings too small for the ngram size result in empty list rows.
And as always, null rows result in null rows.

Example 2

import cudf
s = cudf.Series(["bumblebee", "transformers", None, "", "a"])
print(s.str.character_ngrams(as_list=True))

0                [bu, um, mb, bl, le, eb, be, ee]
1    [tr, ra, an, ns, sf, fo, or, rm, me, er, rs]
2                                            None
3                                              []
4                                              []
dtype: list

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress Python Affects Python cuDF API. non-breaking Non-breaking change labels Oct 22, 2021
@davidwendt davidwendt self-assigned this Oct 22, 2021

# convert the output to a list by just generating the
# offsets for the output list column
s1 = self.len() - (n - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe mypy complains here because the return type of self.len() is a Union[Series, BaseIndex], and BaseIndex does not support a __sub__() method.

@vyasr do you have any thoughts on how to resolve? (other than a type: ignore of course :))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the typing really accurate? Shouldn't it return Series for Series of strings or Int64Index for any index type? This seems like an opportunity to tighten up our annotations.

@codecov
Copy link

codecov bot commented Oct 22, 2021

Codecov Report

Merging #9499 (3e46136) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9499      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19723     +854     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17619     +786     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dce38d4...3e46136. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 27, 2021
@davidwendt davidwendt marked this pull request as ready for review October 27, 2021 13:40
@davidwendt davidwendt requested a review from a team as a code owner October 27, 2021 13:40
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 29, 2021
@galipremsagar
Copy link
Contributor

@davidwendt Your examples are always amazing

@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 77c6f1d into rapidsai:branch-21.12 Oct 30, 2021
@davidwendt davidwendt deleted the fea-char-ngrams-list branch November 1, 2021 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] ngram function that outputs a list
4 participants