Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BytePairEncoder class to cuDF #13891

Merged
merged 123 commits into from
Nov 14, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Aug 16, 2023

Description

Adds a new BytePairEncoding class to cuDF

>>> import cudf
>>> from cudf.core.byte_pair_encoding import BytePairEncoder
>>> mps = cudf.read_text('merges.txt', delimiter='\n', strip_delimiters=True)
>>> bpe = BytePairEncoder(mps)
>>> str_series = cudf.Series(['This is a sentence', 'thisisit'])
>>> bpe(str_series)
0    This is a sent ence
1             this is it
dtype: object

This class wraps the existing nvtext::byte_pair_encoding APIs to load the merge-pairs data and encode a column of strings.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. non-breaking Non-breaking change labels Aug 16, 2023
@davidwendt davidwendt self-assigned this Aug 16, 2023
@github-actions github-actions bot added the CMake CMake build issue label Aug 16, 2023
@davidwendt davidwendt changed the title Add BytePairEncoder class to cuDF Add BytePairEncoder class to cuDF Aug 16, 2023
@davidwendt davidwendt changed the title Add BytePairEncoder class to cuDF Add BytePairEncoder class to cuDF Aug 18, 2023
@davidwendt davidwendt changed the title Add BytePairEncoder class to cuDF Add BytePairEncoder class to cuDF Aug 24, 2023
@github-actions github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Nov 6, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 6, 2023
@davidwendt davidwendt marked this pull request as ready for review November 6, 2023 21:09
@davidwendt davidwendt requested a review from a team as a code owner November 6, 2023 21:09
@davidwendt davidwendt requested review from shwina and bdice November 6, 2023 21:09
@GregoryKimball GregoryKimball requested a review from vyasr November 13, 2023 18:27
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Requesting one small rename and one docstring summary.

python/cudf/cudf/_lib/nvtext/byte_pair_encode.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/nvtext/byte_pair_encode.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/byte_pair_encoding.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/byte_pair_encoding.py Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit b0c1b7b into rapidsai:branch-23.12 Nov 14, 2023
62 checks passed
@davidwendt davidwendt deleted the bpe-python-api branch November 14, 2023 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants