Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Move text processing APIs and implementation out of cudf into a separate library or package #9555

Closed
shwina opened this issue Oct 28, 2021 · 7 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@shwina
Copy link
Contributor

shwina commented Oct 28, 2021

The StringMethods accessor in cuDF contains a few complex text processing APIs including e.g., ngram generation subword tokenization. I think those APIs should live in a separate repo, for two reasons:

  1. They have low discoverability: we advertise cuDF as a GPU DataFrame library, not as a tokenization library or text processing library.
  2. They stick out awkwardly in terms of the complexity of the operations thatStringMethods supports (capitalize is a much simpler operation than subword_tokenize).

There used to be a separate nvtext repository that got merged into cuDF. Perhaps we should consider splitting these functions out again into their own library, especially given that gpuCI is much easier these days to integrate into a new project than it historically was.

@shwina shwina added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Oct 28, 2021
@davidwendt
Copy link
Contributor

There was discussion early in developing nvtext that it should probably live in cuML. Perhaps an NLP or text processing repository would be more useful.
@randerzander @VibhuJawa @beckernick

@VibhuJawa
Copy link
Member

I agree with the principle that sticking everything in StringMethods seems to decease discoverability but I would be vary of moving it to a different repo as that will likely mean that a user will have to install another dependency and that will hamper with user experience and adoption .

I think a middle path might be to not put functions that we don't think belong to StringMethods accessor in a namespace like cudf.text and create some separate documentation around it.

@shwina
Copy link
Contributor Author

shwina commented Nov 17, 2021

Thanks, @VibhuJawa - that's a good point. I'd say that's a reasonable approach also.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 12, 2022
@cwharris
Copy link
Contributor

cwharris commented May 12, 2022

Morpheus depends on nvtext subword tokenizer, so it would be nice to have this in it's own library (or just installed in cudf in a proper namespace) so we can use it in a supported fashion.

In the mean time we are probably going to have to copy-paste the relevant files in to the Morpheus repo.

@vyasr
Copy link
Contributor

vyasr commented May 13, 2024

The long-term home for this functionality will be pylibcudf. While some bits of cuDF functionality may remain a superset of pandas, most functions that are "extra" (in the sense of not being part of the pandas API) will likely be removed from cuDF and be primarily accessible from pylibcudf. At that point, if we deem it necessary we could create additional packages wrapping specific parts of pylibcudf functionality that don't fit in cuDF.

@vyasr vyasr closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

7 participants