Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Byte-Pair Encoder (BPE) support #507

Open
2 tasks done
pdmack opened this issue Nov 29, 2022 · 0 comments
Open
2 tasks done

[FEA]: Byte-Pair Encoder (BPE) support #507

pdmack opened this issue Nov 29, 2022 · 0 comments
Assignees
Labels
feature request New feature or request

Comments

@pdmack
Copy link
Contributor

pdmack commented Nov 29, 2022

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

There are projects such as Bertin whose focus is to train and evaluate BERT-based models for the Spanish language. Models such as roberta, gpt-2, and gpt-3 require a BPE tokenizer. However, the current Morpheus inference pipeline currently uses the cuDF BERT tokenizer.

Describe your ideal solution

Develop a new CPU tokenizer/nlp-preprocessing stage for Morpheus. An adaptation of Morpheus current phishing training.

Describe any alternatives you have considered

cuDF project has an open feature request for GPU-accelerated BPE tokenizer support but without any known roadmap commitment.

Additional context

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@pdmack pdmack added the feature request New feature or request label Nov 29, 2022
@pdmack pdmack moved this to Todo in Morpheus Boards Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants