[FEA]: Byte-Pair Encoder (BPE) support #507

pdmack · 2022-11-29T16:15:36Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

There are projects such as Bertin whose focus is to train and evaluate BERT-based models for the Spanish language. Models such as roberta, gpt-2, and gpt-3 require a BPE tokenizer. However, the current Morpheus inference pipeline currently uses the cuDF BERT tokenizer.

Describe your ideal solution

Develop a new CPU tokenizer/nlp-preprocessing stage for Morpheus. An adaptation of Morpheus current phishing training.

Describe any alternatives you have considered

cuDF project has an open feature request for GPU-accelerated BPE tokenizer support but without any known roadmap commitment.

Additional context

Useful for phishing detection in non-English content
https://leimao.github.io/blog/Byte-Pair-Encoding/
https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0

Code of Conduct

I agree to follow this project's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request

pdmack added the feature request New feature or request label Nov 29, 2022

pdmack assigned BartleyR and raykallen Nov 29, 2022

pdmack added this to Morpheus Boards Nov 29, 2022

pdmack moved this to Todo in Morpheus Boards Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Byte-Pair Encoder (BPE) support #507

[FEA]: Byte-Pair Encoder (BPE) support #507

pdmack commented Nov 29, 2022 •

edited

Loading

[FEA]: Byte-Pair Encoder (BPE) support #507

[FEA]: Byte-Pair Encoder (BPE) support #507

Comments

pdmack commented Nov 29, 2022 • edited Loading

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Code of Conduct

pdmack commented Nov 29, 2022 •

edited

Loading