Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify language_modeling.py and tokenization.py #2704

Closed
masci opened this issue Jun 22, 2022 · 0 comments · Fixed by #2703
Closed

Simplify language_modeling.py and tokenization.py #2704

masci opened this issue Jun 22, 2022 · 0 comments · Fixed by #2703
Assignees
Labels
topic:modeling type:refactor Not necessarily visible to the users

Comments

@masci
Copy link
Contributor

masci commented Jun 22, 2022

Problem
The language_model.py module currently contains wrappers for HF models, one for each type of model supported, plus some additional utilities. The same happens in tokenization.py, but for tokenizers.

Most of these models are straight out copies of each other with minimal variable name changes. All of these could be reduced to a couple of classes. There are also long if-else chains that bring close to no value and slow down the codebase. In the case of tokenizers the simplifications can be even more drastic as AutoTokenizer can replace all the tokenizers we currently use, rendering a good share of the code of tokenization.py redundant.

In addition, many methods in these two modules take **kwargs needlessly, and this makes function calls and parameters passage very opaque and hard to understand. I believe most of these **kwargs can be safely removed or contained into explicit dictionary parameters.

Solution

  • Simplify language_model.py to reduce drastically the code duplication and remove the if-else switches
  • Simplify tokeinzation.py to use only AutoTokenizer
  • Remove usage of **kwargs across these two modules
  • Implement standalone factory methods for LanguageModel subclasses
  • Replace the Tokenizer class with a standalone factory method
  • Adapt the codebase to not pass **kwargs mindlessly, use the new factory methods, and fix the tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:modeling type:refactor Not necessarily visible to the users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants