You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
The language_model.py module currently contains wrappers for HF models, one for each type of model supported, plus some additional utilities. The same happens in tokenization.py, but for tokenizers.
Most of these models are straight out copies of each other with minimal variable name changes. All of these could be reduced to a couple of classes. There are also long if-else chains that bring close to no value and slow down the codebase. In the case of tokenizers the simplifications can be even more drastic as AutoTokenizer can replace all the tokenizers we currently use, rendering a good share of the code of tokenization.py redundant.
In addition, many methods in these two modules take **kwargs needlessly, and this makes function calls and parameters passage very opaque and hard to understand. I believe most of these **kwargs can be safely removed or contained into explicit dictionary parameters.
Solution
Simplify language_model.py to reduce drastically the code duplication and remove the if-else switches
Simplify tokeinzation.py to use only AutoTokenizer
Remove usage of **kwargs across these two modules
Implement standalone factory methods for LanguageModel subclasses
Replace the Tokenizer class with a standalone factory method
Adapt the codebase to not pass **kwargs mindlessly, use the new factory methods, and fix the tests.
The text was updated successfully, but these errors were encountered:
Problem
The
language_model.py
module currently contains wrappers for HF models, one for each type of model supported, plus some additional utilities. The same happens intokenization.py
, but for tokenizers.Most of these models are straight out copies of each other with minimal variable name changes. All of these could be reduced to a couple of classes. There are also long if-else chains that bring close to no value and slow down the codebase. In the case of tokenizers the simplifications can be even more drastic as
AutoTokenizer
can replace all the tokenizers we currently use, rendering a good share of the code oftokenization.py
redundant.In addition, many methods in these two modules take
**kwargs
needlessly, and this makes function calls and parameters passage very opaque and hard to understand. I believe most of these**kwargs
can be safely removed or contained into explicit dictionary parameters.Solution
language_model.py
to reduce drastically the code duplication and remove the if-else switchestokeinzation.py
to use onlyAutoTokenizer
**kwargs
across these two modulesLanguageModel
subclassesTokenizer
class with a standalone factory method**kwargs
mindlessly, use the new factory methods, and fix the tests.The text was updated successfully, but these errors were encountered: