Simplify language_modeling.py and tokenization.py #2704

masci · 2022-06-22T13:16:33Z

Problem
The language_model.py module currently contains wrappers for HF models, one for each type of model supported, plus some additional utilities. The same happens in tokenization.py, but for tokenizers.

Most of these models are straight out copies of each other with minimal variable name changes. All of these could be reduced to a couple of classes. There are also long if-else chains that bring close to no value and slow down the codebase. In the case of tokenizers the simplifications can be even more drastic as AutoTokenizer can replace all the tokenizers we currently use, rendering a good share of the code of tokenization.py redundant.

In addition, many methods in these two modules take **kwargs needlessly, and this makes function calls and parameters passage very opaque and hard to understand. I believe most of these **kwargs can be safely removed or contained into explicit dictionary parameters.

Solution

Simplify language_model.py to reduce drastically the code duplication and remove the if-else switches
Simplify tokeinzation.py to use only AutoTokenizer
Remove usage of **kwargs across these two modules
Implement standalone factory methods for LanguageModel subclasses
Replace the Tokenizer class with a standalone factory method
Adapt the codebase to not pass **kwargs mindlessly, use the new factory methods, and fix the tests.

The text was updated successfully, but these errors were encountered:

masci linked a pull request Jun 22, 2022 that will close this issue

Simplify language_modeling.py and tokenization.py #2703

Merged

2 tasks

masci assigned ZanSara Jun 22, 2022

ZanSara mentioned this issue Jun 22, 2022

Simplify language_modeling.py and tokenization.py #2703

Merged

2 tasks

ZanSara added topic:modeling type:refactor Not necessarily visible to the users labels Jun 23, 2022

ZanSara closed this as completed in #2703 Jul 22, 2022

This was referenced Sep 13, 2022

ONNX FARMReader model conversion is broken #3210

Closed

fix: ONNX FARMReader model conversion is broken #3211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify language_modeling.py and tokenization.py #2704

Simplify language_modeling.py and tokenization.py #2704

masci commented Jun 22, 2022 •

edited by ZanSara

Loading

Simplify language_modeling.py and tokenization.py #2704

Simplify language_modeling.py and tokenization.py #2704

Comments

masci commented Jun 22, 2022 • edited by ZanSara Loading

masci commented Jun 22, 2022 •

edited by ZanSara

Loading