-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tokenization API #1711
Comments
Isn't it already the case with #1022 ? |
Right #1022 introduces a consistent function interface for tokenizers. However, modelling tokenizers just as functions is insufficient / disadvantageous in some situations. Take the integration of NER data sets in Flair as an example. In general, these data sets are integrated by downloading and preparing them upon first instantiation, leading to (fixed) CONLL files. Any further instantiation will just load the prepared CONLL files. If the data set isn't pre-tokenized, tokenization has to be done during the preparation steps. However, persisting this tokenization doesn't enable the user to change the desired tokenization after the first instantiation. But good point anyway. Maybe we have a too restricted view to solve this issue. @alanakbik: Do you have any further ideas / arguments / plans why changing tokenizers to classes? |
One advantage is that classes can be passed variables when initializing, which is more convenient for instance if a tokenizer requires parameterization (otherwise, you would always have to pass the same parameters with the tokenize method call). |
Currently, flair supports tokenization only in a rather simple, non-standardised form, using different functions / callables. In course of #1513 the need for a more sophisticated implementation emerged
The goal of this issue is to provide a general tokenization programming interface based on a abstract superclass. The new implementation should enable users to define and use custom tokenizers easily.
The text was updated successfully, but these errors were encountered: