Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tokenization API #1711

Closed
mariosaenger opened this issue Jun 23, 2020 · 3 comments
Closed

Improve tokenization API #1711

mariosaenger opened this issue Jun 23, 2020 · 3 comments
Assignees

Comments

@mariosaenger
Copy link
Collaborator

Currently, flair supports tokenization only in a rather simple, non-standardised form, using different functions / callables. In course of #1513 the need for a more sophisticated implementation emerged

The goal of this issue is to provide a general tokenization programming interface based on a abstract superclass. The new implementation should enable users to define and use custom tokenizers easily.

@mariosaenger mariosaenger self-assigned this Jun 23, 2020
@mauryaland
Copy link
Contributor

Isn't it already the case with #1022 ?

@mariosaenger
Copy link
Collaborator Author

Right #1022 introduces a consistent function interface for tokenizers. However, modelling tokenizers just as functions is insufficient / disadvantageous in some situations. Take the integration of NER data sets in Flair as an example. In general, these data sets are integrated by downloading and preparing them upon first instantiation, leading to (fixed) CONLL files. Any further instantiation will just load the prepared CONLL files. If the data set isn't pre-tokenized, tokenization has to be done during the preparation steps. However, persisting this tokenization doesn't enable the user to change the desired tokenization after the first instantiation.

But good point anyway. Maybe we have a too restricted view to solve this issue.

@alanakbik: Do you have any further ideas / arguments / plans why changing tokenizers to classes?

@alanakbik
Copy link
Collaborator

One advantage is that classes can be passed variables when initializing, which is more convenient for instance if a tokenizer requires parameterization (otherwise, you would always have to pass the same parameters with the tokenize method call).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants