Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tokenisation modular (easy integration of 3rd party lib) #1022

Merged
merged 15 commits into from
Sep 10, 2019

Conversation

pommedeterresautee
Copy link
Contributor

@pommedeterresautee pommedeterresautee commented Aug 22, 2019

replace the use_tokenizer parameter in Sentence class by a tokenizer parameter where a custom tokenizer can be provided.
The tokenizer function signature is simple: take a string and return a list of Token.

The idea is to let users decide what they want (specialized tokenization, etc.) and still provide them some basic options.
Space tokenizer (the space split when use_tokenizer is set to False) and segtok are provided as basic options.
More may come in the future without having to make modifications.

I hope the API is clear. The only drawback I see is that several functions have their signatures modified. Hopefully, the new API is general enough to not require another change in the future regarding tokenisation, it is still version 0.x so it seems to be acceptable to perform the modification now. Moreover, I don't think a modular approach is possible with the current signature.

Related to #640 , #876 , and #563

FWIW, the approach is inspired from https://github.com/dselivanov/text2vec which is the main NLP lib in R world. I use it a lot since quite some time and never found myself limited by this approach.

@pommedeterresautee pommedeterresautee changed the title Make possible to use third party tokenizers Make tokenization modular (easy integration of 3rd party lib) Aug 22, 2019
@pommedeterresautee pommedeterresautee changed the title Make tokenization modular (easy integration of 3rd party lib) Make tokenisation modular (easy integration of 3rd party lib) Aug 22, 2019
Copy link

@bluesheeptoken bluesheeptoken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this Pull request.

Just a few comments about the aspect.

Also, I hvae noticed they are some line sduplicates, would it be worth to refactor them? This might lead to difficulties while maintaining

flair/data.py Outdated Show resolved Hide resolved
index = -1
for index, char in enumerate(text):
if char == " ":
if len(word) > 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if word instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more readable? Shorter but it requires to know some Python convention. I have not found any other occurrence of this pattern but I may have missed something.

word += char
# increment for last token in sentence if not followed by whitespace
index += 1
if len(word) > 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if word

flair/data.py Outdated Show resolved Hide resolved
flair/data.py Outdated Show resolved Hide resolved
flair/data.py Outdated Show resolved Hide resolved
flair/data.py Outdated Show resolved Hide resolved
@pommedeterresautee
Copy link
Contributor Author

@bluesheeptoken Tks a lot for this review.

Which lines should be deduplicated / refactored?
For the tokenizer factories, there are some duplicated lines to make each of them independent and more easy to read / reproduce for other tokenizers not yet supported without having to understand anything else but the function signature, so it's a choice. If the duplication is inside the Sentence class, let me know where it is.

@bluesheeptoken
Copy link

@pommedeterresautee Thanks for replying ! I was spoken about this, if it is done on purpose, that is fine to me :)

@pommedeterresautee
Copy link
Contributor Author

@bluesheeptoken have switched to Go a few months ago, out there readiness is above everything else :-)
"A little copying is better than a little dependency."

@alanakbik
Copy link
Collaborator

@pommedeterresautee this looks great, thanks! Will do some testing and merge soon!

@bluesheeptoken
Copy link

@pommedeterresautee, For the check of empty lists, the "pythonic way" to check is if l:. This is preferred due to the fact this is more readable (relatable I guess, but if word is not misleading, we expect a string) and faster (does not need to calculate the full length of the list).

Anyway, this is not really important here I guess, since I have seen more occurences of if len(l) == 0: than if l:

Thanks for the Pull Request ! :)

@@ -83,6 +84,18 @@ This should print:
Sentence: "The grass is green ." - 5 Tokens
```

You can write and provide your own wrapper around the tokenizer you want to use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the PR is merged, this text will be on the web page, but most users install flair through pip, i.e. they will not have access to this feature. Perhaps point out in the text that this is a feature only available on master branch currently.

flair/data.py Outdated
"""

def __init__(
self,
text: str = None,
use_tokenizer: bool = False,
tokenizer: Callable[[str], List[Token]] = space_tokenizer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backwards compatibility to downstream code that uses Flair with earlier versions. Perhaps the signature could be changed to:

    def __init__(
        self,
        text: str = None,
        use_tokenizer: Union[bool, Callable[[str], List[Token]]] = space_tokenizer,
        labels: Union[List[Label], List[str]] = None,
        language_code: str = None,
    ):

i.e. call it use_tokenizer instead of tokenizer and allow passing a bool in addition to callable. Then, a few lines down one could add:

        tokenizer = use_tokenizer
        if type(use_tokenizer) == bool:
            tokenizer = segtok_tokenizer if use_tokenizer else space_tokenizer

i.e. by default if a callable is passed tokenizer = use_tokenizer but if a bool is passed the tokenizer is instead initialized with the earlier behavior, i.e. tokenizer = segtok_tokenizer if use_tokenizer=True.

if len(word) > 0:
token = Token(word, start_position=index - len(word))
self.add_token(token)
[self.add_token(token) for token in tokenizer(text)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great :)

@alanakbik
Copy link
Collaborator

@pommedeterresautee thanks again for this great PR! I've put some annotation inline - I wonder if it can be adapted to preserve backwards compatibility so that the original tokenization instructions in the online tutorial still remain valid and using a different tokenizer becomes a case of "opt-in complexity". Otherwise this would cause master branch (and visible documentation online) to diverge from the last Flair release.

@pommedeterresautee
Copy link
Contributor Author

You are right about the retro compatibility, it would be disturbing for many users to break API.
However the API would have strange signature and difficult to clean in the future. What about adding back tokenizer as bool only, and if someone set it to True, segtok is used plus a warning indicating it s deprecated? So in the future it can be safely removed?

@alanakbik
Copy link
Collaborator

Do you mean adding back use_tokenizer as an additional variable and having the two side by side? Yes, that would also be an option.

For now I think a deprecation warning may not be necessary since I think many users are ok with the default tokenization, so simply having a boolean use_tokenizer option might make it easier for many users to get started (this is also one import statement less). So this is something we might want to keep, while also offering those users who want to switch out tokenizers a good way to do so. Either having two separate variables use_tokenizer and tokenizer, or one use_tokenizer variable that is type-overloaded would achieve this, so either way is good for me. What do you think?

@pommedeterresautee
Copy link
Contributor Author

Tks for your ideas.
I get your point, I will stick to your original original proposition of having use_tokenizer being both bool and a function.

@alanakbik
Copy link
Collaborator

👍

1 similar comment
@adizdari
Copy link

👍

@alanakbik alanakbik merged commit 365223e into flairNLP:master Sep 10, 2019
@kashif
Copy link
Contributor

kashif commented Sep 10, 2019

👍

@alanakbik
Copy link
Collaborator

Thanks a lot @pommedeterresautee for this PR!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants