Make tokenisation modular (easy integration of 3rd party lib) #1022

pommedeterresautee · 2019-08-22T14:43:13Z

replace the use_tokenizer parameter in Sentence class by a tokenizer parameter where a custom tokenizer can be provided.
The tokenizer function signature is simple: take a string and return a list of Token.

The idea is to let users decide what they want (specialized tokenization, etc.) and still provide them some basic options.
Space tokenizer (the space split when use_tokenizer is set to False) and segtok are provided as basic options.
More may come in the future without having to make modifications.

I hope the API is clear. The only drawback I see is that several functions have their signatures modified. Hopefully, the new API is general enough to not require another change in the future regarding tokenisation, it is still version 0.x so it seems to be acceptable to perform the modification now. Moreover, I don't think a modular approach is possible with the current signature.

Related to #640 , #876 , and #563

FWIW, the approach is inspired from https://github.com/dselivanov/text2vec which is the main NLP lib in R world. I use it a lot since quite some time and never found myself limited by this approach.

add some documentation on Sentence class

bluesheeptoken

I like this Pull request.

Just a few comments about the aspect.

Also, I hvae noticed they are some line sduplicates, would it be worth to refactor them? This might lead to difficulties while maintaining

flair/data.py

bluesheeptoken · 2019-09-03T08:44:09Z

flair/data.py

+    index = -1
+    for index, char in enumerate(text):
+        if char == " ":
+            if len(word) > 0:


if word instead ?

Is it more readable? Shorter but it requires to know some Python convention. I have not found any other occurrence of this pattern but I may have missed something.

bluesheeptoken · 2019-09-03T08:44:44Z

flair/data.py

+            word += char
+    # increment for last token in sentence if not followed by whitespace
+    index += 1
+    if len(word) > 0:


flair/data.py

pommedeterresautee · 2019-09-03T10:29:30Z

@bluesheeptoken Tks a lot for this review.

Which lines should be deduplicated / refactored?
For the tokenizer factories, there are some duplicated lines to make each of them independent and more easy to read / reproduce for other tokenizers not yet supported without having to understand anything else but the function signature, so it's a choice. If the duplication is inside the Sentence class, let me know where it is.

bluesheeptoken · 2019-09-03T16:02:26Z

@pommedeterresautee Thanks for replying ! I was spoken about this, if it is done on purpose, that is fine to me :)

pommedeterresautee · 2019-09-03T18:33:27Z

@bluesheeptoken have switched to Go a few months ago, out there readiness is above everything else :-)
"A little copying is better than a little dependency."

alanakbik · 2019-09-04T06:26:42Z

@pommedeterresautee this looks great, thanks! Will do some testing and merge soon!

bluesheeptoken · 2019-09-04T20:19:43Z

@pommedeterresautee, For the check of empty lists, the "pythonic way" to check is if l:. This is preferred due to the fact this is more readable (relatable I guess, but if word is not misleading, we expect a string) and faster (does not need to calculate the full length of the list).

Anyway, this is not really important here I guess, since I have seen more occurences of if len(l) == 0: than if l:

Thanks for the Pull Request ! :)

alanakbik · 2019-09-10T12:33:02Z

resources/docs/TUTORIAL_1_BASICS.md

@@ -83,6 +84,18 @@ This should print:
 Sentence: "The grass is green ." - 5 Tokens
 ```

+You can write and provide your own wrapper around the tokenizer you want to use.  


After the PR is merged, this text will be on the web page, but most users install flair through pip, i.e. they will not have access to this feature. Perhaps point out in the text that this is a feature only available on master branch currently.

alanakbik · 2019-09-10T12:37:27Z

flair/data.py

    """

    def __init__(
        self,
        text: str = None,
-        use_tokenizer: bool = False,
+        tokenizer: Callable[[str], List[Token]] = space_tokenizer,


This breaks backwards compatibility to downstream code that uses Flair with earlier versions. Perhaps the signature could be changed to:

def __init__( self, text: str = None, use_tokenizer: Union[bool, Callable[[str], List[Token]]] = space_tokenizer, labels: Union[List[Label], List[str]] = None, language_code: str = None, ):

i.e. call it use_tokenizer instead of tokenizer and allow passing a bool in addition to callable. Then, a few lines down one could add:

tokenizer = use_tokenizer if type(use_tokenizer) == bool: tokenizer = segtok_tokenizer if use_tokenizer else space_tokenizer

i.e. by default if a callable is passed tokenizer = use_tokenizer but if a bool is passed the tokenizer is instead initialized with the earlier behavior, i.e. tokenizer = segtok_tokenizer if use_tokenizer=True.

alanakbik · 2019-09-10T12:41:58Z

flair/data.py

-                if len(word) > 0:
-                    token = Token(word, start_position=index - len(word))
-                    self.add_token(token)
+            [self.add_token(token) for token in tokenizer(text)]


This is great :)

alanakbik · 2019-09-10T12:47:01Z

@pommedeterresautee thanks again for this great PR! I've put some annotation inline - I wonder if it can be adapted to preserve backwards compatibility so that the original tokenization instructions in the online tutorial still remain valid and using a different tokenizer becomes a case of "opt-in complexity". Otherwise this would cause master branch (and visible documentation online) to diverge from the last Flair release.

pommedeterresautee · 2019-09-10T13:20:44Z

You are right about the retro compatibility, it would be disturbing for many users to break API.
However the API would have strange signature and difficult to clean in the future. What about adding back tokenizer as bool only, and if someone set it to True, segtok is used plus a warning indicating it s deprecated? So in the future it can be safely removed?

alanakbik · 2019-09-10T13:41:47Z

Do you mean adding back use_tokenizer as an additional variable and having the two side by side? Yes, that would also be an option.

For now I think a deprecation warning may not be necessary since I think many users are ok with the default tokenization, so simply having a boolean use_tokenizer option might make it easier for many users to get started (this is also one import statement less). So this is something we might want to keep, while also offering those users who want to switch out tokenizers a good way to do so. Either having two separate variables use_tokenizer and tokenizer, or one use_tokenizer variable that is type-overloaded would achieve this, so either way is good for me. What do you think?

pommedeterresautee · 2019-09-10T13:48:41Z

Tks for your ideas.
I get your point, I will stick to your original original proposition of having use_tokenizer being both bool and a function.

update doc

alanakbik · 2019-09-10T15:17:39Z

👍

adizdari · 2019-09-10T15:27:48Z

👍

kashif · 2019-09-10T15:29:22Z

👍

alanakbik · 2019-09-10T15:29:45Z

Thanks a lot @pommedeterresautee for this PR!!

Make possible to use third party tokenizers

97b16c5

pommedeterresautee changed the title ~~Make possible to use third party tokenizers~~ Make tokenization modular (easy integration of 3rd party lib) Aug 22, 2019

apply black formatting

42e8fa0

pommedeterresautee changed the title ~~Make tokenization modular (easy integration of 3rd party lib)~~ Make tokenisation modular (easy integration of 3rd party lib) Aug 22, 2019

pommedeterresautee added 7 commits August 22, 2019 20:01

update documentation

cb17919

add Spacy tokenizer factory

522a9c9

add some documentation on Sentence class

Merge branch 'fix_inf'

74ceca2

Merge remote-tracking branch 'upstream/master'

ab92210

merge master

728c17d

Merge branch 'master' of github.com:pommedeterresautee/flair

bfb8284

Merge remote-tracking branch 'upstream/master'

e65c679

bluesheeptoken reviewed Sep 3, 2019

View reviewed changes

pommedeterresautee added 2 commits September 3, 2019 20:25

harmonize list init

ef9d154

little refactoring

3ebb149

Merge remote-tracking branch 'upstream/master'

42c7fda

pommedeterresautee added 2 commits September 4, 2019 22:24

merge master

13ca59d

Merge remote-tracking branch 'upstream/master'

0dcd7ee

pommedeterresautee mentioned this pull request Sep 10, 2019

Reduce inference time by 30% when using Flair embeddings #1074

Merged

alanakbik reviewed Sep 10, 2019

View reviewed changes

change name of tokenizer parameter to use_tokenizer

c2443e5

update doc

alanakbik merged commit 365223e into flairNLP:master Sep 10, 2019

mauryaland mentioned this pull request Jun 23, 2020

Improve tokenization API #1711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tokenisation modular (easy integration of 3rd party lib) #1022

Make tokenisation modular (easy integration of 3rd party lib) #1022

pommedeterresautee commented Aug 22, 2019 •

edited

Loading

bluesheeptoken left a comment

bluesheeptoken Sep 3, 2019

pommedeterresautee Sep 3, 2019

bluesheeptoken Sep 3, 2019

pommedeterresautee commented Sep 3, 2019

bluesheeptoken commented Sep 3, 2019

pommedeterresautee commented Sep 3, 2019

alanakbik commented Sep 4, 2019

bluesheeptoken commented Sep 4, 2019

alanakbik Sep 10, 2019

alanakbik Sep 10, 2019

alanakbik Sep 10, 2019

alanakbik commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019

alanakbik commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019

alanakbik commented Sep 10, 2019

adizdari commented Sep 10, 2019

kashif commented Sep 10, 2019

alanakbik commented Sep 10, 2019

Make tokenisation modular (easy integration of 3rd party lib) #1022

Make tokenisation modular (easy integration of 3rd party lib) #1022

Conversation

pommedeterresautee commented Aug 22, 2019 • edited Loading

bluesheeptoken left a comment

Choose a reason for hiding this comment

bluesheeptoken Sep 3, 2019

Choose a reason for hiding this comment

pommedeterresautee Sep 3, 2019

Choose a reason for hiding this comment

bluesheeptoken Sep 3, 2019

Choose a reason for hiding this comment

pommedeterresautee commented Sep 3, 2019

bluesheeptoken commented Sep 3, 2019

pommedeterresautee commented Sep 3, 2019

alanakbik commented Sep 4, 2019

bluesheeptoken commented Sep 4, 2019

alanakbik Sep 10, 2019

Choose a reason for hiding this comment

alanakbik Sep 10, 2019

Choose a reason for hiding this comment

alanakbik Sep 10, 2019

Choose a reason for hiding this comment

alanakbik commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019

alanakbik commented Sep 10, 2019

pommedeterresautee commented Sep 10, 2019

alanakbik commented Sep 10, 2019

adizdari commented Sep 10, 2019

kashif commented Sep 10, 2019

alanakbik commented Sep 10, 2019

pommedeterresautee commented Aug 22, 2019 •

edited

Loading