-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tokenisation modular (easy integration of 3rd party lib) #1022
Conversation
add some documentation on Sentence class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this Pull request.
Just a few comments about the aspect.
Also, I hvae noticed they are some line sduplicates, would it be worth to refactor them? This might lead to difficulties while maintaining
index = -1 | ||
for index, char in enumerate(text): | ||
if char == " ": | ||
if len(word) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if word
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it more readable? Shorter but it requires to know some Python convention. I have not found any other occurrence of this pattern but I may have missed something.
word += char | ||
# increment for last token in sentence if not followed by whitespace | ||
index += 1 | ||
if len(word) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if word
@bluesheeptoken Tks a lot for this review. Which lines should be deduplicated / refactored? |
@pommedeterresautee Thanks for replying ! I was spoken about this, if it is done on purpose, that is fine to me :) |
@bluesheeptoken have switched to Go a few months ago, out there readiness is above everything else :-) |
@pommedeterresautee this looks great, thanks! Will do some testing and merge soon! |
@pommedeterresautee, For the check of empty lists, the "pythonic way" to check is Anyway, this is not really important here I guess, since I have seen more occurences of Thanks for the Pull Request ! :) |
@@ -83,6 +84,18 @@ This should print: | |||
Sentence: "The grass is green ." - 5 Tokens | |||
``` | |||
|
|||
You can write and provide your own wrapper around the tokenizer you want to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the PR is merged, this text will be on the web page, but most users install flair through pip, i.e. they will not have access to this feature. Perhaps point out in the text that this is a feature only available on master branch currently.
flair/data.py
Outdated
""" | ||
|
||
def __init__( | ||
self, | ||
text: str = None, | ||
use_tokenizer: bool = False, | ||
tokenizer: Callable[[str], List[Token]] = space_tokenizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks backwards compatibility to downstream code that uses Flair with earlier versions. Perhaps the signature could be changed to:
def __init__(
self,
text: str = None,
use_tokenizer: Union[bool, Callable[[str], List[Token]]] = space_tokenizer,
labels: Union[List[Label], List[str]] = None,
language_code: str = None,
):
i.e. call it use_tokenizer
instead of tokenizer
and allow passing a bool in addition to callable. Then, a few lines down one could add:
tokenizer = use_tokenizer
if type(use_tokenizer) == bool:
tokenizer = segtok_tokenizer if use_tokenizer else space_tokenizer
i.e. by default if a callable is passed tokenizer = use_tokenizer
but if a bool is passed the tokenizer is instead initialized with the earlier behavior, i.e. tokenizer = segtok_tokenizer
if use_tokenizer=True
.
if len(word) > 0: | ||
token = Token(word, start_position=index - len(word)) | ||
self.add_token(token) | ||
[self.add_token(token) for token in tokenizer(text)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great :)
@pommedeterresautee thanks again for this great PR! I've put some annotation inline - I wonder if it can be adapted to preserve backwards compatibility so that the original tokenization instructions in the online tutorial still remain valid and using a different tokenizer becomes a case of "opt-in complexity". Otherwise this would cause master branch (and visible documentation online) to diverge from the last Flair release. |
You are right about the retro compatibility, it would be disturbing for many users to break API. |
Do you mean adding back For now I think a deprecation warning may not be necessary since I think many users are ok with the default tokenization, so simply having a boolean |
Tks for your ideas. |
👍 |
1 similar comment
👍 |
👍 |
Thanks a lot @pommedeterresautee for this PR!! |
replace the
use_tokenizer
parameter inSentence
class by atokenizer
parameter where a custom tokenizer can be provided.The
tokenizer
function signature is simple: take a string and return a list ofToken
.The idea is to let users decide what they want (specialized tokenization, etc.) and still provide them some basic options.
Space tokenizer (the space split when
use_tokenizer
is set to False) andsegtok
are provided as basic options.More may come in the future without having to make modifications.
I hope the API is clear. The only drawback I see is that several functions have their signatures modified. Hopefully, the new API is general enough to not require another change in the future regarding tokenisation, it is still version 0.x so it seems to be acceptable to perform the modification now. Moreover, I don't think a modular approach is possible with the current signature.
Related to #640 , #876 , and #563
FWIW, the approach is inspired from https://github.com/dselivanov/text2vec which is the main NLP lib in
R
world. I use it a lot since quite some time and never found myself limited by this approach.