-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wordpiece tokenizer #111
Wordpiece tokenizer #111
Conversation
and WordpiecePreprocessTokenizer
replace tabs with spaces updated tests - all passing now. WordpiecePreprocessTokenizer is configurable. Added config param 'tokenizeChineseChars'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small things. I'd like wordpiece to be more immutable, and WordpieceTokenizer needs to implement clone(). The rest is just tidying up and adding comments & copyright.
Util/Tokenization/src/main/java/org/tribuo/util/tokens/Token.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitCharactersTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitCharactersTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitCharactersTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitFunctionTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/test/java/org/tribuo/util/tokens/impl/WhitespaceTokenizerTest.java
Show resolved
Hide resolved
...Tokenization/src/test/java/org/tribuo/util/tokens/impl/WordpiecePreprocessTokenizerTest.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/test/java/org/tribuo/util/tokens/impl/WordpieceTokenizerTest.java
Show resolved
Hide resolved
Util/Tokenization/src/test/java/org/tribuo/util/tokens/impl/WordpieceTokenizerTest.java
Outdated
Show resolved
Hide resolved
.../test/resources/org/tribuo/util/tokens/impl/test/regression-text_bert-base-uncased_fails.txt
Show resolved
Hide resolved
- additionally, fixed an issue with the neverSplit strings. - copyrights, javadocs, config params, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few typos, and the clone method on WordpieceTokenizer needs some more work.
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitCharactersTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/wordpiece/WordpieceTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/wordpiece/WordpieceTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/wordpiece/WordpieceTokenizer.java
Outdated
Show resolved
Hide resolved
Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/wordpiece/WordpieceTokenizer.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
Description
An implementation of the Wordpiece tokenizer that is consistent with Huggingface implementation.
Motivation
This is useful for tokenizing text in Java so that we can have word ids that are consistent with BERT models built using Wordpiece-built vocabularies.