-
Notifications
You must be signed in to change notification settings - Fork 177
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* initial commit of SplitFunctionTokenizer * added TokenType values to support Wordpiece and BPE style tokenization * Initial commit of Wordpiece and WordpieceTokenizer and WordpiecePreprocessTokenizer * initial commit of bert-base-uncased along with regression test data * removed WordpieceBuilder in favor of directly making Wordpiece configrbl replace tabs with spaces updated tests - all passing now. WordpiecePreprocessTokenizer is configurable. Added config param 'tokenizeChineseChars' * resolves issues raised in comments in the pull request - additionally, fixed an issue with the neverSplit strings. - copyrights, javadocs, config params, etc. * fixes misc. typos and a problem with clone() method
- Loading branch information
Showing
16 changed files
with
51,843 additions
and
157 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.