Wordpiece tokenizer #111

pogren · 2021-02-01T22:19:12Z

Description

An implementation of the Wordpiece tokenizer that is consistent with Huggingface implementation.

Motivation

This is useful for tokenizing text in Java so that we can have word ids that are consistent with BERT models built using Wordpiece-built vocabularies.

and WordpiecePreprocessTokenizer

replace tabs with spaces updated tests - all passing now. WordpiecePreprocessTokenizer is configurable. Added config param 'tokenizeChineseChars'

Craigacp

Small things. I'd like wordpiece to be more immutable, and WordpieceTokenizer needs to implement clone(). The rest is just tidying up and adding comments & copyright.

Util/Tokenization/src/main/java/org/tribuo/util/tokens/Token.java

Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitCharactersTokenizer.java

Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitFunctionTokenizer.java

Util/Tokenization/src/test/java/org/tribuo/util/tokens/impl/WhitespaceTokenizerTest.java

...Tokenization/src/test/java/org/tribuo/util/tokens/impl/WordpiecePreprocessTokenizerTest.java

Util/Tokenization/src/test/java/org/tribuo/util/tokens/impl/WordpieceTokenizerTest.java

.../test/resources/org/tribuo/util/tokens/impl/test/regression-text_bert-base-uncased_fails.txt

- additionally, fixed an issue with the neverSplit strings. - copyrights, javadocs, config params, etc.

Craigacp

There are a few typos, and the clone method on WordpieceTokenizer needs some more work.

Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/SplitCharactersTokenizer.java

Util/Tokenization/src/main/java/org/tribuo/util/tokens/impl/wordpiece/WordpieceTokenizer.java

Craigacp

LGTM, thanks.

pogren added 5 commits January 13, 2021 16:53

initial commit of SplitFunctionTokenizer

96634ac

added TokenType values to support Wordpiece and BPE style tokenization

b6d5d3f

Initial commit of Wordpiece and WordpieceTokenizer

8c56647

and WordpiecePreprocessTokenizer

initial commit of bert-base-uncased along with regression test data

bbbdffd

removed WordpieceBuilder in favor of directly making Wordpiece configrbl

c0a20b8

replace tabs with spaces updated tests - all passing now. WordpiecePreprocessTokenizer is configurable. Added config param 'tokenizeChineseChars'

Craigacp requested changes Feb 2, 2021

View reviewed changes

resolves issues raised in comments in the pull request

0415216

- additionally, fixed an issue with the neverSplit strings. - copyrights, javadocs, config params, etc.

Craigacp requested changes Feb 3, 2021

View reviewed changes

fixes misc. typos and a problem with clone() method

6c29937

Craigacp approved these changes Feb 4, 2021

View reviewed changes

Craigacp merged commit a10ec50 into main Feb 4, 2021

Craigacp deleted the wordpiece-tokenizer branch February 9, 2021 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wordpiece tokenizer #111

Wordpiece tokenizer #111

pogren commented Feb 1, 2021

Craigacp left a comment

Craigacp left a comment

Craigacp left a comment

Wordpiece tokenizer #111

Wordpiece tokenizer #111

Conversation

pogren commented Feb 1, 2021

Description

Motivation

Craigacp left a comment

Choose a reason for hiding this comment

Craigacp left a comment

Choose a reason for hiding this comment

Craigacp left a comment

Choose a reason for hiding this comment