Implement Tokenizer op #31

yuslepukhin · 2018-11-27T01:15:17Z

Inputs:
X: tensor, string
Outputs:
Y: tensor, string
Attribute:
mark: bool. Whether to mark the beginning/end character with start of text character (0x02)/end of text character (0x03).

padvalue: string. The string used to pad output tensors when the tokens extracted doesn't match the maximum number of tokens found.

separators: list of strings (type: AttributeProto::STRINGS). The list of separators, two consecutive segments in X connected by a separator would be divided into two tokens. For example, if the input is "Hello World!" and this attribute contains only one space character, the corresponding output would be ["Hello", "World!"]. To achieve character-level tokenization, the user should set the separators to [""], which contains only one empty string. This attribute is a 1-D string tensor and each single string in this attribute is a separator. If 'separators' is a L-element array, there will be L rounds of tokenization using one stop word. More specifically, in the first round, the first element in 'separators' is used to tokenize each string in the input. Then, the second element in 'separators' will be used to tokenize the resulted strings produced at the first round.

mincharnum: int. Minimum number of characters allowed in the output. For example, if mincharnum is 2, tokens such as “A” and “B” would be ignored.
Description:
Tokenizer divides each string in X into a vector of strings along the last axis. Allowed input shapes are [C] and [N, C]. If the maximum number of tokens found per input string is D, the output shape would be [N, C, D] when input shape is [N, C]. Similarly, if input shape is [C] then the output should be [C, D]. For example, if input is ["Hello World", "I love computer science !"] whose shape is [2], then the output would be [["Hello", "World", padvalue, padvalue, padvalue], ["I", "love", "computer", "science", "!"]] whose shape is [2, 5] because you can find at most 5 tokens per input string. Note that the input at most can have two axes, so 3-D and higher dimension are not supported.
Let's consider another example where mark is set to true. If input is ["Hello", "World"], then the corresponding output would be [0x02, "Hello", "World", 0x03]. This implies that if mark is true, [C]/[N, C]-input's output shape becomes [C, D+2]/[N, C, D+2]. All input strings including attributes are UTF-8 encoded.

onnxruntime/contrib_ops/cpu/tokenizer.cc

yuslepukhin · 2018-11-27T02:04:28Z

Need to clarify the behavior if the output is empty.
Also current implementation searches all patterns at once which may not the same when patterns applied sequentially. #Closed

onnxruntime/contrib_ops/cpu/tokenizer.cc

onnxruntime/contrib_ops/contrib_ops.cc

onnxruntime/contrib_ops/cpu/tokenizer.cc

TODO: Clarify what to do if the output is empty and no start/end text markers required. Also see if the current search algo is acceptable.

earlier specified pattern matches.

onnxruntime/contrib_ops/cpu/tokenizer.cc

pranavsharma · 2018-12-05T00:41:04Z

onnxruntime/core/common/utf8_util.h

+
+// Returns the number of bytes in the utf8 character
+// by analyzing its leading byte
+inline bool utf8_bytes(unsigned char ch, size_t& len) {


Can we add unit tests for the functions in this file? #WontFix

It is added though tested indirectly but sufficiently

In reply to: 238894458 [](ancestors = 238894458)

pranavsharma · 2018-12-05T00:43:02Z

onnxruntime/contrib_ops/contrib_ops.cc

@@ -92,6 +92,37 @@ Sample echo operator.)DOC");
      .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)
      .SetDoc(R"DOC(Returns which elements of the input are NaN.)DOC");

+  ONNX_CONTRIB_OPERATOR_SCHEMA(Tokenizer)


Should we mention about UTF-8 in the spec just to be super clear? #Closed

Yes, will do. The author of the spec claimed that all strings in ONNX are UTF-8. Well, they use it for Bing. So there you go.

In reply to: 238894817 [](ancestors = 238894817)

onnxruntime/contrib_ops/cpu/tokenizer.cc

pranavsharma · 2018-12-05T00:46:53Z

onnxruntime/contrib_ops/cpu/tokenizer.cc

+
+  // Create TST and insert separators
+  if (!char_tokenezation_) {
+    std::unique_ptr<SearchData> sd(new SearchData);


std::make_unique #Closed

make_unique is not always present in GCC versions.

In reply to: 238895544 [](ancestors = 238895544)

Our code makes use of make_unique extensively and relies on it's existence. It's safe to assume it'll be there. I think we even have a function where we define our own make_unique when not present. #Closed

pranavsharma · 2018-12-05T00:51:42Z

onnxruntime/contrib_ops/cpu/tokenizer.cc

+                    "Invalid utf8 chars in the input: " + s);
+    }
+
+    struct Match {


can probably take it outside the while loop. #Closed

onnxruntime/contrib_ops/cpu/tokenizer.cc

pranavsharma

May be later we can consider parallelizing the operation for N batches since they're all independent.; might be early optimization at this point since we don't know the nature of the input. Worth adding a TODO to identify opportunities for optimization upfront.

xadupre · 2018-12-05T13:16:44Z

onnxruntime/test/contrib_ops/tokenizer_test.cc

+  }
+}
+
+TEST(ContribOpTest, TokenizerWithSeparators) {


Why one test and not multiple ones for each case? #Closed

xadupre · 2018-12-05T14:07:08Z

onnxruntime/test/contrib_ops/tokenizer_test.cc

+    test.Run(OpTester::ExpectResult::kExpectSuccess);
+  }
+}
+}  // namespace test


I added the following unit test, it happens when a separator is a prefix of another one. I don't know if this case is expected but if not we should raised an exception or warning in the constructor.

TEST(ContribOpTest, TokenizerWithSeparators__) {
// Separators and strings with a mix of latin, Spanish, Cyrillic and Chinese
// characters and with start/end text markers
// [C] dimensions
// Output [C][D]
{
std::vectorstd::string separators = {
u8";",
u8";;;"};

OpTester test("Tokenizer", opset_ver, domain); InitTestAttr(test, true, separators, 1); std::vector<int64_t> dims{4}; std::vector<std::string> input{u8"a;b", u8"a;;;b", u8"b;c;;;d;e", u8"a;;b;;;c"}; test.AddInput<std::string>("T", dims, input); std::vector<int64_t> output_dims(dims); // Must split both in 2 output_dims.push_back(int64_t(6)); std::vector<std::string> output { start_mark, u8"a", u8"b", end_mark, padval, padval, start_mark, u8"a", u8"b", end_mark, padval, padval, start_mark, u8"b", u8"c", u8"d", u8"e", end_mark, start_mark, u8"a", /* u8"", if we allow null string*/ u8"b", u8"c", end_mark, padval, }; test.AddOutput<std::string>("Y", output_dims, output); std::cout << "*****\n"; test.Run(OpTester::ExpectResult::kExpectSuccess);

} // namespace test
}

And it returns the following:

a;b

a
b

b;c
d;e

a;;b
c #Closed

Thank you for the test. I will investigate. The spec allows separators that are prefixes of each other. This is what it says: "If 'separators' is a L-element array, there will be L rounds of tokenization using one stop word. More specifically, in the first round, the first element in 'separators' is used to tokenize each string in the input. Then, the second element in 'separators' will be used to tokenize the resulted strings produced at the first round."

In reply to: 239077639 [](ancestors = 239077639)

I have added this test and fixed the issue according to the spec. We do allow separators with common prefix. The one specified earlier in the sequence will win.

In reply to: 239194043 [](ancestors = 239194043,239077639)

onnxruntime/test/contrib_ops/tokenizer_test.cc

Split unit tests for invidividual test cases.

… build.py, add flag PRIVATE for the python bindings (#544) * add option numpy_version to build against the installed numpy version and not 1.15.0 (hardcoded version number), default is still 1.15.0 * add option skip_keras_test to skip keras test even if keras is installed (still enabled by default) disable unnecessary warnings about ubuntu * enable option PRIVATE for the compilation of the Python bindings (settings recommended on pybind11 documentation) * test on debian 9

* Simple integration into CMake build system * Adds vcpkg as a submodule and updates build.py to install hosting dependencies * Don't create vcpkg executable if already created * Fixes how CMake finds toolchain file and quick changes to build.py * Removes setting the CMAKE_TOOLCHAIN_FILE in build.py * Adds Boost Beast echo server and Boost program_options * Fixes spacing problem with program_options * Adds Microsoft headers to all the beast server headers * Removes CXX 14 from CMake file * Adds TODO to create configuration class * Run clang-format on main * Better exception handling of program_options * Remove vckpg submodule via ssh * Add vcpkg as https * Adds onnxruntime namespace to call classes * Fixed places where namespaces were anonymous * Adds a TODO to use the logger * Moves all setting namespace shortnames outside of onnxruntime namespace * Add onnxruntime session options to force app to link with it * Set CMAKE_TOOLCHAIN_FILE in build.py * Remove whitespace * Adds initial ONNX Hosting tests (#5) * Add initial test which is failing linking with no main * Adds test_main to get hosting tests working * Deletes useless add_executable line * Merge changes from upstream * Enable CI build in Vienna environment * make hosting_run*.sh executable * Add boost path in unittest * Add boost to TEST_INC_DIR * Add component detection task in ci yaml * Get tests and hosting to compile with re2 (#7) * Add finding boost packages before using it in unit tests * Add predict.proto and build * Ignore unused parameters in generated code * Removes std::regex in favor of re2 (#8) * Removes std::regex in favor of re2 * Adds back find_package in unit tests and fixes regexes * Adds more negative test cases * Adding more protos * Fix google protobuf file path in the cmake file * Ignore unused parameters for pb generated code * Updates onnx submodule (#10) * Remove duplicated lib in link * Follow Google style guide (#11) * Google style names * Adds more * Adds an additional namespace * Fixes header guards to match filepaths * Consume protobuf * Unit Test setup * Json deserialization simple test cases * Split hosting app to lib and exe for testability * Add more cases * Clean up * Add more comments * Update namespace and format the cmake files * Update cmake/external/onnx to checkout 1ec81bc6d49ccae23cd7801515feaadd13082903 * Separate h and cc in http folder * Clean up hosting application cmake file * Enable logging and proper initialize the session * Update const position for GetSession() * Take latest onnx and onnx-tensorrt * Creates configuration header file for program_options (#15) * Sets up PredictRequest callback (#16) * Init version, porting from prototype, e2e works * More executor implementation * Adds function on application startup (#17) * Attempts to pass HostingEnvironment as a shared_ptr * Removes logging and environment from all http classes * Passes http details to OnStart function * Using full protobuf for hosting app build * MLValue2TensorProto * Revert back changes in inference_session.cc * Refactor logger access and predict handler * Create an error handling callback (#19) * Creates error callback * Logs error and returns back as JSON * Catches exceptions in user functions * Refactor executor and add some test cases * Fix build warning * Add onnx as a dependency and in includes to hosting app (#20) * Converter for specific types and more UTs * More unit tests * Update onnx submodule * Fix string data test * Clean up code * Cleanup code * Refactor logging to use unique id per request and take logging level from user (#21) * Removes capturing env by reference in main * Uses uuid for logging ids * Take logging_level as a program argument * Pass logging_level to default_logging_manager * Change name of logger to HostingApp * Log if request id is null * Update GetHttpStatusCode signature * Fix random result issue and camel-case names * Rollback accidentally changed pybin_state.cc * Rollback pybind_state.cc * Generate protobuf status from onnxruntime status * Fix function name in error message * Clean up comments * Support protobuf byte array as input * Refactor predict handler and add unit tests * Add one more test * update cmake/external/onnx * Accept more protobuf MIME types * Update onnx-tensorrt * Add build instruction and usage doc * Address PR comments * Install g++-7 in the Ubuntu 16.04 build image for vcpkg * Fix onnx-tensorrt version * Check return value during initialization * Fix infinite loop when http port is in use (#29) * Simplify Executor.cc by breaking up Run method (#27) * Move request id to Executor constructor * Refactor the logger to respect user verbosity level * Use Arena allocator instead of device * Creates initial executor tests * Merge upstream master (#31) * Remove all possible shared_ptrs (#30) * Changes GetLogger to unique_ptr * Reserve BFloat raw data vector size * Change HostingEnvironment to being passed by lvalue and rvalue references * Change routes to getting passed by const references * Enable full protobuf if building hosting (#32) * Building hosting application no longer needs use_full_protobuf flag * Improve hosting application docs * Move server core into separate folder (#34) * Turn hosting project off by default (#38) * Remove vcpkg as a submodule and download/install Boost from source (#39) * Remove vcpkg * Use CMake script to download and build Boost as part of the project * Remove std::move for const references * Remove error_code.proto * Change wording of executable help description * Better GenerateProtobufStatus description * Remove error_code protobuf from CMake files * Use all outputs if no filter is given * Pass MLValue by const reference in MLValueToTensorProto * Rename variables to argc and argv * Revert "Use all outputs if no filter is given" This reverts commit 7554190. * Remove all header guards in favor of #pragma once * Reserve size for output vector and optimize for-loop * Use static libs by default for Boost * Improves documentation for GenerateResponseInJson function * Start Result enum at 0 instead of 1 * Remove g++ from Ubuntu's install.sh * Update cmake files * Give explanation for Result enum type * Remove all program options shortcuts except for -h * Add comments for predict.proto * Fix JSON for error codes * Add notice on hosting application docs that it's in beta * Change HostingEnvironment back to a shared_ptr * Handle empty output_filter field * Fix build break * Refactor unit tests location and groups * First end-to-end test * Add missing log * Missing req id and client req id in error response * Add one test case to validate failed resp header * Add build flag for hosting app end to end tests * Update pipeline setup to run e2e test for CI build * Model Zoo data preparation and tests * Add protobuf tests * Remove mention of needing g++-7 in BUILD.md * Make GetAppLogger const * Make using_raw_data_ match the styling of other fields * Avoid copy of strings when initializing model * Escape JSON strings correctly for error messages (#44) * Escape JSON strings correctly * Add test examples with lots of carriage returns * Add result validation * Remove temporary path * Optimize model zoo test execution * Improve reliability of test cases * Generate _pb2.py during the build time * README for integration tests * Pass environment by pointer instead of shared_ptr to executor (#49) * More Integration tests * Remove generated files * Make session private and use a getter instead (#53) * logging_level to log_level for CLI * Single model prediction shortcut * Health endpoint * Integration tests * Rename to onnxruntime server * Build ONNX Server application on Windows (#57) * Gets Boost compiling on Windows * Fix integer conversion and comparison problems * Use size_t in converter_tests instead of int * Fix hosting integration tests on Windows * Removes checks for port because it's an unsigned short * Fixes comparison between signed and unsigned data types * Pip install protobuf and numpy * Missing test data from the rename change * Fix server app path (#58) * Pass shared_ptr by const reference to avoid ref count increase (#59) * Download test model during test setup * Make download into test_util * Rename ci pipeline for onnx runtime server * Support up to 10MiB http request (#61) * Changes minimum request size to 10MB to support all models in ONNX Model Zoo

Ensure version is used instead of version-dev

yuslepukhin requested review from pranavsharma, wschin, linkerzhang and gramalingam November 27, 2018 01:15

yuslepukhin commented Nov 27, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

yuslepukhin commented Nov 27, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

yuslepukhin commented Nov 27, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

yuslepukhin force-pushed the dmitrism/implement_tokenizer branch 2 times, most recently from 037d439 to 3c200ee Compare November 28, 2018 20:10

yuslepukhin requested a review from a team as a code owner November 28, 2018 20:10

yuslepukhin assigned yuslepukhin and linkerzhang Nov 29, 2018

yuslepukhin force-pushed the dmitrism/implement_tokenizer branch from be90da6 to 7330e35 Compare November 29, 2018 22:03

yuslepukhin assigned RyanUnderhill Nov 30, 2018

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/contrib_ops.cc Outdated Show resolved Hide resolved

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Show resolved Hide resolved

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

xadupre reviewed Dec 3, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

yuslepukhin force-pushed the dmitrism/implement_tokenizer branch from 7330e35 to 0a06521 Compare December 4, 2018 02:05

yuslepukhin added 7 commits December 4, 2018 14:57

Implement separator tokenizer with TST.

d12bc6b

TODO: Clarify what to do if the output is empty and no start/end text markers required. Also see if the current search algo is acceptable.

Add utf8 util test

2a5e9dd

For empty output produce [C] -> [C][0], [N][C] -> [N][C][0]

c683c31

Augument TST search with match conflict resolution in favor of

0d490ec

earlier specified pattern matches.

Address MAcOS build error.

ef59eaf

Adjust error message

c52985f

Address review comments.

0a89353

yuslepukhin commented Dec 5, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

pranavsharma reviewed Dec 5, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Show resolved Hide resolved

yuslepukhin commented Dec 5, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Outdated Show resolved Hide resolved

pranavsharma reviewed Dec 5, 2018

View reviewed changes

onnxruntime/contrib_ops/cpu/tokenizer.cc Show resolved Hide resolved

pranavsharma reviewed Dec 5, 2018

View reviewed changes

xadupre reviewed Dec 5, 2018

View reviewed changes

yuslepukhin commented Dec 5, 2018

View reviewed changes

onnxruntime/test/contrib_ops/tokenizer_test.cc Show resolved Hide resolved

yuslepukhin added 3 commits December 5, 2018 11:21

Address review comments part I.

845853c

Move padding outside start/end markers.

cf120f8

Split unit tests for invidividual test cases.

Fix a common prefix bug reported by Xavier.

fb054d4

pranavsharma approved these changes Dec 6, 2018

View reviewed changes

yuslepukhin merged commit c52636e into master Dec 6, 2018

yuslepukhin deleted the dmitrism/implement_tokenizer branch December 6, 2018 01:52

tmccrmck added a commit to tmccrmck/onnxruntime that referenced this pull request Aug 28, 2019

Merge upstream master (microsoft#31)

81b0c0a

yueyihua mentioned this pull request Jan 9, 2020

Runtime eror before my code return #2804

Closed

Liujingxiu23 mentioned this pull request Jun 12, 2020

How to get the current provider used in C++ inference stage? #4140

Closed

ns-wxin mentioned this pull request Jan 31, 2023

How to release a session properly? #11529

Open

horror-proton mentioned this pull request Feb 21, 2024

A bug occurs when the program terminates #15174

Open

tianleiwu mentioned this pull request Mar 20, 2024

Recognize NaN operands in Min and Max ops #19984

Merged

TedThemistokleous added a commit to TedThemistokleous/onnxruntime that referenced this pull request Jun 8, 2024

Merge pull request microsoft#31 from ROCm/rocm6.2_version_update

1adde82

Ensure version is used instead of version-dev

frenetj mentioned this pull request Oct 15, 2024

Crash running inference on a specific ONNX model #22448

Closed

FaithyQi mentioned this pull request Oct 22, 2024

[Mobile] null pointer dereference #22538

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Tokenizer op #31

Implement Tokenizer op #31

yuslepukhin commented Nov 27, 2018 •

edited

Loading

yuslepukhin commented Nov 27, 2018 •

edited

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

yuslepukhin Dec 5, 2018

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

yuslepukhin Dec 5, 2018

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

yuslepukhin Dec 5, 2018

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

pranavsharma left a comment

xadupre Dec 5, 2018 •

edited by yuslepukhin

Loading

xadupre Dec 5, 2018 •

edited by yuslepukhin

Loading

yuslepukhin Dec 5, 2018

yuslepukhin Dec 6, 2018

Implement Tokenizer op #31

Implement Tokenizer op #31

Conversation

yuslepukhin commented Nov 27, 2018 • edited Loading

yuslepukhin commented Nov 27, 2018 • edited Loading

pranavsharma Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

yuslepukhin Dec 5, 2018

Choose a reason for hiding this comment

pranavsharma Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

yuslepukhin Dec 5, 2018

Choose a reason for hiding this comment

pranavsharma Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

yuslepukhin Dec 5, 2018

Choose a reason for hiding this comment

pranavsharma Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

pranavsharma Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

pranavsharma left a comment

Choose a reason for hiding this comment

xadupre Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

xadupre Dec 5, 2018 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

And it returns the following:

a;b

a b

b;c d;e

yuslepukhin Dec 5, 2018

Choose a reason for hiding this comment

yuslepukhin Dec 6, 2018

Choose a reason for hiding this comment

yuslepukhin commented Nov 27, 2018 •

edited

Loading

yuslepukhin commented Nov 27, 2018 •

edited

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

pranavsharma Dec 5, 2018 •

edited by yuslepukhin

Loading

xadupre Dec 5, 2018 •

edited by yuslepukhin

Loading

xadupre Dec 5, 2018 •

edited by yuslepukhin

Loading

a
b

b;c
d;e