Fix SPM Case Encoding Training C++ API denorm flags #13

rjai · 2021-08-02T10:48:27Z

When Case Encoding is used in C++ Train API based model training, the denormalizer spec must have a few whitespace related flags disabled otherwise the resultant SPM model when trained from the C++ Train API is a) no longer reversible, and b) Inconsistent with spm_train spec.

rjai · 2021-08-02T10:51:04Z

Missed to specify these in the denormalizer spec and it was resulting in models where decode(encode(text)) != text. The text had _ characters even after decode.

* Adding alternative project name for spm latest to prevent lib conflicts * Update cmake * Update CMakeFiles to allow for configurable artifact names * Enables --encode_unicode_case option for case-aware sentence piece (marian-nmt#10) * Enables --encode_unicode_case option for case-aware sentence piece * Example: This IS a TEST OF THE CASING gets converted internally to Tthis Uis a Atest of the casing before segmentation. * This is fully reversible. * Enable toggling Case Encoding flag from C++ Train API (marian-nmt#11) * Enable toggling Case Encoding flag from C++ Train API * Fixing issue with hardcoding truth value of encode_decode_case flag * Disable denormalizer flags (marian-nmt#13) Co-authored-by: Rohit Jain <[email protected]> * Fix Surface String to Token Mappings for Case Encoding (marian-nmt#12) Co-authored-by: Marcin Junczys-Dowmunt <[email protected]> Co-authored-by: Rohit Jain <[email protected]> * add one header file to installation * Rename VERSION to VERSION.txt * Rename VERSION to VERSION.txt Installing python package fails with below error. This change addresses this issue ``` × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [10 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/alferre/code/sentencepiece/python/setup.py", line 111, in <module> version=version(), File "/home/alferre/code/sentencepiece/python/setup.py", line 36, in version with codecs.open('VERSION.txt', 'r', 'utf-8') as f: File "/opt/conda/envs/ptca/lib/python3.8/codecs.py", line 905, in open file = builtins.open(filename, mode, buffering) FileNotFoundError: [Errno 2] No such file or directory: 'VERSION.txt' [end of output] ``` --------- Co-authored-by: Rohit Jain <[email protected]> Co-authored-by: Rohit Jain <[email protected]> Co-authored-by: Marcin Junczys-Dowmunt <[email protected]> Co-authored-by: Roman Grundkiewicz <[email protected]> Co-authored-by: alexandremuzio <[email protected]>

Disable denormalizer flags

d298334

rjai requested review from snukky and emjotde August 2, 2021 10:49

snukky approved these changes Aug 2, 2021

View reviewed changes

emjotde approved these changes Aug 27, 2021

View reviewed changes

emjotde merged commit 3921b9a into marian-nmt:master Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SPM Case Encoding Training C++ API denorm flags #13

Fix SPM Case Encoding Training C++ API denorm flags #13

rjai commented Aug 2, 2021

rjai commented Aug 2, 2021

Fix SPM Case Encoding Training C++ API denorm flags #13

Fix SPM Case Encoding Training C++ API denorm flags #13

Conversation

rjai commented Aug 2, 2021

rjai commented Aug 2, 2021