Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SPM Case Encoding Training C++ API denorm flags #13

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

rjai
Copy link
Collaborator

@rjai rjai commented Aug 2, 2021

When Case Encoding is used in C++ Train API based model training, the denormalizer spec must have a few whitespace related flags disabled otherwise the resultant SPM model when trained from the C++ Train API is a) no longer reversible, and b) Inconsistent with spm_train spec.

@rjai rjai requested review from snukky and emjotde August 2, 2021 10:49
@rjai
Copy link
Collaborator Author

rjai commented Aug 2, 2021

Missed to specify these in the denormalizer spec and it was resulting in models where decode(encode(text)) != text. The text had _ characters even after decode.

@emjotde emjotde merged commit 3921b9a into marian-nmt:master Aug 27, 2021
XapaJIaMnu added a commit to browsermt/sentencepiece that referenced this pull request May 3, 2023
* Adding alternative project name for spm latest to prevent lib conflicts

* Update cmake

* Update CMakeFiles to allow for configurable artifact names

* Enables --encode_unicode_case option for case-aware sentence piece (marian-nmt#10)

* Enables --encode_unicode_case option for case-aware sentence piece
* Example: This IS a TEST OF THE CASING gets converted internally to Tthis Uis a Atest of the casing before segmentation.
* This is fully reversible.

* Enable toggling Case Encoding flag from C++ Train API (marian-nmt#11)

* Enable toggling Case Encoding flag from C++ Train API
* Fixing issue with hardcoding truth value of encode_decode_case flag

* Disable denormalizer flags (marian-nmt#13)

Co-authored-by: Rohit Jain <[email protected]>

* Fix Surface String to Token Mappings for Case Encoding (marian-nmt#12)

Co-authored-by: Marcin Junczys-Dowmunt <[email protected]>
Co-authored-by: Rohit Jain <[email protected]>

* add one header file to installation

* Rename VERSION to VERSION.txt

* Rename VERSION to VERSION.txt

Installing python package fails with below error.
This change addresses this issue
```
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/alferre/code/sentencepiece/python/setup.py", line 111, in <module>
          version=version(),
        File "/home/alferre/code/sentencepiece/python/setup.py", line 36, in version
          with codecs.open('VERSION.txt', 'r', 'utf-8') as f:
        File "/opt/conda/envs/ptca/lib/python3.8/codecs.py", line 905, in open
          file = builtins.open(filename, mode, buffering)
      FileNotFoundError: [Errno 2] No such file or directory: 'VERSION.txt'
      [end of output]
```

---------

Co-authored-by: Rohit Jain <[email protected]>
Co-authored-by: Rohit Jain <[email protected]>
Co-authored-by: Marcin Junczys-Dowmunt <[email protected]>
Co-authored-by: Roman Grundkiewicz <[email protected]>
Co-authored-by: alexandremuzio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants