Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable toggling Case Encoding flag from C++ Train API #11

Merged
merged 2 commits into from
Jul 15, 2021

Conversation

rjai
Copy link
Collaborator

@rjai rjai commented Jul 13, 2021

Allow configuring encode_unicode_case flag from within the C++ Training API by fixing up argument parsing.

@rjai rjai requested review from emjotde and snukky July 13, 2021 19:36
@rjai rjai marked this pull request as draft July 14, 2021 07:27
src/sentencepiece_trainer.cc Outdated Show resolved Hide resolved
@rjai rjai marked this pull request as ready for review July 14, 2021 09:53
@@ -153,6 +153,12 @@ util::Status SentencePieceTrainer::MergeSpecsFromArgs(
CHECK_OR_RETURN(absl::SimpleAtoi(value, &v));
absl::SetFlag(&FLAGS_minloglevel, v);
continue;
} else if(key == "encode_unicode_case") {
bool encode_unicode_case;
std::istringstream(value) >> std::boolalpha >> encode_unicode_case;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought there is a better way of doing that, like std::stoi but with bool, but not really. So this is OK.

@emjotde emjotde merged commit 28f9eb8 into master Jul 15, 2021
XapaJIaMnu added a commit to browsermt/sentencepiece that referenced this pull request May 3, 2023
* Adding alternative project name for spm latest to prevent lib conflicts

* Update cmake

* Update CMakeFiles to allow for configurable artifact names

* Enables --encode_unicode_case option for case-aware sentence piece (marian-nmt#10)

* Enables --encode_unicode_case option for case-aware sentence piece
* Example: This IS a TEST OF THE CASING gets converted internally to Tthis Uis a Atest of the casing before segmentation.
* This is fully reversible.

* Enable toggling Case Encoding flag from C++ Train API (marian-nmt#11)

* Enable toggling Case Encoding flag from C++ Train API
* Fixing issue with hardcoding truth value of encode_decode_case flag

* Disable denormalizer flags (marian-nmt#13)

Co-authored-by: Rohit Jain <[email protected]>

* Fix Surface String to Token Mappings for Case Encoding (marian-nmt#12)

Co-authored-by: Marcin Junczys-Dowmunt <[email protected]>
Co-authored-by: Rohit Jain <[email protected]>

* add one header file to installation

* Rename VERSION to VERSION.txt

* Rename VERSION to VERSION.txt

Installing python package fails with below error.
This change addresses this issue
```
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/alferre/code/sentencepiece/python/setup.py", line 111, in <module>
          version=version(),
        File "/home/alferre/code/sentencepiece/python/setup.py", line 36, in version
          with codecs.open('VERSION.txt', 'r', 'utf-8') as f:
        File "/opt/conda/envs/ptca/lib/python3.8/codecs.py", line 905, in open
          file = builtins.open(filename, mode, buffering)
      FileNotFoundError: [Errno 2] No such file or directory: 'VERSION.txt'
      [end of output]
```

---------

Co-authored-by: Rohit Jain <[email protected]>
Co-authored-by: Rohit Jain <[email protected]>
Co-authored-by: Marcin Junczys-Dowmunt <[email protected]>
Co-authored-by: Roman Grundkiewicz <[email protected]>
Co-authored-by: alexandremuzio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants