Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Zero Width Joiner replaced with whitespace #630

Merged
merged 1 commit into from
Feb 25, 2021

Conversation

sarubi
Copy link
Contributor

@sarubi sarubi commented Feb 23, 2021

Resolve #629

Currently, the Zero Width Joiner character has been replaced by whitespace. Due to that Sinhala language that requires Zero Width Joiner getting altered and resulted in wong output.

Ideally, we shouldn't replace Zero Width Joiner (200D) with whitespace or empty since it indicates to join two chars without zero width(no whitespace). Also, It requires to present in order to decode the segmentation to raw test successfully. We should keep these special characters as it is.

This PR eliminate the places where WZJ has replaced with whitespace. after referrog to the commit [1],
data/nmt_nfkc.tsv
data/nmt_nfkc_cf.tsv
builder.cc

The even-though local build [2] was successful, unfortunately, src/normalization_rule.h didn't get updated, hence I couldn't able to apply my changes.
Appreciate it if anyone provides insight on this since I wanted to make it work for the Sinhala Language.

[1] 18c337f#diff-c0960ff1cd2e917394da837d2a75ca10e06abab61bbd2d36f174fa284c8d700bR57266
[2] https://github.com/google/sentencepiece#build-and-install-sentencepiece-command-line-tools-from-c-source

@google-cla
Copy link

google-cla bot commented Feb 23, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@sarubi
Copy link
Contributor Author

sarubi commented Feb 23, 2021

@googlebot I signed it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SentencePiece is not working properly for the Sinhala Language due to Zero Width Joiner is getting removed
2 participants