You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per the reported issue, Empty tokens in output vocabulary #276, Zero Width Joiner character has been replaced by whitespace. Due to that languages that require Zero Width Joiner getting altered and resulted wong output. This issue will be there for languages like Sinhala, Devanagari, Kannada and Malayalam [1].
Ideally, we shouldn't replace Zero Width Joiner (200D) with whitespace or empty since it indicates to join two chars without zero width(no whitespace). Also, It requires to present in order to decode the segmentation to raw test successfully. We should keep these special characters as it is.
As per the reported issue, Empty tokens in output vocabulary #276, Zero Width Joiner character has been replaced by whitespace. Due to that languages that require Zero Width Joiner getting altered and resulted wong output. This issue will be there for languages like Sinhala, Devanagari, Kannada and Malayalam [1].
Ideally, we shouldn't replace Zero Width Joiner (200D) with whitespace or empty since it indicates to join two chars without zero width(no whitespace). Also, It requires to present in order to decode the segmentation to raw test successfully. We should keep these special characters as it is.
[1] https://en.wikipedia.org/wiki/Zero-width_joiner
The text was updated successfully, but these errors were encountered: