You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you train only on Unicode data in NFC, you will never see certain character bigrams in the Unicode portion of your training data. For example, you will never see 1025 102E in the Unicode portion of the training data, since that string NFC-normalizes to 1026. Since NFC is not a meaningful notion for Zawgyi, you cannot normalize Zawgyi strings before training. So if you have zero instances of 1025 102E in the Unicode portion of your training data and nonzero instances of it in the Zawgyi portion, your training data would be biased and any classifier trained on it would incorrectly view 1025 102E as an indicator of Zawgyi.
It gets more subtle for other combinations: 102D 102F is not canonically equivalent to 102F 102D, AFAICT. Here it's a long-standing, valid complaint against Zawgyi that both orderings are used interchangeably in Zawgyi with no preferred ordering defined, while UTN-11 defines a preferred ordering. But otherwise sensible Unicode text in the wild still uses the non-preferred ordering. Or consider that 101D (wa) and 1040 (zero) are visually confusable in many typefaces; it could be argued that for purposes of Z/U classification, these non-equivalent substrings could be treated as interchangeable. Arguably the Z/U decision is based more on larger structural properties than on details of locally interchangeable substrings. E.g. "1040 1031" can really only be Unicode (however misguided), but "1031 1040" can really only be Zawgyi. It's likely that some bias will sneak into the training data maybe purely based on the innocent preferences of common input methods or due to inadvertent but innocent typos that do not fundamentally change the structure of the representation.
The state of NFC in the wild should be analyzed (e.g., is non-NFC data commonly found)? Then, the training data should be analyzed to make sure it is consistent with the NFC status of typical Unicode text on the internet.
The text was updated successfully, but these errors were encountered:
From Martin Jansche:
The state of NFC in the wild should be analyzed (e.g., is non-NFC data commonly found)? Then, the training data should be analyzed to make sure it is consistent with the NFC status of typical Unicode text on the internet.
The text was updated successfully, but these errors were encountered: