Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate potential NFC bias #2

Open
sffc opened this issue Jan 28, 2018 · 0 comments
Open

Investigate potential NFC bias #2

sffc opened this issue Jan 28, 2018 · 0 comments

Comments

@sffc
Copy link
Collaborator

sffc commented Jan 28, 2018

From Martin Jansche:

If you train only on Unicode data in NFC, you will never see certain character bigrams in the Unicode portion of your training data. For example, you will never see 1025 102E in the Unicode portion of the training data, since that string NFC-normalizes to 1026. Since NFC is not a meaningful notion for Zawgyi, you cannot normalize Zawgyi strings before training. So if you have zero instances of 1025 102E in the Unicode portion of your training data and nonzero instances of it in the Zawgyi portion, your training data would be biased and any classifier trained on it would incorrectly view 1025 102E as an indicator of Zawgyi.

It gets more subtle for other combinations: 102D 102F is not canonically equivalent to 102F 102D, AFAICT. Here it's a long-standing, valid complaint against Zawgyi that both orderings are used interchangeably in Zawgyi with no preferred ordering defined, while UTN-11 defines a preferred ordering. But otherwise sensible Unicode text in the wild still uses the non-preferred ordering. Or consider that 101D (wa) and 1040 (zero) are visually confusable in many typefaces; it could be argued that for purposes of Z/U classification, these non-equivalent substrings could be treated as interchangeable. Arguably the Z/U decision is based more on larger structural properties than on details of locally interchangeable substrings. E.g. "1040 1031" can really only be Unicode (however misguided), but "1031 1040" can really only be Zawgyi. It's likely that some bias will sneak into the training data maybe purely based on the innocent preferences of common input methods or due to inadvertent but innocent typos that do not fundamentally change the structure of the representation.

The state of NFC in the wild should be analyzed (e.g., is non-NFC data commonly found)? Then, the training data should be analyzed to make sure it is consistent with the NFC status of typical Unicode text on the internet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant