Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find out exact ordering of hiragana/katakana letters in native apple platforms #86636

Closed
Tracked by #80689
mkhamoyan opened this issue May 23, 2023 · 5 comments
Closed
Tracked by #80689
Assignees
Labels
Milestone

Comments

@mkhamoyan
Copy link
Contributor

mkhamoyan commented May 23, 2023

While working on #85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear.
There are 3 cases

  1. Letters that have small equivalent
    For this case ordering works like hiragana small letter < katakana small letter < hiragana letter < katakana letter

    code char name
    \u3041 Hiragana letter small A
    \u3042 Hiragana letter A
    \u30A1 Katakana letter small A
    \u30A2 Katakana letter A
    -- -- --
  2. Letters without small equivalent
    For this case ordering is katakana letter < hiragana letter but not sure it comes after small katakana letters or somewhere else.

    code char name
    \u30C0 Katakana letter DA
    \u3060 Hiragana letter DA
    -- -- --
  3. Letters only existing in katakana
    Not sure where these letters are in ordering.

    code char name
    \u30F4 Katakana letter VU
    -- -- --

Find out what is the exact flow of ordering and update hybrid-globalization.md for OSX platform CompareString function details and add more test cases showing the ordering.

Contributes to #80689

@mkhamoyan mkhamoyan self-assigned this May 23, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 23, 2023
@ghost
Copy link

ghost commented May 23, 2023

Tagging subscribers to 'os-ios': @steveisok, @akoeplinger
See info in area-owners.md if you want to be subscribed.

Issue Details

While working on #85965 we find out that ordering of hiragana/katakana letters in native apple platforms is not so clear.
There are 3 cases

  1. Letters that have lowercase/uppercase
    For this case ordering works like hiragana lowercase < katakana lowercase < hiragana uppercase < katakana uppercase

    code char name
    \u3041 Hiragana letter small A
    \u3042 Hiragana letter A
    \u30A1 Katakana letter small A
    \u30A2 Katakana letter A
    -- -- --
  2. Letters without lowercase/uppercase
    For this case ordering is katakana letter < hiragana letter but not sure it comes after lowercase katakana letters or somewhere else.

    code char name
    \u30C0 E38380
    \u3060 E381A0
    -- -- --
  3. Letters only existing in katakana
    Not sure where these letters are in ordering.

    code char name
    \u30F4 Katakana letter VU
    -- -- --

Find out what is the exact flow of ordering and update hybrid-globalization.md for OSX platform CompareString function details and add more test cases showing the ordering.

Contributes to #80689

Author: mkhamoyan
Assignees: mkhamoyan
Labels:

area-System.Globalization, os-ios, os-tvos, os-maccatalyst

Milestone: -

@Clockwork-Muse
Copy link
Contributor

Letters that have lowercase/uppercase

There's no such thing as "case" (in the english/latin sense; I guess they might still be designated that way in unicode, but I somehow doubt it) for kana. Small characters are normally used to modify the sounds of "normal sized" characters. For instance:

きよ (ki yo) -> きょ (kyo)

You're not supposed to have small characters on their own - formally that doesn't make any sense.

Letters without lowercase/uppercase

I'm not sure why you're stating the ordering of hiragana/katakana flipped here, unless it's something specific to the original test data? Except looking at the unicode blocks there's a complete match, so it should be possible to do this via offset from start of block (Assuming an interleaved/phonetic ordering, rather than just as two separate blocks).

Note that the two characters chosen as an example have the (ten-ten) marks, which change the sound of the characters, as part of the character, but there's also additional separate and combining character versions. That is, there's both and だ, which are separate unicode sequences!

Letters only existing in katakana

The unicode block actually shows an equivalent entry for a hiragana character, . There some additional entries or other differences between the two blocks (the ten-ten marks are part of the hiragana block, for example).

I don't actually recognize everything in the blocks - they aren't in the commonly taught syllabary (at minimum when learning Japanese as a foreign language, but I don't think in Japanese schools either). I'm not sure of some of the uses of some of the characters.

Also, as an additional wrinkle, for historical reasons there's a half-width katakana block. Note this block does not include the pre-combined versions of characters, and has an additional set of (half-width) combining characters.

@mkhamoyan
Copy link
Contributor Author

mkhamoyan commented May 24, 2023

Thanks for the explanation. My bad for using lowercase/uppercase words, changed to small letter.

I wanted to give examples where ICU compareString and apple native compareString functions return different results while comparing hiragana and katakana letters.
We had test cases that expect for example
\u30C0 ダ to be before \u3060 だ while using ICU , but apple native compareString function returns different ordering result.

It is known that on Windows's NLS hiragana characters sort after katakana , on ICU it is the opposite.
This task is created to investigate how hiragana and katakana characters are sorted on apple platforms.

@tarekgh
Copy link
Member

tarekgh commented Jul 25, 2023

@mkhamoyan @steveisok could you please triage this issue? Thanks!

@steveisok steveisok removed the untriaged New issue has not been triaged by the area owner label Jul 25, 2023
@steveisok steveisok added this to the 9.0.0 milestone Jul 25, 2023
@mkhamoyan mkhamoyan modified the milestones: 9.0.0, Future Jun 25, 2024
@mkhamoyan
Copy link
Contributor Author

Further investigation reveals that the behavioral changes observed are a result of our approach to achieve results same to ICU in hybrid mode.
Specifically, we utilize precomposedStringWithCanonicalMapping combined with stringByFoldingWithOptions:locale: from the NSString class.
While this approach has led to some behavioral changes, it currently represents our best method for achieving results equivalent to those obtained with ICU.
Behavioural changes are already documented.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants