-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Case Folding #228
Comments
I was curious and it seems that JS may support this out-of-the-box. So at least we wouldn't have to copy those tables in 😅 "ß".localeCompare("SS", 'en', {sensitivity: 'base'}) // 0 |
Ah, here's the equivalent with JDK. import java.text.Collator
val coll = Collator.getInstance
coll.setStrength(Collator.PRIMARY)
coll.compare("ß", "SS") // 0 |
@armanbilge I actually already have the lookup tables: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CaseFolds.scala#L23 It only took a few minutes to get emacs to transform the text file into that. That said, I'm happy to use existing JS/JRE based implementations, but I think we need to actually have a case folded string: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CIString.scala#L43 I didn't see anyway to get that from the |
Thanks, yes, I actually saw your lookup tables which is what prompted me to look for any stdlib solutions for this ... at least for JS, this is just one more thing to bloat the generated code 🙃 You're right though, if there's no way to get a folded string then your lookup tables are the way to go 👍 |
I've only poked around in the JRE stdlib. I can check around in JS land to see if we can get a case folded string there and then maybe lighten the generated code and use the lookup tables on the JRE. |
No worries, you can stay focused on doing awesome work in cats-uri and I'll worry about the JS stuff. Correctness/functionality first, optimizations second :) |
Edit: ignore me :) |
It looks like you've implemented "full" case folding in #229. Is what we have currently consistent with "simple" case folding? (I could probably answer this myself...) This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0? I wonder whether our current implementation varies by Java version. JDK8 supported Unicode 6.2, and Java 17 supports Unicode 13.0. It looks like your proposed algorithm guarantees stability, but only if we also normalize to NKFC?
|
Ugh, there are four definitions of this.
|
It seems you're targeting the Default definition, which seems as good a starting point as any. |
I'm not sure. It is a semantic change, but most users will be unaffected. What we currently implement appears to be semantically equivalent to the simple case folding rules + rules for some Turkic languages with no normalization. The Turkic rules are supposed to off by default. That is, this should yield false. scala> val a: Char = 0x0049.toChar
val a: Char = I
scala> val b: Char = 0x0131.toChar
val b: Char = ı
scala> CIString(a.toString) == CIString(b.toString)
val res0: Boolean = true I can probably adapt my PR to provide both explicit simple caseless matching and full caseless matching, and then maybe we can give users the option. Let me take a look. I also want to look into the normalization as well. |
@rossabaker @armanbilge take a look here for an |
While working on cats-uri, I ran into an issue with how
CIString
was handling certain unicode values which led me to notice it wasn't respecting Caseless matching from the Unicode standard. As it turns out, neither doesString.equalsIgnoreCase
.I'd just about completed a branch to implement full case folding as defined by the Unicode standard when I ran across this test.
Since under the Unicode standard's caseless matching these two strings would compare equal, I'm beginning to think we are intentionally not following the standard here. Is that the case? If so, why? Is it to maintain parity with what the Java standard library is doing with methods like
equalsIgnoresCase
?The text was updated successfully, but these errors were encountered: