-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement (Almost) All The Unicode Caseless Matching Systems #232
Draft
isomarcte
wants to merge
11
commits into
typelevel:series/2.x
Choose a base branch
from
isomarcte:all-the-case-folding
base: series/2.x
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 9 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
b376e37
Implement Unicode Case Folding
isomarcte f0bd02f
Fix Formatting Error
isomarcte cfebcb0
Fix Compare
isomarcte 9ca6c36
Fix CIString Compare
isomarcte 98f82e7
Rename CaseFolds To CaseFolding
isomarcte 6ed8433
Add Simple Case Folding Tables
isomarcte dadac58
Implement (Almost) All The Unicode Caseless Matching Systems
isomarcte 4a3fba6
Merge remote-tracking branch 'origin/main' into all-the-case-folding
isomarcte 3bc849d
Add Missing `new` Keywords
isomarcte 619b36f
Define CIString To Be A CanonicalFullCaseFoldedString
isomarcte 47e5980
Add More Docs
isomarcte File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
/* | ||
* Copyright 2020 Typelevel | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.typelevel.ci | ||
|
||
final class CIStringCF private ( | ||
override val toString: String, | ||
val asCanonicalFullCaseFoldedString: CanonicalFullCaseFoldedString) | ||
extends Serializable { | ||
override def equals(that: Any): Boolean = | ||
that match { | ||
case that: CIStringCF => | ||
asCanonicalFullCaseFoldedString == that.asCanonicalFullCaseFoldedString | ||
case _ => | ||
false | ||
} | ||
|
||
override def hashCode(): Int = | ||
asCanonicalFullCaseFoldedString.hashCode | ||
} | ||
|
||
object CIStringCF { | ||
def apply(value: String): CIStringCF = | ||
new CIStringCF( | ||
value, | ||
CanonicalFullCaseFoldedString(value) | ||
) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
/* | ||
* Copyright 2020 Typelevel | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.typelevel.ci | ||
|
||
import cats._ | ||
import cats.kernel._ | ||
import cats.syntax.all._ | ||
|
||
final class CIStringCS private ( | ||
override val toString: String, | ||
val asCanonicalSimpleCaseFoldedString: CanonicalSimpleCaseFoldedString) | ||
extends Serializable { | ||
|
||
override def equals(that: Any): Boolean = | ||
that match { | ||
case that: CIStringCS => | ||
asCanonicalSimpleCaseFoldedString == that.asCanonicalSimpleCaseFoldedString | ||
case _ => | ||
false | ||
} | ||
|
||
override def hashCode(): Int = | ||
asCanonicalSimpleCaseFoldedString.hashCode | ||
} | ||
|
||
object CIStringCS { | ||
|
||
def apply(value: String): CIStringCS = | ||
new CIStringCS( | ||
value, | ||
CanonicalSimpleCaseFoldedString(value) | ||
) | ||
|
||
val empty: CIStringCS = apply("") | ||
|
||
implicit val hashAndOrderForCIStringCS: Hash[CIStringCS] with Order[CIStringCS] = | ||
new Hash[CIStringCS] with Order[CIStringCS] { | ||
override def hash(x: CIStringCS): Int = | ||
x.hashCode | ||
|
||
override def compare(x: CIStringCS, y: CIStringCS): Int = | ||
x.asCanonicalSimpleCaseFoldedString.compare(y.asCanonicalSimpleCaseFoldedString) | ||
} | ||
|
||
implicit val orderingForCIStringCS: Ordering[CIStringCS] = | ||
hashAndOrderForCIStringCS.toOrdering | ||
|
||
implicit val showForCIStringCS: Show[CIStringCS] = | ||
Show.fromToString | ||
|
||
implicit val lowerBoundForCIStringCS: LowerBounded[CIStringCS] = | ||
new LowerBounded[CIStringCS] { | ||
override val partialOrder: PartialOrder[CIStringCS] = | ||
hashAndOrderForCIStringCS | ||
|
||
override val minBound: CIStringCS = | ||
empty | ||
} | ||
|
||
implicit val monoidForCIStringCS: Monoid[CIStringCS] = | ||
new Monoid[CIStringCS] { | ||
override val empty: CIStringCS = CIStringCS.empty | ||
|
||
override def combine(x: CIStringCS, y: CIStringCS): CIStringCS = | ||
CIStringCS(x.toString + y.toString) | ||
|
||
override def combineAll(xs: IterableOnce[CIStringCS]): CIStringCS = { | ||
val sb: StringBuilder = new StringBuilder | ||
xs.iterator.foreach(cfs => sb.append(cfs.toString)) | ||
CIStringCS(sb.toString) | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
/* | ||
* Copyright 2020 Typelevel | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.typelevel.ci | ||
|
||
import cats._ | ||
import cats.kernel._ | ||
import cats.syntax.all._ | ||
|
||
final class CIStringS private ( | ||
override val toString: String, | ||
val asSimpleCaseFoldedString: SimpleCaseFoldedString) | ||
extends Serializable { | ||
|
||
override def equals(that: Any): Boolean = | ||
that match { | ||
case that: CIStringS => | ||
asSimpleCaseFoldedString == that.asSimpleCaseFoldedString | ||
case _ => | ||
false | ||
} | ||
|
||
override def hashCode(): Int = | ||
asSimpleCaseFoldedString.hashCode | ||
} | ||
|
||
object CIStringS { | ||
|
||
def apply(value: String): CIStringS = | ||
new CIStringS( | ||
value, | ||
SimpleCaseFoldedString(value) | ||
) | ||
|
||
val empty: CIStringS = apply("") | ||
|
||
implicit val hashAndOrderForCIStringS: Hash[CIStringS] with Order[CIStringS] = | ||
new Hash[CIStringS] with Order[CIStringS] { | ||
override def hash(x: CIStringS): Int = | ||
x.hashCode | ||
|
||
override def compare(x: CIStringS, y: CIStringS): Int = | ||
x.asSimpleCaseFoldedString.compare(y.asSimpleCaseFoldedString) | ||
} | ||
|
||
implicit val orderingForCIStringS: Ordering[CIStringS] = | ||
hashAndOrderForCIStringS.toOrdering | ||
|
||
implicit val showForCIStringS: Show[CIStringS] = | ||
Show.fromToString | ||
|
||
implicit val lowerBoundForCIStringS: LowerBounded[CIStringS] = | ||
new LowerBounded[CIStringS] { | ||
override val partialOrder: PartialOrder[CIStringS] = | ||
hashAndOrderForCIStringS | ||
|
||
override val minBound: CIStringS = | ||
empty | ||
} | ||
|
||
implicit val monoidForCIStringS: Monoid[CIStringS] = | ||
new Monoid[CIStringS] { | ||
override val empty: CIStringS = CIStringS.empty | ||
|
||
override def combine(x: CIStringS, y: CIStringS): CIStringS = | ||
CIStringS(x.toString + y.toString) | ||
|
||
override def combineAll(xs: IterableOnce[CIStringS]): CIStringS = { | ||
val sb: StringBuilder = new StringBuilder | ||
xs.iterator.foreach(cfs => sb.append(cfs.toString)) | ||
CIStringS(sb.toString) | ||
} | ||
} | ||
} |
40 changes: 40 additions & 0 deletions
40
core/src/main/scala/org/typelevel/ci/CanonicalFullCaseFoldedString.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
/* | ||
* Copyright 2020 Typelevel | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.typelevel.ci | ||
|
||
import java.text.Normalizer | ||
import scala.annotation.tailrec | ||
|
||
final case class CanonicalFullCaseFoldedString private (override val toString: String) | ||
extends AnyVal | ||
|
||
object CanonicalFullCaseFoldedString { | ||
def apply(value: String): CanonicalFullCaseFoldedString = | ||
new CanonicalFullCaseFoldedString( | ||
Normalizer.normalize( | ||
if (Normalizer.isNormalized(value, Normalizer.Form.NFD)) { | ||
CaseFolding.fullCaseFoldString(value) | ||
} else { | ||
CaseFolding.fullCaseFoldString(Normalizer.normalize(value, Normalizer.Form.NFD)) | ||
}, | ||
Normalizer.Form.NFD | ||
) | ||
) | ||
|
||
val empty: CanonicalFullCaseFoldedString = | ||
apply("") | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http4s-0.22 and 0.23 would have to pin, because this is a big part of its public API, and even if we nowarned all the usages, the deprecation would be viral into everyone's apps. I would really rather avoid a source-breaking change on this type if we can avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm struggling with what to do here. I'd like
CIString
to be the one most users will want out of the box, but we can't do that unless we break the semantics by changing it to be Unicode canonical caseless matching with full folded strings.For most users, they probably won't notice, so maybe it's okay? That also would help with the concern about advertising which is the best type to use by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, are we breaking semantics? Or was our current implementation buggy, and we are fixing the semantics?
The CI microsite states that "locale independence" is a design goal. Is that what @isomarcte is achieving here?
Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g.
Locale.setDefault(Locale.forLanguageTag("tr"))
should not change the resultThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armanbilge that's a tricky question. Yes and no. The Turkic special cases are obviously not locale independent. However, and forgive me for splitting hairs here, I'm not sure you can say that any Unicode caseless matching is locale independent. Whether or not
"ı" == "I"
in a caseless match is really only answerable in the context of some locale.In as much as one can have locale independence in this context, the non-Turkic variants achieve that.
I can drop the Turkic ones, but...I don't know...maybe some Turkic language users will want to use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not realize that. 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armanbilge is it clear to you if they are implying default full caseless matching or canonical full caseless matching? It is not to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isomarcte canonical?
https://www.w3.org/TR/charmod-norm/#matching_CanonicalFoldNormalizationStep
How does "compatibility" fit into this?
https://www.w3.org/TR/charmod-norm/#matching_CompatibilityFoldNormalizationStep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armanbilge I don't think compatibility comes into play in our http4s context. (thanks for finding that btw!)
I'm becoming increasingly convinced we should change
CIString
to be full canonical caseless matching and update the docs. I suspect most of our current user base is working around web based technologies and probably should be using full canonical caseless matching anyway.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing here contradicts that: https://mvnrepository.com/artifact/org.typelevel/case-insensitive/usages
It's probably fine 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would challenge that this is a bug in http4s. From the next paragraph in the spec:
I can't think of anywhere that http4s is using it that isn't an ASCII context. The one place I'm less sure of is the URI, but even there, encodings are by convention, so I think the answer is ambiguous. So, while
CIString
is less general purpose than we would like, is there a smoking gun that its use in the http4s ecosystem is wrong?