Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement (Almost) All The Unicode Caseless Matching Systems #232

Draft
wants to merge 11 commits into
base: series/2.x
Choose a base branch
from
5 changes: 4 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,15 @@ lazy val bench = project
.enablePlugins(JmhPlugin)
.settings(
name := "case-insensitive-bench",
libraryDependencies ++= List(
"org.scalacheck" %% "scalacheck" % scalacheckV
),
console / initialCommands := {
fullImports(List("cats", "cats.syntax.all", "org.typelevel.ci"), wildcardImport.value)
},
consoleQuick / initialCommands := ""
)
.dependsOn(core.jvm)
.dependsOn(core.jvm, testing.jvm)

lazy val docs = project
.in(file("site"))
Expand Down
64 changes: 25 additions & 39 deletions core/src/main/scala/org/typelevel/ci/CIString.scala
Original file line number Diff line number Diff line change
Expand Up @@ -22,54 +22,30 @@ import java.io.Serializable
import org.typelevel.ci.compat._
import scala.math.Ordered

/** A case-insensitive String.
*
* Two CI strings are equal if and only if they are the same length, and each corresponding
* character is equal after calling either `toUpper` or `toLower`.
*
* Ordering is based on a string comparison after folding each character to uppercase and then back
* to lowercase.
*
* All comparisons are insensitive to locales.
*
* @param toString
* The original value the CI String was constructed with.
*/
final class CIString private (override val toString: String)
@deprecated(
message =
"Please use either CIStringCF, CIStringCS, or CIStringS instead. CIString/CIStringS implement Unicode default caseless matching on simple case folded strings. For most applications you probably want to use CIStringCF which implements Unicode canonical caseless matching on full case folded strings.",
since = "1.3.0")
final class CIString private (override val toString: String, val asCIStringS: CIStringS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http4s-0.22 and 0.23 would have to pin, because this is a big part of its public API, and even if we nowarned all the usages, the deprecation would be viral into everyone's apps. I would really rather avoid a source-breaking change on this type if we can avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm struggling with what to do here. I'd like CIString to be the one most users will want out of the box, but we can't do that unless we break the semantics by changing it to be Unicode canonical caseless matching with full folded strings.

For most users, they probably won't notice, so maybe it's okay? That also would help with the concern about advertising which is the best type to use by default.

Copy link
Member

@armanbilge armanbilge Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, are we breaking semantics? Or was our current implementation buggy, and we are fixing the semantics?

The CI microsite states that "locale independence" is a design goal. Is that what @isomarcte is achieving here?

Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armanbilge that's a tricky question. Yes and no. The Turkic special cases are obviously not locale independent. However, and forgive me for splitting hairs here, I'm not sure you can say that any Unicode caseless matching is locale independent. Whether or not "ı" == "I" in a caseless match is really only answerable in the context of some locale.

In as much as one can have locale independence in this context, the non-Turkic variants achieve that.

I can drop the Turkic ones, but...I don't know...maybe some Turkic language users will want to use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result

I did not realize that. 👀

Copy link
Member Author

@isomarcte isomarcte Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Idk if we can declare this a bug in case-insensitive specifically, but IIUC it is a bug in http4s.

The [Unicode simple](https://www.w3.org/TR/charmod-norm/#dfn-unicode-simple) casefolding form is not appropriate for string identity matching on the Web.

https://www.w3.org/TR/charmod-norm/#sec_unicode_cs

@armanbilge is it clear to you if they are implying default full caseless matching or canonical full caseless matching? It is not to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isomarcte canonical?

Case sensitivity is not recommended for most specifications but, in the case of an exception where the vocabulary allows non-ASCII characters and which does not want to be sensitive to case distinctions, the 'Unicode canonical case fold' approach SHOULD be used.

https://www.w3.org/TR/charmod-norm/#matching_CanonicalFoldNormalizationStep

How does "compatibility" fit into this?

A 'Unicode compatibility case fold' approach should not be used.

https://www.w3.org/TR/charmod-norm/#matching_CompatibilityFoldNormalizationStep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armanbilge I don't think compatibility comes into play in our http4s context. (thanks for finding that btw!)

I'm becoming increasingly convinced we should change CIString to be full canonical caseless matching and update the docs. I suspect most of our current user base is working around web based technologies and probably should be using full canonical caseless matching anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing here contradicts that: https://mvnrepository.com/artifact/org.typelevel/case-insensitive/usages

It's probably fine 🙃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would challenge that this is a bug in http4s. From the next paragraph in the spec:

Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.

I can't think of anywhere that http4s is using it that isn't an ASCII context. The one place I'm less sure of is the URI, but even there, encodings are by convention, so I think the answer is ambiguous. So, while CIString is less general purpose than we would like, is there a smoking gun that its use in the http4s ecosystem is wrong?

extends Ordered[CIString]
with Serializable {

@deprecated(message = "Please provide a CaseFoldedString directly.", since = "1.3.0")
private def this(toString: String) =
this(toString, CIStringS(toString))

override def equals(that: Any): Boolean =
that match {
case that: CIString =>
this.toString.equalsIgnoreCase(that.toString)
this.asCIStringS == that.asCIStringS
case _ => false
}

@transient private[this] var hash = 0
override def hashCode(): Int = {
if (hash == 0)
hash = calculateHash
hash
}

private[this] def calculateHash: Int = {
var h = 17
var i = 0
val len = toString.length
while (i < len) {
// Strings are equal igoring case if either their uppercase or lowercase
// forms are equal. Equality of one does not imply the other, so we need
// to go in both directions. A character is not guaranteed to make this
// round trip, but it doesn't matter as long as all equal characters
// hash the same.
h = h * 31 + toString.charAt(i).toUpper.toLower
i += 1
}
h
}
override def hashCode(): Int =
this.asCIStringS.hashCode

override def compare(that: CIString): Int =
this.toString.compareToIgnoreCase(that.toString)
Order[CIStringS].compare(asCIStringS, that.asCIStringS)

def transform(f: String => String): CIString = CIString(f(toString))

Expand All @@ -87,8 +63,18 @@ final class CIString private (override val toString: String)

@suppressUnusedImportWarningForCompat
object CIString {
def apply(value: String): CIString = new CIString(value)

@deprecated(
message =
"Please use either CIStringCF, CIStringCS, or CIStringS instead. CIString/CIStringS implement Unicode default caseless matching on simple case folded strings. For most applications you probably want to use CIStringCF which implements Unicode canonical caseless matching on full case folded strings.",
since = "1.3.0")
def apply(value: String): CIString =
new CIString(value, CIStringS(value))

@deprecated(
message =
"Please use either CIStringCF, CIStringCS, or CIStringS instead. CIString/CIStringS implement Unicode default caseless matching on simple case folded strings. For most applications you probably want to use CIStringCF which implements Unicode canonical caseless matching on full case folded strings.",
since = "1.3.0")
val empty = CIString("")

implicit val catsInstancesForOrgTypelevelCIString: Order[CIString]
Expand Down
41 changes: 41 additions & 0 deletions core/src/main/scala/org/typelevel/ci/CIStringCF.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
* Copyright 2020 Typelevel
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.typelevel.ci

final class CIStringCF private (
override val toString: String,
val asCanonicalFullCaseFoldedString: CanonicalFullCaseFoldedString)
extends Serializable {
override def equals(that: Any): Boolean =
that match {
case that: CIStringCF =>
asCanonicalFullCaseFoldedString == that.asCanonicalFullCaseFoldedString
case _ =>
false
}

override def hashCode(): Int =
asCanonicalFullCaseFoldedString.hashCode
}

object CIStringCF {
def apply(value: String): CIStringCF =
new CIStringCF(
value,
CanonicalFullCaseFoldedString(value)
)
}
87 changes: 87 additions & 0 deletions core/src/main/scala/org/typelevel/ci/CIStringCS.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
/*
* Copyright 2020 Typelevel
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.typelevel.ci

import cats._
import cats.kernel._
import cats.syntax.all._

final class CIStringCS private (
override val toString: String,
val asCanonicalSimpleCaseFoldedString: CanonicalSimpleCaseFoldedString)
extends Serializable {

override def equals(that: Any): Boolean =
that match {
case that: CIStringCS =>
asCanonicalSimpleCaseFoldedString == that.asCanonicalSimpleCaseFoldedString
case _ =>
false
}

override def hashCode(): Int =
asCanonicalSimpleCaseFoldedString.hashCode
}

object CIStringCS {

def apply(value: String): CIStringCS =
new CIStringCS(
value,
CanonicalSimpleCaseFoldedString(value)
)

val empty: CIStringCS = apply("")

implicit val hashAndOrderForCIStringCS: Hash[CIStringCS] with Order[CIStringCS] =
new Hash[CIStringCS] with Order[CIStringCS] {
override def hash(x: CIStringCS): Int =
x.hashCode

override def compare(x: CIStringCS, y: CIStringCS): Int =
x.asCanonicalSimpleCaseFoldedString.compare(y.asCanonicalSimpleCaseFoldedString)
}

implicit val orderingForCIStringCS: Ordering[CIStringCS] =
hashAndOrderForCIStringCS.toOrdering

implicit val showForCIStringCS: Show[CIStringCS] =
Show.fromToString

implicit val lowerBoundForCIStringCS: LowerBounded[CIStringCS] =
new LowerBounded[CIStringCS] {
override val partialOrder: PartialOrder[CIStringCS] =
hashAndOrderForCIStringCS

override val minBound: CIStringCS =
empty
}

implicit val monoidForCIStringCS: Monoid[CIStringCS] =
new Monoid[CIStringCS] {
override val empty: CIStringCS = CIStringCS.empty

override def combine(x: CIStringCS, y: CIStringCS): CIStringCS =
CIStringCS(x.toString + y.toString)

override def combineAll(xs: IterableOnce[CIStringCS]): CIStringCS = {
val sb: StringBuilder = new StringBuilder
xs.iterator.foreach(cfs => sb.append(cfs.toString))
CIStringCS(sb.toString)
}
}
}
87 changes: 87 additions & 0 deletions core/src/main/scala/org/typelevel/ci/CIStringS.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
/*
* Copyright 2020 Typelevel
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.typelevel.ci

import cats._
import cats.kernel._
import cats.syntax.all._

final class CIStringS private (
override val toString: String,
val asSimpleCaseFoldedString: SimpleCaseFoldedString)
extends Serializable {

override def equals(that: Any): Boolean =
that match {
case that: CIStringS =>
asSimpleCaseFoldedString == that.asSimpleCaseFoldedString
case _ =>
false
}

override def hashCode(): Int =
asSimpleCaseFoldedString.hashCode
}

object CIStringS {

def apply(value: String): CIStringS =
new CIStringS(
value,
SimpleCaseFoldedString(value)
)

val empty: CIStringS = apply("")

implicit val hashAndOrderForCIStringS: Hash[CIStringS] with Order[CIStringS] =
new Hash[CIStringS] with Order[CIStringS] {
override def hash(x: CIStringS): Int =
x.hashCode

override def compare(x: CIStringS, y: CIStringS): Int =
x.asSimpleCaseFoldedString.compare(y.asSimpleCaseFoldedString)
}

implicit val orderingForCIStringS: Ordering[CIStringS] =
hashAndOrderForCIStringS.toOrdering

implicit val showForCIStringS: Show[CIStringS] =
Show.fromToString

implicit val lowerBoundForCIStringS: LowerBounded[CIStringS] =
new LowerBounded[CIStringS] {
override val partialOrder: PartialOrder[CIStringS] =
hashAndOrderForCIStringS

override val minBound: CIStringS =
empty
}

implicit val monoidForCIStringS: Monoid[CIStringS] =
new Monoid[CIStringS] {
override val empty: CIStringS = CIStringS.empty

override def combine(x: CIStringS, y: CIStringS): CIStringS =
CIStringS(x.toString + y.toString)

override def combineAll(xs: IterableOnce[CIStringS]): CIStringS = {
val sb: StringBuilder = new StringBuilder
xs.iterator.foreach(cfs => sb.append(cfs.toString))
CIStringS(sb.toString)
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright 2020 Typelevel
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.typelevel.ci

import java.text.Normalizer
import scala.annotation.tailrec

final case class CanonicalFullCaseFoldedString private (override val toString: String)
extends AnyVal

object CanonicalFullCaseFoldedString {
def apply(value: String): CanonicalFullCaseFoldedString =
new CanonicalFullCaseFoldedString(
Normalizer.normalize(
if (Normalizer.isNormalized(value, Normalizer.Form.NFD)) {
CaseFolding.fullCaseFoldString(value)
} else {
CaseFolding.fullCaseFoldString(Normalizer.normalize(value, Normalizer.Form.NFD))
},
Normalizer.Form.NFD
)
)

val empty: CanonicalFullCaseFoldedString =
apply("")
}
Loading