clarify Unicode handling #423

hudlow · 2024-12-14T02:07:30Z

fixes #420

hudlow · 2024-12-14T02:08:32Z

doc/langdef.md

-    [BMP](https://en.wikipedia.org/wiki/Plane_\(Unicode\)#Basic_Multilingual_Plane).
-    Characters in other Unicode planes can be represented with surrogate pairs.
-    Valid only for string literals.
+    in the [BMP](https://www.unicode.org/roadmaps/bmp/).


It feels a little inappropriate to link to a tertiary source in a specification.

hudlow · 2024-12-14T02:10:44Z

doc/langdef.md

-    Characters in other Unicode planes can be represented with surrogate pairs.
-    Valid only for string literals.


I can't find any evidence of support for surrogate pairs in the cel-go or cel-java (I can't get cel-cpp to build). And really, why would they be supported since:

they are not a UTF-8 construct

they cannot express anything that isn't more explicitly expressed via a \U-prefixed 32-bit code point

hudlow · 2024-12-14T02:12:25Z

doc/langdef.md

@@ -549,7 +556,7 @@ The "interoperable" range of integer values is `-(2^53-1)` to `2^53 - 1`.

 CEL is a dynamically-typed language, meaning that the types of the values of the
 variables and expressions might not be known until runtime. However, CEL has an
-optional type-checking phase that takes annotation giving the types of all
+optional type-checking phase that takes the types declared for all functions and


Honestly not sure how intentional the word "annotation" is here and whether the word "declared" will pass muster, but I was hoping to build on these ideas and "annotation" did not seem quite right to me.

hudlow · 2024-12-14T02:13:01Z

doc/langdef.md

+Where feasible, CEL implementations ensure that a value bound to a variable name
+or returned by a custom function conforms to the CEL type declared for that
+value or, for dynamic typed values, _a_ CEL type. Where implementations allow
+nonconforming values, (e.g. strings with invalid Unicode code points) to be
+provided to a CEL program, conformance must be enforced by the application
+embedding the CEL program in order to ensure type safety is maintained.


My best effort to address the dangers head on.

hudlow · 2024-12-14T02:14:30Z

doc/langdef.md

+Like standard functions, extension functions must be free from observable side
+effects in order to prevent expressions from having undefined results, since CEL
+does not guarantee evaluation order of sub-expressions.


Clarifying this here so we can remove qualifying language later on. (Though there are already other places that do not have qualifying language and simply assert that CEL expressions are free of side effects.)

hudlow · 2024-12-14T02:14:51Z

doc/langdef.md

+It is possible to add extension functions to CEL, which then behave consistently
+with standard functions. The mechanism for doing this is implementation
+dependent and usually highly curated. For example, an application domain of CEL
+can add a new overload to the `size` function above, provided this overload's
+argument types do not overlap with any existing overload. For methodological
+reasons, CEL does not allow overloading operators.


Maybe I got carried away here, but this paragraph had some weird phrasing.

hudlow · 2024-12-14T02:15:38Z

doc/langdef.md

+Strings and bytes obey lexicographic ordering of the byte values. Because
+strings are encoded in UTF-8, strings consequently also obey lexicographic
+ordering of their Unicode code points.


It seemed to me better not to imply that strings and bytes have different lexicographic ordering behavior.

hudlow · 2024-12-14T02:16:22Z

doc/langdef.md

@@ -1822,6 +1842,8 @@ codepoints
 ```
 "hello".size() // 5
 size("world!") // 6
+"fiance\u0301".size() // 7
+size(string(b'\xF0\x9F\xA4\xAA')) // 1


This seems really important!

hudlow · 2024-12-14T02:17:02Z

doc/langdef.md

+*   `string(bytes) -> string` converts a byte sequence to a UTF-8 string, errors
+    for invalid code points


This is the behavior I've observed in cel-go and cel-java and it certainly seems correct to me.

hudlow · 2024-12-14T02:47:24Z

tests/simple/testdata/comparisons.textproto

-  test {
-    name: "no_string_normalization_surrogate"
-    description: "Should not replace surrogate pairs."
-    expr: "'\\U0001F436' == '\\xef\\xbf\\xbd\\xef\\xbf\\bd'"
-    value: { bool_value: false }
-  }


So many things wrong with this test:

no surrogate pairs in sight

the byte sequence on the right is two replacement characters, perhaps the result of a failed attempt to convert a surrogate pair to UTF-8

which makes no sense, because UTF-8 has no surrogate pairs

meaning the test itself also makes no sense because there's no way a surrogate pair could be preserved in a CEL string

also, the byte sequence on the right isn't a byte sequence since string literals interpret byte escapes as code points

It's possible I got some of that wrong, but I am pretty sure it's safe to remove this test.

jnthntatum · 2025-01-02T18:28:34Z

/gcbrun

clarify Unicode handling

4dbc3c9

hudlow commented Dec 14, 2024

View reviewed changes

remove surrogate pair conformance test

eb3d89d

hudlow commented Dec 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify Unicode handling #423

clarify Unicode handling #423

hudlow commented Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024 •

edited

Loading

hudlow Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024

hudlow Dec 14, 2024 •

edited

Loading

jnthntatum commented Jan 2, 2025

		Characters in other Unicode planes can be represented with surrogate pairs.
		Valid only for string literals.

		* `string(bytes) -> string` converts a byte sequence to a UTF-8 string, errors
		for invalid code points

clarify Unicode handling #423

Are you sure you want to change the base?

clarify Unicode handling #423

Conversation

hudlow commented Dec 14, 2024

Choose a reason for hiding this comment

hudlow Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudlow Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

jnthntatum commented Jan 2, 2025

hudlow Dec 14, 2024 •

edited

Loading

hudlow Dec 14, 2024 •

edited

Loading