-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clarify Unicode handling #423
base: master
Are you sure you want to change the base?
Conversation
[BMP](https://en.wikipedia.org/wiki/Plane_\(Unicode\)#Basic_Multilingual_Plane). | ||
Characters in other Unicode planes can be represented with surrogate pairs. | ||
Valid only for string literals. | ||
in the [BMP](https://www.unicode.org/roadmaps/bmp/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little inappropriate to link to a tertiary source in a specification.
Characters in other Unicode planes can be represented with surrogate pairs. | ||
Valid only for string literals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any evidence of support for surrogate pairs in the cel-go
or cel-java
(I can't get cel-cpp
to build). And really, why would they be supported since:
- they are not a UTF-8 construct
- they cannot express anything that isn't more explicitly expressed via a
\U
-prefixed 32-bit code point
@@ -549,7 +556,7 @@ The "interoperable" range of integer values is `-(2^53-1)` to `2^53 - 1`. | |||
|
|||
CEL is a dynamically-typed language, meaning that the types of the values of the | |||
variables and expressions might not be known until runtime. However, CEL has an | |||
optional type-checking phase that takes annotation giving the types of all | |||
optional type-checking phase that takes the types declared for all functions and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly not sure how intentional the word "annotation" is here and whether the word "declared" will pass muster, but I was hoping to build on these ideas and "annotation" did not seem quite right to me.
Where feasible, CEL implementations ensure that a value bound to a variable name | ||
or returned by a custom function conforms to the CEL type declared for that | ||
value or, for dynamic typed values, _a_ CEL type. Where implementations allow | ||
nonconforming values, (e.g. strings with invalid Unicode code points) to be | ||
provided to a CEL program, conformance must be enforced by the application | ||
embedding the CEL program in order to ensure type safety is maintained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My best effort to address the dangers head on.
Like standard functions, extension functions must be free from observable side | ||
effects in order to prevent expressions from having undefined results, since CEL | ||
does not guarantee evaluation order of sub-expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifying this here so we can remove qualifying language later on. (Though there are already other places that do not have qualifying language and simply assert that CEL expressions are free of side effects.)
It is possible to add extension functions to CEL, which then behave consistently | ||
with standard functions. The mechanism for doing this is implementation | ||
dependent and usually highly curated. For example, an application domain of CEL | ||
can add a new overload to the `size` function above, provided this overload's | ||
argument types do not overlap with any existing overload. For methodological | ||
reasons, CEL does not allow overloading operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I got carried away here, but this paragraph had some weird phrasing.
Strings and bytes obey lexicographic ordering of the byte values. Because | ||
strings are encoded in UTF-8, strings consequently also obey lexicographic | ||
ordering of their Unicode code points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed to me better not to imply that strings and bytes have different lexicographic ordering behavior.
@@ -1822,6 +1842,8 @@ codepoints | |||
``` | |||
"hello".size() // 5 | |||
size("world!") // 6 | |||
"fiance\u0301".size() // 7 | |||
size(string(b'\xF0\x9F\xA4\xAA')) // 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really important!
* `string(bytes) -> string` converts a byte sequence to a UTF-8 string, errors | ||
for invalid code points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the behavior I've observed in cel-go
and cel-java
and it certainly seems correct to me.
test { | ||
name: "no_string_normalization_surrogate" | ||
description: "Should not replace surrogate pairs." | ||
expr: "'\\U0001F436' == '\\xef\\xbf\\xbd\\xef\\xbf\\bd'" | ||
value: { bool_value: false } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So many things wrong with this test:
- no surrogate pairs in sight
- the byte sequence on the right is two replacement characters, perhaps the result of a failed attempt to convert a surrogate pair to UTF-8
- which makes no sense, because UTF-8 has no surrogate pairs
- meaning the test itself also makes no sense because there's no way a surrogate pair could be preserved in a CEL string
- also, the byte sequence on the right isn't a byte sequence since string literals interpret byte escapes as code points
It's possible I got some of that wrong, but I am pretty sure it's safe to remove this test.
/gcbrun |
fixes #420