Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify Unicode handling #423

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hudlow
Copy link
Contributor

@hudlow hudlow commented Dec 14, 2024

fixes #420

[BMP](https://en.wikipedia.org/wiki/Plane_\(Unicode\)#Basic_Multilingual_Plane).
Characters in other Unicode planes can be represented with surrogate pairs.
Valid only for string literals.
in the [BMP](https://www.unicode.org/roadmaps/bmp/).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little inappropriate to link to a tertiary source in a specification.

Comment on lines -313 to -314
Characters in other Unicode planes can be represented with surrogate pairs.
Valid only for string literals.
Copy link
Contributor Author

@hudlow hudlow Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any evidence of support for surrogate pairs in the cel-go or cel-java (I can't get cel-cpp to build). And really, why would they be supported since:

  • they are not a UTF-8 construct
  • they cannot express anything that isn't more explicitly expressed via a \U-prefixed 32-bit code point

@@ -549,7 +556,7 @@ The "interoperable" range of integer values is `-(2^53-1)` to `2^53 - 1`.

CEL is a dynamically-typed language, meaning that the types of the values of the
variables and expressions might not be known until runtime. However, CEL has an
optional type-checking phase that takes annotation giving the types of all
optional type-checking phase that takes the types declared for all functions and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly not sure how intentional the word "annotation" is here and whether the word "declared" will pass muster, but I was hoping to build on these ideas and "annotation" did not seem quite right to me.

Comment on lines +646 to +651
Where feasible, CEL implementations ensure that a value bound to a variable name
or returned by a custom function conforms to the CEL type declared for that
value or, for dynamic typed values, _a_ CEL type. Where implementations allow
nonconforming values, (e.g. strings with invalid Unicode code points) to be
provided to a CEL program, conformance must be enforced by the application
embedding the CEL program in order to ensure type safety is maintained.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best effort to address the dangers head on.

Comment on lines +1021 to +1023
Like standard functions, extension functions must be free from observable side
effects in order to prevent expressions from having undefined results, since CEL
does not guarantee evaluation order of sub-expressions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying this here so we can remove qualifying language later on. (Though there are already other places that do not have qualifying language and simply assert that CEL expressions are free of side effects.)

Comment on lines +1014 to +1019
It is possible to add extension functions to CEL, which then behave consistently
with standard functions. The mechanism for doing this is implementation
dependent and usually highly curated. For example, an application domain of CEL
can add a new overload to the `size` function above, provided this overload's
argument types do not overlap with any existing overload. For methodological
reasons, CEL does not allow overloading operators.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I got carried away here, but this paragraph had some weird phrasing.

Comment on lines +1182 to +1184
Strings and bytes obey lexicographic ordering of the byte values. Because
strings are encoded in UTF-8, strings consequently also obey lexicographic
ordering of their Unicode code points.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to me better not to imply that strings and bytes have different lexicographic ordering behavior.

@@ -1822,6 +1842,8 @@ codepoints
```
"hello".size() // 5
size("world!") // 6
"fiance\u0301".size() // 7
size(string(b'\xF0\x9F\xA4\xAA')) // 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really important!

Comment on lines +2126 to +2127
* `string(bytes) -> string` converts a byte sequence to a UTF-8 string, errors
for invalid code points
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the behavior I've observed in cel-go and cel-java and it certainly seems correct to me.

Comment on lines -150 to -155
test {
name: "no_string_normalization_surrogate"
description: "Should not replace surrogate pairs."
expr: "'\\U0001F436' == '\\xef\\xbf\\xbd\\xef\\xbf\\bd'"
value: { bool_value: false }
}
Copy link
Contributor Author

@hudlow hudlow Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many things wrong with this test:

  • no surrogate pairs in sight
  • the byte sequence on the right is two replacement characters, perhaps the result of a failed attempt to convert a surrogate pair to UTF-8
  • which makes no sense, because UTF-8 has no surrogate pairs
  • meaning the test itself also makes no sense because there's no way a surrogate pair could be preserved in a CEL string
  • also, the byte sequence on the right isn't a byte sequence since string literals interpret byte escapes as code points

It's possible I got some of that wrong, but I am pretty sure it's safe to remove this test.

@jnthntatum
Copy link
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify that malformed Unicode code points result in an error
2 participants