Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify Unicode handling #423

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 51 additions & 26 deletions doc/langdef.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,8 +265,8 @@ properties are different.

### String and Bytes Values

Strings are sequences of Unicode code points. Bytes are sequences of octets
(eight-bit data).
Strings are valid sequences of Unicode code points. Bytes are arbitrary
sequences of octets (eight-bit data).

Quoted string literals are delimited by either single- or double-quote
characters, where the closing delimiter must match the opening one, and can
Expand Down Expand Up @@ -308,16 +308,15 @@ Escape sequences are a backslash (`\ `) followed by one of the following:
* `t`: horizontal tab
* `v`: vertical tab
* A `u` followed by four hexadecimal characters, encoding a Unicode code point
in the
[BMP](https://en.wikipedia.org/wiki/Plane_\(Unicode\)#Basic_Multilingual_Plane).
Characters in other Unicode planes can be represented with surrogate pairs.
Valid only for string literals.
Comment on lines -313 to -314
Copy link
Contributor Author

@hudlow hudlow Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any evidence of support for surrogate pairs in the cel-go or cel-java (I can't get cel-cpp to build). And really, why would they be supported since:

  • they are not a UTF-8 construct
  • they cannot express anything that isn't more explicitly expressed via a \U-prefixed 32-bit code point

in the [BMP](https://www.unicode.org/roadmaps/bmp/).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little inappropriate to link to a tertiary source in a specification.

* A `U` followed by eight hexadecimal characters, encoding a Unicode code
point. Valid only for string literals.
point (in any plane). Valid only for string literals.
* A `x` or `X` followed by two hexadecimal characters. For strings, it denotes
the unicode code point. For bytes, it represents an octet value.
* Three octal digits, in the range `000` to `377`. For strings, it denotes the
unicode code point. For bytes, it represents an octet value.
a Unicode code point. For bytes, it represents an octet value.
* Three octal digits, in the range `000` to `377`. For strings, it denotes a
Unicode code point. For bytes, it represents an octet value.

All hexadecimal digits in escape sequences are case-insensitive.

Examples:

Expand All @@ -336,7 +335,15 @@ CEL Literal | Meaning
`"\377"` | String of "ÿ" (code point 255)
`b"\377"` | Sequence of byte 255 (*not* UTF-8 of ÿ)
`"\xFF"` | String of "ÿ" (code point 255)
`b"\xFF"` | Sequence of byte 255 (*not* UTF-8 of ÿ)
`b"\xff"` | Sequence of byte 255 (*not* UTF-8 of ÿ)

The following constructions are syntactically invalid and will result in a parse
error:

* A backslash (`` \ ``) outside of a valid escape sequence, e.g. `\s`.
* An invalid Unicode code point, e.g. `\u2FE0`.
* A UTF-16 surrogate code point, even if in a valid UTF-16 surrogate pair,
e.g. `\uD83D\uDE03` or `\UD83DDE03`.

While strings must be sequences of valid Unicode code points, no Unicode
normalization is attempted on strings, as there are several normal forms, they
Expand Down Expand Up @@ -549,7 +556,7 @@ The "interoperable" range of integer values is `-(2^53-1)` to `2^53 - 1`.

CEL is a dynamically-typed language, meaning that the types of the values of the
variables and expressions might not be known until runtime. However, CEL has an
optional type-checking phase that takes annotation giving the types of all
optional type-checking phase that takes the types declared for all functions and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly not sure how intentional the word "annotation" is here and whether the word "declared" will pass muster, but I was hoping to build on these ideas and "annotation" did not seem quite right to me.

variables and tries to deduce the type of the expression and of all its
sub-expressions. This is not always possible, due to the dynamic expansion of
certain messages like `Struct`, `Value`, and `Any` (see
Expand Down Expand Up @@ -636,6 +643,13 @@ The CEL implementation provides mechanisms for adding bindings of variable names
to either values or errors. The implementation will also provide function
bindings for at least all the standard functions listed below.

Where feasible, CEL implementations ensure that a value bound to a variable name
or returned by a custom function conforms to the CEL type declared for that
value or, for dynamic typed values, _a_ CEL type. Where implementations allow
nonconforming values, (e.g. strings with invalid Unicode code points) to be
provided to a CEL program, conformance must be enforced by the application
embedding the CEL program in order to ensure type safety is maintained.
Comment on lines +646 to +651
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best effort to address the dangers head on.


Some implementations might make use of a _context proto_, where a single
protocol buffer message represents all variable bindings: each field in the
message is a binding of the field name to the field value. This provides a
Expand Down Expand Up @@ -997,12 +1011,16 @@ functions and operators.

### Extension Functions

It is possible to add extension functions to CEL, which then behave in no way
different than standard functions. The mechanism how to do this is
implementation dependent and usually highly curated. For example, an application
domain of CEL can add a new overload to the `size` function above, provided this
overload's argument types do not overlap with any existing overload. For
methodological reasons, CEL disallows to add overloads to operators.
It is possible to add extension functions to CEL, which then behave consistently
with standard functions. The mechanism for doing this is implementation
dependent and usually highly curated. For example, an application domain of CEL
can add a new overload to the `size` function above, provided this overload's
argument types do not overlap with any existing overload. For methodological
reasons, CEL does not allow overloading operators.
Comment on lines +1014 to +1019
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I got carried away here, but this paragraph had some weird phrasing.


Like standard functions, extension functions must be free from observable side
effects in order to prevent expressions from having undefined results, since CEL
does not guarantee evaluation order of sub-expressions.
Comment on lines +1021 to +1023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying this here so we can remove qualifying language later on. (Though there are already other places that do not have qualifying language and simply assert that CEL expressions are free of side effects.)


### Receiver Call Style

Expand Down Expand Up @@ -1161,11 +1179,12 @@ Ordering operators are defined for `int`, `uint`, `double`, `string`, `bytes`,
supported across `int`, `uint`, and `double` for consistency with the runtime
equality definition for numeric types.

Strings obey lexicographic ordering of the code points, and bytes obey
lexicographic ordering of the byte values. The ordering operators obey the
usual algebraic properties, i.e. `e1 <= e2` gives the same result as
`!(e1 > e2)` as well as `(e1 < e2) || (e1 == e2)` when the expressions
involved do not have side effects.
Strings and bytes obey lexicographic ordering of the byte values. Because
strings are encoded in UTF-8, strings consequently also obey lexicographic
ordering of their Unicode code points.
Comment on lines +1182 to +1184
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to me better not to imply that strings and bytes have different lexicographic ordering behavior.


The ordering operators obey the usual algebraic properties, i.e. `e1 <= e2`
gives the same result as `!(e1 > e2)` as well as `(e1 < e2) || (e1 == e2)`.

### Overflow

Expand Down Expand Up @@ -1743,6 +1762,7 @@ size({1: true, 2: false}) // 2
```
b'hello'.size() // 5
size(b'world!') // 6
size(b'\xF0\x9F\xA4\xAA') // 4
```

#### String Functions
Expand Down Expand Up @@ -1822,6 +1842,8 @@ codepoints
```
"hello".size() // 5
size("world!") // 6
"fiance\u0301".size() // 7
size(string(b'\xF0\x9F\xA4\xAA')) // 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really important!

```

#### Date/Time Functions
Expand Down Expand Up @@ -2014,6 +2036,7 @@ bool("FALSE") // false

```
bytes("hello") // b'hello'
bytes("🤪") // b'\xF0\x9F\xA4\xAA'
```

**double** `type(double)` \- Type denotation
Expand Down Expand Up @@ -2100,7 +2123,8 @@ int("123") // 123 (if successful, otherwise an error)
* `string(uint) -> string` converts unsigned integer values to base 10
representation
* `string(double) -> string` converts a double to a string
* `string(bytes) -> string` converts a byte sequence to a utf-8 string
* `string(bytes) -> string` converts a byte sequence to a UTF-8 string, errors
for invalid code points
Comment on lines +2126 to +2127
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the behavior I've observed in cel-go and cel-java and it certainly seems correct to me.

* `string(timestamp) -> string` converts a timestamp value to
[RFC3339](https://datatracker.ietf.org/doc/html/rfc3339) format
* `string(duration) -> string` converts a duration value to seconds and
Expand All @@ -2112,8 +2136,9 @@ int("123") // 123 (if successful, otherwise an error)
string(123) // "123"
string(123u) // "123u"
string(3.14) // "3.14"
string(b'hello') // 'hello'
string(duration('1m1ms')) // '60.001s'
string(b'hello') // "hello"
string(b'\xf0\x9f\xa4\xaa') // "🤪"
string(duration('1m1ms')) // "60.001s"
```

**timestamp**
Expand Down