Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properly support malformed char literals #44765

Closed
wants to merge 5 commits into from
Closed

Conversation

simeonschaub
Copy link
Member

@simeonschaub simeonschaub commented Mar 27, 2022

Fixes the parsing of char literals like '\xc0\x80'. At first, I tried
to replicate the behavior of getindex on a string in Julia here, but
then I noticed that we probably also want to support cases like
'\xff\xff', which would give two characters in a Julia string. Now
this also supports non-sensible UTF-8, but only if written out using
\x...

fixes #25072

@simeonschaub simeonschaub added parser Language parsing and surface syntax unicode Related to unicode characters and encodings labels Mar 27, 2022
Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried
to replicate the behavior of `getindex` on a string in Julia here, but
then I noticed that we probably also want to support cases like
`'\xff\xff'`, which would give two characters in a Julia string. Now
this supports any combination of characters as long as they are between
1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this
makes sense because otherwise we wouldn't be able to reparse every valid
char in Julia, but I could see Python users being confused about why
Julia only supports strings up to length 4... 😄

fixes #25072
@Keno
Copy link
Member

Keno commented Mar 27, 2022

'abcd'

I don't like that. I feel like for Char purposes, literal, non-escaped chars are basically zero-padded.

@simeonschaub
Copy link
Member Author

Yeah, I get what you mean. Unfortunately, to allow this only for escaped char literals, we can't piggyback off the parsing for strings anymore, we'd have to have custom parsing rules for char literals.

Malformed chars now need to be written out explicitly with (multiple)
`\x...`.
@StefanKarpinski
Copy link
Member

@simeonschaub, I want to thank you for fixing a bunch of syntaxy issues like this that have been outstanding for many years and make the language much smoother and more cohesive. Bravo 👏🏻

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Mar 28, 2022

Messing about with this I found an interesting corner case: '\xbf\xbf\xbf'. This PR currently allows this and handles it as:

'\xbf\xbf\xbf': Malformed UTF-8 (category Ma: Malformed, bad data)

While you can construct a Char object that contains this data, you can also create one that contains multiple valid characters, such as what @Keno objected to:

julia> c = reinterpret(Char, reverse(map(UInt8, ['a', 'b', 'c', 'd'])))[1]
'\x61\x62\x63\x64': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> println(c)
abcd

IMO allowing one but not the other is inconsistent. If we allow the one with invalid data why do we disallow the one with valid data? I think both should be disallowed. But then what is a sensible criterion for which character literals are allowed? The key issue with both of these is that iterating string data can never produce the character 'abcd' or '\xbf\xbf\xbf'. In other words, a simple approach here would be to parse the character literal as a string and then check if it's a single character (valid or invalid). In other words basically make '...' the equivalent of only("...").

@simeonschaub
Copy link
Member Author

simeonschaub commented Mar 28, 2022

Messing about with this I found an interesting corner case: '\xbf\xbf\xbf'. This PR currently allows this and handles it as:

'\xbf\xbf\xbf': Malformed UTF-8 (category Ma: Malformed, bad data)

Yes, that is the intended behavior!

The key issue with both of these is that iterating string data can never produce the character 'abcd' or '\xbf\xbf\xbf'. In other words, a simple approach here would be to parse the character literal as a string and then check if it's a single character (valid or invalid). In other words basically make '...' the equivalent of only("...").

While that's true, I personally really like that with this PR, we can represent any possible Julia Char as a literal. In fact, we already assume this when printing these characters, so it does seem useful to me to have syntax for this.

I thought what Keno was saying is that something like 'abcd' to everyone looks like 4 characters, whereas for '\xbf\xbf\xbf', you already need a good understanding of UTF-8 to be able to tell whether that gives you one character. So I think at that point it isn't too confusing to extend the syntax to non-sensible UTF-8 in the case of character literals.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Mar 28, 2022

I think it's worth stepping back and thinking about why we want the ability to express invalid Char values as literals. It's useful to be able to write invalid Char values like '\xff' or '\x80' or '\0xc0\x80' because you can encounter them when working with actual strings. You might, for example, read some data in that is UTF-8 with '\xff' as a field separator, in which case you could read lines and splits them on the '\xff' character, even though it's not a valid UTF-8 character. I could come up with more examples, but the point is that the goal is not to have a literal syntax for every possible memory that a Char could take in memory, it is to have literal syntax for Char values that might be useful to work with as characters. And that is precisely the ones that might be iterated by a string.

Moreover, there is an implicit invariant for the Char type, which is that it only ever holds data for a single character as defined by string iteration. You can violate that invariant and construct Char values like 'abcd' or \x80\x80' by using reinterpret, but we really don't want to encourage it and we certainly shouldn't provide syntax to construct Char values like that. The same invariant argument doesn't apply to characters like '\xff' or '\x80' or '\0xc0\x80' because there is already an official way to get these Char values without reinterpret, i.e. "\xff"[1], "\x80"[1] and "\0xc0\x80"[1]. All that this PR should do, IMO, is align the set of Char values that you can express with character liter and string literal syntax.

@simeonschaub
Copy link
Member Author

simeonschaub commented Mar 28, 2022

I think I still disagree with this. I really mostly think of Char as a type on its own right, which represents a superset of values you might get from iterating a UTF-8 encoded string. Just because you can't get a certain Char from iterating a string doesn't mean it's not useful, they may be used as a sentinel value for example.

OTOH, I don't really see what we gain from declaring these Chars to be generally ill-defined. We'd also have to change their printing to something like reinterpret(Char, 0x80800000), which is a lot more verbose and IMO will read very awkwardly among other valid Chars in an array for example.

Maybe another way of thinking about it: to me \x80 just represents a byte. If you put those bytes into a string, that string will of course iterate a certain way, but I would say that's not really something fundamental about the \x80. The same way, you can also fill a Char/UInt32 with bytes -- you start with the top-most of the 4 bytes and fill the remaining bytes with zeros.

@simeonschaub simeonschaub added the triage This should be discussed on a triage call label Mar 29, 2022
@JeffBezanson
Copy link
Member

I completely agree with Stefan. Char is already sketchy-but-very-useful for representing invalid utf-8, we should not make it even more complex by allowing things that wouldn't even be possible in that (highly permissive) context. There are likely to be negative consequences down the road for adding more weird corner cases that Char code needs to support.

@vtjnash
Copy link
Member

vtjnash commented Mar 31, 2022

IIUC, Char already supports those, since its domain is strictly 0..2^32. That is pre-decided. The question posed here is whether the parser should support forming those characters directly, or require forming them via reinterpret. Relatedly, there is the question of whether to show those non-unicode characters using character syntax with '\xff\xff\xff\xff' (as we show it now) or require writing it as typemax(Char) == reinterpret(Char, typemax(UInt32)) (as we parse it now)

@StefanKarpinski
Copy link
Member

I would be in favor of also showing those characters as reinterpret calls. I'd suggest erroring but that would slow a lot of code down (even if the error doesn't happen). Another option would be to ignore trailing junk in Char values and consider 'abcd' equivalent to 'a' but that would complicate equality checking etc.

@JeffBezanson
Copy link
Member

I will try to implement this to work the way we want.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Apr 14, 2022
@simeonschaub
Copy link
Member Author

I will try to implement this to work the way we want.

I think I already implemented most of that in a separate branch. Let me fix that up

simeonschaub added a commit that referenced this pull request Apr 15, 2022
Alternative to #44765. This disallows character literals that can not be
created from iterating a UTF-8 string.

fixes #25072
@JeffBezanson
Copy link
Member

Great. Closing in favor of #44989

@DilumAluthge DilumAluthge deleted the sds/char_literals branch April 26, 2022 22:04
@simeonschaub simeonschaub restored the sds/char_literals branch April 27, 2022 08:07
@giordano giordano deleted the sds/char_literals branch May 13, 2022 22:42
JeffBezanson pushed a commit that referenced this pull request May 19, 2022
Make the syntax for character literals the same as what is allowed in
single-character string literals.

Alternative to #44765

fixes #25072
JeffBezanson pushed a commit that referenced this pull request May 24, 2022
Make the syntax for character literals the same as what is allowed in
single-character string literals.

Alternative to #44765

fixes #25072
KristofferC pushed a commit that referenced this pull request May 25, 2022
Make the syntax for character literals the same as what is allowed in
single-character string literals.

Alternative to #44765

fixes #25072
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Language parsing and surface syntax unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'\xc0\x80' should either error or make an overlong Char
5 participants