-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properly support malformed char literals #44765
Conversation
Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried to replicate the behavior of `getindex` on a string in Julia here, but then I noticed that we probably also want to support cases like `'\xff\xff'`, which would give two characters in a Julia string. Now this supports any combination of characters as long as they are between 1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this makes sense because otherwise we wouldn't be able to reparse every valid char in Julia, but I could see Python users being confused about why Julia only supports strings up to length 4... 😄 fixes #25072
f20a3a7
to
6c955c6
Compare
I don't like that. I feel like for |
Yeah, I get what you mean. Unfortunately, to allow this only for escaped char literals, we can't piggyback off the parsing for strings anymore, we'd have to have custom parsing rules for char literals. |
Malformed chars now need to be written out explicitly with (multiple) `\x...`.
@simeonschaub, I want to thank you for fixing a bunch of syntaxy issues like this that have been outstanding for many years and make the language much smoother and more cohesive. Bravo 👏🏻 |
Messing about with this I found an interesting corner case:
While you can construct a julia> c = reinterpret(Char, reverse(map(UInt8, ['a', 'b', 'c', 'd'])))[1]
'\x61\x62\x63\x64': Malformed UTF-8 (category Ma: Malformed, bad data)
julia> println(c)
abcd IMO allowing one but not the other is inconsistent. If we allow the one with invalid data why do we disallow the one with valid data? I think both should be disallowed. But then what is a sensible criterion for which character literals are allowed? The key issue with both of these is that iterating string data can never produce the character |
Yes, that is the intended behavior!
While that's true, I personally really like that with this PR, we can represent any possible Julia I thought what Keno was saying is that something like |
I think it's worth stepping back and thinking about why we want the ability to express invalid Moreover, there is an implicit invariant for the |
I think I still disagree with this. I really mostly think of OTOH, I don't really see what we gain from declaring these Maybe another way of thinking about it: to me |
I completely agree with Stefan. |
IIUC, |
I would be in favor of also showing those characters as reinterpret calls. I'd suggest erroring but that would slow a lot of code down (even if the error doesn't happen). Another option would be to ignore trailing junk in Char values and consider |
I will try to implement this to work the way we want. |
I think I already implemented most of that in a separate branch. Let me fix that up |
Great. Closing in favor of #44989 |
Fixes the parsing of char literals like
'\xc0\x80'
. At first, I triedto replicate the behavior of
getindex
on a string in Julia here, butthen I noticed that we probably also want to support cases like
'\xff\xff'
, which would give two characters in a Julia string. Nowthis also supports non-sensible UTF-8, but only if written out using
\x...
fixes #25072