You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed in #48281 that various search routines are giving incorrect results because they check for ASCII chars via c ≤ '\x7f' rather than isascii(c), which can be wrong for a malformed Char:
julia> c =reinterpret(Char, 0x7f000000-0x01)
'\x7e\xff\xff\xff': Malformed UTF-8 (category Ma: Malformed, bad data)
julia>isascii(c)
false
julia> c ≤'\x7f'true
julia> s ="foo~bar~\xff\xff\xff~baz""foo~bar~\xff\xff\xff~baz"
julia>readuntil(IOBuffer(s), c)
"foo"
julia> s[1:findfirst(==(c), s)]
"foo~"
julia> s[findlast(==(c), s):end]
"~baz"
These search results seem clearly wrong to me — they are treating c as if it were == Char(0x7e) == '~'.
It's not entirely clear to me what the correct behavior should be, however, since Julia never treats '\x7e\xff\xff\xff' as a character in a string: "$c" == "~\xff\xff\xff". Maybe these routines should throw an error if the character is malformed?
See also #44989, where such characters literals were allowed disallowed: even though since it is impossible(?) to obtain such a Char by iterating a String.
The text was updated successfully, but these errors were encountered:
Using reinterpret is bound to be able to create instances that violate any assumptions, and thus can break code.
Is there any other way of creating such a Char, without using reinterpret? If not, we could simply consider using such a Char undefined behaviour?
See also #44989, where such characters literals were allowed:
julia>`'\x7e\xff\xff\xff'``'\x7e\xff\xff\xff'`
I advocated for allowing those, but it was eventually decided against. This just creates a command literal, if you try this with a character literal this will throw a syntax error:
Moreover, there is an implicit invariant for the Char type, which is that it only ever holds data for a single character as defined by string iteration. You can violate that invariant and construct Char values like 'abcd' or \x80\x80' by using reinterpret, but we really don't want to encourage it and we certainly shouldn't provide syntax to construct Char values like that.
If not, we could simply consider using such a Char undefined behaviour?
That's certainly an option. (It doesn't crash, it just gives a surprising result.)
It might be nicer to throw an exception, since it shouldn't be too costly in the context of a search routine to check for a malformed Char. But currently we don't seem to have a predicate to check for malformed Char values? (isvalid(c) is different: it returns false for anything that does not correspond to a valid UTF-8-encoded character.)
I noticed in #48281 that various search routines are giving incorrect results because they check for ASCII chars via
c ≤ '\x7f'
rather thanisascii(c)
, which can be wrong for a malformedChar
:These search results seem clearly wrong to me — they are treating
c
as if it were== Char(0x7e) == '~'
.It's not entirely clear to me what the correct behavior should be, however, since Julia never treats
'\x7e\xff\xff\xff'
as a character in a string:"$c" == "~\xff\xff\xff"
. Maybe these routines should throw an error if the character is malformed?cc @StefanKarpinski, who wrote this code in #24999.
See also #44989, where such characters literals were
alloweddisallowed:even thoughsince it is impossible(?) to obtain such aChar
by iterating aString
.The text was updated successfully, but these errors were encountered: