-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: string overhaul #24439
wip: string overhaul #24439
Conversation
base/iostream.jl
Outdated
@@ -336,12 +336,12 @@ function read(s::IOStream, nb::Integer; all::Bool=true) | |||
end | |||
|
|||
## Character streams ## | |||
const _chtmp = Ref{Char}() | |||
const _chtmp = Ref{UInt32}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but this should now be inside the function :)
Nice! For testing purposes I have written a verbose version of the decoder here https://gist.github.com/bkamins/5a2f89cab14d434e3f3e4a23c121a8a2 that checks for validity of UTF-8 (it is slower of course). |
base/char.jl
Outdated
((u & 0x00ff0000) >> 4) ⊻ ((u & 0xff000000) >> 6) | ||
end | ||
|
||
function convert(::Type{Char}, u::UInt32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and this works OK for all valid Unicode (last code point U+10FFFF). The conversion breaks for invalid Unicode as it overflows at 0x00400000
.
I understand that we want to avoid branching here so adding the test if u
is not too large is not an option? If so - this probably be documented somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still contemplating how best to handle invalid code points. There was previously an error check here but it caused a massive build time regression (it was also calling a deprecated error type), so I just took it out for now. Will have to figure out what to do here.
With respect to validity checking: |
Either that or change the transformations to be bijective. I’m considering how to do that. |
My thoughts on the bijection (I hope there is no error in this 😄, but I have tested it not only in the head but also by implementation). The best I could think of it is doable but a bit messy (fortunately only for invalid values). Here is the outline how this can be done (giving description of mapping
There are exactly 1112064 valid |
base/filesystem.jl
Outdated
while 1 < n | ||
p = position(f) | ||
b = read(f, UInt8) | ||
if b & 0xc0 != 0x80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should also handle sequences starting with 0xC0
, 0xC1
, 0xE0
, 0xF0
that would be overlong, starting with 0xED
that would be reserved and starting with 0xF4
that would encode too large code point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about that. I think that parsing overlong encodings as a single char is a much friendlier behavior. They occur fairly often in real data, many systems (e.g. Java) use overlong encodings to avoid embedded nul bytes.
base/io.jl
Outdated
c = UInt32(b0) | ||
if n <= 4 | ||
while 1 < n | ||
peek(s) & 0xc0 == 0x80 || break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment as above regarding handling invalid sequences.
function peekchar(s::IOStream) | ||
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{Char}), s, _chtmp) < 0 | ||
chref = Ref{UInt32}() | ||
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{UInt32}), s, chref) < 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current implementation of ios_peekutf8
also does not check for invalid sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit wasteful for ios_peekutf8
to convert the UTF-8 encoding to UInt32
, and then to convert it back when returning Char(chref[])
.
Since this seems to be the only function in all of Julia that calls ios_peekutf8
, we should just re-write ios_peekutf8
to return Char
.ios_peekutf8
is also used in src/flisp
, so we will need to keep that as-is and maybe write a ios_peekrawutf8
function to get Char
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this stuff should be optimized to avoid converting back and forth. This PR is WIP and the first step is just to get everything working and all tests passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still in my new PR, which is annoying but I think that all the ios_
support code needs an overhaul in the 1.x timeframe and ideally we would move most of the buffering code etc. into Julia and it can understand the new Char
representation and avoid the inefficiency here. Until then I think it's ok to just leave this as it is.
In the above comments I have followed strict Unicode 10 compliant approach to reading of UTF-8. The reason is that if they were created they would have This will probably speed up If this approach is taken the only issue to fix is |
As an additional comment to keep in mind - if we use strict parsing rule we are then sure that e.g. string length is consistent with Unicode 10 compliant parsers (as @StefanKarpinski mentioned earlier). If we use simplified |
We should definitely not ever read 4-5 continuation bytes; that's from a very old UTF-8 specification and hasn't been allowed for a long time. It's also not a thing that happens much in real life (aside from CESU-8 maybe?), and would require a larger |
See also WTF-8: https://en.wikipedia.org/wiki/UTF-8#WTF-8, which ignores invalidity of surrogate pairs, since UTF-8 has no intrinsic issue with them – they are purely invalid because of how UTF-16 works (and it only works that way because of its legacy history with UCS-2). |
Some cases I can see for invalid UTF-8 data, together with the most useful "character" parsing:
It's hard to imagine where the Unicode 10 partial character decoding business is actually a useful interpretation. Of course, the byte interpretation for binary and Latin-1 data isn't really very useful either since you don't get the right characters out, but at least the number and size of the units are correct. And yes, there's a chance that binary data or Latin-1 will look like valid UTF-8, but that's very improbable, especially if you require a complete UTF-8 character to group bytes. So I guess I think the most useful
This doesn't give 100% Unicode 10 conformance, but as I said above, I'm not entirely sure the standard really applies. The only observable discrepancy we might encounter is counting the number of characters in invalid strings differently, and in those cases, I think our answers will be what people actually expect/want. That seems like a pretty acceptable tradeoff between handling common UTF-8 variants smoothly and being approximately Unicode 10 conformant. |
First about reading 4-5 byte sequences. As I have shown in #24420 currently Julia does this, e.g. run I am OK with parsing valid overlong encoding and valid reserved surrogates as if they were allowed. And all other invalid UTF-8 following Unicode 10. I understand this is what you propose and this will lead to fastest implementation of splitting. We should simply document that we work like this However, the other issue is conversion from So the question is do we want to have this bijection or not. And even if we do not want a full bijection I propose to decode all other than overlong and surrogate invalid Apart from that then |
I think I've changed my mind on the bijection business, after contemplating it. When converting from |
Maybe I am stating the obvious but I have just realized that actually internally in Julia we will not need to convert from |
Yes, that's exactly the idea – you can do most things with |
15d5497
to
915538c
Compare
I'm puzzling through ordering here and thought maybe you have some thoughts, @bkamins. What ordering should the entire space of characters – valid and invalid – have? It's fairly common to write code like The other consideration is that we might want sorting lexicographically by character to match sorting lexicographically by bytes – which is a property that valid UTF-8 is designed to have. However, I don't think that's actually possible for any ordering of partial characters, so perhaps that shouldn't be a consideration after all. Any thoughts? |
I think that we can order
In short - outside of ASCII range I would not worry too much about comparing As a side comment - supporting https://en.wikipedia.org/wiki/Unicode_collation_algorithm somehow in Julia would be great. |
I think we're in agreement here. The conclusion I came to is that the invariant we want is that |
Also: I think that all fancy Unicode stuff like the collation algorithm can be implemented in packages. |
The property you have indicated would be nice to have. Two things to consider though:
|
Very nice! I agree that consistency between Regarding over-long character encodings, I'd say it's fine to treat them as different from their correct counterparts for |
Just to highlight the issue with comparison I see using analogous notation: The consistency So I would say that a major decision between left and right padding would be how they compare in performance. As for conversion from
If I understand it correctly @StefanKarpinski wants to implement yet something else. Correct? As another thought maybe it would be useful to define |
765d6eb
to
e54e4c0
Compare
Making a PR to run CI.