-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document invalid UTF-8 indexing and concatenation #26952
Conversation
base/strings/basic.jl
Outdated
|
||
# Examples | ||
```jldoctest | ||
julia> thisind("αβγdef", -5) | ||
-5 | ||
julia> thisind("α", -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps start with an example that "works" and leave the BoundsError
examples to last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved (I wanted to sow what happens if we increase the index but I agree with your reasoning).
base/strings/basic.jl
Outdated
|
||
julia> thisind("αβγdef", 20) | ||
20 | ||
julia> thisind("α", 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has no output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (a copy-paste glitch)
base/strings/basic.jl
Outdated
@@ -394,27 +396,41 @@ end | |||
""" | |||
prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int | |||
|
|||
Case `n == 1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps make this a bullet point list instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
base/strings/basic.jl
Outdated
@@ -436,25 +452,42 @@ end | |||
""" | |||
nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int | |||
|
|||
Case `n == 1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, perhaps better with a bulleted list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/src/manual/strings.md
Outdated
@@ -348,6 +348,35 @@ x | |||
y | |||
``` | |||
|
|||
Strings in Julia can contain invalid UTF-8 code unit sequences. This is rule allows to accept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rule allows
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/src/manual/strings.md
Outdated
@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n") | |||
"Hello, world.\n" | |||
``` | |||
|
|||
An important to be aware of situation is when invalid UTF-8 strings are concatenated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps better as A situation which is important to be aware of is when...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. @fredrikekre thank you for a review 😄,
@StefanKarpinski: I have documented it in this PR but the question is whether it is intended or we want to throw an error in such situations. |
I've been on vacation but will take a look at this now that I'm back. |
doc/src/manual/strings.md
Outdated
@@ -348,6 +348,35 @@ x | |||
y | |||
``` | |||
|
|||
Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to | |||
treat any byte sequence as a `String`. In such situations a rule is that characters are formed | |||
by longest possibly valid sequences of code points. This rule is best explained by an example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"possibly valid" isn't very explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to improve it (but it is hard).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The new version much more precise, but is "the longest sequence of code units that could be a start of some valid code point" really correct? e.g. an overlong encoding isn't a start of a valid character. Sorry, I don't know what the best description could be, but Stefan can probably help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made another shot (more verbose) and with one additional example.
doc/src/manual/strings.md
Outdated
|
||
We can see that first two code units in `s` form an overlong encoding of space character. | ||
It is invalid, but is accepted in a string as a single character. | ||
Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The next two". "three-byte". "The fifth".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/src/manual/strings.md
Outdated
Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit | ||
`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed | ||
character in this string. Similarly code unit 5 forms a malformed character because | ||
because `|` is not a valid continuation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Twice "because".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/src/manual/strings.md
Outdated
@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n") | |||
"Hello, world.\n" | |||
``` | |||
|
|||
A situation which is important to be aware of is when invalid UTF-8 strings are concatenated. | |||
In that case string may contain different characters than those that constitute concatenated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the resulting string" and "that constitute input strings"?
Below, typo "sting". "such a string" could just be "its [number of characters]".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. thank you for a review.
doc/src/manual/strings.md
Outdated
true | ||
``` | ||
|
||
We can see that first two code units in `s` form an overlong encoding of space character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We can see that +the+ first two..."
Also, I'd suggest ...in `s`...
--> ...in the string `s`...
, to make the sentence structure more evident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/src/manual/strings.md
Outdated
The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth | ||
code unit `\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second | ||
malformed character in this string. Similarly code unit 5 forms a malformed character because | ||
`|` is not a valid continuation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"because `|` is not a valid continuation +to it+"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/src/manual/strings.md
Outdated
@@ -371,6 +401,34 @@ julia> string(greet, ", ", whom, ".\n") | |||
"Hello, world.\n" | |||
``` | |||
|
|||
A situation which is important to be aware of is when invalid UTF-8 strings are concatenated. | |||
In that case the resulting string may contain different characters than those that constitute | |||
input strings and its number of characters may be lower than sum of numbers of characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"than those that constitute input strings and" --> "than the input strings, and"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
a6489a1
to
de20606
Compare
Any opinion on this PR? |
Definitely a big improvement. I should have merged long ago. I fixed the merge conflict, we can merge as soon as CI passes. |
@fredrikekre Here, CI mostly passes, but I have investigated Travis failure and it is related to the same problem as why #26802 fails. Documenter.jl throws an error when digesting malformed characters. Again - we could disable doctests here, but maybe there is some workaround? In general, in both cases it is crucial that we show in the manual how Julia works with invalid UTF-8 so I have to leave those offending lines in both PRs. |
CI status is that 2 builds on CircleCI were canceled, and AppVeyor and Travis builds partially passed and partially failed due to some strange reason (not due to the error from Documenter.jl); freebsd ci passes. |
base/strings/basic.jl
Outdated
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`. | ||
If `n == 0` return `i`. | ||
encoding starts after index `i`. In other words, if `i` is the start of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section need to be indented (and same below) if they should render as bullet points, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks.
Thanks for this—it's a really good explanation of how this works. |
#27109 on Travis 64-bit |
This PR is a take to address all documentation points from #25478.
Changes:
thisind
,nextind
,prevind
.I am not a native speaker so if there are grammatical mistakes just push a patch to the descriptions 😄.
@StefanKarpinski the descriptions are complex unfortunately but I hope I have managed to cover all cases.