-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document invalid UTF-8 indexing and concatenation #26952
Changes from 1 commit
47f307b
0c080fd
60395c7
32d8d65
f6dc4ca
c0f0509
a563e33
14b157a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -352,31 +352,33 @@ end | |
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding code unit `i` is part of. In other words, if `i` is the start of a | ||
character, return `i`; if `i` is not the start of a character, rewind until the | ||
start of a character and return that index. If `i` is out of bounds in `s` | ||
return `i`. | ||
start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1` | ||
return `i`. In all other cases throw `BoundsError`. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> thisind("αβγdef", -5) | ||
-5 | ||
julia> thisind("α", -1) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [-1] | ||
[...] | ||
|
||
julia> thisind("α", 0) | ||
0 | ||
|
||
julia> thisind("αβγdef", 1) | ||
julia> thisind("α", 1) | ||
1 | ||
|
||
julia> thisind("αβγdef", 3) | ||
3 | ||
julia> thisind("α", 2) | ||
1 | ||
|
||
julia> thisind("αβγdef", 4) | ||
julia> thisind("α", 3) | ||
3 | ||
|
||
julia> thisind("αβγdef", 9) | ||
9 | ||
|
||
julia> thisind("αβγdef", 10) | ||
10 | ||
|
||
julia> thisind("αβγdef", 20) | ||
20 | ||
julia> thisind("α", 4) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line has no output There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed (a copy-paste glitch) |
||
julia> thisind("α", -1) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [-1] | ||
[...] | ||
``` | ||
""" | ||
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i)) | ||
|
@@ -394,27 +396,41 @@ end | |
""" | ||
prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int | ||
|
||
Case `n == 1`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps make this a bullet point list instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding starts before index `i`. In other words, if `i` is the start of a | ||
character, return the start of the previous character; if `i` is not the start | ||
of a character, rewind until the start of a character and return that index. | ||
If `i` is out of bounds in `s` return `i - 1`. If `n == 0` return `i`. | ||
If `i` is equal to `1` return `0`. | ||
If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`. | ||
Otherwise throw `BoundsError`. | ||
|
||
Case `n > 1`. Behaves like applying `n` times `prevind` for `n==1`. The only difference | ||
is that if `n` is so large that applying `prevind` would reach `0` then each remaining | ||
iteration decreases the returned value by `1`. | ||
This means that in this case `prevind` can return a negative value. | ||
|
||
Case `n == 0`. | ||
Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`. | ||
Otherwise `StringIndexError` or `BoundsError` is thrown. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> prevind("αβγdef", 3) | ||
julia> prevind("α", 3) | ||
1 | ||
|
||
julia> prevind("αβγdef", 1) | ||
julia> prevind("α", 1) | ||
0 | ||
|
||
julia> prevind("αβγdef", 0) | ||
ERROR: BoundsError: attempt to access "αβγdef" | ||
julia> prevind("α", 0) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [0] | ||
Stacktrace: | ||
[...] | ||
|
||
julia> prevind("αβγdef", 3, 2) | ||
julia> prevind("α", 2, 2) | ||
0 | ||
|
||
julia> prevind("α", 2, 3) | ||
0 | ||
``` | ||
""" | ||
|
@@ -436,25 +452,42 @@ end | |
""" | ||
nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int | ||
|
||
Case `n == 1`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here, perhaps better with a bulleted list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`. | ||
If `n == 0` return `i`. | ||
encoding starts after index `i`. In other words, if `i` is the start of a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section need to be indented (and same below) if they should render as bullet points, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed, thanks. |
||
character, return the start of the next character; if `i` is not the start | ||
of a character, move forward until the start of a character and return that index. | ||
If `i` is equal to `0` return `1`. | ||
If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`. | ||
Otherwise throw `BoundsError`. | ||
|
||
Case `n > 1`. Behaves like applying `n` times `nextind` for `n==1`. The only difference | ||
is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then each | ||
remaining iteration increases the returned value by `1`. | ||
This means that in this case `nextind` can return a value greater than `ncodeunits(str)+1`. | ||
|
||
Case `n == 0`. | ||
Return `i` only if `i` is a valid index in `s` or is equal to `0`. | ||
Otherwise `StringIndexError` or `BoundsError` is thrown. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> str = "αβγdef"; | ||
julia> nextind("α", 0) | ||
1 | ||
|
||
julia> nextind(str, 1) | ||
julia> nextind("α", 1) | ||
3 | ||
|
||
julia> nextind(str, 1, 2) | ||
5 | ||
julia> nextind("α", 3) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [3] | ||
[...] | ||
|
||
julia> lastindex(str) | ||
9 | ||
julia> nextind("α", 0, 2) | ||
3 | ||
|
||
julia> nextind(str, 9) | ||
10 | ||
julia> nextind("α", 1, 2) | ||
4 | ||
``` | ||
""" | ||
nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -348,6 +348,35 @@ x | |
y | ||
``` | ||
|
||
Strings in Julia can contain invalid UTF-8 code unit sequences. This is rule allows to accept | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
any byte sequence as a string. In such situations a rule is that characters are formed by longest | ||
possibly valid sequences of code points. This rule is best explained by an example: | ||
|
||
```jldoctest unicodestring | ||
julia> s = "\xc0\xa0\xe2\x88\xe2|" | ||
"\xc0\xa0\xe2\x88\xe2|" | ||
|
||
julia> foreach(display, s) | ||
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space) | ||
'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data) | ||
'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data) | ||
'|': ASCII/Unicode U+007c (category Sm: Symbol, math) | ||
|
||
julia> isvalid.(collect(s)) | ||
4-element BitArray{1}: | ||
false | ||
false | ||
false | ||
true | ||
``` | ||
|
||
We can see that first two code units in `s` form an overlong encoding of space character. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "We can see that +the+ first two..." Also, I'd suggest There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
It is invalid, but is accepted in a string as a single character. | ||
Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "The next two". "three-byte". "The fifth". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed | ||
character in this string. Similarly code unit 5 forms a malformed character because | ||
because `|` is not a valid continuation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Twice "because". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
|
||
Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages. | ||
For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package | ||
implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and | ||
|
@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n") | |
"Hello, world.\n" | ||
``` | ||
|
||
An important to be aware of situation is when invalid UTF-8 strings are concatenated. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps better as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. @fredrikekre thank you for a review 😄, |
||
In that case string may contain different characters than those that constitute concatenated | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the resulting string" and "that constitute input strings"? Below, typo "sting". "such a string" could just be "its [number of characters]". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. thank you for a review. |
||
stings and number of characters in such string may be lower than sum of numbers of characters | ||
of the concatenated strings, e.g.: | ||
|
||
```jldoctest stringconcat | ||
julia> a, b = "\xe2\x88", "\x80" | ||
("\xe2\x88", "\x80") | ||
|
||
julia> c = a*b | ||
"∀" | ||
|
||
julia> collect.([a, b, c]) | ||
3-element Array{Array{Char,1},1}: | ||
['\xe2\x88'] | ||
['\x80'] | ||
['∀'] | ||
|
||
julia> length.([a, b, c]) | ||
3-element Array{Int64,1}: | ||
1 | ||
1 | ||
1 | ||
``` | ||
|
||
This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings | ||
concatenation preserves all characters in strings and additivity of string lengths. | ||
|
||
Julia also provides `*` for string concatenation: | ||
|
||
```jldoctest stringconcat | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps start with an example that "works" and leave the
BoundsError
examples to last?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved (I wanted to sow what happens if we increase the index but I agree with your reasoning).