-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document invalid UTF-8 indexing and concatenation #26952
Changes from 2 commits
47f307b
0c080fd
60395c7
32d8d65
f6dc4ca
c0f0509
a563e33
14b157a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -352,31 +352,32 @@ end | |
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding code unit `i` is part of. In other words, if `i` is the start of a | ||
character, return `i`; if `i` is not the start of a character, rewind until the | ||
start of a character and return that index. If `i` is out of bounds in `s` | ||
return `i`. | ||
start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1` | ||
return `i`. In all other cases throw `BoundsError`. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> thisind("αβγdef", -5) | ||
-5 | ||
julia> thisind("α", 0) | ||
0 | ||
|
||
julia> thisind("αβγdef", 1) | ||
julia> thisind("α", 1) | ||
1 | ||
|
||
julia> thisind("αβγdef", 3) | ||
3 | ||
julia> thisind("α", 2) | ||
1 | ||
|
||
julia> thisind("αβγdef", 4) | ||
julia> thisind("α", 3) | ||
3 | ||
|
||
julia> thisind("αβγdef", 9) | ||
9 | ||
|
||
julia> thisind("αβγdef", 10) | ||
10 | ||
julia> thisind("α", 4) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [4] | ||
[...] | ||
|
||
julia> thisind("αβγdef", 20) | ||
20 | ||
julia> thisind("α", -1) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [-1] | ||
[...] | ||
``` | ||
""" | ||
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i)) | ||
|
@@ -394,27 +395,45 @@ end | |
""" | ||
prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int | ||
|
||
* Case `n == 1` | ||
|
||
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding starts before index `i`. In other words, if `i` is the start of a | ||
character, return the start of the previous character; if `i` is not the start | ||
of a character, rewind until the start of a character and return that index. | ||
If `i` is out of bounds in `s` return `i - 1`. If `n == 0` return `i`. | ||
If `i` is equal to `1` return `0`. | ||
If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`. | ||
Otherwise throw `BoundsError`. | ||
|
||
* Case `n > 1` | ||
|
||
Behaves like applying `n` times `prevind` for `n==1`. The only difference | ||
is that if `n` is so large that applying `prevind` would reach `0` then each remaining | ||
iteration decreases the returned value by `1`. | ||
This means that in this case `prevind` can return a negative value. | ||
|
||
* Case `n == 0` | ||
|
||
Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`. | ||
Otherwise `StringIndexError` or `BoundsError` is thrown. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> prevind("αβγdef", 3) | ||
julia> prevind("α", 3) | ||
1 | ||
|
||
julia> prevind("αβγdef", 1) | ||
julia> prevind("α", 1) | ||
0 | ||
|
||
julia> prevind("αβγdef", 0) | ||
ERROR: BoundsError: attempt to access "αβγdef" | ||
julia> prevind("α", 0) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [0] | ||
Stacktrace: | ||
[...] | ||
|
||
julia> prevind("αβγdef", 3, 2) | ||
julia> prevind("α", 2, 2) | ||
0 | ||
|
||
julia> prevind("α", 2, 3) | ||
0 | ||
``` | ||
""" | ||
|
@@ -436,25 +455,46 @@ end | |
""" | ||
nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int | ||
|
||
* Case `n == 1` | ||
|
||
If `i` is in bounds in `s` return the index of the start of the character whose | ||
encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`. | ||
If `n == 0` return `i`. | ||
encoding starts after index `i`. In other words, if `i` is the start of a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section need to be indented (and same below) if they should render as bullet points, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed, thanks. |
||
character, return the start of the next character; if `i` is not the start | ||
of a character, move forward until the start of a character and return that index. | ||
If `i` is equal to `0` return `1`. | ||
If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`. | ||
Otherwise throw `BoundsError`. | ||
|
||
* Case `n > 1` | ||
|
||
Behaves like applying `n` times `nextind` for `n==1`. The only difference | ||
is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then each | ||
remaining iteration increases the returned value by `1`. | ||
This means that in this case `nextind` can return a value greater than `ncodeunits(str)+1`. | ||
|
||
* Case `n == 0` | ||
|
||
Return `i` only if `i` is a valid index in `s` or is equal to `0`. | ||
Otherwise `StringIndexError` or `BoundsError` is thrown. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> str = "αβγdef"; | ||
julia> nextind("α", 0) | ||
1 | ||
|
||
julia> nextind(str, 1) | ||
julia> nextind("α", 1) | ||
3 | ||
|
||
julia> nextind(str, 1, 2) | ||
5 | ||
julia> nextind("α", 3) | ||
ERROR: BoundsError: attempt to access "α" | ||
at index [3] | ||
[...] | ||
|
||
julia> lastindex(str) | ||
9 | ||
julia> nextind("α", 0, 2) | ||
3 | ||
|
||
julia> nextind(str, 9) | ||
10 | ||
julia> nextind("α", 1, 2) | ||
4 | ||
``` | ||
""" | ||
nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -348,6 +348,35 @@ x | |
y | ||
``` | ||
|
||
Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to | ||
treat any byte sequence as a `String`. In such situations a rule is that characters are formed | ||
by longest possibly valid sequences of code points. This rule is best explained by an example: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "possibly valid" isn't very explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have tried to improve it (but it is hard). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. The new version much more precise, but is "the longest sequence of code units that could be a start of some valid code point" really correct? e.g. an overlong encoding isn't a start of a valid character. Sorry, I don't know what the best description could be, but Stefan can probably help. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have made another shot (more verbose) and with one additional example. |
||
|
||
```jldoctest unicodestring | ||
julia> s = "\xc0\xa0\xe2\x88\xe2|" | ||
"\xc0\xa0\xe2\x88\xe2|" | ||
|
||
julia> foreach(display, s) | ||
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space) | ||
'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data) | ||
'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data) | ||
'|': ASCII/Unicode U+007c (category Sm: Symbol, math) | ||
|
||
julia> isvalid.(collect(s)) | ||
4-element BitArray{1}: | ||
false | ||
false | ||
false | ||
true | ||
``` | ||
|
||
We can see that first two code units in `s` form an overlong encoding of space character. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "We can see that +the+ first two..." Also, I'd suggest There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
It is invalid, but is accepted in a string as a single character. | ||
Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "The next two". "three-byte". "The fifth". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed | ||
character in this string. Similarly code unit 5 forms a malformed character because | ||
because `|` is not a valid continuation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Twice "because". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
|
||
Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages. | ||
For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package | ||
implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and | ||
|
@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n") | |
"Hello, world.\n" | ||
``` | ||
|
||
A situation which is important to be aware of is when invalid UTF-8 strings are concatenated. | ||
In that case string may contain different characters than those that constitute concatenated | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the resulting string" and "that constitute input strings"? Below, typo "sting". "such a string" could just be "its [number of characters]". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. thank you for a review. |
||
stings and number of characters in such a string may be lower than sum of numbers of | ||
characters of the concatenated strings, e.g.: | ||
|
||
```jldoctest stringconcat | ||
julia> a, b = "\xe2\x88", "\x80" | ||
("\xe2\x88", "\x80") | ||
|
||
julia> c = a*b | ||
"∀" | ||
|
||
julia> collect.([a, b, c]) | ||
3-element Array{Array{Char,1},1}: | ||
['\xe2\x88'] | ||
['\x80'] | ||
['∀'] | ||
|
||
julia> length.([a, b, c]) | ||
3-element Array{Int64,1}: | ||
1 | ||
1 | ||
1 | ||
``` | ||
|
||
This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings | ||
concatenation preserves all characters in strings and additivity of string lengths. | ||
|
||
Julia also provides `*` for string concatenation: | ||
|
||
```jldoctest stringconcat | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has no output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (a copy-paste glitch)