Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document invalid UTF-8 indexing and concatenation #26952

Merged
merged 8 commits into from
Jun 1, 2018
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 73 additions & 27 deletions base/strings/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -365,25 +365,32 @@ end
If `i` is in bounds in `s` return the index of the start of the character whose
encoding code unit `i` is part of. In other words, if `i` is the start of a
character, return `i`; if `i` is not the start of a character, rewind until the
start of a character and return that index. If `i` is out of bounds in `s`
return `i`.
start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1`
return `i`. In all other cases throw `BoundsError`.

# Examples
```jldoctest
julia> thisind("αβγdef", 1)
julia> thisind("α", 0)
0

julia> thisind("α", 1)
1

julia> thisind("αβγdef", 3)
3
julia> thisind("α", 2)
1

julia> thisind("αβγdef", 4)
julia> thisind("α", 3)
3

julia> thisind("αβγdef", 9)
9
julia> thisind("α", 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has no output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (a copy-paste glitch)

ERROR: BoundsError: attempt to access "α"
at index [4]
[...]

julia> thisind("αβγdef", 10)
10
julia> thisind("α", -1)
ERROR: BoundsError: attempt to access "α"
at index [-1]
[...]
```
"""
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))
Expand All @@ -401,28 +408,46 @@ end
"""
prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int

* Case `n == 1`

If `i` is in bounds in `s` return the index of the start of the character whose
encoding starts before index `i`. In other words, if `i` is the start of a
character, return the start of the previous character; if `i` is not the start
of a character, rewind until the start of a character and return that index.
If `i` is out of bounds in `s` return `i - 1`. If `n == 0` return `i`.
If `i` is equal to `1` return `0`.
If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`.
Otherwise throw `BoundsError`.

* Case `n > 1`

Behaves like applying `n` times `prevind` for `n==1`. The only difference
is that if `n` is so large that applying `prevind` would reach `0` then each remaining
iteration decreases the returned value by `1`.
This means that in this case `prevind` can return a negative value.

* Case `n == 0`

Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`.
Otherwise `StringIndexError` or `BoundsError` is thrown.

# Examples
```jldoctest
julia> prevind("αβγdef", 3)
julia> prevind("α", 3)
1

julia> prevind("αβγdef", 1)
julia> prevind("α", 1)
0

julia> prevind("αβγdef", 0)
ERROR: BoundsError: attempt to access "αβγdef"
julia> prevind("α", 0)
ERROR: BoundsError: attempt to access "α"
at index [0]
Stacktrace:
[...]

julia> prevind("αβγdef", 3, 2)
julia> prevind("α", 2, 2)
0

julia> prevind("α", 2, 3)
-1
```
"""
prevind(s::AbstractString, i::Integer, n::Integer) = prevind(s, Int(i), Int(n))
Expand All @@ -443,25 +468,46 @@ end
"""
nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int

* Case `n == 1`

If `i` is in bounds in `s` return the index of the start of the character whose
encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`.
If `n == 0` return `i`.
encoding starts after index `i`. In other words, if `i` is the start of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section need to be indented (and same below) if they should render as bullet points, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

character, return the start of the next character; if `i` is not the start
of a character, move forward until the start of a character and return that index.
If `i` is equal to `0` return `1`.
If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`.
Otherwise throw `BoundsError`.

* Case `n > 1`

Behaves like applying `n` times `nextind` for `n==1`. The only difference
is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then each
remaining iteration increases the returned value by `1`.
This means that in this case `nextind` can return a value greater than `ncodeunits(str)+1`.

* Case `n == 0`

Return `i` only if `i` is a valid index in `s` or is equal to `0`.
Otherwise `StringIndexError` or `BoundsError` is thrown.

# Examples
```jldoctest
julia> str = "αβγdef";
julia> nextind("α", 0)
1

julia> nextind(str, 1)
julia> nextind("α", 1)
3

julia> nextind(str, 1, 2)
5
julia> nextind("α", 3)
ERROR: BoundsError: attempt to access "α"
at index [3]
[...]

julia> lastindex(str)
9
julia> nextind("α", 0, 2)
3

julia> nextind(str, 9)
10
julia> nextind("α", 1, 2)
4
```
"""
nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n))
Expand Down
76 changes: 76 additions & 0 deletions doc/src/manual/strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,54 @@ x
y
```

Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
treat any byte sequence as a `String`. In such situations a rule is that when parsing
a sequence of code units from left to right characters are formed by the longest sequence of
8-bit code units that matches the start of one of the following bit patterns
(each `x` can be `0` or `1`):

* `0xxxxxxx`;
* `110xxxxx` `10xxxxxx`;
* `1110xxxx` `10xxxxxx` `10xxxxxx`;
* `11110xxx` `10xxxxxx` `10xxxxxx` `10xxxxxx`;
* `10xxxxxx`;
* `11111xxx`.

In particular this implies that overlong and too high code unit sequences are accepted.
This rule is best explained by an example:

```julia-repl
julia> s = "\xc0\xa0\xe2\x88\xe2|"
"\xc0\xa0\xe2\x88\xe2|"

julia> foreach(display, s)
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
'|': ASCII/Unicode U+007c (category Sm: Symbol, math)

julia> isvalid.(collect(s))
4-element BitArray{1}:
false
false
false
true

julia> s2 = "\xf7\xbf\xbf\xbf"
"\U1fffff"

julia> foreach(display, s2)
'\U1fffff': Unicode U+1fffff (category In: Invalid, too high)
```

We can see that the first two code units in the string `s` form an overlong encoding of
space character. It is invalid, but is accepted in a string as a single character.
The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth
code unit `\xe2` is not its valid continuation. Therefore code units 3 and 4 are also
interpreted as malformed characters in this string. Similarly code unit 5 forms a malformed
character because `|` is not a valid continuation to it. Finally the string `s2` contains
one too high code point.

Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package
implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and
Expand All @@ -371,6 +419,34 @@ julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
```

A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
In that case the resulting string may contain different characters than the input strings,
and its number of characters may be lower than sum of numbers of characters
of the concatenated strings, e.g.:

```julia-repl
julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")

julia> c = a*b
"∀"

julia> collect.([a, b, c])
3-element Array{Array{Char,1},1}:
['\xe2\x88']
['\x80']
['∀']

julia> length.([a, b, c])
3-element Array{Int64,1}:
1
1
1
```

This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings
concatenation preserves all characters in strings and additivity of string lengths.

Julia also provides `*` for string concatenation:

```jldoctest stringconcat
Expand Down