Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document invalid UTF-8 indexing and concatenation #26952

Merged
merged 8 commits into from
Jun 1, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 72 additions & 32 deletions base/strings/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -352,31 +352,32 @@ end
If `i` is in bounds in `s` return the index of the start of the character whose
encoding code unit `i` is part of. In other words, if `i` is the start of a
character, return `i`; if `i` is not the start of a character, rewind until the
start of a character and return that index. If `i` is out of bounds in `s`
return `i`.
start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1`
return `i`. In all other cases throw `BoundsError`.

# Examples
```jldoctest
julia> thisind("αβγdef", -5)
-5
julia> thisind("α", 0)
0

julia> thisind("αβγdef", 1)
julia> thisind("α", 1)
1

julia> thisind("αβγdef", 3)
3
julia> thisind("α", 2)
1

julia> thisind("αβγdef", 4)
julia> thisind("α", 3)
3

julia> thisind("αβγdef", 9)
9

julia> thisind("αβγdef", 10)
10
julia> thisind("α", 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has no output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (a copy-paste glitch)

ERROR: BoundsError: attempt to access "α"
at index [4]
[...]

julia> thisind("αβγdef", 20)
20
julia> thisind("α", -1)
ERROR: BoundsError: attempt to access "α"
at index [-1]
[...]
```
"""
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))
Expand All @@ -394,27 +395,45 @@ end
"""
prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int

* Case `n == 1`

If `i` is in bounds in `s` return the index of the start of the character whose
encoding starts before index `i`. In other words, if `i` is the start of a
character, return the start of the previous character; if `i` is not the start
of a character, rewind until the start of a character and return that index.
If `i` is out of bounds in `s` return `i - 1`. If `n == 0` return `i`.
If `i` is equal to `1` return `0`.
If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`.
Otherwise throw `BoundsError`.

* Case `n > 1`

Behaves like applying `n` times `prevind` for `n==1`. The only difference
is that if `n` is so large that applying `prevind` would reach `0` then each remaining
iteration decreases the returned value by `1`.
This means that in this case `prevind` can return a negative value.

* Case `n == 0`

Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`.
Otherwise `StringIndexError` or `BoundsError` is thrown.

# Examples
```jldoctest
julia> prevind("αβγdef", 3)
julia> prevind("α", 3)
1

julia> prevind("αβγdef", 1)
julia> prevind("α", 1)
0

julia> prevind("αβγdef", 0)
ERROR: BoundsError: attempt to access "αβγdef"
julia> prevind("α", 0)
ERROR: BoundsError: attempt to access "α"
at index [0]
Stacktrace:
[...]

julia> prevind("αβγdef", 3, 2)
julia> prevind("α", 2, 2)
0

julia> prevind("α", 2, 3)
0
```
"""
Expand All @@ -436,25 +455,46 @@ end
"""
nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int

* Case `n == 1`

If `i` is in bounds in `s` return the index of the start of the character whose
encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`.
If `n == 0` return `i`.
encoding starts after index `i`. In other words, if `i` is the start of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section need to be indented (and same below) if they should render as bullet points, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

character, return the start of the next character; if `i` is not the start
of a character, move forward until the start of a character and return that index.
If `i` is equal to `0` return `1`.
If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`.
Otherwise throw `BoundsError`.

* Case `n > 1`

Behaves like applying `n` times `nextind` for `n==1`. The only difference
is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then each
remaining iteration increases the returned value by `1`.
This means that in this case `nextind` can return a value greater than `ncodeunits(str)+1`.

* Case `n == 0`

Return `i` only if `i` is a valid index in `s` or is equal to `0`.
Otherwise `StringIndexError` or `BoundsError` is thrown.

# Examples
```jldoctest
julia> str = "αβγdef";
julia> nextind("α", 0)
1

julia> nextind(str, 1)
julia> nextind("α", 1)
3

julia> nextind(str, 1, 2)
5
julia> nextind("α", 3)
ERROR: BoundsError: attempt to access "α"
at index [3]
[...]

julia> lastindex(str)
9
julia> nextind("α", 0, 2)
3

julia> nextind(str, 9)
10
julia> nextind("α", 1, 2)
4
```
"""
nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n))
Expand Down
57 changes: 57 additions & 0 deletions doc/src/manual/strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,35 @@ x
y
```

Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
treat any byte sequence as a `String`. In such situations a rule is that characters are formed
by longest possibly valid sequences of code points. This rule is best explained by an example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"possibly valid" isn't very explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to improve it (but it is hard).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The new version much more precise, but is "the longest sequence of code units that could be a start of some valid code point" really correct? e.g. an overlong encoding isn't a start of a valid character. Sorry, I don't know what the best description could be, but Stefan can probably help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made another shot (more verbose) and with one additional example.


```jldoctest unicodestring
julia> s = "\xc0\xa0\xe2\x88\xe2|"
"\xc0\xa0\xe2\x88\xe2|"

julia> foreach(display, s)
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
'|': ASCII/Unicode U+007c (category Sm: Symbol, math)

julia> isvalid.(collect(s))
4-element BitArray{1}:
false
false
false
true
```

We can see that first two code units in `s` form an overlong encoding of space character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We can see that +the+ first two..."

Also, I'd suggest ...in `s`... --> ...in the string `s`..., to make the sentence structure more evident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

It is invalid, but is accepted in a string as a single character.
Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The next two". "three-byte". "The fifth".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed
character in this string. Similarly code unit 5 forms a malformed character because
because `|` is not a valid continuation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twice "because".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package
implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and
Expand All @@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
```

A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
In that case string may contain different characters than those that constitute concatenated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the resulting string" and "that constitute input strings"?

Below, typo "sting". "such a string" could just be "its [number of characters]".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thank you for a review.

stings and number of characters in such a string may be lower than sum of numbers of
characters of the concatenated strings, e.g.:

```jldoctest stringconcat
julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")

julia> c = a*b
"∀"

julia> collect.([a, b, c])
3-element Array{Array{Char,1},1}:
['\xe2\x88']
['\x80']
['∀']

julia> length.([a, b, c])
3-element Array{Int64,1}:
1
1
1
```

This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings
concatenation preserves all characters in strings and additivity of string lengths.

Julia also provides `*` for string concatenation:

```jldoctest stringconcat
Expand Down