JuliaLang · fredrikekre · Jun 1, 2018 · May 1, 2018 · May 5, 2018 · May 12, 2018
diff --git a/base/strings/basic.jl b/base/strings/basic.jl
@@ -365,25 +365,32 @@ end
 If `i` is in bounds in `s` return the index of the start of the character whose
 encoding code unit `i` is part of. In other words, if `i` is the start of a
 character, return `i`; if `i` is not the start of a character, rewind until the
-start of a character and return that index. If `i` is out of bounds in `s`
-return `i`.
+start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1`
+return `i`. In all other cases throw `BoundsError`.
 
 # Examples
 ```jldoctest
-julia> thisind("αβγdef", 1)
+julia> thisind("α", 0)
+0
+
+julia> thisind("α", 1)
 1
 
-julia> thisind("αβγdef", 3)
-3
+julia> thisind("α", 2)
+1
 
-julia> thisind("αβγdef", 4)
+julia> thisind("α", 3)
 3
 
-julia> thisind("αβγdef", 9)
-9
+julia> thisind("α", 4)
+ERROR: BoundsError: attempt to access "α"
+  at index [4]
+[...]
 
-julia> thisind("αβγdef", 10)
-10
+julia> thisind("α", -1)
+ERROR: BoundsError: attempt to access "α"
+  at index [-1]
+[...]
 ```
 """
 thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))
@@ -401,28 +408,46 @@ end
 """
     prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int
 
-If `i` is in bounds in `s` return the index of the start of the character whose
-encoding starts before index `i`. In other words, if `i` is the start of a
-character, return the start of the previous character; if `i` is not the start
-of a character, rewind until the start of a character and return that index.
-If `i` is out of bounds in `s` return `i - 1`. If `n == 0` return `i`.
+* Case `n == 1`
+
+  If `i` is in bounds in `s` return the index of the start of the character whose
+  encoding starts before index `i`. In other words, if `i` is the start of a
+  character, return the start of the previous character; if `i` is not the start
+  of a character, rewind until the start of a character and return that index.
+  If `i` is equal to `1` return `0`.
+  If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`.
+  Otherwise throw `BoundsError`.
+
+* Case `n > 1`
+
+  Behaves like applying `n` times `prevind` for `n==1`. The only difference
+  is that if `n` is so large that applying `prevind` would reach `0` then each remaining
+  iteration decreases the returned value by `1`.
+  This means that in this case `prevind` can return a negative value.
+
+* Case `n == 0`
+
+  Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`.
+  Otherwise `StringIndexError` or `BoundsError` is thrown.
 
 # Examples
 ```jldoctest
-julia> prevind("αβγdef", 3)
+julia> prevind("α", 3)
 1
 
-julia> prevind("αβγdef", 1)
+julia> prevind("α", 1)
 0
 
-julia> prevind("αβγdef", 0)
-ERROR: BoundsError: attempt to access "αβγdef"
+julia> prevind("α", 0)
+ERROR: BoundsError: attempt to access "α"
   at index [0]
-Stacktrace:
 [...]
 
-julia> prevind("αβγdef", 3, 2)
+julia> prevind("α", 2, 2)
 0
+
+julia> prevind("α", 2, 3)
+-1
 ```
 """
 prevind(s::AbstractString, i::Integer, n::Integer) = prevind(s, Int(i), Int(n))
@@ -443,25 +468,46 @@ end
 """
     nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int
 
-If `i` is in bounds in `s` return the index of the start of the character whose
-encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`.
-If `n == 0` return `i`.
+* Case `n == 1`
+
+  If `i` is in bounds in `s` return the index of the start of the character whose
+  encoding starts after index `i`. In other words, if `i` is the start of a
+  character, return the start of the next character; if `i` is not the start
+  of a character, move forward until the start of a character and return that index.
+  If `i` is equal to `0` return `1`.
+  If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`.
+  Otherwise throw `BoundsError`.
+
+* Case `n > 1`
+
+  Behaves like applying `n` times `nextind` for `n==1`. The only difference
+  is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then
+  each remaining iteration increases the returned value by `1`. This means that in this
+  case `nextind` can return a value greater than `ncodeunits(str)+1`.
+
+* Case `n == 0`
+
+  Return `i` only if `i` is a valid index in `s` or is equal to `0`.
+  Otherwise `StringIndexError` or `BoundsError` is thrown.
 
 # Examples
 ```jldoctest
-julia> str = "αβγdef";
+julia> nextind("α", 0)
+1
 
-julia> nextind(str, 1)
+julia> nextind("α", 1)
 3
 
-julia> nextind(str, 1, 2)
-5
+julia> nextind("α", 3)
+ERROR: BoundsError: attempt to access "α"
+  at index [3]
+[...]
 
-julia> lastindex(str)
-9
+julia> nextind("α", 0, 2)
+3
 
-julia> nextind(str, 9)
-10
+julia> nextind("α", 1, 2)
+4
 ```
 """
 nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n))

diff --git a/doc/src/manual/strings.md b/doc/src/manual/strings.md
@@ -348,6 +348,54 @@ x
 y
 ```
 
+Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
+treat any byte sequence as a `String`. In such situations a rule is that when parsing
+a sequence of code units from left to right characters are formed by the longest sequence of
+8-bit code units that matches the start of one of the following bit patterns
+(each `x` can be `0` or `1`):
+
+* `0xxxxxxx`;
+* `110xxxxx` `10xxxxxx`;
+* `1110xxxx` `10xxxxxx` `10xxxxxx`;
+* `11110xxx` `10xxxxxx` `10xxxxxx` `10xxxxxx`;
+* `10xxxxxx`;
+* `11111xxx`.
+
+In particular this implies that overlong and too high code unit sequences are accepted.
+This rule is best explained by an example:
+
+```julia-repl
+julia> s = "\xc0\xa0\xe2\x88\xe2|"
+"\xc0\xa0\xe2\x88\xe2|"
+
+julia> foreach(display, s)
+'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
+'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
+'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
+'|': ASCII/Unicode U+007c (category Sm: Symbol, math)
+
+julia> isvalid.(collect(s))
+4-element BitArray{1}:
+ false
+ false
+ false
+  true
+
+julia> s2 = "\xf7\xbf\xbf\xbf"
+"\U1fffff"
+
+julia> foreach(display, s2)
+'\U1fffff': Unicode U+1fffff (category In: Invalid, too high)
+```
+
+We can see that the first two code units in the string `s` form an overlong encoding of
+space character. It is invalid, but is accepted in a string as a single character.
+The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth
+code unit `\xe2` is not its valid continuation. Therefore code units 3 and 4 are also
+interpreted as malformed characters in this string. Similarly code unit 5 forms a malformed
+character because `|` is not a valid continuation to it. Finally the string `s2` contains
+one too high code point.
+
 Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
 For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package
 implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and
@@ -371,6 +419,34 @@ julia> string(greet, ", ", whom, ".\n")
 "Hello, world.\n"
 ```
 
+A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
+In that case the resulting string may contain different characters than the input strings,
+and its number of characters may be lower than sum of numbers of characters
+of the concatenated strings, e.g.:
+
+```julia-repl
+julia> a, b = "\xe2\x88", "\x80"
+("\xe2\x88", "\x80")
+
+julia> c = a*b
+"∀"
+
+julia> collect.([a, b, c])
+3-element Array{Array{Char,1},1}:
+ ['\xe2\x88']
+ ['\x80']
+ ['∀']
+
+julia> length.([a, b, c])
+3-element Array{Int64,1}:
+ 1
+ 1
+ 1
+```
+
+This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings
+concatenation preserves all characters in strings and additivity of string lengths.
+
 Julia also provides `*` for string concatenation:
 
 ```jldoctest stringconcat