JuliaLang · fredrikekre · Jun 1, 2018 · May 1, 2018 · May 5, 2018 · May 12, 2018
diff --git a/base/strings/basic.jl b/base/strings/basic.jl
@@ -352,31 +352,32 @@ end
 If `i` is in bounds in `s` return the index of the start of the character whose
 encoding code unit `i` is part of. In other words, if `i` is the start of a
 character, return `i`; if `i` is not the start of a character, rewind until the
-start of a character and return that index. If `i` is out of bounds in `s`
-return `i`.
+start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1`
+return `i`. In all other cases throw `BoundsError`.
 
 # Examples
 ```jldoctest
-julia> thisind("αβγdef", -5)
--5
+julia> thisind("α", 0)
+0
 
-julia> thisind("αβγdef", 1)
+julia> thisind("α", 1)
 1
 
-julia> thisind("αβγdef", 3)
-3
+julia> thisind("α", 2)
+1
 
-julia> thisind("αβγdef", 4)
+julia> thisind("α", 3)
 3
 
-julia> thisind("αβγdef", 9)
-9
-
-julia> thisind("αβγdef", 10)
-10
+julia> thisind("α", 4)
+ERROR: BoundsError: attempt to access "α"
+  at index [4]
+[...]
 
-julia> thisind("αβγdef", 20)
-20
+julia> thisind("α", -1)
+ERROR: BoundsError: attempt to access "α"
+  at index [-1]
+[...]
 ```
 """
 thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))
@@ -394,27 +395,45 @@ end
 """
     prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int
 
+* Case `n == 1`
+
 If `i` is in bounds in `s` return the index of the start of the character whose
 encoding starts before index `i`. In other words, if `i` is the start of a
 character, return the start of the previous character; if `i` is not the start
 of a character, rewind until the start of a character and return that index.
-If `i` is out of bounds in `s` return `i - 1`. If `n == 0` return `i`.
+If `i` is equal to `1` return `0`.
+If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`.
+Otherwise throw `BoundsError`.
+
+* Case `n > 1`
+
+Behaves like applying `n` times `prevind` for `n==1`. The only difference
+is that if `n` is so large that applying `prevind` would reach `0` then each remaining
+iteration decreases the returned value by `1`.
+This means that in this case `prevind` can return a negative value.
+
+* Case `n == 0`
+
+Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`.
+Otherwise `StringIndexError` or `BoundsError` is thrown.
 
 # Examples
 ```jldoctest
-julia> prevind("αβγdef", 3)
+julia> prevind("α", 3)
 1
 
-julia> prevind("αβγdef", 1)
+julia> prevind("α", 1)
 0
 
-julia> prevind("αβγdef", 0)
-ERROR: BoundsError: attempt to access "αβγdef"
+julia> prevind("α", 0)
+ERROR: BoundsError: attempt to access "α"
   at index [0]
-Stacktrace:
 [...]
 
-julia> prevind("αβγdef", 3, 2)
+julia> prevind("α", 2, 2)
+0
+
+julia> prevind("α", 2, 3)
 0
 ```
 """
@@ -436,25 +455,46 @@ end
 """
     nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int
 
+* Case `n == 1`
+
 If `i` is in bounds in `s` return the index of the start of the character whose
-encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`.
-If `n == 0` return `i`.
+encoding starts after index `i`. In other words, if `i` is the start of a
+character, return the start of the next character; if `i` is not the start
+of a character, move forward until the start of a character and return that index.
+If `i` is equal to `0` return `1`.
+If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`.
+Otherwise throw `BoundsError`.
+
+* Case `n > 1`
+
+Behaves like applying `n` times `nextind` for `n==1`. The only difference
+is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then each
+remaining iteration increases the returned value by `1`.
+This means that in this case `nextind` can return a value greater than `ncodeunits(str)+1`.
+
+* Case `n == 0`
+
+Return `i` only if `i` is a valid index in `s` or is equal to `0`.
+Otherwise `StringIndexError` or `BoundsError` is thrown.
 
 # Examples
 ```jldoctest
-julia> str = "αβγdef";
+julia> nextind("α", 0)
+1
 
-julia> nextind(str, 1)
+julia> nextind("α", 1)
 3
 
-julia> nextind(str, 1, 2)
-5
+julia> nextind("α", 3)
+ERROR: BoundsError: attempt to access "α"
+  at index [3]
+[...]
 
-julia> lastindex(str)
-9
+julia> nextind("α", 0, 2)
+3
 
-julia> nextind(str, 9)
-10
+julia> nextind("α", 1, 2)
+4
 ```
 """
 nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n))

diff --git a/doc/src/manual/strings.md b/doc/src/manual/strings.md
@@ -348,6 +348,35 @@ x
 y
 ```
 
+Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
+treat any byte sequence as a `String`. In such situations a rule is that characters are formed
+by longest possibly valid sequences of code points. This rule is best explained by an example:
+
+```jldoctest unicodestring
+julia> s = "\xc0\xa0\xe2\x88\xe2|"
+"\xc0\xa0\xe2\x88\xe2|"
+
+julia> foreach(display, s)
+'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
+'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
+'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
+'|': ASCII/Unicode U+007c (category Sm: Symbol, math)
+
+julia> isvalid.(collect(s))
+4-element BitArray{1}:
+ false
+ false
+ false
+  true
+```
+
+We can see that first two code units in `s` form an overlong encoding of space character.
+It is invalid, but is accepted in a string as a single character.
+Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit
+`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed
+character in this string. Similarly code unit 5 forms a malformed character because
+because `|` is not a valid continuation.
+
 Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
 For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package
 implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and
@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n")
 "Hello, world.\n"
 ```
 
+A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
+In that case string may contain different characters than those that constitute concatenated
+stings and number of characters in such a string may be lower than sum of numbers of
+characters of the concatenated strings, e.g.:
+
+```jldoctest stringconcat
+julia> a, b = "\xe2\x88", "\x80"
+("\xe2\x88", "\x80")
+
+julia> c = a*b
+"∀"
+
+julia> collect.([a, b, c])
+3-element Array{Array{Char,1},1}:
+ ['\xe2\x88']
+ ['\x80']
+ ['∀']
+
+julia> length.([a, b, c])
+3-element Array{Int64,1}:
+ 1
+ 1
+ 1
+```
+
+This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings
+concatenation preserves all characters in strings and additivity of string lengths.
+
 Julia also provides `*` for string concatenation:
 
 ```jldoctest stringconcat