Document invalid UTF-8 indexing and concatenation #26952

bkamins · 2018-05-01T21:33:26Z

This PR is a take to address all documentation points from #25478.

Changes:

documentation how Julia identifies characters in malformed UTF-8;
documentation of string concatenation for malformed UTF-8;
documentation of thisind, nextind, prevind.

I am not a native speaker so if there are grammatical mistakes just push a patch to the descriptions 😄.

@StefanKarpinski the descriptions are complex unfortunately but I hope I have managed to cover all cases.

…tind

fredrikekre · 2018-05-05T15:57:50Z

base/strings/basic.jl


 # Examples
 ```jldoctest
-julia> thisind("αβγdef", -5)
-5
+julia> thisind("α", -1)


Perhaps start with an example that "works" and leave the BoundsError examples to last?

moved (I wanted to sow what happens if we increase the index but I agree with your reasoning).

fredrikekre · 2018-05-05T15:58:15Z

base/strings/basic.jl

-
-julia> thisind("αβγdef", 20)
-20
+julia> thisind("α", 4)


This line has no output

fixed (a copy-paste glitch)

fredrikekre · 2018-05-05T15:58:49Z

base/strings/basic.jl

@@ -394,27 +396,41 @@ end
 """
    prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int

+Case `n == 1`.


Perhaps make this a bullet point list instead?

fredrikekre · 2018-05-05T15:59:37Z

base/strings/basic.jl

@@ -436,25 +452,42 @@ end
 """
    nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int

+Case `n == 1`.


Same here, perhaps better with a bulleted list.

fredrikekre · 2018-05-05T16:00:50Z

doc/src/manual/strings.md

@@ -348,6 +348,35 @@ x
 y
 ```

+Strings in Julia can contain invalid UTF-8 code unit sequences. This is rule allows to accept


This is rule allows? :)

fredrikekre · 2018-05-05T16:01:41Z

doc/src/manual/strings.md

@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n")
 "Hello, world.\n"
 ```

+An important to be aware of situation is when invalid UTF-8 strings are concatenated.


Perhaps better as A situation which is important to be aware of is when...

fixed. @fredrikekre thank you for a review 😄,

bkamins · 2018-05-05T17:37:46Z

@StefanKarpinski:
Following a discussion I had with @fredrikekre there is one question regarding functionality of nextind/prevind. Currently if n>1 it is possible that they return indices that would not be possible to get when iterating them for n=1. E.g. prevind can produce a negative index.

I have documented it in this PR but the question is whether it is intended or we want to throw an error in such situations.

StefanKarpinski · 2018-05-08T12:06:30Z

I've been on vacation but will take a look at this now that I'm back.

nalimilan · 2018-05-12T20:35:41Z

doc/src/manual/strings.md

@@ -348,6 +348,35 @@ x
 y
 ```

+Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
+treat any byte sequence as a `String`. In such situations a rule is that characters are formed
+by longest possibly valid sequences of code points. This rule is best explained by an example:


"possibly valid" isn't very explicit.

I have tried to improve it (but it is hard).

Yeah. The new version much more precise, but is "the longest sequence of code units that could be a start of some valid code point" really correct? e.g. an overlong encoding isn't a start of a valid character. Sorry, I don't know what the best description could be, but Stefan can probably help.

I have made another shot (more verbose) and with one additional example.

nalimilan · 2018-05-12T20:36:32Z

doc/src/manual/strings.md

+
+We can see that first two code units in `s` form an overlong encoding of space character.
+It is invalid, but is accepted in a string as a single character.
+Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit


"The next two". "three-byte". "The fifth".

nalimilan · 2018-05-12T20:37:11Z

doc/src/manual/strings.md

+Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit
+`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed
+character in this string. Similarly code unit 5 forms a malformed character because
+because `|` is not a valid continuation.


Twice "because".

nalimilan · 2018-05-12T20:38:15Z

doc/src/manual/strings.md

@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n")
 "Hello, world.\n"
 ```

+A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
+In that case string may contain different characters than those that constitute concatenated


"the resulting string" and "that constitute input strings"?

Below, typo "sting". "such a string" could just be "its [number of characters]".

fixed. thank you for a review.

waldyrious · 2018-05-13T05:41:28Z

doc/src/manual/strings.md

+  true
+```
+
+We can see that first two code units in `s` form an overlong encoding of space character.


"We can see that +the+ first two..."

Also, I'd suggest ...in `s`... --> ...in the string `s`..., to make the sentence structure more evident.

waldyrious · 2018-05-13T05:50:13Z

doc/src/manual/strings.md

+The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth
+code unit `\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second
+malformed character in this string. Similarly code unit 5 forms a malformed character because
+`|` is not a valid continuation.


"because `|` is not a valid continuation +to it+"

waldyrious · 2018-05-13T05:51:58Z

doc/src/manual/strings.md

@@ -371,6 +401,34 @@ julia> string(greet, ", ", whom, ".\n")
 "Hello, world.\n"
 ```

+A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
+In that case the resulting string may contain different characters than those that constitute
+input strings and its number of characters may be lower than sum of numbers of characters


"than those that constitute input strings and" --> "than the input strings, and"

bkamins · 2018-05-29T12:02:40Z

Any opinion on this PR?

StefanKarpinski · 2018-05-29T12:09:44Z

Definitely a big improvement. I should have merged long ago. I fixed the merge conflict, we can merge as soon as CI passes.

bkamins · 2018-05-30T07:14:25Z

@fredrikekre Here, CI mostly passes, but I have investigated Travis failure and it is related to the same problem as why #26802 fails. Documenter.jl throws an error when digesting malformed characters. Again - we could disable doctests here, but maybe there is some workaround?

In general, in both cases it is crucial that we show in the manual how Julia works with invalid UTF-8 so I have to leave those offending lines in both PRs.

bkamins · 2018-05-30T20:59:18Z

CI status is that 2 builds on CircleCI were canceled, and AppVeyor and Travis builds partially passed and partially failed due to some strange reason (not due to the error from Documenter.jl); freebsd ci passes.

fredrikekre · 2018-05-31T20:54:46Z

base/strings/basic.jl

 If `i` is in bounds in `s` return the index of the start of the character whose
-encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`.
-If `n == 0` return `i`.
+encoding starts after index `i`. In other words, if `i` is the start of a


I think this section need to be indented (and same below) if they should render as bullet points, right?

fixed, thanks.

StefanKarpinski · 2018-05-31T21:46:44Z

Thanks for this—it's a really good explanation of how this works.

fredrikekre · 2018-06-01T06:06:28Z

#27109 on Travis 64-bit

Document invalid UTF-8 indexing, concatenation, thisind, prevind, nex…

47f307b

…tind

ararslan added unicode Related to unicode characters and encodings strings "Strings!" labels May 2, 2018

ararslan requested a review from StefanKarpinski May 2, 2018 18:19

fredrikekre reviewed May 5, 2018

View reviewed changes

language corrections after a review

0c080fd

nalimilan reviewed May 12, 2018

View reviewed changes

language corrections after review

60395c7

waldyrious reviewed May 13, 2018

View reviewed changes

bkamins force-pushed the str_indexing_doc branch 2 times, most recently from a6489a1 to de20606 Compare May 13, 2018 07:23

iproved character parsing description

32d8d65

bkamins force-pushed the str_indexing_doc branch from de20606 to 32d8d65 Compare May 13, 2018 07:24

Merge branch 'master' into str_indexing_doc

f6dc4ca

bkamins added 2 commits May 30, 2018 12:34

use julia-repl

c0f0509

clean up thisind documentation

a563e33

bkamins force-pushed the str_indexing_doc branch from 1d63ad3 to a563e33 Compare May 30, 2018 14:12

fredrikekre reviewed May 31, 2018

View reviewed changes

indent bullet points

14b157a

StefanKarpinski approved these changes May 31, 2018

View reviewed changes

fredrikekre merged commit eb51673 into JuliaLang:master Jun 1, 2018

bkamins deleted the str_indexing_doc branch June 1, 2018 07:13

Document invalid UTF-8 indexing and concatenation #26952

Document invalid UTF-8 indexing and concatenation #26952

Conversation

bkamins commented May 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented May 5, 2018

StefanKarpinski commented May 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented May 29, 2018

StefanKarpinski commented May 29, 2018

bkamins commented May 30, 2018 • edited Loading

bkamins commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented May 31, 2018

fredrikekre commented Jun 1, 2018

bkamins commented May 30, 2018 •

edited

Loading