Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document invalid UTF-8 indexing and concatenation #26952

Merged
merged 8 commits into from
Jun 1, 2018

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 1, 2018

This PR is a take to address all documentation points from #25478.

Changes:

  • documentation how Julia identifies characters in malformed UTF-8;
  • documentation of string concatenation for malformed UTF-8;
  • documentation of thisind, nextind, prevind.

I am not a native speaker so if there are grammatical mistakes just push a patch to the descriptions 😄.

@StefanKarpinski the descriptions are complex unfortunately but I hope I have managed to cover all cases.

@ararslan ararslan added unicode Related to unicode characters and encodings strings "Strings!" labels May 2, 2018
@ararslan ararslan requested a review from StefanKarpinski May 2, 2018 18:19

# Examples
```jldoctest
julia> thisind("αβγdef", -5)
-5
julia> thisind("α", -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps start with an example that "works" and leave the BoundsError examples to last?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved (I wanted to sow what happens if we increase the index but I agree with your reasoning).


julia> thisind("αβγdef", 20)
20
julia> thisind("α", 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has no output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (a copy-paste glitch)

@@ -394,27 +396,41 @@ end
"""
prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int

Case `n == 1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this a bullet point list instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -436,25 +452,42 @@ end
"""
nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int

Case `n == 1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, perhaps better with a bulleted list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -348,6 +348,35 @@ x
y
```

Strings in Julia can contain invalid UTF-8 code unit sequences. This is rule allows to accept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rule allows? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
```

An important to be aware of situation is when invalid UTF-8 strings are concatenated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better as A situation which is important to be aware of is when...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. @fredrikekre thank you for a review 😄,

@bkamins
Copy link
Member Author

bkamins commented May 5, 2018

@StefanKarpinski:
Following a discussion I had with @fredrikekre there is one question regarding functionality of nextind/prevind. Currently if n>1 it is possible that they return indices that would not be possible to get when iterating them for n=1. E.g. prevind can produce a negative index.

I have documented it in this PR but the question is whether it is intended or we want to throw an error in such situations.

@StefanKarpinski
Copy link
Member

I've been on vacation but will take a look at this now that I'm back.

@@ -348,6 +348,35 @@ x
y
```

Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to
treat any byte sequence as a `String`. In such situations a rule is that characters are formed
by longest possibly valid sequences of code points. This rule is best explained by an example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"possibly valid" isn't very explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to improve it (but it is hard).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The new version much more precise, but is "the longest sequence of code units that could be a start of some valid code point" really correct? e.g. an overlong encoding isn't a start of a valid character. Sorry, I don't know what the best description could be, but Stefan can probably help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made another shot (more verbose) and with one additional example.


We can see that first two code units in `s` form an overlong encoding of space character.
It is invalid, but is accepted in a string as a single character.
Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The next two". "three-byte". "The fifth".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Next two code units form a valid start of a three byte UTF-8 sequence. However, fifth code unit
`\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second malformed
character in this string. Similarly code unit 5 forms a malformed character because
because `|` is not a valid continuation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twice "because".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -371,6 +400,34 @@ julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
```

A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
In that case string may contain different characters than those that constitute concatenated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the resulting string" and "that constitute input strings"?

Below, typo "sting". "such a string" could just be "its [number of characters]".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thank you for a review.

true
```

We can see that first two code units in `s` form an overlong encoding of space character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We can see that +the+ first two..."

Also, I'd suggest ...in `s`... --> ...in the string `s`..., to make the sentence structure more evident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth
code unit `\xe2` is not its valid continuation. Therefore code units 3 and 4 form a second
malformed character in this string. Similarly code unit 5 forms a malformed character because
`|` is not a valid continuation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"because `|` is not a valid continuation +to it+"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -371,6 +401,34 @@ julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
```

A situation which is important to be aware of is when invalid UTF-8 strings are concatenated.
In that case the resulting string may contain different characters than those that constitute
input strings and its number of characters may be lower than sum of numbers of characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"than those that constitute input strings and" --> "than the input strings, and"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkamins bkamins force-pushed the str_indexing_doc branch 2 times, most recently from a6489a1 to de20606 Compare May 13, 2018 07:23
@bkamins bkamins force-pushed the str_indexing_doc branch from de20606 to 32d8d65 Compare May 13, 2018 07:24
@bkamins
Copy link
Member Author

bkamins commented May 29, 2018

Any opinion on this PR?

@StefanKarpinski
Copy link
Member

Definitely a big improvement. I should have merged long ago. I fixed the merge conflict, we can merge as soon as CI passes.

@bkamins
Copy link
Member Author

bkamins commented May 30, 2018

@fredrikekre Here, CI mostly passes, but I have investigated Travis failure and it is related to the same problem as why #26802 fails. Documenter.jl throws an error when digesting malformed characters. Again - we could disable doctests here, but maybe there is some workaround?

In general, in both cases it is crucial that we show in the manual how Julia works with invalid UTF-8 so I have to leave those offending lines in both PRs.

@bkamins bkamins force-pushed the str_indexing_doc branch from 1d63ad3 to a563e33 Compare May 30, 2018 14:12
@bkamins
Copy link
Member Author

bkamins commented May 30, 2018

CI status is that 2 builds on CircleCI were canceled, and AppVeyor and Travis builds partially passed and partially failed due to some strange reason (not due to the error from Documenter.jl); freebsd ci passes.

If `i` is in bounds in `s` return the index of the start of the character whose
encoding starts after index `i`. If `i` is out of bounds in `s` return `i + 1`.
If `n == 0` return `i`.
encoding starts after index `i`. In other words, if `i` is the start of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section need to be indented (and same below) if they should render as bullet points, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

@StefanKarpinski
Copy link
Member

Thanks for this—it's a really good explanation of how this works.

@fredrikekre
Copy link
Member

#27109 on Travis 64-bit

@fredrikekre fredrikekre merged commit eb51673 into JuliaLang:master Jun 1, 2018
@bkamins bkamins deleted the str_indexing_doc branch June 1, 2018 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants