Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF encoding validity functions #11575

Merged
merged 1 commit into from
Jun 19, 2015

Conversation

ScottPJones
Copy link
Contributor

This introduces the Base.check_string function, with methods for handling UTF-8, UTF-16, and UTF-32 as vectors of UInt8, UInt16, and UInt32, respectively, as well as an AbstractString version that operates on Unicode characters (making sure that they are all valid code points, 0 <= ch < 0xd800, 0xe000 <= ch < 0x10ffff).
There are options to accept or not accept things like Modified UTF-8 encoding, or CESU-8 encoding,
or "overly long" encodings. (See utfcheck.jl for documentation)
These methods either throw a UnicodeError, with information about which character was invalid, and
its position in the input, or they return a tuple, of the number of logical characters in the string, a bit flag to indicate what types of data were found (i.e. all ASCII, all Latin1, surrogates present, overlong characters present, etc.), the number of characters that would take 2 bytes to encode in UTF-8, the number that would take 3 bytes, and the number that would take 4 bytes.
This information is designed to be useful for future conversion routines, because they allow one to calculate the exact size needed to represent a valid string in UTF-8, UTF-16, or UTF-32, and also
indicate whether an optimized "widening" or "narrowing" conversion can be done.

@tkelman tkelman added the unicode Related to unicode characters and encodings label Jun 4, 2015
@ScottPJones
Copy link
Contributor Author

Prepare your 🍅s! I'm off for dinner, please let me know what you think of this and #11573

@@ -84,6 +84,8 @@ include("iterator.jl")
include("osutils.jl")

# strings & printing
include("utferror.jl")
include("utftypes.jl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need to include("utfcheck.jl") here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... somehow that got lost... was definitely part of my source locally...

@ScottPJones ScottPJones force-pushed the spj/checkstring branch 2 times, most recently from 6714044 to f1c1583 Compare June 5, 2015 10:43
@tkelman
Copy link
Contributor

tkelman commented Jun 5, 2015

Please rebase relative to latest master now that #11573 is merged

@ScottPJones
Copy link
Contributor Author

Of course! Thanks very much. Should I squash everything also?

@tkelman
Copy link
Contributor

tkelman commented Jun 5, 2015

It's only 2 commits here, and rebasing would remove 1 since that's now on master.

If you want to amend your commit message while rebasing to make it a little more descriptive that would be welcomed.

@tkelman
Copy link
Contributor

tkelman commented Jun 5, 2015

As I mentioned at #11551 (comment), I think this code is essentially ready now and good to merge, assuming the tests pass.

@ScottPJones
Copy link
Contributor Author

Is that what you wanted, as far as the commit message? Thanks!

@tkelman
Copy link
Contributor

tkelman commented Jun 5, 2015

The indentation is a little odd, but it's fine. Nice and descriptive, I like that part.

@ScottPJones
Copy link
Contributor Author

Ah, I see about what happened to the indentation... I updated that directly on GitHub.
The test failure is the dreaded OOM, not the fault of this change, so hopefully somebody can merge this.
Thanks!

elseif ch < 0x800
num2byte += 1
flags |= UTF_UNICODE2
elseif T != Vector{UInt16} && ch > 0x0ffff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the type check here actually necessary? if T==UInt16, then typemax(T) == 0xffff && ch > 0xffff should be false. no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's checking against Vector{UInt16}, not UInt16, so I don't think Julia can figure it out by itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> function t16(x)
         local a::UInt32
         @inbounds a, i = next(x, 1)
         if a > 0xffff
           return 1
         else
           return 2
         end
       end
t16 (generic function with 1 method)

julia> code_native(t16,Tuple{Vector{UInt16}})
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 4
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $2, %eax
Source line: 4
    popq    %rbp
    ret

julia> code_llvm(t16,Tuple{Vector{UInt16}})

define i64 @julia_t16_20941(%jl_value_t*) {
top:
  ret i64 2
}

that looks pretty successful to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I changed that. I know that at one point (before Jeff put in the change that fixed the big performance regression compared to 0.3.x I'd seen) it seemed not to get that correctly... (unless my eyes where playing tricks on me!), but I hadn't thought to try again (also my test wasn't the same as yours, it was using an AbstractString)

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 6, 2015
@ScottPJones
Copy link
Contributor Author

The only failures are due to the OOM problems affecting everybody.
I believe I've addressed the remaining issues. Anything else preventing this from going in?
Thanks!

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 7, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 7, 2015
Added new convert methods that use the check_string function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 7, 2015
Added new `convert` methods that use the `check_string` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
end

function check_string(dat::Vector{UInt8}, len = sizeof(dat), pos = 0 ; options::Integer=0)
" Validates and calculates number of characters in a UTF-8 encoded vector of UInt8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @nalimilan said this in one of the other PR's, but the doc string convention is more likely to be docs immediately preceding the function, not just inside the function. Not positive though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't seen that, but, why introduce a different convention from what's very commonly in use (i.e. Python documentation)? I'd heard that void strings (which I took to mean any string inside or outside of a method, since it wasn't at all specified, that wasn't used (as an argument, to be returned, or set into a variable), would in the future be picked out for documentation.
Since it seems there is a lot of overlap between Python and Julia programmers (all the SciPy, NumPy, etc. people), I think it would be a very good think to be consistent with that, if there is no compelling reason not to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compelling reason being Julia has multiple dispatch, with multiple method implementations for the same generic function but different argument types, which Python doesn't have. That and the @doc macro works for docstring-before-function, so I believe that's the direction that things are going.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the @doc macro apparently will no longer be necessary (or even the doc string prefix, if the latest version of Docile were put in Base. I'm not sure why multiple dispatch feels like a compelling reason to you... This doesn't remove the ability to (later in the build process, at least) use @doc doc"""...""" -> syntax before the first function of a set of methods...
I think you are going to see this style whether you like it or not... just this week I had to tell a Python programmer learning Julia that he had to move his comments out of the function and use @doc doc"""...""" -> in order to get Help to pick it up. Allowing this style will make a lot of people moving from Python happy (or less confused, at least)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People coming from Doxygen, Javadoc or gtk-doc are used to having docs before the body of the function, so you can't please everybody. Anyway, didn't you support a solution similar to Doxygen? :-p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doxygen doesn't have anything to do with it... you can use doxygen with Python, which has the docstring inside the function.
Also, I'm not saying that the default shouldn't be to have the documentation before... but why couldn't both be allowed?
About doxygen - I just like having things documented, and using doxygen pushes that a bit, but I've never been particularly attached to the syntax, just to having the necessary information preserved to make cross-references, etc. Since nobody here seems to like the doxygen syntax, and somebody on another thread talked about parsing the comments to build the metadata, (which could then be used to produce output for doxygen), I decided that that was the better solution for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just pick one convention, and use it without arguing please? Docstrings still need some work, but for code outside of base, docstring before function has been implemented, docstring inside of function hasn't. If someone wants to make a PR to support docstrings inside of functions in Docile or the base docs system, then that convention could be an alternate to accommodate people coming from python. For now, wouldn't it be easier to just make this change than waste more bandwidth resisting trivial changes that multiple reviewers are asking you to make?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing it... I was responding via my phone until a couple of hours ago... out enjoying the day... I'm back at the laptop, doing it now.

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 8, 2015
Updated comment to go before function, not indented by 4
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 8, 2015
Added new `convert` methods that use the `check_string` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
@ScottPJones
Copy link
Contributor Author

I believe I've handled everything... please take a look!

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 15, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
@nalimilan
Copy link
Member

Ah, something that @mbauman's comment on the other thread reminded me of: to check bounds in checkstring, you can call checkbounds, which is slightly shorter.

@ScottPJones
Copy link
Contributor Author

I'm already changing it - it doesn't really matter that it's shorter though, it is pretty important that it is correct! 😀 Thanks!

@ScottPJones
Copy link
Contributor Author

@nalimilan Turns out, there would still be problems, using checkbounds, so I think for now, this should just be left as is, and a separate PR done to deal with bounds checking issues throughout base.
So, anything left to do?

@ScottPJones
Copy link
Contributor Author

@mbauman Could you please take a look at my use of next(), start(), endof(), pos, and endpos in this code? The back and forth review suggestions are driving me a bit crazy, and I'd like some sort of definitive answer as to what should be done. I've just been following patterns I've seen elsewhere in the code, but if that code is not correct, then I'm lost...

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

Yes, I'm sorry, I was missing the context of this PR, and was answering all your questions (both here and in #11713) in the general case for all AbstractArrays. Your usages of indexing and iteration over arrays here looks like its entirely restricted to Vector, and in that case, indexing and iteration state happen to be the same. So everything here looks to be correct (albeit dependent upon the internal Array iteration implementation).

I think that's just fine for internal code like this, but this is outside of my purview and I won't be helping to maintain this code directly. Edit: I will be working more on array indexing, and I do not foresee the implementation of iteration for Array changing anytime soon.

@ScottPJones
Copy link
Contributor Author

@mbauman Thanks very much. I'll be looking forward to your planned Interfaces manual page!

    Adds `check_string` function, which checks a vector of bytes,
    16-bit or 32-bit words, or an AbstractString for validity,
    either for UTF-8, UTF-16, or UTF-32 encoding.
    By default, `Modified UTF-8 (long \0 encoding)` and
    `CESU-8 (surrogate pairs encoded as 2 UTF-8 3-byte sequences)`
    are allowed, but other over long encoded sequences are not allowed,
    but this can be changed by the keyword options argument.
Add unit tests of all the errors found by `check_string`
Updated documentation to not use doxygen tags.

Move documentation strings from line after to line before

Add testing of valid strings
Improve/consolidate documentation

Add bounds checking

Change name to unsafe_checkstring, warn that doesn't check bounds
Add checkstring, which does check bounds

Add tests of bounds checking

Change order of start/end positions

Update bounds checking tests

Change 1 to start(dat)

Use checkbounds()
@ScottPJones
Copy link
Contributor Author

OK, I've changed this according to the very last round of suggestions in #11713.
Thanks everybody for your time and thorough reviews!
Anything left to fix?

@tkelman
Copy link
Contributor

tkelman commented Jun 17, 2015

Unless anyone objects strongly, I'll merge this in a day or two so we can continue on to the actual bug fixes. This part of the code looks pretty much done to me, so for the sake of future unicode bugfix backporting if nothing else let's get it on master.

tkelman added a commit that referenced this pull request Jun 19, 2015
Add UTF encoding validity functions
@tkelman tkelman merged commit bbb8764 into JuliaLang:master Jun 19, 2015
@ScottPJones
Copy link
Contributor Author

🎉 Thank you very much, @tkelman! I'll go rebase the rest of my PRs now...
Thanks to everybody for the very thorough reviews, and all the time you spent on them. I'm very impressed with and grateful for all the thought, great suggestions, and help from people!

@ScottPJones ScottPJones deleted the spj/checkstring branch June 19, 2015 01:03
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 19, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 20, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 22, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 22, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation

Fix AbstractVector{UInt16} conversion

Remove support for converting Vector{UInt16} to UTF8String

Add Unicode validation function and fix UTF-16 conversion bugs
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 22, 2015
Added new `convert` methods that use the `check_string` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 25, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 1, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 1, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 6, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 10, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 12, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants