Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii #11203

ScottPJones · 2015-05-08T18:09:16Z

Rewrote u8_isvalid to fix cases of not detecting single or out of order surrogate characters.
Also improved performance, generally around 2-3x faster on simple tests with different types
of characters (ASCII, Latin1, 2-byte/3-byte/4-byte UTF-8 sequences).
Fixed some tests that were depending on being able to make invalid string literals.

StefanKarpinski · 2015-05-08T18:11:51Z

src/support/utf8.c


   length is in bytes, since without knowing whether the string is valid
   it's hard to know how many characters there are! */
-int u8_isvalid(const char *str, size_t length)
+int u8_isvalid(const char *iStr, size_t iLength)


What's with these argument names? Is this Hungarian notation?

Just the coding convention where I'm consulting... not Hungarian...
i for input, io for input/output, o for output, for parameters...

Is using that convention a problem? (I dislike Hungarian myself, this was something that the other developers had wanted, just for the parameters, similar to Google's requirement of putting a "p" on formal parameter names...). I hadn't found any C/C++ style guide for Julia internals, just the guide for Julia code. Thanks!

I think it would make sense to follow the style of the surrounding code, which is not to do this.

OK, I just copied this from my own UTF handling code, and changed the name to match Jeff's name, and removed the documentation :sad:
When I have all of u8 rewritten to run faster, can I use my standard conventions, since there will be no other surrounding code?

@StefanKarpinski Because when I had copied code of mine previously into utf8.c, I was told to remove the documentation, because nobody had decided yet how code should be documented 😢

@tknopp I don't mean to sound snide, but, if I'd actually found any consistent conventions in the code I've been looking at, I would have followed it...
I've been working mostly in utf8.c, utf8proc.c, and gc.c. Each one has different conventions for how the code is formatted, [2 vs 4 spaces for indentation, for example, different ways of indenting {}'s, etc.], and different ways things are named. There are things like jlegal_func (is that a legal function? hmmm...) or should that really have been jl_egal_func, as were some of the other functions, such as jl_alloc_svec_func.

Having authored the coding / style / portability standard for a company with hundreds of programmers, that is still in use after 28 years, and co-authored the guide for the start-up I'm at now, I feel not having some sort of guide for C/C++ programming is a serious lack for Julia.
I've found that consistency, external and internal documentation, good readable style, as well as having portable code, is incredibly important when you have lots of contributors over large periods of time.
The code now is not very consistent, there doesn't seem to be any internal documentation or comments to explain whatever tricks or design / implementation decisions were made, use of too short variable names makes it hard to read parts of the code (using a single letter variable name, that is a substring of several other variable names in the same function, for example)...
lots of static variables that will make adding threading very difficult... I could go on, but it would just frustrate me more...

code formatting is described in

https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#general-formatting-guidelines-for-c-code-contributions

Yes, I'd seen that paragraph, but it says nothing about variable naming... and isn't followed in the code I've seen.

Yes you are right about variable naming. I just wanted to reply to your complaint about formatting.

I can absolutely see some of your other points. For instance it would be a good move to put globals into structs so that there is a julia context that can be independently hold. But one has also to acknowledge that the C core of julia is moving fast and maintained/developed by a small group of people. Taking this into account development has been pretty efficient IMHO.

If you want to help improving the C core I suggest opening a new issue and making concrete proposals that the core hackers can judge.

@tknopp to be fair, I think @ScottPJones has gotten the message about opening concrete, independent, small issues by now. See #11123 for an issue about doc comments. On doc comments and variable naming I think it might be best to just start proposing small PR's to solely add doc comments or replace variable names with clearer, more consistent versions. They may stand out as not matching the surrounding code at first, but if a consistent style starts getting gradually applied to more of the C code then I think it could be fair to start holding newly added code to a standard (which we would add to CONTRIBUTING.md).

tkelman · 2015-05-09T01:13:02Z

+ a bunch for the level of comments here.

For indentation and purely cosmetic code formatting conventions I'd be in favor of adopting a "run clang-format on it" policy, ideally with formatting-only changes split into separate commits from functional changes. Variable naming is harder to enforce.

ScottPJones · 2015-05-09T01:42:47Z

I did try to follow all of the formatting stuff I'd found (4 space indentation, no tab characters, no trailing spaces on a line, { } placement...). I've never used "clang-format", I'll have to investigate that... thanks!

ScottPJones · 2015-05-13T13:19:53Z

Anything left keeping this from being merged into base? Thanks!

tkelman · 2015-05-13T20:22:32Z

I don't think the argument names matter enough to prevent this from being merged, but I don't know enough of the unicode details to be qualified to review the functional changes here. @StefanKarpinski or @stevengj or @jiahao can you take a look and hit the button if everything is in order?

kmsquire · 2015-05-13T21:19:20Z

Any particular reason for the test changes? They look mostly cosmetic, and to me, the changes are not clearly better to me, but I might be missing something.

Also, when all edits are done, the commits should be squashed.

ScottPJones · 2015-05-14T06:59:13Z

What do you mean? I added a very complete set of tests, with all of the edge conditions, and tests for UTF8String, which didn't have any before...

tkelman · 2015-05-14T07:01:28Z

I think you might be mixing up the different PR's, here 88a8071 makes a few small changes to a test file - as described up top

Fixed some tests that were depending on being able to make invalid string literals.

ScottPJones · 2015-05-14T07:13:22Z

The changes to test/strings.jl in 88a8071 were not cosmetic at all... they are necessary when the string literals no longer accept invalid UTF-8.

tkelman · 2015-05-14T07:17:36Z

In this case, while I don't usually like having commits that fail tests present in the history of master, separating out the commits is actually good for license's sake. Note the comment based on the valid_utf8 routine from the PCRE library by Philip Hazel, about the old implementation which is being deleted here.

ScottPJones · 2015-05-14T07:42:45Z

@tkelman It's too early for my brain... what would you like me to change?

tkelman · 2015-05-14T07:49:26Z

Nothing from me yet, unless someone who knows unicode says otherwise. Was mostly responding to

Also, when all edits are done, the commits should be squashed.

where here it's actually useful to have the clean separation of new from-scratch version in a separate commit from deleting the old one.

kmsquire · 2015-05-14T14:39:36Z

@ScottPJones, regarding the tests, thanks for the explanation--that's what I was missing.

@tkelman, re:squashing, I agree that separate commits are sometimes useful, and you make a good point.

Scott, can you squash only the first and third commits? You can do this with git rebase -i HEAD~3 on this branch, swapping the second and third commits, and then squashing the second into the first.

(For why broken commits can be an issue, look up "git bisect".)

I wound also suggest renaming the parameters to match the surrounding code, and consider clarifications to the coding conventions separately. (Ideally the rename would also be squashed into the first commit.)

(If you're concerned about screwing up git history, you could make a branch off of this branch first, and practice these changes there.)

Cheers!

ScottPJones · 2015-05-14T15:17:32Z

OK, all done! Please let me know if I’ve somehow screwed this up again! 😀 Thanks everybody!

stevengj · 2015-05-14T17:07:59Z

Looks good, but it seems that there are currently no test cases of is_valid_utf8 in test/strings.jl, and I wouldn't like to see changes to such a basic routine merged unless there are more extensive tests.

Maybe something like:

let tests = (("abcd".data,true), [("α".data[i:i],false) for i in 1:2]..., ("🐨".data[1:3],false), ("🐨".data, true), ("\ud800".data, false))
    for (d1,v1) in tests
        for (d2,v2) in tests
            for (d3,v3) in tests
                @test is_valid_utf8(vcat(d1,d2,d3)) == v1 && v2 && v3
            end
        end
    end
end

(and add any other edge cases as (data, isvalid) pairs to the tests variable)

ScottPJones · 2015-05-14T17:23:16Z

OK, I put them in as part of the is_valid_* -> invalid(T change, I’ll rename them and put them in here as well. Your method is nice... I’ll have to change it a bit though (“\ud800” gives a error now with one of my changes, it is an invalid string that wasn’t caught before).

jakebolewski · 2015-05-14T17:24:22Z

looks great

stevengj · 2015-05-14T17:26:08Z

Just replace "\ud800".data with [0xed,0xa0,0x80]

stevengj · 2015-05-14T17:29:32Z

There should also be a

@test is_valid_utf8(UInt8[])

Change parameters in utf8.c

ScottPJones · 2015-05-14T18:03:47Z

OK, should be all set, and resquashed

tkelman · 2015-05-14T20:38:27Z

Looks to be failing the new tests?

ScottPJones · 2015-05-14T20:59:37Z

Yes, just saw that... don't know what's different from on my system... will try to get that fixed shortly

ScottPJones · 2015-05-15T00:47:21Z

OK, I was playing with too many branches at the same time, and I didn't actually test those particular lines from @stevengj. After looking at it further, I realized that his tests simply are not correct, you can't mix and match incomplete UTF-8 sequences like that, because you can have two sequences that are invalid on their own, which put back together, are a valid sequence of bytes, which is esp. likely because of the way the code takes apart the bytes of the UTF-8 representation of a valid character, and then vcat's things together... in this test, sometimes two wrongs do make a right (which then causes the test to fail!) I'll have the tests in order shortly...

ScottPJones · 2015-05-15T03:36:27Z

I’ve updated the test cases... I hope you’ll all agree that I did a fairly comprehensive set of test cases for UTF-8! Let me know if there is anything else to prevent this from being merged... Thanks!

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii

stevengj · 2015-05-15T14:35:38Z

Looks good, thanks!

stevengj · 2015-05-15T14:36:37Z

Ah shoot, I just noticed you hadn't squashed the final commit; I probably should have waited for that before merging.

ScottPJones · 2015-05-15T14:50:38Z

Ah, I missed squashing that puppy... my other two that I think are totally ready to go (#11186 and #11241) are both squashed down to 1 commit. Thanks!

ScottPJones changed the title ~~Spj/fixvalidutf8~~ Fix #11141 and improve performance of is_validate_utf8/is_validate_ascii May 8, 2015

StefanKarpinski reviewed May 8, 2015
View reviewed changes

ScottPJones changed the title ~~Fix #11141 and improve performance of is_validate_utf8/is_validate_ascii~~ Fix #11141/#10973 and improve performance of is_validate_utf8/is_validate_ascii May 8, 2015

tkelman added the unicode Related to unicode characters and encodings label May 9, 2015

ScottPJones changed the title ~~Fix #11141/#10973 and improve performance of is_validate_utf8/is_validate_ascii~~ Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii May 9, 2015

ScottPJones mentioned this pull request May 13, 2015

Add isvalid(Type, value) methods, to replace is_valid_* #11241

Merged

ScottPJones added a commit to ScottPJones/julia that referenced this pull request May 14, 2015

Make tests easier to handle fix JuliaLang#11203

5bb8828

Fix JuliaLang#11141 and improve performance greatly

778810b

ScottPJones force-pushed the spj/fixvalidutf8 branch from 7eed834 to 29bb4bf Compare May 14, 2015 15:14

Remove old u8_isvalid

f473f5f

Change parameters in utf8.c

ScottPJones force-pushed the spj/fixvalidutf8 branch from 29bb4bf to f473f5f Compare May 14, 2015 18:02

Fix is_valid_utf8 tests

14e289d

stevengj added a commit that referenced this pull request May 15, 2015

Merge pull request #11203 from ScottPJones/spj/fixvalidutf8

2f019d7

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii

stevengj merged commit 2f019d7 into JuliaLang:master May 15, 2015

ScottPJones deleted the spj/fixvalidutf8 branch May 15, 2015 14:41

ScottPJones mentioned this pull request Jun 15, 2015

"invalid UTF-8 character index" error in writetable JuliaData/DataFrames.jl#813

Closed

ScottPJones mentioned this pull request Jul 28, 2015

Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

Closed

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii #11203

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii #11203

Conversation

ScottPJones commented May 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented May 9, 2015

ScottPJones commented May 9, 2015

ScottPJones commented May 13, 2015

tkelman commented May 13, 2015

kmsquire commented May 13, 2015

ScottPJones commented May 14, 2015

tkelman commented May 14, 2015

ScottPJones commented May 14, 2015

tkelman commented May 14, 2015

ScottPJones commented May 14, 2015

tkelman commented May 14, 2015

kmsquire commented May 14, 2015

ScottPJones commented May 14, 2015

stevengj commented May 14, 2015

ScottPJones commented May 14, 2015

jakebolewski commented May 14, 2015

stevengj commented May 14, 2015

stevengj commented May 14, 2015

ScottPJones commented May 14, 2015

tkelman commented May 14, 2015

ScottPJones commented May 14, 2015

ScottPJones commented May 15, 2015

ScottPJones commented May 15, 2015

stevengj commented May 15, 2015

stevengj commented May 15, 2015

ScottPJones commented May 15, 2015