-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii #11203
Conversation
|
||
length is in bytes, since without knowing whether the string is valid | ||
it's hard to know how many characters there are! */ | ||
int u8_isvalid(const char *str, size_t length) | ||
int u8_isvalid(const char *iStr, size_t iLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with these argument names? Is this Hungarian notation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the coding convention where I'm consulting... not Hungarian...
i for input, io for input/output, o for output, for parameters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using that convention a problem? (I dislike Hungarian myself, this was something that the other developers had wanted, just for the parameters, similar to Google's requirement of putting a "p" on formal parameter names...). I hadn't found any C/C++ style guide for Julia internals, just the guide for Julia code. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to follow the style of the surrounding code, which is not to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I just copied this from my own UTF handling code, and changed the name to match Jeff's name, and removed the documentation :sad:
When I have all of u8 rewritten to run faster, can I use my standard conventions, since there will be no other surrounding code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanKarpinski Because when I had copied code of mine previously into utf8.c
, I was told to remove the documentation, because nobody had decided yet how code should be documented 😢
@tknopp I don't mean to sound snide, but, if I'd actually found any consistent conventions in the code I've been looking at, I would have followed it...
I've been working mostly in utf8.c
, utf8proc.c
, and gc.c
. Each one has different conventions for how the code is formatted, [2 vs 4 spaces for indentation, for example, different ways of indenting {}'s, etc.], and different ways things are named. There are things like jlegal_func
(is that a legal function? hmmm...) or should that really have been jl_egal_func
, as were some of the other functions, such as jl_alloc_svec_func
.
Having authored the coding / style / portability standard for a company with hundreds of programmers, that is still in use after 28 years, and co-authored the guide for the start-up I'm at now, I feel not having some sort of guide for C/C++ programming is a serious lack for Julia.
I've found that consistency, external and internal documentation, good readable style, as well as having portable code, is incredibly important when you have lots of contributors over large periods of time.
The code now is not very consistent, there doesn't seem to be any internal documentation or comments to explain whatever tricks or design / implementation decisions were made, use of too short variable names makes it hard to read parts of the code (using a single letter variable name, that is a substring of several other variable names in the same function, for example)...
lots of static variables that will make adding threading very difficult... I could go on, but it would just frustrate me more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code formatting is described in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd seen that paragraph, but it says nothing about variable naming... and isn't followed in the code I've seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right about variable naming. I just wanted to reply to your complaint about formatting.
I can absolutely see some of your other points. For instance it would be a good move to put globals into structs so that there is a julia context that can be independently hold. But one has also to acknowledge that the C core of julia is moving fast and maintained/developed by a small group of people. Taking this into account development has been pretty efficient IMHO.
If you want to help improving the C core I suggest opening a new issue and making concrete proposals that the core hackers can judge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tknopp to be fair, I think @ScottPJones has gotten the message about opening concrete, independent, small issues by now. See #11123 for an issue about doc comments. On doc comments and variable naming I think it might be best to just start proposing small PR's to solely add doc comments or replace variable names with clearer, more consistent versions. They may stand out as not matching the surrounding code at first, but if a consistent style starts getting gradually applied to more of the C code then I think it could be fair to start holding newly added code to a standard (which we would add to CONTRIBUTING.md).
+ a bunch for the level of comments here. For indentation and purely cosmetic code formatting conventions I'd be in favor of adopting a "run clang-format on it" policy, ideally with formatting-only changes split into separate commits from functional changes. Variable naming is harder to enforce. |
I did try to follow all of the formatting stuff I'd found (4 space indentation, no tab characters, no trailing spaces on a line, { } placement...). I've never used "clang-format", I'll have to investigate that... thanks! |
Anything left keeping this from being merged into base? Thanks! |
I don't think the argument names matter enough to prevent this from being merged, but I don't know enough of the unicode details to be qualified to review the functional changes here. @StefanKarpinski or @stevengj or @jiahao can you take a look and hit the button if everything is in order? |
Any particular reason for the test changes? They look mostly cosmetic, and to me, the changes are not clearly better to me, but I might be missing something. Also, when all edits are done, the commits should be squashed. |
What do you mean? I added a very complete set of tests, with all of the edge conditions, and tests for UTF8String, which didn't have any before... |
I think you might be mixing up the different PR's, here 88a8071 makes a few small changes to a test file - as described up top
|
The changes to |
In this case, while I don't usually like having commits that fail tests present in the history of master, separating out the commits is actually good for license's sake. Note the comment |
@tkelman It's too early for my brain... what would you like me to change? |
Nothing from me yet, unless someone who knows unicode says otherwise. Was mostly responding to
where here it's actually useful to have the clean separation of new from-scratch version in a separate commit from deleting the old one. |
@ScottPJones, regarding the tests, thanks for the explanation--that's what I was missing. @tkelman, re:squashing, I agree that separate commits are sometimes useful, and you make a good point. Scott, can you squash only the first and third commits? You can do this with (For why broken commits can be an issue, look up "git bisect".) I wound also suggest renaming the parameters to match the surrounding code, and consider clarifications to the coding conventions separately. (Ideally the rename would also be squashed into the first commit.) (If you're concerned about screwing up git history, you could make a branch off of this branch first, and practice these changes there.) Cheers! |
7eed834
to
29bb4bf
Compare
OK, all done! Please let me know if I’ve somehow screwed this up again! 😀 Thanks everybody! |
Looks good, but it seems that there are currently no test cases of Maybe something like:
(and add any other edge cases as |
OK, I put them in as part of the |
looks great |
Just replace |
There should also be a
|
Change parameters in utf8.c
29bb4bf
to
f473f5f
Compare
OK, should be all set, and resquashed |
Looks to be failing the new tests? |
Yes, just saw that... don't know what's different from on my system... will try to get that fixed shortly |
OK, I was playing with too many branches at the same time, and I didn't actually test those particular lines from @stevengj. After looking at it further, I realized that his tests simply are not correct, you can't mix and match incomplete UTF-8 sequences like that, because you can have two sequences that are invalid on their own, which put back together, are a valid sequence of bytes, which is esp. likely because of the way the code takes apart the bytes of the UTF-8 representation of a valid character, and then vcat's things together... in this test, sometimes two wrongs do make a right (which then causes the test to fail!) I'll have the tests in order shortly... |
I’ve updated the test cases... I hope you’ll all agree that I did a fairly comprehensive set of test cases for UTF-8! Let me know if there is anything else to prevent this from being merged... Thanks! |
Looks good, thanks! |
Ah shoot, I just noticed you hadn't squashed the final commit; I probably should have waited for that before merging. |
Rewrote u8_isvalid to fix cases of not detecting single or out of order surrogate characters.
Also improved performance, generally around 2-3x faster on simple tests with different types
of characters (ASCII, Latin1, 2-byte/3-byte/4-byte UTF-8 sequences).
Fixed some tests that were depending on being able to make invalid string literals.