Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #34 handle 66 Unicode non-characters and surrogates correctly #35

Merged
merged 2 commits into from
May 30, 2015

Conversation

ScottPJones
Copy link
Contributor

This change also improves the performance of the functions utf8proc_iterate and utf8proc_codepoint_valid, and fixes surrogate handling in utf8proc_encode_char.

@pao
Copy link

pao commented May 9, 2015

make check failed on Travis, possible encoding problem? (Fancy that in a Unicode library.)

@pao
Copy link

pao commented May 9, 2015

Or possibly it's now testing non-characters because the definition changed.

@ScottPJones
Copy link
Contributor Author

The errors are something to do with graphemes, which I didn't think my changes would even affect.
There are also a number of warnings about a wcwidth function, on both gcc and clang, that I didn't touch :sad:

@nalimilan
Copy link
Member

Given that both Travis builds failed (AppVeyor does not run the tests), it doesn't look like a random failure. It doesn't look unlikely to me that modifying an essential function like utf8proc_encode_char could change the behaviour of tests. Even if you didn't change these functions, they probably call the modified code indirectly.

@ScottPJones
Copy link
Contributor Author

@nalimilan What makes me wonder a bit, is all the warning messages in the travis compilation...
I'll dig into it though

@stevengj
Copy link
Member

Would also be good to add a regression test to the tests/ directory.

@ScottPJones
Copy link
Contributor Author

Note: I figured out why the grapheme test failed... but it doesn't make me very happy about utf8proc.
I managed to work around it, but basically, the tests depended on the utf8 encoding being broken!
(it encoded some values as 0xFF and 0xFE, which it then used as markers, but left that bad encoding
for all callers to the utf8proc_encode_char function).
@stevengj This already has fairly complete tests..., are you talking about additional tests in the julia tests unit tests?

@stevengj
Copy link
Member

Yes, the CHARBOUND option was documented as inserting 0xFF bytes as delimiters. I'd be happy to deprecate this functionality, since I find utf8proc_grapheme_break to be much more useful (because it doesn't require making a copy of the string etcetera just to detect grapheme breaks).

Alternatively, we could change it to insert NUL bytes as delimiters, but it would be good to change the API (e.g. to UTF8PROC_GRAPHEME) in that case to force a breaking change for code relying on this feature.

By test case, I mean creating an additional test program that checks for regressions on handling of these non-characters and surrogates.

@@ -49,7 +51,7 @@ int main(int argc, char **argv)
else
i++;
}
glen = utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND);
glen = utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND | UTF8PROC_OLDENCODE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't UTF8PROC_CHARBOUND imply UTF8PROC_OLDENCODE? Why have a separate flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd missed that (about CHARBOUND...).
OK, I'll try to get some testing in... (in my copious spare time! 😀)
Would it be possible to have extra testing as a follow-up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's good practice to merge patches only given regression tests and (where needed) documentation (though none is needed in this case if there are no API changes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true... I'll add some further testing, and make sure it gets all of the edge cases that were broken before... there are no API changes though.

@ScottPJones
Copy link
Contributor Author

@stevengj Could you also please take a look at my new PR #39 here?
I added some reasonably extensive testing of both the codepoint_valid and iterate functions...
I did this as a separate PR, so you could easily see the issues I found with the current utf8proc.c code...
Thanks greatly!

}

/* internal "unsafe" version that does not check whether uc is in range */
static const utf8proc_property_t *utf8proc_unsafe_get_property(utf8proc_int32_t uc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a static function, there is no need to put the utf8proc_ prefix. It's good to preserve the convention that exported (non-static) functions get the prefix, whereas non-exported functions have no prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added those because I mistakenly thought that's what you were asking me to do... I'll go back and put those back... (this is what you get for programming on 3 days of only 2 hours of sleep!)

@stevengj
Copy link
Member

Once 39 is merged, can you update the tests to also test for noncharacter validity and include the updated tests in this PR?

@ScottPJones
Copy link
Contributor Author

#39 already does test for noncharacter validity... If you run it now, it will show you all the problems you have if you don't merge this one 😀

@stevengj
Copy link
Member

@ScottPJones, then it should be merged into this PR (and squashed/rebased).

The tests should be part of the PR on general principle, but also as a practical matter: unless the tests are part of the PR, we can't tell on github (until after everything is merged) that the PR actually passes the tests.

@stevengj
Copy link
Member

Oh, you should also update the .gitignore file with the names of the new test programs (so that git status will ignore them).

@ScottPJones
Copy link
Contributor Author

But then, isn't it harder to show issues that haven't been fixed yet?
Actually, that's something that I think is a problem with the way the current unit testing works...
I think that you should be able to independently of fixes, be able to add unit tests for any problems that have been found... and the unit test framework should check for regressions on any particular test, while saving and displaying information about those that fail, instead of terminating the testing process...
(I worked this way for many years, it was very effective in not losing track of issues in the product, while showing where progress was being made in cleaning up old issues)

@ScottPJones
Copy link
Contributor Author

Oh, didn't know about .gitignore... Thanks!

@stevengj
Copy link
Member

Committing tests that fail is not really compatible with the Travis/Github workflow. If you don't combine the PRs, then we are forced to either (a) merge 39 first, in which case you break the build (tests passing/failing is binary on Github) or (b) merge 35 first, in which case we are merging before testing.

A more fine-grained testing framework might be nice, but that's not an issue we can take up here.

@timholy
Copy link

timholy commented May 29, 2015

I think that you should be able to independently of fixes, be able to add unit tests for any problems that have been found.

I agree this would be a major improvement. There have been quite a few times where I've had to omit a test that should have been there simply because some other bug made it fail. It would be really nice to have a "todo list" of failing tests.

That said, certainly our current framework doesn't support this.

@ScottPJones
Copy link
Contributor Author

@stevengj The way the unit tests would work within the Travis/GitHub framework is fairly simple... A "failure" for Travis would only be when there is a regression on a test that passed previously...
New tests show what the behavior is supposed to be, and the ones that don't match, get counted up,
and so you'd see something like:

Unicode Tests:
     Line 30, 'isvalid(UTF8String, b"\xff") -> false' failed, returned true
     .... 
  1000 passed, 50 failed, 3.53 seconds

What do you think?

@stevengj
Copy link
Member

@ScottPJones, that's reasonable, but since you have to build this facility "manually" on github I wouldn't bother on a project as small as utf8proc. For Julia, it might be worthwhile to extend Base.Test with a @todo test (for something that is expected to fail) or similar.

@ScottPJones
Copy link
Contributor Author

@stevengj Thanks for all the review! Hope I got the move of #39 here done correctly...

size_t encode(char *dest, const char *buf)
{
size_t i = 0, j, d = 0;
do {
for (;;) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is there some substantive reason you changed this? Don't most compilers generate exactly the same code for for (;;) {...} and do {...} while(1)?

Usually, it's best to avoid purely stylistic changes in patches (unless you have a patch that is purely for a global style change); on the contrary, one strives for a patch that matches the surrounding code style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even remember doing that... that sort of thing happens on autopilot... I date myself 😀, but when I started, not all C compilers optimized while(1) away!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, do I need to change that back and resquash?

@ScottPJones
Copy link
Contributor Author

@stevengj You're quite right, but maybe this should be raised as an issue? To improve unit testing for Julia?
We actually had a system that kept track of the results of previous unit tests, in CSV files... so it wasn't really "manual", but 1) I didn't build the unit testing framework, 2) it's not the sort of programming I really want to do 😀, but I know it can be done and is very very useful, and would be good if somebody smart could implement it... (sort of intermediate beginner project? or Summer of Code?)

@tkelman
Copy link
Contributor

tkelman commented May 29, 2015

There are lots of nice small usable test frameworks out there that are better than what we have now - we could just pick one and start using it. I like the one that libgit2 uses for example: https://github.com/vmg/clar

@ScottPJones
Copy link
Contributor Author

@tkelman I'm pretty ignorant of a lot of what's out there in OSS land... (too many years with "not invented here" mentality where I worked... unfortunately I wasn't able to change that...)
Does vmg/clar have the capabilities that I described, of checking for regressions on each individual test?

dst[0] = 0xFE;
return 1;
// Note: we allow encoding 0xd800-0xdfff here, so as not to change
// the API, however, these are actually invalid in UTF-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, do we still need utf8proc_unsafe_encode_char? Oh, never mind, I see that it is still used for the CHARBOUND thing.

@tkelman
Copy link
Contributor

tkelman commented May 29, 2015

checking for regressions on each individual test?

Not sure exactly what you mean, but the main thing a nicer test framework like clar (and there are even fancier ones out there, I'm sure) would do is run all the tests and give you a summary of failures at the end.

@ScottPJones
Copy link
Contributor Author

I just started looking at clar, but (in my really quick 50,000 feet flying look), it didn't seem to have built in tracking of previous results... but 1) I'm jetlagged, 2) need to look further...
The way the InterSystems test suite worked (IIRC) is that it would tag a potential release point (for the nightly build, for example), using Perforce, and run the tests, saving the results of all the tests...
(it also had files that had the correct results). It would then compare (and it had to check if tests were added or deleted) each item, to see if the result changed... if it changed, and didn't match the correct result, that was deemed a regression, and that build would be considered failed.
You'd get a nice report with how many new tests passed, how many new tests failed (along with more information to help debug each individual one), how many old tests passed, and how many failed (but did not change the result), and how many changed behavior and failed... you could also run the test suite locally... or subsections of it...

@ScottPJones
Copy link
Contributor Author

Oh, I forgot to say, it compared with the last known good build along that branch... then if there are no regressions detected, the build is marked OK, and it's results (if different), are saved and become the new point for future comparisons in the branch.

stevengj added a commit that referenced this pull request May 30, 2015
Fix #34 handle 66 Unicode non-characters and surrogates correctly
@stevengj stevengj merged commit 35ec8e3 into JuliaStrings:master May 30, 2015
@ScottPJones
Copy link
Contributor Author

Thanks!

@ScottPJones ScottPJones deleted the spj/valid branch May 30, 2015 08:25
tkelman added a commit that referenced this pull request May 30, 2015
#35 and #40 added new tests that #38 did not take into account

this is one case where it would be good if Travis re-tested the PR
after new commits get pushed to master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants