Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8proc does not correctly handle the 66 Unicode "noncharacters" #34

Closed
ScottPJones opened this issue May 7, 2015 · 8 comments
Closed

Comments

@ScottPJones
Copy link
Contributor

utf8proc considers the 66 Unicode "noncharacters" to not be valid, however the Unicode standard specifically says that they are valid code points, and need to be handled correctly in conforming software.

@stevengj
Copy link
Member

stevengj commented May 7, 2015

According to the Unicode FAQ they should still be category Cn, but we shouldn't return UTF8PROC_ERROR_INVALIDUTF8 for these.

@ScottPJones
Copy link
Contributor Author

Yes, precisely!

@stevengj
Copy link
Member

stevengj commented May 7, 2015

Quickly looking through the code, it seems like the only place this comes up is in utf8proc_iterate, where it seems like if (uc < 0 || ((uc & 0xFFFF) >= 0xFFFE)) should be replaced with if (uc < 0). And the (uc >= 0xFDD0 && uc < 0xFDF0) check should be removed in the lines above that.

Oh, and also in utf8proc_codepoint_valid.

@ScottPJones
Copy link
Contributor Author

I think that's all I'd seen also... will fix as soon as I find that bloody "round tuit"!

@ScottPJones
Copy link
Contributor Author

OK, I made the changes, testing them by rebuilding julia and running all the unit tests... but then it turns out that deps/utf8proc/utf8proc.c is not the same as in the JuliaLang/utf8proc repository... lots of simple difference, like UTF8PROC_DLLEXPORT vs. DLLEXPORT.
Can anybody please explain why there are differences, and how I put my changes in so they get used by Julia? Thanks!

@pao
Copy link

pao commented May 9, 2015

Julia doesn't track master of external dependencies, like utf8proc, except when absolutely necessary. I think you can modify your local build to use master (you can probably figure it out from deps/Makefile) for testing purposes.

@stevengj
Copy link
Member

stevengj commented May 9, 2015

I would clone utf8proc into a separate repository before making changes. Editing git submodules is a recipe for trouble if you aren't a git guru.

@ScottPJones
Copy link
Contributor Author

@stevengj Thanks, luckily, I'd already done that last night, after going crazy trying to figure out why deps/utf8proc didn't match my ScottPJones/utf8proc fork...

ScottPJones added a commit to ScottPJones/utf8proc that referenced this issue May 9, 2015
stevengj added a commit that referenced this issue May 30, 2015
Fix #34 handle 66 Unicode non-characters and surrogates correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants