-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy between foliapy and libfolia in stripping control characters in normalize_spaces() #55
Comments
Yes, you are right. Strange oversight. But until now it never caused problems. |
@proycon This introduces another interesting issue: should we preserve (some?) BiDI information? |
I was having the same thoughts yeah, preserving the bidi information
would indeed be best so I'm not entirely happy with our solution now.
One can also argue that FoLiA itself could have explicit contructs for
bidi information (in markup annotation), rather than leave it to
unicode. (like HTML does it).
But unless there are real use cases for mixed bidirectional text I don't
really want to make an issue out of this.
|
Aren't the files from @martinreynaert examples of a use case? |
normalize_spaces()
is used in text validation, currently foliapy (v2.5.11) and libfolia behave differently here regarding control characters:This issue arose from @martinreynaert 's data, where we see for example:
Character in question is a 0x7f (DELETE).
It also happens in an instance of hebrew text (I translitterate the hebrew because browsers are too smart in RTL rendering and mess up the point):
<0x202d>Tun-<0x202d>Idash
which libfolia turns intoTun- Idash
(inserts an unwanted space). 0x202d is a left-to-right control override.The text was updated successfully, but these errors were encountered: