-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing RTL ligatures decomposition order #1589
Comments
Here's another explanation from an SO answer for viewers that are not familiar with Arabic:
|
@pubpub-zz
I'll initiate a PR for the first point, and I'll start inspecting the second point further once possible. |
There's an issue with Here's how the PDF is rendered (link to the file here): Here's what happens if you copy the "h" char from the PDF and paste it in a browser url bar: Here's the content stream and the cmap:
Notice how Let me know where should I publish the new sample Arabic PDF(s) please. Inside resources folder, in network, in samples submodule or somewhere else? |
The best is to put the files in the issues. You can then reference them from the tests using the url |
Fixed |
@pubpub-zz |
@pubpub-zz |
Added P.S. We'll still have to test also bidirectional text (Arabic+Latin) including Arabic inside Latin, Latin inside Arabic, Arabic then Latin, Latin then Arabic, end with punctuation, and end with diacritic. |
@pubpub-zz Awesome, RTL ligatures order has been fixed in f5ac79b. I'll be closing the draft PR. |
I close this issue as Fixed |
Explanation
When you extract Arabic text, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.
The problem is when you have ligatures chars that are composed of two or more chars, these ligatures are not reversed which makes the extracted text inaccurate.
Here's an example:
Let's suppose we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:
When you extract the pdf text using
page.extract_text()
you get this:كارتــــــشلاا
Notice how all chars are in reverse order except "لا".
And here's the final result after applying bidi algorithm:
االشــــــتراك
Sample PDF:
https://drive.google.com/file/d/1SYi4aDRGsgwQydukwfLrkSXoQgH7oPBn/view?usp=sharing
Implementation
Before joining
[cmap[1][x] if x in cmap[1] else x for x in t]
into a single string, we can check if there's items with more than one character and if these characters are in the RTL range. If it's the case, we can reverse their order before joining the list.This could be done by extracting the RTL Range Check method and using it twice (for ligature decomposition and for text appending), or by refactoring the logic of how RTL extraction is handled.
The text was updated successfully, but these errors were encountered: