-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629
Comments
@naourass, tell me if you want to try to propose a PR. |
@pubpub-zz |
That's clearly an option to look at
Not sure all the programs will handle that. I would prefer to not use this if possible |
There's also a decoding issue for some characters. To focus on inspecting the concatenation order issue, I'm manually overriding them by adding a temporary
|
@pubpub-zz I'm not a BiDi expert (yet), but after further inspection, here's my humble conclusion so far:
There still might be some heuristic indicators or other approaches to handle/detect the overall direction which I couldn't find at the moment. I'll be investigating this further when possible and I'll report if I find anything useful. |
Thank you for looking into this topic 💙
Adding machine learning to pypdf seems out of scope to be. Adding a hook for external code / another library would be fine to be |
@pubpub-zz @MartinThoma I've started working on an implementation example, I'll let you know when it's ready for review. |
@naourass Are you still willing to provide a corresponding PR for this? |
Explanation
When you have Arabic text mixed with digits, the text extraction order is messed up. Below is an example.
القسم الرئيسي - عدد 5161
2 جمادى الآخرة 1444 (18 يناير 2023)
page.extract_text()
:(2023 0 ﻳﻨﺎﻳ18) 1444 ة0 ﺟﻤﺎدى اﻵﺧ2 5161 ﺋﻴﴘ - ﻋﺪد0ﻢ اﻟ5اﻟﻘ
Attachements:
The text was updated successfully, but these errors were encountered: