Wrong RTL language text direction when English numbers exist in the text #1638
Labels
is-feature
A feature request
workflow-arabic-text-extraction
Related to text extraction, but with a focus on Arabic text
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
I was using pypdf to extract some Arabic text from a PDF. An earlier problem related to the order of characters in words was fixed. However, when the text contains any English numbers, the order of sentences is totally mixed up. Fixing this is highly appreciated because Pypdf is the best package for extracting Arabic text so far. It's just critical to solve this because the majority of Arabic speakers use English numbers.
For example, the correct order of the following extracted line
should be
and some of the numbers are not read as you can see.
Also, there is no space added at all between the last and first words when there is a great space between two texts in the same line as in the header of the pdf
is extracted as
There are also 2 additional problems in the text above. The two-page numbers at the bottom of the PDF page are grabbed up and a part of the text is moved to the next line.
I already tried to get over this by extracting words and combining them, but there's no such option. Thanks for your contributions!
Ubuntu 22.04
pypdf 3.4.1
https://drive.google.com/file/d/1UAcXUqPpu1WVzlcdeuVBGWODAMmb3lkq/view?usp=sharing
The text was updated successfully, but these errors were encountered: