Wrong RTL language text direction when English numbers exist in the text #1638

esraa-abdelmaksoud · 2023-02-16T00:47:52Z

I was using pypdf to extract some Arabic text from a PDF. An earlier problem related to the order of characters in words was fixed. However, when the text contains any English numbers, the order of sentences is totally mixed up. Fixing this is highly appreciated because Pypdf is the best package for extracting Arabic text so far. It's just critical to solve this because the majority of Arabic speakers use English numbers.

For example, the correct order of the following extracted line

should be

and some of the numbers are not read as you can see.

Also, there is no space added at all between the last and first words when there is a great space between two texts in the same line as in the header of the pdf

is extracted as

There are also 2 additional problems in the text above. The two-page numbers at the bottom of the PDF page are grabbed up and a part of the text is moved to the next line.

I already tried to get over this by extracting words and combining them, but there's no such option. Thanks for your contributions!

Ubuntu 22.04
pypdf 3.4.1

https://drive.google.com/file/d/1UAcXUqPpu1WVzlcdeuVBGWODAMmb3lkq/view?usp=sharing

from pypdf import PdfReader
reader = PdfReader(fname)
    full_text = ""
    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        text += f"\n Page{i} + "\n"
        full_text += text

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-02-16T19:42:03Z

@esraa-abdelmaksoud
This looks like a duplicate of #1629
Can you add your case to this thread, and also please provide the PDF file : without such data we can not conduct a proper analysis.

pubpub-zz · 2023-02-16T19:42:17Z

closed as duplicate

pubpub-zz closed this as completed Feb 16, 2023

MartinThoma added workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-feature A feature request labels Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong RTL language text direction when English numbers exist in the text #1638

Wrong RTL language text direction when English numbers exist in the text #1638

esraa-abdelmaksoud commented Feb 16, 2023 •

edited

Loading

pubpub-zz commented Feb 16, 2023

pubpub-zz commented Feb 16, 2023

Wrong RTL language text direction when English numbers exist in the text #1638

Wrong RTL language text direction when English numbers exist in the text #1638

Comments

esraa-abdelmaksoud commented Feb 16, 2023 • edited Loading

pubpub-zz commented Feb 16, 2023

pubpub-zz commented Feb 16, 2023

esraa-abdelmaksoud commented Feb 16, 2023 •

edited

Loading