Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong RTL language text direction when English numbers exist in the text #1638

Closed
esraa-abdelmaksoud opened this issue Feb 16, 2023 · 2 comments
Labels
is-feature A feature request workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@esraa-abdelmaksoud
Copy link

esraa-abdelmaksoud commented Feb 16, 2023

I was using pypdf to extract some Arabic text from a PDF. An earlier problem related to the order of characters in words was fixed. However, when the text contains any English numbers, the order of sentences is totally mixed up. Fixing this is highly appreciated because Pypdf is the best package for extracting Arabic text so far. It's just critical to solve this because the majority of Arabic speakers use English numbers.

For example, the correct order of the following extracted line
Screenshot from 2023-02-16 02-42-13
should be
Screenshot from 2023-02-16 02-43-04
and some of the numbers are not read as you can see.

Also, there is no space added at all between the last and first words when there is a great space between two texts in the same line as in the header of the pdf
Screenshot from 2023-02-16 02-44-16
is extracted as
Screenshot from 2023-02-16 02-44-43

There are also 2 additional problems in the text above. The two-page numbers at the bottom of the PDF page are grabbed up and a part of the text is moved to the next line.

I already tried to get over this by extracting words and combining them, but there's no such option. Thanks for your contributions!

Ubuntu 22.04
pypdf 3.4.1

https://drive.google.com/file/d/1UAcXUqPpu1WVzlcdeuVBGWODAMmb3lkq/view?usp=sharing

from pypdf import PdfReader
reader = PdfReader(fname)
    full_text = ""
    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        text += f"\n Page{i} + "\n"
        full_text += text
@pubpub-zz
Copy link
Collaborator

@esraa-abdelmaksoud
This looks like a duplicate of #1629
Can you add your case to this thread, and also please provide the PDF file : without such data we can not conduct a proper analysis.

@pubpub-zz
Copy link
Collaborator

closed as duplicate

@MartinThoma MartinThoma added workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-feature A feature request labels Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants