Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing RTL ligatures decomposition order #1589

Closed
naourass opened this issue Jan 31, 2023 · 11 comments
Closed

Fixing RTL ligatures decomposition order #1589

naourass opened this issue Jan 31, 2023 · 11 comments
Assignees
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text

Comments

@naourass
Copy link

naourass commented Jan 31, 2023

Explanation

When you extract Arabic text, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.

The problem is when you have ligatures chars that are composed of two or more chars, these ligatures are not reversed which makes the extracted text inaccurate.

Here's an example:

Let's suppose we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:

image

When you extract the pdf text using page.extract_text() you get this:
كارتــــــشلاا

Notice how all chars are in reverse order except "لا".

And here's the final result after applying bidi algorithm:
االشــــــتراك

Sample PDF:
https://drive.google.com/file/d/1SYi4aDRGsgwQydukwfLrkSXoQgH7oPBn/view?usp=sharing

Implementation

# pypdf/_page.py

def _extract_text():
    ...
    elif operator == b"Tj":
        ...
        for x in "".join(
            [cmap[1][x] if x in cmap[1] else x for x in t]
        ):
            ...

Before joining [cmap[1][x] if x in cmap[1] else x for x in t] into a single string, we can check if there's items with more than one character and if these characters are in the RTL range. If it's the case, we can reverse their order before joining the list.

This could be done by extracting the RTL Range Check method and using it twice (for ligature decomposition and for text appending), or by refactoring the logic of how RTL extraction is handled.

@naourass
Copy link
Author

Here's another explanation from an SO answer for viewers that are not familiar with Arabic:

An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003). If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.

@MartinThoma MartinThoma added the workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text label Jan 31, 2023
@pubpub-zz
Copy link
Collaborator

@naourass
can you review #1578 and confirm that the side effect is the same. I've started the analysis there and my first thoughts (not validated yet) is that the RTL should be consider when merging Tj (line 1886 and further). Do you want to confirm my analysis and propose a PR? 😉

@naourass
Copy link
Author

naourass commented Feb 1, 2023

@pubpub-zz
In addition to the issue of this thread which is RTL de-ligaturing, #1578 reflects another issue with Diacritics positioning during RTL extraction. To summarize, there's currently two apparent issues with RTL text extraction:

  1. Ligature decomposition: the ligatures should be reversed before joining each TJ operand set of chars into one string.
  2. Diacritics positioning: It seems that the diacritics are appended to the wrong extremity, which screw up the structure of the entire line of text.

I'll initiate a PR for the first point, and I'll start inspecting the second point further once possible.

@naourass
Copy link
Author

naourass commented Feb 2, 2023

@pubpub-zz

There's an issue with 015-arabic/habibi.pdf. I've deleted all the chars except "h". It turned out that "h" is a glyph that is mapped to the whole unicode representation of "حَبيبي h".

Here's how the PDF is rendered (link to the file here):

image

Here's what happens if you copy the "h" char from the PDF and paste it in a browser url bar:

image

image

Here's the content stream and the cmap:

BT
0 0 0 rg
/GS0 gs
/C2_0 12 Tf
386.3 747.978 Td
<004B>Tj
ET
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
4 beginbfchar
<004b> <062d064e0628064a0628064a00200068>
<0044> <0061>
<0045> <0062>
<004c> <0069>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

Notice how <004b> is mapped to <062d064e0628064a0628064a00200068>. I've almost pulled my hair to figure out what's happening here, fortunately I've found it 😆.

Let me know where should I publish the new sample Arabic PDF(s) please. Inside resources folder, in network, in samples submodule or somewhere else?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Feb 2, 2023

The best is to put the files in the issues. You can then reference them from the tests using the url

@naourass
Copy link
Author

naourass commented Feb 2, 2023

Fixed Habibi.pdf for test_minimal_arab_text_extraction():
https://github.com/py-pdf/pypdf/files/10567398/habibi-fixed.pdf

@naourass
Copy link
Author

naourass commented Feb 2, 2023

@pubpub-zz
I've made a draft PR to fix the main arabic extraction test. I'll make another one for ligature decomposition. Let me know if there's any problem.

@naourass
Copy link
Author

naourass commented Feb 7, 2023

@pubpub-zz
Thanks for your feedback. I need some confirmation if it's better to keep the odd habibi.pdf inside the main Arabic extraction test, or to move it into a new Ligatures Decomposition test since it contains an uncommon and very special case ligature (single glyph "h" mapped to a set of Arabic+Latin+White characters). Can you review this please: #1597 (comment)

@naourass
Copy link
Author

naourass commented Feb 10, 2023

Added arabic-ligature.pdf for test_ligature_arab_text_extraction():
https://github.com/py-pdf/pypdf/files/10707635/arabic-ligature.pdf

P.S. We'll still have to test also bidirectional text (Arabic+Latin) including Arabic inside Latin, Latin inside Arabic, Arabic then Latin, Latin then Arabic, end with punctuation, and end with diacritic.

@naourass
Copy link
Author

@pubpub-zz Awesome, RTL ligatures order has been fixed in f5ac79b. I'll be closing the draft PR.

@pubpub-zz
Copy link
Collaborator

I close this issue as Fixed

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text
Projects
None yet
Development

No branches or pull requests

3 participants