Fixing RTL ligatures decomposition order #1589

naourass · 2023-01-31T05:32:20Z

Explanation

When you extract Arabic text, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.

The problem is when you have ligatures chars that are composed of two or more chars, these ligatures are not reversed which makes the extracted text inaccurate.

Here's an example:

Let's suppose we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:

When you extract the pdf text using page.extract_text() you get this:
كارتــــــشلاا

Notice how all chars are in reverse order except "لا".

And here's the final result after applying bidi algorithm:
االشــــــتراك

Sample PDF:
https://drive.google.com/file/d/1SYi4aDRGsgwQydukwfLrkSXoQgH7oPBn/view?usp=sharing

Implementation

# pypdf/_page.py

def _extract_text():
    ...
    elif operator == b"Tj":
        ...
        for x in "".join(
            [cmap[1][x] if x in cmap[1] else x for x in t]
        ):
            ...

Before joining [cmap[1][x] if x in cmap[1] else x for x in t] into a single string, we can check if there's items with more than one character and if these characters are in the RTL range. If it's the case, we can reverse their order before joining the list.

This could be done by extracting the RTL Range Check method and using it twice (for ligature decomposition and for text appending), or by refactoring the logic of how RTL extraction is handled.

The text was updated successfully, but these errors were encountered:

naourass · 2023-01-31T05:42:09Z

Here's another explanation from an SO answer for viewers that are not familiar with Arabic:

An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003). If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.

pubpub-zz · 2023-01-31T18:38:07Z

@naourass
can you review #1578 and confirm that the side effect is the same. I've started the analysis there and my first thoughts (not validated yet) is that the RTL should be consider when merging Tj (line 1886 and further). Do you want to confirm my analysis and propose a PR? 😉

naourass · 2023-02-01T02:12:04Z

@pubpub-zz
In addition to the issue of this thread which is RTL de-ligaturing, #1578 reflects another issue with Diacritics positioning during RTL extraction. To summarize, there's currently two apparent issues with RTL text extraction:

Ligature decomposition: the ligatures should be reversed before joining each TJ operand set of chars into one string.
Diacritics positioning: It seems that the diacritics are appended to the wrong extremity, which screw up the structure of the entire line of text.

I'll initiate a PR for the first point, and I'll start inspecting the second point further once possible.

naourass · 2023-02-02T05:21:27Z

@pubpub-zz

There's an issue with 015-arabic/habibi.pdf. I've deleted all the chars except "h". It turned out that "h" is a glyph that is mapped to the whole unicode representation of "حَبيبي h".

Here's how the PDF is rendered (link to the file here):

Here's what happens if you copy the "h" char from the PDF and paste it in a browser url bar:

Here's the content stream and the cmap:

BT
0 0 0 rg
/GS0 gs
/C2_0 12 Tf
386.3 747.978 Td
<004B>Tj
ET

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
4 beginbfchar
<004b> <062d064e0628064a0628064a00200068>
<0044> <0061>
<0045> <0062>
<004c> <0069>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

Notice how <004b> is mapped to <062d064e0628064a0628064a00200068>. I've almost pulled my hair to figure out what's happening here, fortunately I've found it 😆.

Let me know where should I publish the new sample Arabic PDF(s) please. Inside resources folder, in network, in samples submodule or somewhere else?

pubpub-zz · 2023-02-02T06:08:00Z

The best is to put the files in the issues. You can then reference them from the tests using the url

naourass · 2023-02-02T09:31:30Z

Fixed Habibi.pdf for test_minimal_arab_text_extraction():
https://github.com/py-pdf/pypdf/files/10567398/habibi-fixed.pdf

naourass · 2023-02-02T12:36:39Z

@pubpub-zz
I've made a draft PR to fix the main arabic extraction test. I'll make another one for ligature decomposition. Let me know if there's any problem.

naourass · 2023-02-07T12:03:06Z

@pubpub-zz
Thanks for your feedback. I need some confirmation if it's better to keep the odd habibi.pdf inside the main Arabic extraction test, or to move it into a new Ligatures Decomposition test since it contains an uncommon and very special case ligature (single glyph "h" mapped to a set of Arabic+Latin+White characters). Can you review this please: #1597 (comment)

naourass · 2023-02-10T13:11:53Z

Added arabic-ligature.pdf for test_ligature_arab_text_extraction():
https://github.com/py-pdf/pypdf/files/10707635/arabic-ligature.pdf

P.S. We'll still have to test also bidirectional text (Arabic+Latin) including Arabic inside Latin, Latin inside Arabic, Arabic then Latin, Latin then Arabic, end with punctuation, and end with diacritic.

naourass · 2023-02-10T15:06:28Z

@pubpub-zz Awesome, RTL ligatures order has been fixed in f5ac79b. I'll be closing the draft PR.

pubpub-zz · 2023-02-12T11:49:01Z

I close this issue as Fixed

naourass assigned MartinThoma Jan 31, 2023

MartinThoma added the workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text label Jan 31, 2023

pubpub-zz closed this as completed Feb 12, 2023

esraa-abdelmaksoud mentioned this issue Feb 16, 2023

Wrong RTL language text direction when English numbers exist in the text #1638

Closed

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing RTL ligatures decomposition order #1589

Fixing RTL ligatures decomposition order #1589

naourass commented Jan 31, 2023 •

edited

Loading

naourass commented Jan 31, 2023

pubpub-zz commented Jan 31, 2023

naourass commented Feb 1, 2023 •

edited

Loading

naourass commented Feb 2, 2023 •

edited

Loading

pubpub-zz commented Feb 2, 2023 •

edited by MartinThoma

Loading

naourass commented Feb 2, 2023 •

edited

Loading

naourass commented Feb 2, 2023 •

edited

Loading

naourass commented Feb 7, 2023 •

edited

Loading

naourass commented Feb 10, 2023 •

edited

Loading

naourass commented Feb 10, 2023

pubpub-zz commented Feb 12, 2023

Fixing RTL ligatures decomposition order #1589

Fixing RTL ligatures decomposition order #1589

Comments

naourass commented Jan 31, 2023 • edited Loading

Explanation

Implementation

naourass commented Jan 31, 2023

pubpub-zz commented Jan 31, 2023

naourass commented Feb 1, 2023 • edited Loading

naourass commented Feb 2, 2023 • edited Loading

pubpub-zz commented Feb 2, 2023 • edited by MartinThoma Loading

naourass commented Feb 2, 2023 • edited Loading

naourass commented Feb 2, 2023 • edited Loading

naourass commented Feb 7, 2023 • edited Loading

naourass commented Feb 10, 2023 • edited Loading

naourass commented Feb 10, 2023

pubpub-zz commented Feb 12, 2023

naourass commented Jan 31, 2023 •

edited

Loading

naourass commented Feb 1, 2023 •

edited

Loading

naourass commented Feb 2, 2023 •

edited

Loading

pubpub-zz commented Feb 2, 2023 •

edited by MartinThoma

Loading

naourass commented Feb 2, 2023 •

edited

Loading

naourass commented Feb 2, 2023 •

edited

Loading

naourass commented Feb 7, 2023 •

edited

Loading

naourass commented Feb 10, 2023 •

edited

Loading