Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629

naourass · 2023-02-12T20:00:41Z

Explanation

When you have Arabic text mixed with digits, the text extraction order is messed up. Below is an example.

Reading from right to left, here's the ground truth of a file with two blocks:

القسم الرئيسي - عدد 5161
2 جمادى الآخرة 1444 (18 يناير 2023)

Here's how the pdf is rendered:

Here's the result of page.extract_text():
(2023 0 ﻳﻨﺎﻳ18) 1444 ة0 ﺟﻤﺎدى اﻵﺧ2 5161 ﺋﻴﴘ - ﻋﺪد0ﻢ اﻟ5اﻟﻘ

Attachements:

Complete sample - arabic-plus-digits-h-blocks.pdf
Minimal sample - digits-after-arabic.pdf

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-02-12T20:52:18Z

@naourass,
At first sight(but maybe I'm wrong)
you should have a look at the concatenation of output and text (below if check_crlf_space lines 1871 and below)

tell me if you want to try to propose a PR.

naourass · 2023-02-12T20:58:17Z

@pubpub-zz
From my first analysis, I think that the concatenation flow should be changed to handle more cases. I'm also inspecting whether it would be possible to fix this using Control Characters.

pubpub-zz · 2023-02-12T21:01:05Z

@pubpub-zz
From my first analysis, I think that the concatenation flow should be changed to handle more cases.

That's clearly an option to look at

I'm also inspecting whether it would be possible to fix this using Control Characters.

Not sure all the programs will handle that. I would prefer to not use this if possible

naourass · 2023-02-13T10:25:32Z

There's also a decoding issue for some characters. To focus on inspecting the concatenation order issue, I'm manually overriding them by adding a temporary cmap_override argument to extract_text():

# _page.py
for x in t:
    hex_x = hex(ord(x))
    if hex_x in cmap_override:
        cmap[1][x] = cmap_override[hex_x]
    print(ord(x), hex_x, x, cmap[1][x] if x in cmap[1] else "-", sep="\t")

# my-app.py
cmap_override = {
    "0x27f": "سي",
    # "0x3": " ",
    # "0x206": "ا",
    # "0x273": "ن",
}
text = page.extract_text(cmap_override=cmap_override)

naourass · 2023-02-14T16:38:01Z

@pubpub-zz
I have an update regarding this issue.

I'm not a BiDi expert (yet), but after further inspection, here's my humble conclusion so far:

How to handle bidi concatenation will depend on the main direction of the whole sentence (independent from the char level dir handling). The main direction could be evaluated only at the end of the sentence, in our case, after the ET Operator.
The main direction of the sentence can't always be concluded from the sentence's start words, end words, or RTL:LTR words ratio. The only viable way to detect that is from the semantics of the sentence. So we can either offer the possibility to explicitly define the extraction direction as an argument when calling the extraction method or implement a machine learning model to predict it.
It may be possible to do that without Bidirectional Control Characters, but this may require adding double white space in some cases to not break the display order.
We could either keep the current extraction flow (with minor changes) and add a new separate layer to process the sentence when the dir is set as RTL, or we can rewrite it to handle the extraction with the main direction in mind from the start of the process.

There still might be some heuristic indicators or other approaches to handle/detect the overall direction which I couldn't find at the moment. I'll be investigating this further when possible and I'll report if I find anything useful.

MartinThoma · 2023-02-14T16:53:04Z

Thank you for looking into this topic 💙

or implement a machine learning model to predict it

Adding machine learning to pypdf seems out of scope to be. Adding a hook for external code / another library would be fine to be

naourass · 2023-02-17T17:23:57Z

@pubpub-zz @MartinThoma
After more experimentation, it looks like it's much simpler to just drop the RTL dir checks, process everything as LTR to provide the "logical" version of the text (except for ligatures and paired punc like ()[]{}«»), and let the user call bidi.get_display() to easily get the visual order!

I've started working on an implementation example, I'll let you know when it's ready for review.

stefan6419846 · 2024-12-20T10:19:56Z

@naourass Are you still willing to provide a corresponding PR for this?

naourass assigned MartinThoma Feb 12, 2023

naourass changed the title ~~Fixing Text Extraction Order For Arabic+Digits~~ Fixing Text Extraction Order For Arabic+Digits+Punctuation Feb 14, 2023

pubpub-zz mentioned this issue Feb 16, 2023

Wrong RTL language text direction when English numbers exist in the text #1638

Closed

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Mar 14, 2023

MartinThoma added the is-feature A feature request label Mar 25, 2023

stefan6419846 unassigned MartinThoma Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629

Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629

naourass commented Feb 12, 2023

pubpub-zz commented Feb 12, 2023

naourass commented Feb 12, 2023

pubpub-zz commented Feb 12, 2023

naourass commented Feb 13, 2023 •

edited

Loading

naourass commented Feb 14, 2023

MartinThoma commented Feb 14, 2023

naourass commented Feb 17, 2023 •

edited

Loading

stefan6419846 commented Dec 20, 2024

Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629

Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629

Comments

naourass commented Feb 12, 2023

Explanation

pubpub-zz commented Feb 12, 2023

naourass commented Feb 12, 2023

pubpub-zz commented Feb 12, 2023

naourass commented Feb 13, 2023 • edited Loading

naourass commented Feb 14, 2023

MartinThoma commented Feb 14, 2023

naourass commented Feb 17, 2023 • edited Loading

stefan6419846 commented Dec 20, 2024

naourass commented Feb 13, 2023 •

edited

Loading

naourass commented Feb 17, 2023 •

edited

Loading