Line returns missing in text_extraction() #2138

pubpub-zz · 2023-08-31T20:52:54Z

PDF file:
https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf

Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space.
the first page in my attached file.

Originally posted by @yonglee7015 in #2135 (reply in thread)

pubpub-zz · 2023-09-02T10:05:29Z

I've removed the whitespace:this deals with line return

MartinThoma · 2023-09-02T10:54:11Z

Whitespaces includes newlines. I just edited the description of the tag to make that explicit.

To me those space / newline issues look related as I think we touch similar parts of the code and the types of issues the users have is similar. Am I wrong with that?

pubpub-zz · 2023-09-02T19:01:57Z

the issue is coming from cm being modified at the "same time" as Tm:

      q
        1 0 0 1 2.125 0 cm
        0 g
        BT
          /F3 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (Company:) ] TJ
        ET
      Q
      q
        1 0 0 1 83.125 0 cm
        0 g
        BT
          /F1 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (AMERICAN EAGLE OUTFITTERS) ] TJ
        ET
      Q
      q
        1 0 0 1 2.125 13.85 cm
        0 g
        BT
          /F3 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (Division / Dept:) ] TJ
        ET
      Q
      q
        1 0 0 1 83.125 13.85 cm
        0 g
        BT
          /F1 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (50 / 170) ] TJ
        ET
      Q

in order to get the actual text position we need to compare tm.cm to tm_Prev.cm_prev (cm_prev is currently not saved)the big point is about the change merged from #2060 : we are passing tm_prev,but cm_matrix which is not consistent.

closes py-pdf#2138

pubpub-zz · 2023-09-03T08:33:26Z

@yonglee7015, the PR is now OK if you want to testit

yonglee7015 · 2023-09-03T08:36:09Z

Yes,how can I test it?

MartinThoma · 2023-09-03T09:28:58Z

#2142 is the PR

Get the git repository: git clone https://github.com/pubpub-zz/PyPDF2.git pypdf-pubpub
Go into the directory: cd pypdf-pubpub
Checkout the branch: git checkout iss2138
Install that version: pip install -e .
Run your code with that version. Make sure you really use that version and not e.g. have a different environment

I think I should document this somewhere 🤔

pubpub-zz · 2023-09-03T09:35:33Z

@MartinThoma does this trick works ?
pip install git+https://github.com/pubpub-zz/PyPDF2.git@iss2138

MartinThoma · 2023-09-03T11:37:57Z

Yes! I completely forgot about that!

MartinThoma · 2023-09-03T11:39:10Z

By the way: Could you please rename it from PyPDF2 to pypdf? It might be confusing to others if they see PyPDF2.

pubpub-zz · 2023-09-03T11:59:59Z

oups. used to not know how to do it
@yonglee7015
the instructions shoud be now be:
pip install git+https://github.com/pubpub-zz/pypdf.git@iss2138

yonglee7015 · 2023-09-03T12:06:48Z

Thanks 😊I will test it

yonglee7015 · 2023-09-04T04:09:57Z

HI @pubpub-zz Yes, it works. Thanks for your help.

Can you also test this pdf file? the page 3.
You will find the order of extracted text is not correct.

1,the first line of text in pdf goes to the last line in the output text.

2, the order of text in table is not correct

Can you fixed this?

I try another library tika-python,
their text in table order is correct. but the first line also goes to the last line in the output text as yours.

test.pdf

pubpub-zz · 2023-09-04T20:27:08Z

You have reached the limit of pypdf current implementation:
a) strings are extracted in the order they have been "inserted" inside the document. when you print a document they are printed top from bottom, but in a pdf its more likely like a 2D plotter which can draw top left then bottom right before reaching the middle. extract_text get the text in the order they are plotted so the order is not garanted. It is far much more difficult in your case as you are working on documents.

Sorry there is no solution for the moment with pypdf.😞

yonglee7015 · 2023-09-05T05:29:19Z

Oh,no. It's so pity. Thank you.

stefan6419846 · 2023-09-05T06:51:31Z

I have not tested it, but shouldn't a visitor be able to fix the order on the user side in this case? https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor

pubpub-zz · 2023-09-05T14:38:49Z

It is more complex: you need to know if there is some columns what are the coordinates... Maybe ai could help...

Closes #2138

MartinThoma added whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Sep 2, 2023

pubpub-zz removed the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Sep 2, 2023

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 3, 2023

BUG : Missing new line in extract_text with cm operations

1f1ffbf

closes py-pdf#2138

pubpub-zz mentioned this issue Sep 3, 2023

BUG: Missing new line in extract_text with cm operations #2142

Merged

MartinThoma closed this as completed in #2142 Sep 17, 2023

MartinThoma pushed a commit that referenced this issue Sep 17, 2023

BUG: Missing new line in extract_text with cm operations (#2142)

5b45785

Closes #2138

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line returns missing in text_extraction() #2138

Line returns missing in text_extraction() #2138

pubpub-zz commented Aug 31, 2023 •

edited

Loading

pubpub-zz commented Sep 2, 2023

MartinThoma commented Sep 2, 2023

pubpub-zz commented Sep 2, 2023

pubpub-zz commented Sep 3, 2023

yonglee7015 commented Sep 3, 2023

MartinThoma commented Sep 3, 2023

pubpub-zz commented Sep 3, 2023 •

edited

Loading

MartinThoma commented Sep 3, 2023

MartinThoma commented Sep 3, 2023

pubpub-zz commented Sep 3, 2023

yonglee7015 commented Sep 3, 2023

yonglee7015 commented Sep 4, 2023

pubpub-zz commented Sep 4, 2023

yonglee7015 commented Sep 5, 2023

stefan6419846 commented Sep 5, 2023

pubpub-zz commented Sep 5, 2023

Line returns missing in text_extraction() #2138

Line returns missing in text_extraction() #2138

Comments

pubpub-zz commented Aug 31, 2023 • edited Loading

pubpub-zz commented Sep 2, 2023

MartinThoma commented Sep 2, 2023

pubpub-zz commented Sep 2, 2023

pubpub-zz commented Sep 3, 2023

yonglee7015 commented Sep 3, 2023

MartinThoma commented Sep 3, 2023

pubpub-zz commented Sep 3, 2023 • edited Loading

MartinThoma commented Sep 3, 2023

MartinThoma commented Sep 3, 2023

pubpub-zz commented Sep 3, 2023

yonglee7015 commented Sep 3, 2023

yonglee7015 commented Sep 4, 2023

pubpub-zz commented Sep 4, 2023

yonglee7015 commented Sep 5, 2023

stefan6419846 commented Sep 5, 2023

pubpub-zz commented Sep 5, 2023

pubpub-zz commented Aug 31, 2023 •

edited

Loading

pubpub-zz commented Sep 3, 2023 •

edited

Loading