Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line returns missing in text_extraction() #2138

Closed
pubpub-zz opened this issue Aug 31, 2023 · 16 comments · Fixed by #2142
Closed

Line returns missing in text_extraction() #2138

pubpub-zz opened this issue Aug 31, 2023 · 16 comments · Fixed by #2142
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Aug 31, 2023

PDF file:
https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf

Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space.
the first page in my attached file.
image

Originally posted by @yonglee7015 in #2135 (reply in thread)

@MartinThoma MartinThoma added whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Sep 2, 2023
@pubpub-zz pubpub-zz removed the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Sep 2, 2023
@pubpub-zz
Copy link
Collaborator Author

I've removed the whitespace:this deals with line return

@MartinThoma
Copy link
Member

Whitespaces includes newlines. I just edited the description of the tag to make that explicit.

To me those space / newline issues look related as I think we touch similar parts of the code and the types of issues the users have is similar. Am I wrong with that?

@pubpub-zz
Copy link
Collaborator Author

the issue is coming from cm being modified at the "same time" as Tm:

      q
        1 0 0 1 2.125 0 cm
        0 g
        BT
          /F3 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (Company:) ] TJ
        ET
      Q
      q
        1 0 0 1 83.125 0 cm
        0 g
        BT
          /F1 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (AMERICAN EAGLE OUTFITTERS) ] TJ
        ET
      Q
      q
        1 0 0 1 2.125 13.85 cm
        0 g
        BT
          /F3 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (Division / Dept:) ] TJ
        ET
      Q
      q
        1 0 0 1 83.125 13.85 cm
        0 g
        BT
          /F1 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (50 / 170) ] TJ
        ET
      Q

in order to get the actual text position we need to compare tm.cm to tm_Prev.cm_prev (cm_prev is currently not saved)the big point is about the change merged from #2060 : we are passing tm_prev,but cm_matrix which is not consistent.

@pubpub-zz
Copy link
Collaborator Author

@yonglee7015, the PR is now OK if you want to testit

@yonglee7015
Copy link

Yes,how can I test it?

@MartinThoma
Copy link
Member

#2142 is the PR

  1. Get the git repository: git clone https://github.com/pubpub-zz/PyPDF2.git pypdf-pubpub
  2. Go into the directory: cd pypdf-pubpub
  3. Checkout the branch: git checkout iss2138
  4. Install that version: pip install -e .
  5. Run your code with that version. Make sure you really use that version and not e.g. have a different environment

I think I should document this somewhere 🤔

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Sep 3, 2023

@MartinThoma does this trick works ?
pip install git+https://github.com/pubpub-zz/PyPDF2.git@iss2138

@MartinThoma
Copy link
Member

Yes! I completely forgot about that!

@MartinThoma
Copy link
Member

By the way: Could you please rename it from PyPDF2 to pypdf? It might be confusing to others if they see PyPDF2.

@pubpub-zz
Copy link
Collaborator Author

oups. used to not know how to do it
@yonglee7015
the instructions shoud be now be:
pip install git+https://github.com/pubpub-zz/pypdf.git@iss2138

@yonglee7015
Copy link

Thanks 😊I will test it

@yonglee7015
Copy link

HI @pubpub-zz Yes, it works. Thanks for your help.

Can you also test this pdf file? the page 3.
You will find the order of extracted text is not correct.

1,the first line of text in pdf goes to the last line in the output text.
image
image

2, the order of text in table is not correct
image

Can you fixed this?

I try another library tika-python,
their text in table order is correct. but the first line also goes to the last line in the output text as yours.
image

test.pdf

@pubpub-zz
Copy link
Collaborator Author

You have reached the limit of pypdf current implementation:
a) strings are extracted in the order they have been "inserted" inside the document. when you print a document they are printed top from bottom, but in a pdf its more likely like a 2D plotter which can draw top left then bottom right before reaching the middle. extract_text get the text in the order they are plotted so the order is not garanted. It is far much more difficult in your case as you are working on documents.

Sorry there is no solution for the moment with pypdf.😞

@yonglee7015
Copy link

Oh,no. It's so pity. Thank you.

@stefan6419846
Copy link
Collaborator

I have not tested it, but shouldn't a visitor be able to fix the order on the user side in this case? https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor

@pubpub-zz
Copy link
Collaborator Author

It is more complex: you need to know if there is some columns what are the coordinates... Maybe ai could help...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants