Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

Open
ghbm-itk opened this issue Dec 20, 2023 · 9 comments
Labels
needs-pdf The issue needs a PDF file to show the problem workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...

Comments

@ghbm-itk
Copy link

I'm trying to extract text from a pdf together with the position of the text.
When I do it in pypdf 3.16 I get the expected result, but I don't in 3.17.

Environment

Windows-10-10.0.19045-SP0
pypdf==3.16.0, crypt_provider=('cryptography', '41.0.3'), PIL=9.5.0
AND
pypdf==3.17.3, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
file_path = "list.pdf"
reader = pypdf.PdfReader(file_path)

text_parts = []

def visitor(text, cm, tm, fd, fs):
    if text.strip() == "Flyttesagsnr.:":
        text_parts.append((cm, tm, text))

reader.pages[0].extract_text(visitor_text=visitor)

print(text_parts)

Unfourtunately I can't share the PDF since it's confidential. I haven't been able to declassify the document and keep the bug.
I know this might make the bug hard to replicate.

Results

In version 3.17 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], ' Flyttesagsnr.:')]

In version 3.16 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, -1.0, 448.313, 352.05], ' Flyttesagsnr.:')]

As you can see tm[4] and tm[5] are both 0 in version 3.17, which is definitely wrong.

@stefan6419846
Copy link
Collaborator

If you have a look at the changelog, you will see that there have been some changes/improvements to the text extraction in the meantime. This probably is related to these changes and most likely intended or a previous bug.

@ghbm-itk
Copy link
Author

But 3.17 outputs a wrong answer, when 3.16 outputs the correct answer. Seems like a new bug.

@stefan6419846
Copy link
Collaborator

Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this?

@MartinThoma MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Dec 20, 2023
@pubpub-zz
Copy link
Collaborator

In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix.

@ghbm-itk
Copy link
Author

Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this?

I will try this when I have some time.

In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix.

I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't.

@ghbm-itk
Copy link
Author

@stefan6419846
I tested the code snippet in different versions with the following results:
3.16.0: Correct
3.16.1: Correct
3.16.2: Correct
3.16.3: Wrong
3.17.3: Wrong

I suspect the change happened with #2206

@pubpub-zz
Copy link
Collaborator

I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't.

oups you are right I had to keep the existing definitions whereas it was more complex to be used.

I suspect the change happened with #2206

The change was raised because the TM was not captured at the beginning of the line. Would you accept to share the file in private, emailing it to @MartinThoma ?

@ghbm-itk
Copy link
Author

I'm sorry but it would be illegal for me to share the document with anyone outside my org.
Is there a good way where I can remove all other text from the pdf without affecting the "Flyttesagsnr.:" text?

Whenever I try to edit the pdf, the matrices change completely.

@stefan6419846
Copy link
Collaborator

In general, there is no easy/general purpose approach to do this as far as I know. A possible way would be to manually mess with the internal page source, but this requires some deeper understanding of the PDF format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-pdf The issue needs a PDF file to show the problem workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...
Projects
None yet
Development

No branches or pull requests

4 participants