Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fitz freezes on some PDFs when calling the fitz.Page.get_text_blocks method. #2548

Closed
AndreyRomanyukov opened this issue Jul 17, 2023 · 5 comments
Labels
upstream bug bug outside this package

Comments

@AndreyRomanyukov
Copy link

Describe the bug (mandatory)

Fitz freezes on some PDFs when calling the fitz.Page.get_text_blocks method.

To Reproduce (mandatory)

Download the pdf that causes a freeze.

original https://aacr.figshare.com/articles/journal_contribution/Supplementary_Data_from_Targeting_Therapeutic_Resistance_and_Multinucleate_Giant_Cells_in_CCNE1-Amplified_HR-Proficient_Ovarian_Cancer/22523824/1/files/39986620.pdf
mirror https://www.dropbox.com/s/s7zjp7a8ys5ibh0/mct-21-0873_supplementary_data_s1_supps1.pdf?dl=0

Run the python code

from io import BytesIO

import fitz


path = 'mct-21-0873_supplementary_data_s1_supps1.pdf'

with open(path, 'rb') as opened:
    stream = BytesIO(opened.read())

with fitz.open(stream=stream, filetype="pdf") as pdf:
    for page in pdf:
        print(page)
        blocks = page.get_text_blocks()
        print('blocks:', len(blocks))

The program will print

page 0 of <memory, doc# 1>
blocks: 2
page 1 of <memory, doc# 1>
blocks: 2
page 2 of <memory, doc# 1>

and then freeze.

Additional context (optional)

Reproduces on PyMuPDF==1.22.3 and PyMuPDF==1.22.5. Reproduces on macOS 12.6.5 and Ubuntu 20.04.2

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Jul 18, 2023
@JorjMcKie
Copy link
Collaborator

The problem occurs when extracting text in general. It more precisely is loop.
It is also not a problem inside PyMuPDF but in the base library MuPDF.
mutool draw -o test.txt file.pdf also loops.
I will forward the problem to MuPDF's issue system.

@AndreyRomanyukov
Copy link
Author

I will forward the problem to MuPDF's issue system.

Thanks!

@JorjMcKie
Copy link
Collaborator

Sorry for the delay, here is the MuPDF bug report: https://bugs.ghostscript.com/show_bug.cgi?id=707074.

@JorjMcKie
Copy link
Collaborator

The problem has been fixed in MuPDF.
The file is damaged in the sense that it contains circular references within its objects - that hadn't been detected before.
Now the circular references are detected and your script with raise an exception - as it should.

JorjMcKie added a commit that referenced this issue Oct 5, 2023
julian-smith-artifex-com added a commit that referenced this issue Oct 6, 2023
Fixed test failure in rebased. We were not converting fz exception into C++
exception in src/extra.i:page_get_textpage(). Also fixed other cases where we
leaked fz exception.
JorjMcKie pushed a commit that referenced this issue Oct 6, 2023
Fixed test failure in rebased. We were not converting fz exception into C++
exception in src/extra.i:page_get_textpage(). Also fixed other cases where we
leaked fz exception.
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

3 participants