page.get_text("blocks") not working with PyMuPDF version 1.24.0 #3300

3051360 · 2024-03-23T11:19:02Z

page.get_text("blocks") has stopped working with the latest PyMuPDF version.

Below is a simple code snippet:

input_file = "my_pdf.pdf"
doc = fitz.open(input_file)

for page_no, page in enumerate(doc):
    print(page.get_text("blocks"))

With PyMuPDF version 1.22.2 the code prints:

[(0.0, 0.0, 273.6000061035156, 728.1599731445312, '<image: ICCBased(RGB,GIMP built-in sRGB), width: 1140, height: 3034, bpc: 8>', 0, 1)]

With PyMuPDF version 1.24.0 the code is not able to get anything.

1.24.0

MacOS

3.12

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-03-23T14:17:22Z

We cannot accept a bug report without a reproducing file.

JorjMcKie · 2024-03-24T00:00:26Z

Please consult the documentation on text extraction flags and include TEXT_PRESERVE_IMAGES or use TEXTFLAGS_DICT as your flags value.

3051360 · 2024-03-24T11:54:50Z

Thank you for your response. I see that we now have to necessarily use this flag for documents containing images.

Earlier, with version 1.22.2 the flag was not required explicitly,

With version 1.24.0, the same code block does extract the image information by default,

But it does when the flag is supplied,

JorjMcKie added example required Waiting for information labels Mar 23, 2024

JorjMcKie added not a bug not a bug / user error / unable to reproduce and removed example required Waiting for information labels Mar 24, 2024

JorjMcKie closed this as completed Mar 24, 2024

Provide feedback