Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image.get_pos() returns wrong values for images nested in Form XObjects #277

Closed
3 tasks done
PasaOpasen opened this issue Nov 20, 2023 · 6 comments
Closed
3 tasks done
Labels
pdfium This issue may be caused by (or related to) pdfium itself

Comments

@PasaOpasen
Copy link

PasaOpasen commented Nov 20, 2023

Checklist

  • I confirm this is not a question, or helpers feature request. Otherwise, use the Discussions page.
  • I confirm this is not an issue encountered with an installed build of pypdfium2, but about some other aspect of the project (specify below). Otherwise, use one of the package templates (PyPA/conda).
  • I confirm this is not about an unofficial build of pypdfium2, or an effort to create a such. We do not support third-party builds, and they are not eligible for a bug report. Please use our official packages instead.

Reason for Generic issue (keyword/topic)

seems like a logic problem, not build

Description

I found that sometimes .get_pos() method returns wrong bbox for images.

Example document

To reproduce:

import pypdfium2 as pdfium
from pypdfium2 import PdfDocument

pdf_path='color_lines_bad.pdf'

doc = PdfDocument(pdf_path, autoclose=True)
image = list(doc[0].get_objects(filter=[pdfium.raw.FPDF_PAGEOBJ_IMAGE], max_depth=50))[0]
image.get_pos()
# Out[7]: (0.0, 0.0, 981.0, 1256.25)   # wrong value!
doc[0].get_size()
# Out[8]: (627.8399658203125, 804.0)

import fitz  # pymupdf
doc = fitz.open(pdf_path)
images = doc[0].get_images(full=True)
doc[0].get_image_rects(images[0])
# Out[12]: [Rect(0.0, 0.0, 627.8399658203125, 804.0)]  # right value
images[0]
#  (17, 0, 1308, 1675, 8, 'DeviceRGB', '', 'Img3', 'DCTDecode', 16)
doc[0].mediabox_size
# Out[13]: Point(627.84, 804.0)

Also I found that image meta has wrong dpi=96 instead of real 150, which can be found using pdfimages (poppler):

> pdfimages -list .\color_lines_bad.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1308  1675  rgb     3   8  jpeg   no        17  0   150   150  166K 2.6%
@mara004
Copy link
Member

mara004 commented Nov 20, 2023

Unfortunately I can't really comment on the values returned by get_pos(), we're just forwarding them from FPDFPageObj_GetBounds()...

I acknowledge DPI is a problem, but not really a bug, since pdfium calculates it from pixel size relative to the occupied canvas area, so this is not actually the DPI metadata embedded in the image. The docs for get_metadata() mention this.
While I know it's confusing, that's just how pdfium's API is designed, and we're only providing the bindings.

(Also note, I would have expected people to use one of the package-specific templates for an issue like this, just to have version info available and so on. I plan to clarify point 2 of the checklist as this seems to be unclear.)

@mara004
Copy link
Member

mara004 commented Nov 20, 2023

I also checked with FPDFPageObj_GetRotatedBounds(), it effectively returns the same info:
((0.0, 0.0), (981.0, 0.0), (981.0, 1256.25), (0.0, 1256.25))

(If you're confident what pdfium returns is wrong, then feel free to ask about this on pdfium's mailing list or file a pdfium bug report.) - Update: see finding below

@mara004 mara004 added the pdfium This issue may be caused by (or related to) pdfium itself label Nov 20, 2023
@mara004
Copy link
Member

mara004 commented Nov 20, 2023

@PasaOpasen Ah, I figured something out. The image seems to be recursively nested in Form XObjects (twice, actually).
Probably that's what confuses pdfium. It much reminds me of https://crbug.com/pdfium/2073
I would hazard a guess that pdfium currently returns dimensions relative to the nearest Form XObject or something.

>>> import pypdfium2 as pdfium
>>> pdf = pdfium.PdfDocument("color_lines_bad.pdf")
>>> pdf
<PdfDocument uuid:9198c957 from '/home/me/Downloads/color_lines_bad.pdf'>
>>> page = pdf[0]
>>> list(page.get_objects(filter=[pdfium.raw.FPDF_PAGEOBJ_IMAGE], max_depth=1))
[]
>>> list(page.get_objects(filter=[pdfium.raw.FPDF_PAGEOBJ_IMAGE], max_depth=2))
[]
>>> list(page.get_objects(filter=[pdfium.raw.FPDF_PAGEOBJ_IMAGE], max_depth=3))
[<PdfImage uuid:0fc6e6e2>]

@mara004 mara004 changed the title wrong image pos for some images image.get_pos() returns wrong values for images nested in Form XObjects Nov 20, 2023
@mara004
Copy link
Member

mara004 commented Nov 20, 2023

I just filed https://bugs.chromium.org/p/pdfium/issues/detail?id=2100 for this.

@PasaOpasen
Copy link
Author

@mara004 thank u!

@mara004
Copy link
Member

mara004 commented Nov 25, 2023

Would you mind closing this issue? I don't think we can do much else now except wait for pdfium.

Or do you reckon we should prevent get_pos() calls on nested objects by raising an exception?
However, we'd have to remember removing that again once the pdfium issue is fixed...

@PasaOpasen PasaOpasen closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2023
@mara004 mara004 closed this as completed Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pdfium This issue may be caused by (or related to) pdfium itself
Projects
None yet
Development

No branches or pull requests

2 participants