Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 #2927

Closed
stefan6419846 opened this issue Oct 30, 2024 · 5 comments · Fixed by #3002
Closed
Labels
is-regression Regression introduced as a side-effect of another change PdfWriter The PdfWriter component is affected

Comments

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Oct 30, 2024

Calling page.transfer_rotation_to_content() changes the visibility of some content after upgrading from version 4.2.0 to 4.3.0 for some PDF files. The corresponding text layer is invisible, but can be selected.

When viewing the diff, two Q operators are missing in version 4.3.0.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.25-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

writer = PdfWriter(clone_from='file.pdf')
for page in writer.pages:
    page.transfer_rotation_to_content()
writer.write('out.pdf')

I do not have a suitable PDF file at the moment, but I am working on getting one.

@stefan6419846 stefan6419846 added PdfWriter The PdfWriter component is affected is-regression Regression introduced as a side-effect of another change labels Oct 30, 2024
@stefan6419846 stefan6419846 changed the title PageObject.transfer_rotation_to_content() hides content since pypdf 4.3.0 PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 Oct 30, 2024
@stefan6419846
Copy link
Collaborator Author

I managed to create a standalone example in the meantime: test_clean.pdf Please note that this might show further issues due to the cleanup done by me.

After running the above code with pypdf version 4.2.0 and 4.3.0, I get the following diff:

diff --git a/result_4.2.0.pdf b/result_4.3.0.pdf
index 04d3347..72ec47e 100644
--- a/result_4.2.0.pdf
+++ b/result_4.3.0.pdf
@@ -72,7 +72,7 @@ endstream
 endobj
 8 0 obj
 <<
-/Length 992
+/Length 990
 >>
 stream
 q
@@ -122,7 +122,6 @@ BI
 ID /221̎215346^PT^PBS377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377360^A^@^P
 EI
 Q
-Q
 q
 110 170 5520 7850 re
 W
@@ -177,8 +176,8 @@ xref
 0000000576 00000 n 
 0000000845 00000 n 
 0000001785 00000 n 
-0000002828 00000 n 
-0000002865 00000 n 
+0000002826 00000 n 
+0000002863 00000 n 
 trailer
 <<
 /Size 11
@@ -186,5 +185,5 @@ trailer
 /Info 10 0 R
 >>
 startxref
-2929
+2927
 %%EOF

The most apparent change seems to be that there is one Q operator less than before.

The output files: result_4.2.0.pdf result_4.3.0.pdf

You can already see that the "abc" text disappeared. When rendering this as PNG through Ghostscript, we can see that the white circles disappear as well.

For 4.2.0:

result_4 2 0

For 4.3.0:

result_4 3 0

@stefan6419846
Copy link
Collaborator Author

The offending commit appears to be 23a81ba, which makes sense as the offending image is an inline image (although never requesting it explicitly).

@stefan6419846
Copy link
Collaborator Author

Further debugging shows the following behavior:

  • In version 4.3.0, pypdf.generic._data_structures.ContentStream._read_inline_image would point the input stream stream at the Q\nQ\nq sequence, where the first operator is indeed the one that is missing now.
  • In version 5.1.0,
    data = extract_inline_default(stream)
    ei = stream.read(3)
    stream.seek(-1, 1)
    if ei[0:2] != b"EI" or ei[2:3] not in WHITESPACES:
    stream.seek(savpos, 0)
    data = extract_inline_default(stream)
    is really weird:
    • Line 1367 extracts the correct image data.
    • Line 1369 sets ei = b'\nQ\n', where the missing Q operator can already be seen.
    • Line 1371 does not see the EI = end image tag.
    • Line 1372 resets the input stream to the same position as before running line 1367.
    • Line 1373 does exactly the same as line 1367.

From this, some questions arise for me regarding the new implementation:

  • Do we always extract inline images twice? At first sight, it does indeed look like this.
  • Why do we actually need the second extraction run with basically the same input stream? Isn't this just the same as the first one, just without checking the input stream afterwards?
  • Are there any side effects of doing stream.seek(saved_pos - 1, 0) when truncating to the actual image data?
    stream_out.truncate(sav_pos_ei)
    • This would always satisfy line 1371 of _read_inline_image if I am not mistaken. Do we still have to add any additional handling if this is not the case here?

@stefan6419846
Copy link
Collaborator Author

Partially answering my own questions after changing the input stream position on EI discovery as proposed in my previous comment:

  • tika-935066.pdf runs through

    cs = settings.get("/CS", "")
    if "RGB" in cs:
    lcs = 3
    elif "CMYK" in cs:
    lcs = 4
    else:
    bits = settings.get(
    "/BPC",
    8 if cs in {"/I", "/G", "/Indexed", "/DeviceGray"} else -1,
    )
    if bits > 0:
    lcs = bits / 8.0
    else:
    data = extract_inline_default(stream)
    lcs = -1
    if lcs > 0:
    data = stream.read(
    ceil(cast(int, settings["/W"]) * lcs) * cast(int, settings["/H"])
    )
    ei = read_non_whitespace(stream)
    stream.seek(-1, 1)
    The relevant section from the input is

    \r\nBI /W 8 /H 3 /BPC 1 /IM true ID \xf3\xe0\xc0\x80\r\nEI Q\r\n
    

    We do indeed run into line 1371 here, as with lcs = 0.125, only 0.125 * 8 * 3 = 3 bytes are being read, but returning b'\xf3\xe0\xc0\x80\r\n' as the image data here does not look right either ...

  • The last example of test_extra_test_iss1541 runs into line 1371 here as well (it is the same special case without an extra filter). The error might be a bit nicer here, but there is no real benefit of the extraction run here either.

@stefan6419846
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-regression Regression introduced as a side-effect of another change PdfWriter The PdfWriter component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant