`PageObject.transfer_rotation_to_content()` hides some content since pypdf 4.3.0 #2927

stefan6419846 · 2024-10-30T12:15:28Z

Calling page.transfer_rotation_to_content() changes the visibility of some content after upgrading from version 4.2.0 to 4.3.0 for some PDF files. The corresponding text layer is invisible, but can be selected.

When viewing the diff, two Q operators are missing in version 4.3.0.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.25-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

writer = PdfWriter(clone_from='file.pdf')
for page in writer.pages:
    page.transfer_rotation_to_content()
writer.write('out.pdf')

I do not have a suitable PDF file at the moment, but I am working on getting one.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-11-04T07:30:52Z

I managed to create a standalone example in the meantime: test_clean.pdf Please note that this might show further issues due to the cleanup done by me.

After running the above code with pypdf version 4.2.0 and 4.3.0, I get the following diff:

diff --git a/result_4.2.0.pdf b/result_4.3.0.pdf
index 04d3347..72ec47e 100644
--- a/result_4.2.0.pdf
+++ b/result_4.3.0.pdf
@@ -72,7 +72,7 @@ endstream
 endobj
 8 0 obj
 <<
-/Length 992
+/Length 990
 >>
 stream
 q
@@ -122,7 +122,6 @@ BI
 ID /221̎215346^PT^PBS377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377360^A^@^P
 EI
 Q
-Q
 q
 110 170 5520 7850 re
 W
@@ -177,8 +176,8 @@ xref
 0000000576 00000 n 
 0000000845 00000 n 
 0000001785 00000 n 
-0000002828 00000 n 
-0000002865 00000 n 
+0000002826 00000 n 
+0000002863 00000 n 
 trailer
 <<
 /Size 11
@@ -186,5 +185,5 @@ trailer
 /Info 10 0 R
 >>
 startxref
-2929
+2927
 %%EOF

The most apparent change seems to be that there is one Q operator less than before.

The output files: result_4.2.0.pdf result_4.3.0.pdf

You can already see that the "abc" text disappeared. When rendering this as PNG through Ghostscript, we can see that the white circles disappear as well.

For 4.2.0:

For 4.3.0:

stefan6419846 · 2024-11-12T14:07:08Z

The offending commit appears to be 23a81ba, which makes sense as the offending image is an inline image (although never requesting it explicitly).

stefan6419846 · 2024-11-28T08:58:17Z

Further debugging shows the following behavior:

In version 4.3.0, pypdf.generic._data_structures.ContentStream._read_inline_image would point the input stream stream at the Q\nQ\nq sequence, where the first operator is indeed the one that is missing now.

In version 5.1.0,

pypdf/pypdf/generic/_data_structures.py

Lines 1367 to 1373 in bd26922

    
               data = extract_inline_default(stream) 
        
           ei = stream.read(3) 
        
           stream.seek(-1, 1) 
        
           if ei[0:2] != b"EI" or ei[2:3] not in WHITESPACES: 
        
               stream.seek(savpos, 0) 
        
               data = extract_inline_default(stream)

is really weird:

Line 1367 extracts the correct image data.
Line 1369 sets ei = b'\nQ\n', where the missing Q operator can already be seen.
Line 1371 does not see the EI = end image tag.
Line 1372 resets the input stream to the same position as before running line 1367.
Line 1373 does exactly the same as line 1367.

From this, some questions arise for me regarding the new implementation:

Do we always extract inline images twice? At first sight, it does indeed look like this.
Why do we actually need the second extraction run with basically the same input stream? Isn't this just the same as the first one, just without checking the input stream afterwards?
Are there any side effects of doing stream.seek(saved_pos - 1, 0) when truncating to the actual image data?

pypdf/pypdf/generic/_image_inline.py

Line 232 in bd26922

stream_out.truncate(sav_pos_ei)
- This would always satisfy line 1371 of _read_inline_image if I am not mistaken. Do we still have to add any additional handling if this is not the case here?

stefan6419846 · 2024-12-02T17:02:42Z

Partially answering my own questions after changing the input stream position on EI discovery as proposed in my previous comment:

tika-935066.pdf runs through

pypdf/pypdf/generic/_data_structures.py

Lines 1345 to 1365 in bd26922

    
           cs = settings.get("/CS", "") 
        
           if "RGB" in cs: 
        
               lcs = 3 
        
           elif "CMYK" in cs: 
        
               lcs = 4 
        
           else: 
        
               bits = settings.get( 
        
                   "/BPC", 
        
                   8 if cs in {"/I", "/G", "/Indexed", "/DeviceGray"} else -1, 
        
               ) 
        
               if bits > 0: 
        
                   lcs = bits / 8.0 
        
               else: 
        
                   data = extract_inline_default(stream) 
        
                   lcs = -1 
        
           if lcs > 0: 
        
               data = stream.read( 
        
                   ceil(cast(int, settings["/W"]) * lcs) * cast(int, settings["/H"]) 
        
               ) 
        
           ei = read_non_whitespace(stream) 
        
           stream.seek(-1, 1)

The relevant section from the input is

\r\nBI /W 8 /H 3 /BPC 1 /IM true ID \xf3\xe0\xc0\x80\r\nEI Q\r\n

We do indeed run into line 1371 here, as with lcs = 0.125, only 0.125 * 8 * 3 = 3 bytes are being read, but returning b'\xf3\xe0\xc0\x80\r\n' as the image data here does not look right either ...

The last example of test_extra_test_iss1541 runs into line 1371 here as well (it is the same special case without an extra filter). The error might be a bit nicer here, but there is no real benefit of the extraction run here either.

stefan6419846 · 2024-12-18T13:13:25Z

Reference file: https://github.com/user-attachments/assets/abe16f48-9afa-4179-b1e8-62be27b95c26

stefan6419846 added PdfWriter The PdfWriter component is affected is-regression Regression introduced as a side-effect of another change labels Oct 30, 2024

stefan6419846 changed the title ~~PageObject.transfer_rotation_to_content() hides content since pypdf 4.3.0~~ PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 Oct 30, 2024

stefan6419846 mentioned this issue Nov 12, 2024

ROB: improve inline image extraction #2622

Merged

stefan6419846 mentioned this issue Dec 18, 2024

BUG: Avoid extracting inline images twice and dropping other operators #3002

Merged

pubpub-zz closed this as completed in #3002 Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PageObject.transfer_rotation_to_content()` hides some content since pypdf 4.3.0 #2927

`PageObject.transfer_rotation_to_content()` hides some content since pypdf 4.3.0 #2927

stefan6419846 commented Oct 30, 2024 •

edited

Loading

stefan6419846 commented Nov 4, 2024

stefan6419846 commented Nov 12, 2024

stefan6419846 commented Nov 28, 2024

stefan6419846 commented Dec 2, 2024

stefan6419846 commented Dec 18, 2024

PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 #2927

PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 #2927

Comments

stefan6419846 commented Oct 30, 2024 • edited Loading

Environment

Code + PDF

stefan6419846 commented Nov 4, 2024

stefan6419846 commented Nov 12, 2024

stefan6419846 commented Nov 28, 2024

stefan6419846 commented Dec 2, 2024

stefan6419846 commented Dec 18, 2024

`PageObject.transfer_rotation_to_content()` hides some content since pypdf 4.3.0 #2927

`PageObject.transfer_rotation_to_content()` hides some content since pypdf 4.3.0 #2927

stefan6419846 commented Oct 30, 2024 •

edited

Loading