Splitting PDF files resulting in larger-than-expected output PDF files #1322

xilopaint · 2022-09-03T17:53:38Z

I've noticed a weird behavior trying to split some PDF files containing a table of contents with outlines on their first pages. The issue is that the output file containing the table of contents has the same file size of the input PDF although it has fewer pages since it has been split.

Therefore, trying to split a 9.6 MB PDF with 199 pages with the following code snippet I get two output files: one with 9.6 MB and 100 pages and another one with 3 MB and 99 pages.

Running a similar code with pikepdf I have no issues and I get one file with 6.7 MB and 100 pages and another one with 99 pages and 3 MB.

Environment

$ python -m platform
macOS-12.5.1-x86_64-i386-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.4

Code

from PyPDF2 import PdfReader, PdfMerger, PageRange

reader = PdfReader("sample.pdf")

num_pages = len(reader.pages)
page_ranges = [PageRange(slice(n, n + 100)) for n in range(0, num_pages, 100)]

for n, page_range in enumerate(page_ranges, 1):
    merger = PdfMerger()
    merger.append(reader, pages=page_range)
    merger.write(f"{'sample'} [{'part'} {n}].pdf")

PDF

Unfortunately, I can't share the PDF file.

MartinThoma · 2022-09-04T15:51:17Z

It's interesting that this might be connected to outlines. Thank you for the hint!

I don't know when anybody will pick this one up, but in the mean time you might be interested in https://pypdf2.readthedocs.io/en/latest/user/file-size.html

xilopaint · 2022-09-04T18:11:02Z

HowtoMakeAccessiblePDF.pdf

Hey @MartinThoma I just found the above PDF file that you can use to reproduce the issue. The file has 6.8 MB and 34 pages. Splitting it into files of 10 pages maximum each using PyPDF2 I get:

File 1: 10 pages / 6.8 MB
File 2: 10 pages / 2 MB
File 3: 10 pages / 3 MB
File 4: 4 pages / 1.7 MB

Using pikepdf I get:

File 1: 10 pages / 588 KB
File 2: 10 pages / 2 MB
File 3: 10 pages / 3 MB
File 4: 4 pages / 1.7 MB

As you can see I don't get the issue with the first output file using pikepdf.

xilopaint · 2022-09-09T14:30:15Z

Hey @MartinThoma, could you reproduce the issue with the PDF file I provided?

pubpub-zz · 2022-09-09T18:45:24Z

Reproduced!
I did some analysis, and we can detect some object such as pages which are not expected.
Still under analysis to identify which part of the code/parameters which is inducing the extra pages

xilopaint · 2022-09-09T19:50:21Z

Thanks for looking into it!

pubpub-zz · 2022-09-11T08:18:24Z

I think I've got it. when writing, a process to refer/adjust the objects in the write object. during this step, PyPDF2 parses through the pages to write and "collect" all the referenced indirect objects. What I found in the HowtoMakeAccessiblePDF.pdf, when you look at the fifth page, there is some (internet) link annonations. In this annotation, the "/P" field references the reader pages and not the modified pages. this induces the reader page to be collected also, and through the "/Parent" the other objects. Those are just collected but are not listed in the "/Pages" Tree and therefore not displayed.
To fix this I've started to work on some "cloning" capability (identified #1194)
Work is in progress.

Note: I did not check what fields would induced the same effect as the annotation

xilopaint · 2022-09-11T14:50:33Z

Please, let me know when I can test the fix in other PDF files I have.

xilopaint · 2022-09-24T00:39:20Z

Any progress here @pubpub-zz?

pubpub-zz · 2022-09-24T07:20:42Z

Work in progress... The cloning is not so easy...🤔

pubpub-zz · 2022-09-27T18:26:51Z

@xilopaint
A PR still draft is available. I did some test on HowtoMakeAccessiblePDF.pdf
and in this file, the point I've noticed that the problem was linked with /Annots that contains some link to other pages. Passing ["/Annots"] to add_page will prevent copying "/Annots" and the increase of size. An easy way to get a rough idea about about the size is to monitor len(w._objects") and see it increases slowly
This is just a first draft to but a good basis for improvement, isn't it

ps:
@MartinThoma some advice in order to clear the mypy errors would be appreciated.

xilopaint · 2022-09-27T19:26:38Z

A PR still draft is available.

I still have the issue using your fork with my code. Should I change anything to make it work?

pubpub-zz · 2022-09-27T19:44:00Z

You should use PdfWriter (For the moment PdfMerger has not been modified yet):

import PyPDF2
r=PyPDF2.PdfReader("e:/HowtoMakeAccessiblePDF.pdf")
w=PyPDF2.PdfWriter()
for i in range(10):
    _=w.add_page(r.pages[i],("/Annots","/B"))                  # _= is not required if this coded in 
w.write("e:/extract1-10.pdf")

xilopaint · 2022-09-27T20:47:56Z

w.add_page(r.pages[i],("/Annots","/B"))

This feels a bit weird for me. In pikepdf we don't need to use any different parameter to make it work. It just works.

pubpub-zz · 2022-09-27T20:52:15Z

As said, this is a first draft. We have now a solution where the objects can be modified in a proper manner. We need now to find the best encapsulation about it

pubpub-zz · 2022-09-27T20:52:42Z

This issue is not yet closed😉

MartinThoma · 2022-09-27T21:01:13Z

I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)

fixes py-pdf#1322 cope with cmap where the range contains first and last code are on variable length. Also fix cases where the code is on 3 characters only (not standard) no test data available

pubpub-zz · 2022-09-28T09:07:58Z

Mistake in the issue referenced. This issue should stay open for the moment

xilopaint · 2022-10-09T07:11:07Z

I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)

@MartinThoma

Done!

pubpub-zz · 2022-10-10T17:34:56Z

@xilopaint,
If you want to try,I've completed the PR with all the functions from PdfMerger.
You just need to change PdfMerger by PdfWriter (no other change required):

import PyPDF2
reader=PyPDF2.PdfFileReader("e:/HowtoMakeAccessiblePDF.pdf")
num_pages = len(reader.pages)
page_ranges = [PyPDF2.PageRange(slice(n, n + 10)) for n in range(0, num_pages, 10)]

for n, page_range in enumerate(page_ranges, 1):
    merger = PyPDF2.PdfWriter()
    merger.append(reader, pages=page_range)
    merger.write(f"e:/Downloads/{'sample'} [{'part'} {n}].pdf")

result of dir :
10/10/2022 19:29 580 716 sample [part 1].pdf
10/10/2022 19:29 1 963 843 sample [part 2].pdf
10/10/2022 19:29 2 996 937 sample [part 3].pdf
10/10/2022 19:29 1 680 228 sample [part 4].pdf

Bug Fixes (BUG): - td matrix (#1373) - Cope with cmap from #1322 (#1372) Robustness (ROB): - Cope with str returned from get_data in cmap (#1380) Full Changelog: 2.11.0...2.11.1

xilopaint · 2022-10-10T20:45:41Z

@xilopaint,
If you want to try,I've completed the PR with all the functions from PdfMerger.
You just need to change PdfMerger by PdfWriter (no other change required)

It worked! Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?

pubpub-zz · 2022-10-10T20:53:50Z

This should ease maintenability.
For compatibility purpose, PdfMerger should be kept as a synonym of PdfWriter with maybe a depreciation warning. @MartinThoma your opinion ?

MartinThoma · 2022-10-10T20:56:11Z

Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?

I would actually be super happy about deprecating PdfMerger 😄 I always thought that the PdfMerger is confusing.

I would need to check carefully if PdfMerger can be replaced easily by PdfWriter.

pubpub-zz · 2022-10-10T20:58:50Z

Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.

xilopaint · 2022-10-10T21:48:38Z

Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.

@pubpub-zz

It looks like the PR introduced a bug. Please, run the test suite of my project. One of the tests is failing since I pushed your PR. You just need to run python3 -m unittest discover tests -b .

pubpub-zz · 2022-10-11T16:50:22Z

@xilopaint,
with your project I'm getting ModuleNotFoundError: No module named 'fcntl',
I'm working under windows, can you please propose a work around else, can you at least report the stack at failure.

xilopaint · 2022-10-16T14:52:31Z

@pubpub-zz you can reproduce the issue with the following sample code and PDF file:

foo.pdf

#!/usr/bin/env python3
from PyPDF2 import PageObject, PdfReader, PdfWriter

reader = PdfReader("foo.pdf")
writer = PdfWriter()

for page in reader.pages:
    out_page = PageObject.create_blank_page(None, 8.3 * 72, 11.7 * 72)
    out_page.merge_page(page)

    writer.add_page(out_page)

with open("bar.pdf", "wb") as f:
    writer.write(f)

reader = PdfReader("bar.pdf")

for n, page in enumerate(reader.pages, 1):
    print(int(page.extract_text()) == n)

The code works with the latest release but not with your fork.

py-pdf#1322

pubpub-zz · 2022-10-16T21:00:23Z

@xilopaint
thanks for the trouble report. the problem seems to be solved, can you confirm?

xilopaint · 2022-10-16T21:04:30Z

@xilopaint
thanks for the trouble report. the problem seems to be solved, can you confirm?

@pubpub-zz yes, it's fixed.

xilopaint · 2022-10-16T21:05:53Z

Is the PR ready to be merged now?

pubpub-zz · 2022-10-16T21:08:28Z

Not yet,
I need to fix a few points about merging annotations and articles

The method `.clone(pdf_dest,[force_duplicate])` clones the objects and all referenced objects. If an object is already cloned, the already cloned object is returned (unless force_duplicate is set) mainly for internal use but can be used on a page for pageObject/DictionnaryObject/[Encoded/Decoded/Content]Stream an extra parameter ignore_fields list that provide the list of fields that should not be cloned. When available, the pointer to an object is available in `indirect_obj` attribute. New API for add_page/insert_page that : * returns the cloned page object * ignore_fields can be provided as a parameter. ## Others * file is closed at the end of PdfWriter.write when a filename is provided * Breaking Change: `add_outline_item` now has a parameter before which is not the last parameter ## Update * The public API of PdfMerger has been added to PdfWriter (ready to make PdfMerger an alias of it) * Process properly Outline merging * Process properly Named destinated Deals with #1194, #1322, #471, #1337

pubpub-zz · 2023-02-05T15:09:08Z

@xilopaint
The fixes introduced in PdfWriter.append() should have fixed the issue about siwe can you update the status of this PR.

xilopaint · 2023-02-05T15:13:50Z

The issue has been fixed.

xilopaint · 2023-02-12T15:31:02Z

@pubpub-zz if I split a multi-page PDF file into PDF files with one page each, in some cases the sum of the output PDF file sizes is larger, and in other cases, it is smaller than the input PDF file size. Why that? Is this the expected behavior?

pubpub-zz · 2023-02-12T15:38:00Z

just general assumptions without checking(sorry I will have no time for analyzing them):
if smaller some objects may be unused and discarded (typically with the append /Info are not copied)
if bigger some objects may be linked from the different pages and then "duplicate" in the different files

Therefore the behavior you are observing is not abnormal

xilopaint · 2023-02-12T16:12:42Z

@pubpub-zz could you run this script when you have some time?

#!/usr/bin/env python3

import os

from pypdf import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
pg_sizes = []

for n, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)

    with open(f"page{n}", "wb") as f:
        writer.write(f)

    file_size = os.path.getsize(f"page{n}")
    pg_sizes.append(file_size)
    os.remove(f"page{n}")

inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}")

Here you have the test PDF file.

The output of the script is:

input file size: 541984
output file sizes sum: 944794

As you can see the sum of the output file sizes is much larger than the size of the input file. Maybe I could improve the performance of my application if I can understand why this happens.

pubpub-zz · 2023-02-12T16:46:40Z

@xilopaint
first redo the test using writer.append(reader,[page_num]) : .add_page()is not optimized

xilopaint · 2023-02-12T16:56:55Z

@xilopaint
first redo the test using writer.append(reader,[page_num]) : .add_page()is not optimized

@pubpub-zz running this code:

#!/usr/bin/env python3

import os

from pypdf import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
pg_sizes = []

for i in range(len(reader.pages)):
    writer = PdfWriter()
    writer.append(reader, pages=(i, i + 1))

    with open(f"page{i + 1}", "wb") as f:
        writer.write(f)

    file_size = os.path.getsize(f"page{i + 1}")
    pg_sizes.append(file_size)
    os.remove(f"page{i + 1}")

inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}")

I get:

input file size: 541984
output file sizes sum: 693229

Still a significant difference between the input file size and the sum.

pubpub-zz · 2023-02-12T17:03:31Z

can you try to identify duplicate objects using https://github.com/Rossi1337/pdf_vole

xilopaint · 2023-02-12T17:11:18Z

can you try to identify duplicate objects using https://github.com/Rossi1337/pdf_vole

@pubpub-zz I'm afraid I don't know how to run this thing. Also, I don't have Java installed on my Mac. Are you able to run it with the PDF I've uploaded?

pubpub-zz · 2023-02-12T21:40:26Z

I modified slightly the program to show the translation(ID in reader -> ID in writer)

#!/usr/bin/env python3

import os

from pypdf import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
pg_sizes = []

for i in range(len(reader.pages)):
    print(f"page{i}")
    writer = PdfWriter()
    writer.append(reader, pages=(i, i + 1))
    print(writer._id_translated[id(reader)])

    with open(f"page{i + 1}", "wb") as f:
        writer.write(f)

    file_size = os.path.getsize(f"page{i + 1}")
    pg_sizes.append(file_size)
    os.remove(f"page{i + 1}")

inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}")

the output shows:

page0
{3: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 11: 11, 12: 12, 415: 13, 414: 14, 4: 15}
page1
{10: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 32: 11, 33: 12, 34: 13, 35: 14, 36: 15, 410: 16, 411: 17, 409: 18, 37: 19, 38: 20, 39: 21, 40: 22, 41: 23, 42: 24, 31: 25}
page2
{15: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 32: 11, 33: 12, 34: 13, 35: 14, 36: 15, 410: 16, 411: 17, 409: 18, 44: 19, 45: 20, 47: 21, 48: 22, 49: 23, 50: 24, 51: 25, 52: 26, 43: 27, 46: 31}
page3
{18: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 54: 13, 55: 14, 56: 15, 57: 16, 58: 17, 53: 18}
page4
{19: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 60: 12, 59: 13}
page5
{20: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 54: 11, 62: 12, 63: 13, 61: 14}
page6
{21: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 45: 11, 65: 12, 66: 13, 67: 14, 64: 15}
page7
{22: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 69: 11, 68: 12}
page8
{23: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 71: 11, 72: 12, 73: 13, 70: 14}
page9
{24: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 45: 11, 75: 12, 76: 13, 77: 14, 74: 15}
page10
{25: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 79: 13, 78: 14}
page11
{26: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 89: 11, 90: 12, 91: 13, 80: 14, 81: 18, 82: 19, 83: 20, 84: 21, 85: 22, 86: 23, 87: 24, 88: 25}
page12
{29: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 93: 13, 94: 14, 95: 15, 96: 16, 97: 17, 98: 18, 99: 19, 92: 20}
page13
{100: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 93: 13, 102: 14, 103: 15, 104: 16, 105: 17, 106: 18, 107: 19, 108: 20, 109: 21, 101: 22}
input file size: 541984
output file sizes sum: 693229

You can see that some Readers ID are shown many times. If you look at ID(412) it corresponds to the Widths table of the font which is quite big.
The size increase sounds absolutely normal.

MartinThoma · 2023-02-13T06:33:06Z

That reminds me of something: is it possible to make embedded fonts accessible via the reader? This would potentially help people to understand the size of their document better.

It might also be cool if we had an option to strip embedded fonts

MartinThoma added nf-performance Non-functional change: Performance is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem labels Sep 4, 2022

pubpub-zz mentioned this issue Sep 27, 2022

ENH: Add Cloning #1371

Merged

pubpub-zz mentioned this issue Sep 27, 2022

ROB: Cope with cmap from #1370 #1372

Merged

MartinThoma closed this as completed in f3b6d0e Sep 28, 2022

MartinThoma reopened this Sep 28, 2022

MartinThoma added a commit that referenced this issue Oct 10, 2022

REL: 2.11.1

d14f1de

Bug Fixes (BUG): - td matrix (#1373) - Cope with cmap from #1322 (#1372) Robustness (ROB): - Cope with str returned from get_data in cmap (#1380) Full Changelog: 2.11.0...2.11.1

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 16, 2022

fix xylopaint

a9449a6

py-pdf#1322

xilopaint closed this as completed Feb 5, 2023

Splitting PDF files resulting in larger-than-expected output PDF files #1322

Splitting PDF files resulting in larger-than-expected output PDF files #1322

Comments

xilopaint commented Sep 3, 2022 • edited Loading

Environment

Code

PDF

MartinThoma commented Sep 4, 2022

xilopaint commented Sep 4, 2022 • edited Loading

xilopaint commented Sep 9, 2022

pubpub-zz commented Sep 9, 2022

xilopaint commented Sep 9, 2022

pubpub-zz commented Sep 11, 2022 • edited Loading

xilopaint commented Sep 11, 2022

xilopaint commented Sep 24, 2022

pubpub-zz commented Sep 24, 2022

pubpub-zz commented Sep 27, 2022

xilopaint commented Sep 27, 2022

pubpub-zz commented Sep 27, 2022

xilopaint commented Sep 27, 2022

pubpub-zz commented Sep 27, 2022

pubpub-zz commented Sep 27, 2022

MartinThoma commented Sep 27, 2022

pubpub-zz commented Sep 28, 2022

xilopaint commented Oct 9, 2022

pubpub-zz commented Oct 10, 2022

xilopaint commented Oct 10, 2022

pubpub-zz commented Oct 10, 2022

MartinThoma commented Oct 10, 2022

pubpub-zz commented Oct 10, 2022

xilopaint commented Oct 10, 2022 • edited Loading

pubpub-zz commented Oct 11, 2022

xilopaint commented Oct 16, 2022

pubpub-zz commented Oct 16, 2022

xilopaint commented Oct 16, 2022

xilopaint commented Oct 16, 2022 • edited Loading

pubpub-zz commented Oct 16, 2022

pubpub-zz commented Feb 5, 2023

xilopaint commented Feb 5, 2023 • edited Loading

xilopaint commented Feb 12, 2023 • edited Loading

pubpub-zz commented Feb 12, 2023 • edited Loading

xilopaint commented Feb 12, 2023 • edited Loading

pubpub-zz commented Feb 12, 2023

xilopaint commented Feb 12, 2023 • edited Loading

pubpub-zz commented Feb 12, 2023

xilopaint commented Feb 12, 2023

pubpub-zz commented Feb 12, 2023

MartinThoma commented Feb 13, 2023

xilopaint commented Sep 3, 2022 •

edited

Loading

xilopaint commented Sep 4, 2022 •

edited

Loading

pubpub-zz commented Sep 11, 2022 •

edited

Loading

xilopaint commented Oct 10, 2022 •

edited

Loading

xilopaint commented Oct 16, 2022 •

edited

Loading

xilopaint commented Feb 5, 2023 •

edited

Loading

xilopaint commented Feb 12, 2023 •

edited

Loading

pubpub-zz commented Feb 12, 2023 •

edited

Loading

xilopaint commented Feb 12, 2023 •

edited

Loading

xilopaint commented Feb 12, 2023 •

edited

Loading