Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting PDF files resulting in larger-than-expected output PDF files #1322

Closed
xilopaint opened this issue Sep 3, 2022 · 41 comments
Closed
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem nf-performance Non-functional change: Performance

Comments

@xilopaint
Copy link
Contributor

xilopaint commented Sep 3, 2022

I've noticed a weird behavior trying to split some PDF files containing a table of contents with outlines on their first pages. The issue is that the output file containing the table of contents has the same file size of the input PDF although it has fewer pages since it has been split.

Therefore, trying to split a 9.6 MB PDF with 199 pages with the following code snippet I get two output files: one with 9.6 MB and 100 pages and another one with 3 MB and 99 pages.

Running a similar code with pikepdf I have no issues and I get one file with 6.7 MB and 100 pages and another one with 99 pages and 3 MB.

Environment

$ python -m platform
macOS-12.5.1-x86_64-i386-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.4

Code

from PyPDF2 import PdfReader, PdfMerger, PageRange

reader = PdfReader("sample.pdf")

num_pages = len(reader.pages)
page_ranges = [PageRange(slice(n, n + 100)) for n in range(0, num_pages, 100)]

for n, page_range in enumerate(page_ranges, 1):
    merger = PdfMerger()
    merger.append(reader, pages=page_range)
    merger.write(f"{'sample'} [{'part'} {n}].pdf")

PDF

Unfortunately, I can't share the PDF file.

@MartinThoma
Copy link
Member

It's interesting that this might be connected to outlines. Thank you for the hint!

I don't know when anybody will pick this one up, but in the mean time you might be interested in https://pypdf2.readthedocs.io/en/latest/user/file-size.html

@MartinThoma MartinThoma added nf-performance Non-functional change: Performance is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem labels Sep 4, 2022
@xilopaint
Copy link
Contributor Author

xilopaint commented Sep 4, 2022

HowtoMakeAccessiblePDF.pdf

Hey @MartinThoma I just found the above PDF file that you can use to reproduce the issue. The file has 6.8 MB and 34 pages. Splitting it into files of 10 pages maximum each using PyPDF2 I get:

File 1: 10 pages / 6.8 MB
File 2: 10 pages / 2 MB
File 3: 10 pages / 3 MB
File 4: 4 pages / 1.7 MB

Using pikepdf I get:

File 1: 10 pages / 588 KB
File 2: 10 pages / 2 MB
File 3: 10 pages / 3 MB
File 4: 4 pages / 1.7 MB

As you can see I don't get the issue with the first output file using pikepdf.

@xilopaint
Copy link
Contributor Author

Hey @MartinThoma, could you reproduce the issue with the PDF file I provided?

@pubpub-zz
Copy link
Collaborator

Reproduced!
I did some analysis, and we can detect some object such as pages which are not expected.
Still under analysis to identify which part of the code/parameters which is inducing the extra pages

@xilopaint
Copy link
Contributor Author

Thanks for looking into it!

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 11, 2022

I think I've got it. when writing, a process to refer/adjust the objects in the write object. during this step, PyPDF2 parses through the pages to write and "collect" all the referenced indirect objects. What I found in the HowtoMakeAccessiblePDF.pdf, when you look at the fifth page, there is some (internet) link annonations. In this annotation, the "/P" field references the reader pages and not the modified pages. this induces the reader page to be collected also, and through the "/Parent" the other objects. Those are just collected but are not listed in the "/Pages" Tree and therefore not displayed.
To fix this I've started to work on some "cloning" capability (identified #1194)
Work is in progress.

Note: I did not check what fields would induced the same effect as the annotation

@xilopaint
Copy link
Contributor Author

Please, let me know when I can test the fix in other PDF files I have.

@xilopaint
Copy link
Contributor Author

Any progress here @pubpub-zz?

@pubpub-zz
Copy link
Collaborator

Work in progress... The cloning is not so easy...🤔

@pubpub-zz
Copy link
Collaborator

@xilopaint
A PR still draft is available. I did some test on HowtoMakeAccessiblePDF.pdf
and in this file, the point I've noticed that the problem was linked with /Annots that contains some link to other pages. Passing ["/Annots"] to add_page will prevent copying "/Annots" and the increase of size. An easy way to get a rough idea about about the size is to monitor len(w._objects") and see it increases slowly
This is just a first draft to but a good basis for improvement, isn't it

ps:
@MartinThoma some advice in order to clear the mypy errors would be appreciated.

@xilopaint
Copy link
Contributor Author

A PR still draft is available.

I still have the issue using your fork with my code. Should I change anything to make it work?

@pubpub-zz
Copy link
Collaborator

You should use PdfWriter (For the moment PdfMerger has not been modified yet):

import PyPDF2
r=PyPDF2.PdfReader("e:/HowtoMakeAccessiblePDF.pdf")
w=PyPDF2.PdfWriter()
for i in range(10):
    _=w.add_page(r.pages[i],("/Annots","/B"))                  # _= is not required if this coded in 
w.write("e:/extract1-10.pdf")

@xilopaint
Copy link
Contributor Author

w.add_page(r.pages[i],("/Annots","/B"))

This feels a bit weird for me. In pikepdf we don't need to use any different parameter to make it work. It just works.

@pubpub-zz
Copy link
Collaborator

As said, this is a first draft. We have now a solution where the objects can be modified in a proper manner. We need now to find the best encapsulation about it

@pubpub-zz
Copy link
Collaborator

This issue is not yet closed😉

@MartinThoma
Copy link
Member

I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 27, 2022
fixes py-pdf#1322
cope with cmap where the range contains first and last   code are on variable length.
Also fix cases where the code is on 3 characters only (not standard)
no test data available
@pubpub-zz
Copy link
Collaborator

Mistake in the issue referenced. This issue should stay open for the moment

@MartinThoma MartinThoma reopened this Sep 28, 2022
@xilopaint
Copy link
Contributor Author

I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)

@MartinThoma

Done!

@pubpub-zz
Copy link
Collaborator

@xilopaint,
If you want to try,I've completed the PR with all the functions from PdfMerger.
You just need to change PdfMerger by PdfWriter (no other change required):

import PyPDF2
reader=PyPDF2.PdfFileReader("e:/HowtoMakeAccessiblePDF.pdf")
num_pages = len(reader.pages)
page_ranges = [PyPDF2.PageRange(slice(n, n + 10)) for n in range(0, num_pages, 10)]

for n, page_range in enumerate(page_ranges, 1):
    merger = PyPDF2.PdfWriter()
    merger.append(reader, pages=page_range)
    merger.write(f"e:/Downloads/{'sample'} [{'part'} {n}].pdf")

result of dir :
10/10/2022 19:29 580 716 sample [part 1].pdf
10/10/2022 19:29 1 963 843 sample [part 2].pdf
10/10/2022 19:29 2 996 937 sample [part 3].pdf
10/10/2022 19:29 1 680 228 sample [part 4].pdf

MartinThoma added a commit that referenced this issue Oct 10, 2022
Bug Fixes (BUG):
- td matrix (#1373)
- Cope with cmap from #1322 (#1372)

Robustness (ROB):
-  Cope with str returned from get_data in cmap (#1380)

Full Changelog: 2.11.0...2.11.1
@xilopaint
Copy link
Contributor Author

@xilopaint,
If you want to try,I've completed the PR with all the functions from PdfMerger.
You just need to change PdfMerger by PdfWriter (no other change required)

It worked! Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?

@pubpub-zz
Copy link
Collaborator

This should ease maintenability.
For compatibility purpose, PdfMerger should be kept as a synonym of PdfWriter with maybe a depreciation warning. @MartinThoma your opinion ?

@MartinThoma
Copy link
Member

Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?

I would actually be super happy about deprecating PdfMerger 😄 I always thought that the PdfMerger is confusing.

I would need to check carefully if PdfMerger can be replaced easily by PdfWriter.

@pubpub-zz
Copy link
Collaborator

Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.

@xilopaint
Copy link
Contributor Author

xilopaint commented Oct 10, 2022

Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.

@pubpub-zz

It looks like the PR introduced a bug. Please, run the test suite of my project. One of the tests is failing since I pushed your PR. You just need to run python3 -m unittest discover tests -b .

@pubpub-zz
Copy link
Collaborator

@xilopaint,
with your project I'm getting ModuleNotFoundError: No module named 'fcntl',
I'm working under windows, can you please propose a work around else, can you at least report the stack at failure.

@xilopaint
Copy link
Contributor Author

@pubpub-zz you can reproduce the issue with the following sample code and PDF file:

foo.pdf

#!/usr/bin/env python3
from PyPDF2 import PageObject, PdfReader, PdfWriter

reader = PdfReader("foo.pdf")
writer = PdfWriter()

for page in reader.pages:
    out_page = PageObject.create_blank_page(None, 8.3 * 72, 11.7 * 72)
    out_page.merge_page(page)

    writer.add_page(out_page)

with open("bar.pdf", "wb") as f:
    writer.write(f)

reader = PdfReader("bar.pdf")

for n, page in enumerate(reader.pages, 1):
    print(int(page.extract_text()) == n)

The code works with the latest release but not with your fork.

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 16, 2022
@pubpub-zz
Copy link
Collaborator

@xilopaint
thanks for the trouble report. the problem seems to be solved, can you confirm?

@xilopaint
Copy link
Contributor Author

@xilopaint
thanks for the trouble report. the problem seems to be solved, can you confirm?

@pubpub-zz yes, it's fixed.

@xilopaint
Copy link
Contributor Author

xilopaint commented Oct 16, 2022

Is the PR ready to be merged now?

@pubpub-zz
Copy link
Collaborator

Not yet,
I need to fix a few points about merging annotations and articles

MartinThoma pushed a commit that referenced this issue Dec 11, 2022
The method `.clone(pdf_dest,[force_duplicate])` clones the objects and all referenced objects.

If an object is already cloned, the already cloned object is returned (unless force_duplicate is set)
mainly for internal use but can be used on a page
for pageObject/DictionnaryObject/[Encoded/Decoded/Content]Stream an extra parameter ignore_fields list that provide the list of fields that should not be cloned.

When available, the pointer to an object is available in `indirect_obj` attribute.

New API for add_page/insert_page that :

* returns the cloned page object
* ignore_fields can be provided as a parameter.

## Others

* file is closed at the end of PdfWriter.write when a filename is provided
* Breaking Change: `add_outline_item` now has a parameter before which is not the last parameter

## Update
* The public API of PdfMerger has been added to PdfWriter (ready to make PdfMerger an alias of it)
* Process properly Outline merging
* Process properly Named destinated

Deals with #1194, #1322, #471, #1337
@pubpub-zz
Copy link
Collaborator

@xilopaint
The fixes introduced in PdfWriter.append() should have fixed the issue about siwe can you update the status of this PR.

@xilopaint
Copy link
Contributor Author

xilopaint commented Feb 5, 2023

The issue has been fixed.

@xilopaint
Copy link
Contributor Author

xilopaint commented Feb 12, 2023

@pubpub-zz if I split a multi-page PDF file into PDF files with one page each, in some cases the sum of the output PDF file sizes is larger, and in other cases, it is smaller than the input PDF file size. Why that? Is this the expected behavior?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Feb 12, 2023

just general assumptions without checking(sorry I will have no time for analyzing them):
if smaller some objects may be unused and discarded (typically with the append /Info are not copied)
if bigger some objects may be linked from the different pages and then "duplicate" in the different files

Therefore the behavior you are observing is not abnormal

@xilopaint
Copy link
Contributor Author

xilopaint commented Feb 12, 2023

@pubpub-zz could you run this script when you have some time?

#!/usr/bin/env python3

import os

from pypdf import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
pg_sizes = []

for n, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)

    with open(f"page{n}", "wb") as f:
        writer.write(f)

    file_size = os.path.getsize(f"page{n}")
    pg_sizes.append(file_size)
    os.remove(f"page{n}")

inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}")

Here you have the test PDF file.

The output of the script is:

input file size: 541984
output file sizes sum: 944794

As you can see the sum of the output file sizes is much larger than the size of the input file. Maybe I could improve the performance of my application if I can understand why this happens.

@pubpub-zz
Copy link
Collaborator

@xilopaint
first redo the test using writer.append(reader,[page_num]) : .add_page()is not optimized

@xilopaint
Copy link
Contributor Author

xilopaint commented Feb 12, 2023

@xilopaint
first redo the test using writer.append(reader,[page_num]) : .add_page()is not optimized

@pubpub-zz running this code:

#!/usr/bin/env python3

import os

from pypdf import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
pg_sizes = []

for i in range(len(reader.pages)):
    writer = PdfWriter()
    writer.append(reader, pages=(i, i + 1))

    with open(f"page{i + 1}", "wb") as f:
        writer.write(f)

    file_size = os.path.getsize(f"page{i + 1}")
    pg_sizes.append(file_size)
    os.remove(f"page{i + 1}")

inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}")

I get:

input file size: 541984
output file sizes sum: 693229

Still a significant difference between the input file size and the sum.

@pubpub-zz
Copy link
Collaborator

can you try to identify duplicate objects using https://github.com/Rossi1337/pdf_vole

@xilopaint
Copy link
Contributor Author

can you try to identify duplicate objects using https://github.com/Rossi1337/pdf_vole

@pubpub-zz I'm afraid I don't know how to run this thing. Also, I don't have Java installed on my Mac. Are you able to run it with the PDF I've uploaded?

@pubpub-zz
Copy link
Collaborator

I modified slightly the program to show the translation(ID in reader -> ID in writer)

#!/usr/bin/env python3

import os

from pypdf import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
pg_sizes = []

for i in range(len(reader.pages)):
    print(f"page{i}")
    writer = PdfWriter()
    writer.append(reader, pages=(i, i + 1))
    print(writer._id_translated[id(reader)])

    with open(f"page{i + 1}", "wb") as f:
        writer.write(f)

    file_size = os.path.getsize(f"page{i + 1}")
    pg_sizes.append(file_size)
    os.remove(f"page{i + 1}")

inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}")

the output shows:

page0
{3: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 11: 11, 12: 12, 415: 13, 414: 14, 4: 15}
page1
{10: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 32: 11, 33: 12, 34: 13, 35: 14, 36: 15, 410: 16, 411: 17, 409: 18, 37: 19, 38: 20, 39: 21, 40: 22, 41: 23, 42: 24, 31: 25}
page2
{15: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 32: 11, 33: 12, 34: 13, 35: 14, 36: 15, 410: 16, 411: 17, 409: 18, 44: 19, 45: 20, 47: 21, 48: 22, 49: 23, 50: 24, 51: 25, 52: 26, 43: 27, 46: 31}
page3
{18: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 54: 13, 55: 14, 56: 15, 57: 16, 58: 17, 53: 18}
page4
{19: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 60: 12, 59: 13}
page5
{20: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 54: 11, 62: 12, 63: 13, 61: 14}
page6
{21: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 45: 11, 65: 12, 66: 13, 67: 14, 64: 15}
page7
{22: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 69: 11, 68: 12}
page8
{23: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 71: 11, 72: 12, 73: 13, 70: 14}
page9
{24: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 45: 11, 75: 12, 76: 13, 77: 14, 74: 15}
page10
{25: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 79: 13, 78: 14}
page11
{26: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 89: 11, 90: 12, 91: 13, 80: 14, 81: 18, 82: 19, 83: 20, 84: 21, 85: 22, 86: 23, 87: 24, 88: 25}
page12
{29: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 93: 13, 94: 14, 95: 15, 96: 16, 97: 17, 98: 18, 99: 19, 92: 20}
page13
{100: 4, 5: 5, 6: 6, 412: 7, 7: 8, 8: 9, 413: 10, 44: 11, 45: 12, 93: 13, 102: 14, 103: 15, 104: 16, 105: 17, 106: 18, 107: 19, 108: 20, 109: 21, 101: 22}
input file size: 541984
output file sizes sum: 693229

You can see that some Readers ID are shown many times. If you look at ID(412) it corresponds to the Widths table of the font which is quite big.
The size increase sounds absolutely normal.

@MartinThoma
Copy link
Member

That reminds me of something: is it possible to make embedded fonts accessible via the reader? This would potentially help people to understand the size of their document better.

It might also be cool if we had an option to strip embedded fonts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem nf-performance Non-functional change: Performance
Projects
None yet
Development

No branches or pull requests

3 participants