-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting PDF files resulting in larger-than-expected output PDF files #1322
Comments
It's interesting that this might be connected to outlines. Thank you for the hint! I don't know when anybody will pick this one up, but in the mean time you might be interested in https://pypdf2.readthedocs.io/en/latest/user/file-size.html |
Hey @MartinThoma I just found the above PDF file that you can use to reproduce the issue. The file has 6.8 MB and 34 pages. Splitting it into files of 10 pages maximum each using PyPDF2 I get: File 1: 10 pages / 6.8 MB Using pikepdf I get: File 1: 10 pages / 588 KB As you can see I don't get the issue with the first output file using pikepdf. |
Hey @MartinThoma, could you reproduce the issue with the PDF file I provided? |
Reproduced! |
Thanks for looking into it! |
I think I've got it. when writing, a process to refer/adjust the objects in the write object. during this step, PyPDF2 parses through the pages to write and "collect" all the referenced indirect objects. What I found in the HowtoMakeAccessiblePDF.pdf, when you look at the fifth page, there is some (internet) link annonations. In this annotation, the "/P" field references the reader pages and not the modified pages. this induces the reader page to be collected also, and through the "/Parent" the other objects. Those are just collected but are not listed in the "/Pages" Tree and therefore not displayed. Note: I did not check what fields would induced the same effect as the annotation |
Please, let me know when I can test the fix in other PDF files I have. |
Any progress here @pubpub-zz? |
Work in progress... The cloning is not so easy...🤔 |
@xilopaint ps: |
I still have the issue using your fork with my code. Should I change anything to make it work? |
You should use PdfWriter (For the moment PdfMerger has not been modified yet):
|
This feels a bit weird for me. In pikepdf we don't need to use any different parameter to make it work. It just works. |
As said, this is a first draft. We have now a solution where the objects can be modified in a proper manner. We need now to find the best encapsulation about it |
This issue is not yet closed😉 |
I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅) |
fixes py-pdf#1322 cope with cmap where the range contains first and last code are on variable length. Also fix cases where the code is on 3 characters only (not standard) no test data available
Mistake in the issue referenced. This issue should stay open for the moment |
Done! |
@xilopaint,
result of dir : |
Bug Fixes (BUG): - td matrix (#1373) - Cope with cmap from #1322 (#1372) Robustness (ROB): - Cope with str returned from get_data in cmap (#1380) Full Changelog: 2.11.0...2.11.1
It worked! Will this PR deprecate |
This should ease maintenability. |
I would actually be super happy about deprecating PdfMerger 😄 I always thought that the PdfMerger is confusing. I would need to check carefully if PdfMerger can be replaced easily by PdfWriter. |
Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required. |
It looks like the PR introduced a bug. Please, run the test suite of my project. One of the tests is failing since I pushed your PR. You just need to run |
@xilopaint, |
@pubpub-zz you can reproduce the issue with the following sample code and PDF file: #!/usr/bin/env python3
from PyPDF2 import PageObject, PdfReader, PdfWriter
reader = PdfReader("foo.pdf")
writer = PdfWriter()
for page in reader.pages:
out_page = PageObject.create_blank_page(None, 8.3 * 72, 11.7 * 72)
out_page.merge_page(page)
writer.add_page(out_page)
with open("bar.pdf", "wb") as f:
writer.write(f)
reader = PdfReader("bar.pdf")
for n, page in enumerate(reader.pages, 1):
print(int(page.extract_text()) == n) The code works with the latest release but not with your fork. |
@xilopaint |
@pubpub-zz yes, it's fixed. |
Is the PR ready to be merged now? |
Not yet, |
The method `.clone(pdf_dest,[force_duplicate])` clones the objects and all referenced objects. If an object is already cloned, the already cloned object is returned (unless force_duplicate is set) mainly for internal use but can be used on a page for pageObject/DictionnaryObject/[Encoded/Decoded/Content]Stream an extra parameter ignore_fields list that provide the list of fields that should not be cloned. When available, the pointer to an object is available in `indirect_obj` attribute. New API for add_page/insert_page that : * returns the cloned page object * ignore_fields can be provided as a parameter. ## Others * file is closed at the end of PdfWriter.write when a filename is provided * Breaking Change: `add_outline_item` now has a parameter before which is not the last parameter ## Update * The public API of PdfMerger has been added to PdfWriter (ready to make PdfMerger an alias of it) * Process properly Outline merging * Process properly Named destinated Deals with #1194, #1322, #471, #1337
@xilopaint |
The issue has been fixed. |
@pubpub-zz if I split a multi-page PDF file into PDF files with one page each, in some cases the sum of the output PDF file sizes is larger, and in other cases, it is smaller than the input PDF file size. Why that? Is this the expected behavior? |
just general assumptions without checking(sorry I will have no time for analyzing them): Therefore the behavior you are observing is not abnormal |
@pubpub-zz could you run this script when you have some time? #!/usr/bin/env python3
import os
from pypdf import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
pg_sizes = []
for n, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page{n}", "wb") as f:
writer.write(f)
file_size = os.path.getsize(f"page{n}")
pg_sizes.append(file_size)
os.remove(f"page{n}")
inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}") Here you have the test PDF file. The output of the script is:
As you can see the sum of the output file sizes is much larger than the size of the input file. Maybe I could improve the performance of my application if I can understand why this happens. |
@xilopaint |
@pubpub-zz running this code: #!/usr/bin/env python3
import os
from pypdf import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
pg_sizes = []
for i in range(len(reader.pages)):
writer = PdfWriter()
writer.append(reader, pages=(i, i + 1))
with open(f"page{i + 1}", "wb") as f:
writer.write(f)
file_size = os.path.getsize(f"page{i + 1}")
pg_sizes.append(file_size)
os.remove(f"page{i + 1}")
inp_file_size = os.path.getsize("test.pdf")
sum_pg_sizes = sum(pg_sizes)
print(f"input file size: {inp_file_size}")
print(f"output file sizes sum: {sum_pg_sizes}") I get:
Still a significant difference between the input file size and the sum. |
can you try to identify duplicate objects using https://github.com/Rossi1337/pdf_vole |
@pubpub-zz I'm afraid I don't know how to run this thing. Also, I don't have Java installed on my Mac. Are you able to run it with the PDF I've uploaded? |
I modified slightly the program to show the translation(ID in reader -> ID in writer)
the output shows:
You can see that some Readers ID are shown many times. If you look at ID(412) it corresponds to the Widths table of the font which is quite big. |
That reminds me of something: is it possible to make embedded fonts accessible via the reader? This would potentially help people to understand the size of their document better. It might also be cool if we had an option to strip embedded fonts |
I've noticed a weird behavior trying to split some PDF files containing a table of contents with outlines on their first pages. The issue is that the output file containing the table of contents has the same file size of the input PDF although it has fewer pages since it has been split.
Therefore, trying to split a 9.6 MB PDF with 199 pages with the following code snippet I get two output files: one with 9.6 MB and 100 pages and another one with 3 MB and 99 pages.
Running a similar code with pikepdf I have no issues and I get one file with 6.7 MB and 100 pages and another one with 99 pages and 3 MB.
Environment
$ python -m platform macOS-12.5.1-x86_64-i386-64bit $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.10.4
Code
PDF
Unfortunately, I can't share the PDF file.
The text was updated successfully, but these errors were encountered: