Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix merge of a cropped page #879

Closed
wants to merge 9 commits into from

Conversation

pubpub-zz
Copy link
Collaborator

tracked in #636

thee smaller box between cropBox is and trimBox(== mediaBox by default) is used

tracked in py-pdf#636

thee smaller box between cropBox is and  trimBox(== mediaBox by default) is used
@codecov
Copy link

codecov bot commented May 16, 2022

Codecov Report

Merging #879 (9cf5171) into main (759cbc3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #879   +/-   ##
=======================================
  Coverage   92.15%   92.16%           
=======================================
  Files          24       24           
  Lines        4948     4950    +2     
  Branches     1024     1025    +1     
=======================================
+ Hits         4560     4562    +2     
  Misses        244      244           
  Partials      144      144           
Impacted Files Coverage Δ
PyPDF2/_page.py 92.65% <100.00%> (+0.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma ,
any issue with this PR ?

@MartinThoma
Copy link
Member

@pubpub-zz I simply didn't have the time to review it so far 😅

@pubpub-zz
Copy link
Collaborator Author

😳

PyPDF2/_page.py Outdated Show resolved Hide resolved
tests/test_page.py Outdated Show resolved Hide resolved
tests/test_page.py Outdated Show resolved Hide resolved
tests/test_page.py Outdated Show resolved Hide resolved
tests/test_page.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

I've just fixed the merge conflicts I've caused. For the review, I still need to do those things:

  1. Read BUG: page.mediabox not updating correctly when merging #636 and understand the issue
  2. Inspect a couple of examples of merges where this change makes a difference

This takes some time. Getting the 2.0.0-dev branch back into main has priority to me as this unblocks other changes. Still, I'm confident that this PR will be handled in the next 4 weeks.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF soon PRs that are almost ready to be merged, issues that get solved pretty soon labels May 23, 2022
@MartinThoma MartinThoma changed the base branch from 2.0.0-dev to main May 26, 2022 08:13
@MartinThoma
Copy link
Member

I was reading "10.10.1 Page Boundaries":

  • The media box defines the boundaries of the physical medium on which the page is to be printed. It may include any extended area surrounding the finished page for bleed, printing marks, or other such purposes. It may also include areas close to the edges of the medium that cannot be marked because of physical limitations of the output device. Content falling outside this boundary can safely be discarded without affecting the meaning of the PDF file.
  • The crop box defines the region to which the contents of the page are to be clipped (cropped) when displayed or printed. Unlike the other boxes, the crop box has no defined meaning in terms of physical page geometry or intended use; it merely imposes clipping on the page contents. However, in the absence of additional information (such as imposition instructions specified in a JDF or PJTF job ticket), the crop box determines how the page’s contents are to be positioned on the output medium. The default value is the page’s media box.
  • The bleed box (PDF 1.3) defines the region to which the contents of the page should be clipped when output in a production environment. [...] The default value is the page’s crop box.
  • The trim box (PDF 1.3) defines the intended dimensions of the finished page after trimming. It may be smaller than the media box to allow for production-related content, such as printing instructions, cut marks, or color bars. The default value is the page’s crop box.
  • The art box (PDF 1.3) defines the extent of the page’s meaningful content (including potential white space) as intended by the page’s creator. The default value is the page’s crop box.

There is also a pretty helpful image.

@MartinThoma
Copy link
Member

This PR changes the behavior of _merge_page.

It changes which rectangle is appened to the path (re operation; TABLE 4.9 Path construction operators).

If the page that is about to be merged has a crop box that is smaller in any dimension than its trim box, the crop box will be added. Otherwise the trim box will be added.

@MartinThoma
Copy link
Member

@pubpub-zz Why don't we simply always take the cropbox?

@MartinThoma
Copy link
Member

MartinThoma commented May 26, 2022

Pdfjam seems to also use the cropbox, but the positioning is different (I have no clue what pdfjam does there; PyPDF2 pins the mediabox to the lower-left corner. PyPDF2 essentially just uses the same coordinate system to overlay):

pdfjam crazyones.pdf git-cropped.pdf --outfile merged-pdfjam.pdf --nup "1x2"

Pdftk also seems to use the cropbox, but the positioning is different from PyPDF2 and pdfjam (pinning the merged pages cropbox to the upper right corner):

pdftk git-cropped.pdf background crazyones.pdf output merged-pdftk.pdf

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented May 26, 2022

@pubpub-zz Why don't we simply always take the cropbox?

@MartinThoma
Update : after reviewing the code I confirm my proposal : The issue comes from cases where the Cropbox is not defined but trim yes.
default values are already taken into account

@MartinThoma
Copy link
Member

The part that I'm really confused about is what if the crop box and the trim box are completely disjunct? If they don't overlap at all? What result would you expect?

It seems weird to me to switch from trim box to crop box just because one of the dimensions of the crop box is smaller. What is the logic behind that?

@MartinThoma
Copy link
Member

By the way, this is how I checked the results:

from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.generic import RectangleObject

reader = PdfReader("box.pdf")
crop_page = reader.pages[0]
print(crop_page.mediabox)
print(crop_page.cropbox)
print(crop_page.trimbox)

writer = PdfWriter()
crop_page.cropbox = RectangleObject((0, 0, 400, 400))
writer.add_page(crop_page)
with open("git-cropped.pdf", "wb") as fp:
    writer.write(fp)

reader = PdfReader("crazyones.pdf")
crazy_page = reader.pages[0]

crazy_page.merge_page(crop_page)

writer = PdfWriter()
writer.add_page(crazy_page)

with open("merged-pypdf2-crop.pdf", "wb") as fp:
    writer.write(fp)

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 14, 2022
@MartinThoma MartinThoma changed the title BUG : Fix merge of a cropped page BUG: Fix merge of a cropped page Jun 25, 2022
MartinThoma added a commit that referenced this pull request Jul 3, 2022
MartinThoma added a commit that referenced this pull request Jul 3, 2022
@MartinThoma
Copy link
Member

@pubpub-zz I'm sorry that I didn't merge this PR so far. I'm simply really uncertain if this is doing the right thing. The test also currently succeeds (without your change). This makes it hard for me to judge.

Let's make an example:

Scenario 1

page

Assume the gray part is where we actually have content.

The green rectangle is the cropbox and the red one is the trimbox. What should we use for rect and why?

Scenario 2

grow-2

The green rectangle is the cropbox and the red one is the trimbox. What should we use for rect and why?

Scenario 3

grow-2

The green rectangle is the trimbox and the red one is the cropbox. What should we use for rect and why?

MartinThoma added a commit that referenced this pull request Jul 5, 2022
New Features (ENH):
-  Add PdfReader.xfa attribute (#1026)

Bug Fixes (BUG):
-  Wrong page inserted when PdfMerger.merge is done (#1063)
-  Resolve IndirectObject when it refers to a free entry (#1054)

Developer Experience (DEV):
-  Added {posargs} to tox.ini (#1055)

Maintenance (MAINT):
-  Remove PyPDF2._utils.bytes_type (#1053)

Testing (TST):
-  Scale page (indirect rect object) (#1057)
-  Simplify pathlib PdfReader test (#1056)
-  IndexError of VirtualList (#1052)
-  Invalid XML in xmp information (#1051)
-  No pycryptodome (#1050)
-  Increase test coverage (#1045)

Code Style (STY):
-  DOC of compress_content_streams (#1061)
-  Minimize diff for #879 (#1049)

Full Changelog: 2.4.1...2.4.2
@pubpub-zz pubpub-zz closed this Aug 20, 2022
@pubpub-zz pubpub-zz deleted the MergeCrop-636 branch August 20, 2022 14:19
MartinThoma added a commit that referenced this pull request Feb 11, 2023
While the old behavior can be considered a bug, people might rely on trimbox being used.
To allow them to switch back from cropbox to trimbox, they can set

    pypdf._page.MERGE_CROP_BOX = "trimbox"

See discussions in #879 and #1426
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants