-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Avoid isolating the graphics state multiple times #2224
Conversation
The test failures appear to be unrelated to this PR. |
We have to be careful: we may be in a situation where we are on two isolated sections next to each other's |
Do you have an example? Judging from the fact that PyMuPDF basically recommends the approach implemented in this PR, I would assume that this indeed is the correct way. Citing from https://pymupdf.readthedocs.io/en/latest/recipes-common-issues-and-their-solutions.html#id3:
|
here you are an example the two q/Q sequences are at same level and even separated with some operations |
This PDF file will not be considered to have an isolated graphics state at the moment due to The current check is not cheap, but rather simple to do. What you are proposing seems to be that this check should look at every byte and wrap under the following conditions:
Am I correct? While this is doable, this will have to scan the whole page content in the worst case (= correct PDF). With the current solution (without the fix from this PR), the PDF file might grow each time I add another layer, which does not seem to be a good solution either, although much easier to implement. |
The following code snippets would correctly detect the most common variations of the def has_isolated_graphics_state_in_operations(operations: List[Tuple[Any, Any]]) -> bool:
level = 0
last_operation_index = len(operations) - 1
if last_operation_index == -1:
# There are no operations, thus no isolation required.
return True
for index, (_, operator) in enumerate(operations):
if index == 0 and operator != b"q":
# Not isolated at the start.
return False
if index == last_operation_index and operator != b"Q":
# Not isolated at the end.
return False
if operator not in {b"q", b"Q"}:
continue
# `q` commands are at the start, thus increasing the level.
level += (1 if operator == b"q" else -1)
if level <= 0:
# We have detected a second `q ... Q` pair on the root level.
return False
return False
def has_isolated_graphics_state_in_bytes(data: bytes) -> bool:
data = data.strip(b"".join(WHITESPACES))
byte_count = len(data)
if byte_count == 0:
# Whitespace-only data is considered isolated.
return True
isolation_commands = re.finditer(rb"(?P<character>[qQ])" + WHITESPACES_AS_REGEXP + rb"*", data)
# Fast path: If it does not start isolated, return directly.
first_command = next(isolation_commands, None)
if first_command is None:
# We have some non-whitespace content, but no isolation call at all.
return False
if first_command.start("character") != 0 or first_command.group("character") != b"q":
# The first non-whitespace character should be the isolation call.
return False
# We have consumed the first `q` command, thus start at level 1.
level = 1
for command in isolation_commands:
character = command.group("character")
if command.end("character") == byte_count and character == b"Q":
# We have a `Q` command at the end. Due to having stopped for non-`q` at the
# start, this is the expected case.
return True
# `q` commands are at the start, thus increasing the level.
level += (1 if character == b"q" else -1)
if level <= 0:
# We have detected a second `q ... Q` pair on the root level.
return False
return False |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2224 +/- ##
==========================================
+ Coverage 94.37% 94.42% +0.05%
==========================================
Files 43 43
Lines 7660 7604 -56
Branches 1515 1502 -13
==========================================
- Hits 7229 7180 -49
+ Misses 267 260 -7
Partials 164 164 ☔ View full report in Codecov by Sentry. |
@MartinThoma |
@pubpub-zz Thank you for your assessment of the situation 🙏 @stefan6419846 It makes sense to me what pubpub-zz is writing. I tend to close this PR. Is there anything I miss / should additionally consider? |
pubpub-zz usually has more insights into the PDF spec, so I am going to trust the aforementioned comments in this case. As some context, which is outlined in #2219: I have seen this behavior with the graphics state isolation only being done once in PyMuPDF and from my experience, MuPDF being one of the most stable/fault tolerant PDF library, I considered this to be a good solution. There they only check the first and last character of the content stream, although I have not looked at how they handle content streams which are split up. In some of my use cases, I am using page merges with three or more overlays for one page, so this will quickly lead to quite some nested graphics state isolation calls, which might not always be necessary. Having said that, I am fine with closing this PR and the corresponding issue as long as their is no real need and you as the maintainer decide that it is not worth the hassle or lead to quite some unnecessary overhead. As a side effect, implementing this PR allowed me to have a quick look at the inner workings of PDF/pypdf and allow me to get a basic grasp of the low-level primitives of PDF, so this PR will always have a positive learning effect for me nevertheless. |
Thank you for your work with this PR and thank you for taking the time to respond. I do agree with your comment - MuPDF is good and pubpub-zz has very good understanding of the PDF specs + pypdf. For this reason, I'm closing this PR and the issue now. If people stumble over that issue, we can re-evaluate / re-open :-) |
Fixes #2219.