-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for inline images in formatContent() #693
Conversation
`formatContent()` now accounts for inline image `BI ... ID ... EI` commands in document streams.
Include the `BI` command in the regexp, and move inline image detection after string replacement to prevent false-positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting this to a draft for now. @iGrog supplied another PDF that still had the issue, and in fixing it, I'm sure there is an edge case: if a The internal content of the captured |
Add the /s modifier so the `.` token matches newlines as well. Thanks to @iGrog for supplying another PDF that demonstrated this issue. Add the same modifier for dictionaries as well, fixing this oversight. Move the inline image replacement before string replacement. Parentheses in binary image data may be interpreted as the start of a string. Move the inline images test to its own function and add a newline to the sample data to test for the dotall modifier change.
I really appreciate you taking the time! |
`BI` "commands" within strings should not be parsed as the beginning of inline image blocks. Detect if the `BI` we found is inside a (string) and if it is, note the offset and move past it for the next match.
In the case where a valid inline image dictionary isn't found, or if the dictionary doesn't include the required parameters Height and Width, also bump the search offset forward by the current match position so we don't fall into a loop here.
So, the last thing left here that the code wouldn't cover is a proper inline image, that doesn't have a proper image-properties dictionary with a width and height. The code in this PR then skips over it, but the potential is there for such an inline image (probably very rare if it happens at all) to contain binary content that can potentially cause errors in the way PdfParser interprets the document stream. (Like unbalanced Q/q etc.) We can:
I've no data to back it up, but I believe the second case, where Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.
In my opinion, if something is not according to the specification, you can go yolo and do whatever you want. As you described in high detail, the only thing we can do with ill-formed PDFs is to try to make the best of it. As a user/developer I surely appreciate if software can handle ill-formed data to some extent. It keeps me sane. On the other hand we are a community which maintains the library in our sparetime, so there must be a balance.
That being said, your arguments make sense and I will follow your advice here @GreyWyvern. Please do the final preparations and mark the PR ready for review.
In the following just a few remarks/suggestions.
Add "Step X:" to the comments to better define what the inline image replacement code is doing. Small adjustment to the balanced parentheses regexp to also exclude open parenthesis '(' from the matching. This will ensure replacing balanced parentheses from the innermost to the outermost.
Thank you very much @GreyWyvern |
Type of pull request
About
formatContent()
now accounts for inline imageBI ... ID ... EI
commands in document streams. Resolves #691.Checklist for code / configuration changes
In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:
fixes #1234
to outline that you are providing a fix for the issue#1234
.