Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for inline images in formatContent() #693

Merged
merged 7 commits into from
May 13, 2024

Conversation

GreyWyvern
Copy link
Contributor

Type of pull request

  • Bug fix (involves code and configuration changes)

About

formatContent() now accounts for inline image BI ... ID ... EI commands in document streams. Resolves #691.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

  • Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
  • Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
  • In case you fix an existing issue, please do one of the following:
    • Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.

`formatContent()` now accounts for inline image `BI ... ID ... EI` commands in document streams.
Include the `BI` command in the regexp, and move inline image detection after string replacement to prevent false-positives.
@k00ni k00ni added the fix label Mar 25, 2024
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/PHPUnit/Integration/PDFObjectTest.php Outdated Show resolved Hide resolved
@GreyWyvern GreyWyvern marked this pull request as draft March 25, 2024 15:33
@GreyWyvern
Copy link
Contributor Author

Converting this to a draft for now. @iGrog supplied another PDF that still had the issue, and in fixing it, I'm sure there is an edge case: if a (string) contains the BI keyword, and then ID and EI can be found further on in the document, the potential is there for a large chunk of the document to be ignored. Very small chance this happens, but it's there.

The internal content of the captured BI ... ID ... EI needs to be checked to verify that it is indeed inline image content before allowing the replace. I'll work on this and update this PR when ready.

Add the /s modifier so the `.` token matches newlines as well. Thanks to @iGrog for supplying another PDF that demonstrated this issue. Add the same modifier for dictionaries as well, fixing this oversight.

Move the inline image replacement before string replacement. Parentheses in binary image data may be interpreted as the start of a string.

Move the inline images test to its own function and add a newline to the sample data to test for the dotall modifier change.
@k00ni
Copy link
Collaborator

k00ni commented Mar 25, 2024

I really appreciate you taking the time!

GreyWyvern and others added 2 commits March 25, 2024 16:10
`BI` "commands" within strings should not be parsed as the beginning of inline image blocks. Detect if the `BI` we found is inside a (string) and if it is, note the offset and move past it for the next match.
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
In the case where a valid inline image dictionary isn't found, or if the dictionary doesn't include the required parameters Height and Width, also bump the search offset forward by the current match position so we don't fall into a loop here.
@GreyWyvern
Copy link
Contributor Author

So, the last thing left here that the code wouldn't cover is a proper inline image, that doesn't have a proper image-properties dictionary with a width and height. The code in this PR then skips over it, but the potential is there for such an inline image (probably very rare if it happens at all) to contain binary content that can potentially cause errors in the way PdfParser interprets the document stream. (Like unbalanced Q/q etc.)

We can:

  • Just accept it as is; the document with such an inline image is malformed anyways. There should be no expectation of an error-free parsing in such a case.
  • Not check for the height and width in the dictionary at all, and just accept all BI ... ID ... EI sequences outside of strings as "valid" inline images. This allows the possibility (miniscule?) of finding false-positive inline image sequences.

I've no data to back it up, but I believe the second case, where formatContent() finds a BI ... ID ... EI sequence outside of a string, but in error, is a probably rarer than an inline image dictionary not containing a height and width. But then again I could be wrong!

Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.

In my opinion, if something is not according to the specification, you can go yolo and do whatever you want. As you described in high detail, the only thing we can do with ill-formed PDFs is to try to make the best of it. As a user/developer I surely appreciate if software can handle ill-formed data to some extent. It keeps me sane. On the other hand we are a community which maintains the library in our sparetime, so there must be a balance.

That being said, your arguments make sense and I will follow your advice here @GreyWyvern. Please do the final preparations and mark the PR ready for review.

In the following just a few remarks/suggestions.

src/Smalot/PdfParser/PDFObject.php Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
Add "Step X:" to the comments to better define what the inline image replacement code is doing.

Small adjustment to the balanced parentheses regexp to also exclude open parenthesis '(' from the matching. This will ensure replacing balanced parentheses from the innermost to the outermost.
@GreyWyvern GreyWyvern marked this pull request as ready for review May 10, 2024 15:16
@k00ni k00ni merged commit a19d555 into smalot:master May 13, 2024
29 checks passed
@k00ni
Copy link
Collaborator

k00ni commented May 13, 2024

Thank you very much @GreyWyvern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trying to access array offset on value of type null (PDFObject.php line 795)
2 participants