Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Update to PDFObject.php + Ancillary #634

Merged
merged 37 commits into from
Nov 7, 2023
Merged

Conversation

GreyWyvern
Copy link
Contributor

@GreyWyvern GreyWyvern commented Aug 18, 2023

Okay, here it is! I fully expect this will take quite some time to review and merge and likely require many more commits before it's ready to go; please mark it as a draft if that fits better.

It fully passes all unit tests here (PHP 8.2.7), even a couple that were marked "linux only" from which I removed that criteria. Several existing unit test assertions have been altered simply because of the way the update now parses document stream data differently, and thus generates arrays of commands differently. As examples: BT commands (as well as several others) are now stored instead of discarded, and outside of (strings) and <<dictionaries>> whitespace is normalized.

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.


PHPObject.php

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions focusing on BT ... ET blocks, this update changes the behaviour of cleanContent() adds a new function formatContent() that considers the entire document stream. It takes the following steps:

  • Hide all (strings)
  • Remove all newlines and carriage-returns
  • Hide all <<dictionaries>>
  • Perform a MIME-type check for 'text/plain'1
  • Normalize all whitespace
  • Using a list of valid Operators from the PDF Reference2, add \r\n back to the remaining text so that there is a single valid PDF command on every line
  • Restore the <<dictionaries>>
  • Encode newlines and carriage returns in (strings) as \n and \r and restore them as well

By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of multiline PCRE extraction. getSectionsText() has been updated to do just this, stepping through the output of cleanContent() formatContent() line-by-line and returning an array of only the relevant commands needed to position and display text.

The guts of getText() have been moved to getTextArray() to reduce code duplication. getTextArray() now takes into account both the current graphics position cm and the scaling factors of the text matrix Tm when adding \n, \t and space " " whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a Tm or Td/TD is found.

getTextArray() now also treats Q and q as a stack of stored states rather than a single stored state. Both font Tf and graphics positioning cm are stored.

The presence of ActualText BDC commands is also taken into account and the contents of the ActualText will replace the formerly output text in both content and position. This requires the new parseDictionary() method to reliably extract such commands as well as any others PdfParser may take into account in the future.

Font.php

decodeText() in Font.php now takes into account the current text matrix font size and scale when considering whether or not to define strings of text as "words" that require spaces between them.

In decodeContent() in Font.php add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.

Page.php

In Page.php remove the addition of a "fake" BT command as the content stream now records them.

Add a check to see if there are remaining texts to use from PDFObject::getTextArray() before proceeding in getDataTm() which prevents "undefined array key" PHP errors.

Also prevent ET commands from resetting the font Tf as PR #629 did for BT commands.

Issues affected

Resolves #219.
Resolves #353.
Resolves #398. Current v2.7.0 fixes text direction, but this PR fixes all spacing issues.
Resolves #464. Removes duplicated text by examining ActualText commands.
Resolves #474.
Resolves #508.
Resolves #528. Fixes spacing issues.
Resolves #537.
Resolves #564. Current v2.7.0 fixes text extraction, but this PR fixes spacing issues.
Resolves #568. Fixes spacing issues.
Resolves #575.
Resolves #576.
Resolves #585.
Resolves #608. Fixes headings by taking into account the graphics position cm.
Resolves #628.
Resolves #637.

Footnotes

  1. This prevents non-document-stream data from being passed to cleanContent() formatContent() such as JPEG data in file '12249.pdf' from https://github.com/smalot/pdfparser/issues/458

  2. https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A; https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions, this update changes the behaviour of `cleanContent()` to the following:
- Hide all (strings)
- Remove all newlines and carriage-returns
- Hide all <<dictionaries>>
- Normalize all whitespace
- Using a list of valid Operators from the PDF Reference, add \r\n back to the remaining text so that there is a single PDF command on every line
- Restore the <<dictionaries>> and (strings)

By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of PREG extraction. `getSections()` text has been updated to do just this, stepping through the output of `cleanContent()` line-by-line and returning an array of only the relevant commands needed to position and display text.

The guts of `getText()` have been moved to `getTextArray()` to reduce code duplication. `getTextArray()` now takes into account both the current graphics position `cm` and the scaling factors of the text matrix `Tm` when adding \n and \t whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a `Tm` or `Td`/`TD` was found.

It also treats `Q` and `q` as a stack of stored states rather than a single stored state.

The presence of `ActualText` `BDC` commands is also taken into account and the contents of the `ActualText` will replace the formerly output text in both content and position. This requires the new `parseDictionary()` method to reliably extract such commands as well as any others PdfParser may take into account in the future.

`decodeText()` in **Font.php** now takes into account the current text matrix when considering whether or not to add spaces between words. Instead of `implode()`-ing the result array with a space joiner, rely on the positioning check to add all required spacing.

In `decodeContent()` in **Font.php** add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.
Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue smalot#585 User @se-ti allowed use of this file in issue smalot#586 (comment)

In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from smalot#458

Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works.
Modify several code comments to be clearer.

Remove the `$key => ` from `decodeText()` in **Font.php** as it's no longer needed.

Also, now that `cleanContent()` is ignoring non `text/plain` content, there should be no errant `q` or `Q` commands that cause the stored-state stack to try restoring a state that doesn't exist. Remove the kludgy code that prevented this.
Remove unnecessary `$whitespace` line.
@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Aug 18, 2023

Hrms. It seems all the Windows PHP checks are failing because the Fileinfo functions haven't been installed. They're installed by default on Linux, but the Installation manual says: 'Windows users must include the bundled php_fileinfo.dll DLL file in php.ini to enable this extension.' So Windows users have the DLL file, but I guess it's just not enabled in a "default" install?

Edit: This has been resolved by using a different method to detect whether a content stream is binary and can be safely ignored.

For "PHPUnit coverage (PHP 7.4)" it is failing an assertion where it tests to see if PDFParser's memory usage is higher than a checked value, but with my edits, the memory usage is actually lower than it expects, so that's probably actually a win? :D

https://github.com/smalot/pdfparser/blob/f3a5a3ea20f1f7d4f70f909c70a6c9c0a4a6fc63/tests/PHPUnit/Integration/ParserTest.php#L379C22-L379C22

Edit: This has been resolved by checking for a fixed numerical value for memory use above the $baselineMemory instead of a multiple of it in the ParserTest::testRetainImageContentImpact() method.

Edit: It's very possible that the Fileinfo MIME-type mb_detect_encoding() check is what's reducing the memory use since non-document-stream data is rejected before the various interpreting functions in PHPObject.php and Font.php tries to work on it. As an example, There are 4 (I think?) streams that get passed to getSectionsText() from 12249.pdf in #458 and one of them is actually binary JPEG data.

@k00ni
Copy link
Collaborator

k00ni commented Aug 21, 2023

@GreyWyvern Thank you very much for all your work. I will see how I manage to give more feedback soon, but I hope that our community has time to comment too. Surprisingly I can't mark this PR as draft.

The correct matrix elements to use for scaling the x-axis are actually the first *column*, so 'a' and 'i', not 'a' and 'b'. My bad! It worked before because almost always the x-axis scaling is equal to the y-axis scaling.
@GreyWyvern GreyWyvern marked this pull request as draft August 21, 2023 17:06
The Fileinfo functions are not installed by default on Windows, so use a different method to determine whether the stream is valid or binary.
PHP CS Fixer native_function_invocation
Make the cases a little bit more alphabetical. Remove cases/commands that aren't relevant to getting and positioning text.
@k00ni
Copy link
Collaborator

k00ni commented Aug 24, 2023

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

@GreyWyvern
Copy link
Contributor Author

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.

We can't really tell without people actually using it, so I think putting out a release candidate is a very good idea.

Add some documentation comment text to PDFObject.php and fix a comment typo in Font.php.

Add a test accounting for text-matrix scaling in Font::decodeText().

Add a test verifying that a string prefixed by a UTF-16BE BOM is decoded directly by Font::decodeContent().

Move "ET in font name" test from testCleanContent() to testGetSectionsText() as that is the function the test uses.

Add a test that verifies cleanContent() returns an empty string for binary content.

Remove unnecessary variable reset from ET command in Page::getDataTm. Only needed under BT.
Account for the entire font-factor (font-size multiplied by the horizontal scaling factor of the text-matrix) when estimating the width of the current text block.

Insert a fix when decoding octal strings in Font::decodeOctal(), check further ahead for escaped backslashes.

Remove tests for images in DocumentGeneratorFocusTest.php. These also fail in the current v2.7.0 release and they should be looked at in a separate PR.
GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this pull request Sep 19, 2023
Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated.

Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR smalot#634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated.

Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f.

Add new tests for both of these.
@k00ni
Copy link
Collaborator

k00ni commented Sep 20, 2023

@GreyWyvern

I'll rely on your judgement @k00ni, if you think they should be marked as @internal I'll do that.

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

About the images: Just remove test-code then, if you didn't change any image related code.


I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

@GreyWyvern
Copy link
Contributor Author

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

Will do! These functions are not really something a regular user would use, so @internal makes enough sense.

About the images: Just remove test-code then, if you didn't change any image related code.

Removed.

I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

Well... the main change of function is in the two functions formatContent() and getSectionsText(). Especially with the latter, this is an all-or-nothing switch. Without the new way the document stream is formatted by these two functions, the rest of the changes (especially getTextArray()) won't really work.

The BOM-related changes in Font.php, and all the changes in Page.php (except removing lines 403-404) are really the only changes that are essentially entirely unrelated to the meat of this PR. I will revert my change to Font::decodeOctal() in prep for the other PR I've made.

I think at most I would be able to split this into two PRs, the first of which would be all the additions that don't actually change execution flow (like adding formatContent() and parseDictionary() but not calling them anywhere). And then the ones that do affect execution flow second. Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

I'm not in a hurry. I'm happy to allow you all the time you need to review :)

Revert the change to Font::decodeOctal() as it's been superceded by PR smalot#640.

Add @internal notes to formatContent() and parseDictionary().
The `@internal` tag hides the content that comes _after_ it from the documentation, so adjust these comments as appropriate. See: https://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_tags.internal.pkg.html
This test will succeed once PR smalot#640 is merged. It doesn't have anything to do with the current PR, so disable it for now.
@k00ni
Copy link
Collaborator

k00ni commented Sep 21, 2023

Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

Alright. If there are no further objections, we can merge it all at once.

How is your timetable for this PR?

CC @j0k3r @smalot

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following I am proposing a few changes in the function headers. As far as I understand the doc, @internal should reflect why a function is for internal use only. It was accompanied with general information here in some cases.

src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
Switch to tagging method for `@internal`. Adjust comments.
PHP-CS-Fixer requires spaces between `@` statements I guess.
k00ni added a commit that referenced this pull request Sep 26, 2023
* Better octal and hex-entity decode

Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated.

Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR #634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated.

Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f.

Add new tests for both of these.

* Use #2D to ensure we're capturing hex letters

* Change order of special string replacement

Move the special string replacement after the unescaping of parentheses so we don't unescape any parentheses we shouldn't.

Add more tests to make sure this is working.

* Apply suggestions from code review

Co-authored-by: Konrad Abicht <[email protected]>

---------

Co-authored-by: Konrad Abicht <[email protected]>
In some edge cases, the formatContent() method may return a document stream row containing an invalid command. Make sure we just ignore these commands instead of triggering warnings for missing $matches array elements.
Re-enable this assertion, now that we have merged smalot#640.
Copy link
Collaborator

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is massive rewrite and it's hard to follow up 😅

Comment on lines 211 to 213
* @internal
*/
public function formatContent(?string $content): string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you define it as private instead? It'll avoid adding the @internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's called explicitly in PDFObjectTest.php, but I can work around that with ReflectionMethod. Initially I thought it would be useful to be able to use this method publicly to format any old document stream, but that's probably not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be applied to other @internal method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The less public methods we have the better, because @internal is just a label and can't be enforced.

About the testing of private methods: I support the view that private functionality is not to test, because it is the objects obligation. One should only test public methods. In practice this might lead to complicated situations in which it is hard to cover a certain case sometimes.

Initially I thought it would be useful to be able to use this method publicly to format any old document stream, but that's probably not necessary.

I was thinking that a method like that should be extracted from PDFObject and moved to a utility class or something. It is useful and might be handy outside of PDFObject context. But we should finalize this one first and then see. Because it is private now, we could extract it from PDFObject and make it available later on, if we want.

Copy link
Contributor Author

@GreyWyvern GreyWyvern Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be applied to other @internal method?

Almost certainly. I wouldn't want to do it in this PR though. :)

About the testing of private methods: I support the view that private functionality is not to test, because it is the objects obligation. One should only test public methods. In practice this might lead to complicated situations in which it is hard to cover a certain case sometimes.

Since we can test private methods by making them accessible via Reflection (well supported for pretty much exactly this purpose since PHP 5) I don't see why we shouldn't, personally. The more targeted the tests, the easier they are to isolate and fix. More tests!!!

I was thinking that a method like that should be extracted from PDFObject and moved to a utility class or something. It is useful and might be handy outside of PDFObject context.

Definitely. For instance, I feel like my added function parseDictionary() duplicates much of the protected parsing functionality from Parser.php that breaks apart PDF 'dictionary' structures, which aren't only used in document streams, but trailer info, etc. In the future, there should probably be one global class or function that parses dictionaries (a fundamental PDF data structure) for all situations.

Make the formatContent() method private to PDFObject so that `@internal` isn't required. Adjust the unit tests with `ReflectionMethod` to account for this.
@k00ni
Copy link
Collaborator

k00ni commented Nov 1, 2023

@GreyWyvern A quick follow up: Are you planning any updates on this one in the near future? I won't have that much time in the next weeks/months most likely. I remember your suggestion to "throw further PDFs" at the code.

I suggest the following next steps:

  • You finalize your work
  • You mark the PR as ready to review
  • I will merge it (if there are no objections)
  • I will create a release candidate, so our community can start to test it

This way your work gets out to more people and we can observe, if there are any remaining bugs in your code. I can help organize further steps regarding releases or issues.

WDYT?

@GreyWyvern
Copy link
Contributor Author

WDYT?

I think this is a good plan. I have actually found a couple more PDFs that PdfParser has trouble with since I last visited, but related to embedded images rather than text-extraction.

Putting out a release candidate will certainly help get the new code more testing. The most important thing to find are PDFs that v2.7.0 parses "correctly" while this new version does not. That's where the most tweaking information will come from. :)

I will mark this PR as Ready for review.

@GreyWyvern GreyWyvern marked this pull request as ready for review November 1, 2023 14:26
@k00ni
Copy link
Collaborator

k00ni commented Nov 1, 2023

@j0k3r any objections? I would merge it right away and proceed as suggested.

@j0k3r
Copy link
Collaborator

j0k3r commented Nov 2, 2023

@k00ni This is good to me

@k00ni k00ni merged commit feaf39e into smalot:master Nov 7, 2023
29 checks passed
@k00ni
Copy link
Collaborator

k00ni commented Nov 7, 2023

Big thanks to @GreyWyvern and all commentators.

@k00ni k00ni mentioned this pull request Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment