Major Update to PDFObject.php + Ancillary #634

GreyWyvern · 2023-08-18T16:56:48Z

Okay, here it is! I fully expect this will take quite some time to review and merge and likely require many more commits before it's ready to go; please mark it as a draft if that fits better.

It fully passes all unit tests here (PHP 8.2.7), even a couple that were marked "linux only" from which I removed that criteria. Several existing unit test assertions have been altered simply because of the way the update now parses document stream data differently, and thus generates arrays of commands differently. As examples: BT commands (as well as several others) are now stored instead of discarded, and outside of (strings) and <<dictionaries>> whitespace is normalized.

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.

PHPObject.php

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions focusing on BT ... ET blocks, this update ~~changes the behaviour of cleanContent()~~ adds a new function formatContent() that considers the entire document stream. It takes the following steps:

Hide all (strings)
Remove all newlines and carriage-returns
Hide all <<dictionaries>>
Perform a MIME-type check for 'text/plain'¹
Normalize all whitespace
Using a list of valid Operators from the PDF Reference², add \r\n back to the remaining text so that there is a single valid PDF command on every line
Restore the <<dictionaries>>
Encode newlines and carriage returns in (strings) as \n and \r and restore them as well

By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of multiline PCRE extraction. getSectionsText() has been updated to do just this, stepping through the output of ~~cleanContent()~~ formatContent() line-by-line and returning an array of only the relevant commands needed to position and display text.

The guts of getText() have been moved to getTextArray() to reduce code duplication. getTextArray() now takes into account both the current graphics position cm and the scaling factors of the text matrix Tm when adding \n, \t and space " " whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a Tm or Td/TD is found.

getTextArray() now also treats Q and q as a stack of stored states rather than a single stored state. Both font Tf and graphics positioning cm are stored.

The presence of ActualText BDC commands is also taken into account and the contents of the ActualText will replace the formerly output text in both content and position. This requires the new parseDictionary() method to reliably extract such commands as well as any others PdfParser may take into account in the future.

Font.php

decodeText() in Font.php now takes into account the current ~~text matrix~~ font size and scale when considering whether or not to define strings of text as "words" that require spaces between them.

In decodeContent() in Font.php add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.

Page.php

In Page.php remove the addition of a "fake" BT command as the content stream now records them.

Add a check to see if there are remaining texts to use from PDFObject::getTextArray() before proceeding in getDataTm() which prevents "undefined array key" PHP errors.

Also prevent ET commands from resetting the font Tf as PR #629 did for BT commands.

Issues affected

Resolves #219.
Resolves #353.
Resolves #398. Current v2.7.0 fixes text direction, but this PR fixes all spacing issues.
Resolves #464. Removes duplicated text by examining ActualText commands.
Resolves #474.
Resolves #508.
Resolves #528. Fixes spacing issues.
Resolves #537.
Resolves #564. Current v2.7.0 fixes text extraction, but this PR fixes spacing issues.
Resolves #568. Fixes spacing issues.
Resolves #575.
Resolves #576.
Resolves #585.
Resolves #608. Fixes headings by taking into account the graphics position cm.
Resolves #628.
Resolves #637.

This prevents non-document-stream data from being passed to ~~cleanContent()~~ formatContent() such as JPEG data in file '12249.pdf' from https://github.com/smalot/pdfparser/issues/458 ↩
https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A; https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A ↩

This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions, this update changes the behaviour of `cleanContent()` to the following: - Hide all (strings) - Remove all newlines and carriage-returns - Hide all <<dictionaries>> - Normalize all whitespace - Using a list of valid Operators from the PDF Reference, add \r\n back to the remaining text so that there is a single PDF command on every line - Restore the <<dictionaries>> and (strings) By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of PREG extraction. `getSections()` text has been updated to do just this, stepping through the output of `cleanContent()` line-by-line and returning an array of only the relevant commands needed to position and display text. The guts of `getText()` have been moved to `getTextArray()` to reduce code duplication. `getTextArray()` now takes into account both the current graphics position `cm` and the scaling factors of the text matrix `Tm` when adding \n and \t whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a `Tm` or `Td`/`TD` was found. It also treats `Q` and `q` as a stack of stored states rather than a single stored state. The presence of `ActualText` `BDC` commands is also taken into account and the contents of the `ActualText` will replace the formerly output text in both content and position. This requires the new `parseDictionary()` method to reliably extract such commands as well as any others PdfParser may take into account in the future. `decodeText()` in **Font.php** now takes into account the current text matrix when considering whether or not to add spaces between words. Instead of `implode()`-ing the result array with a space joiner, rely on the positioning check to add all required spacing. In `decodeContent()` in **Font.php** add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.

@se-ti

Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue smalot#585 User @se-ti allowed use of this file in issue smalot#586 (comment) In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from smalot#458 Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works.

Modify several code comments to be clearer. Remove the `$key => ` from `decodeText()` in **Font.php** as it's no longer needed. Also, now that `cleanContent()` is ignoring non `text/plain` content, there should be no errant `q` or `Q` commands that cause the stored-state stack to try restoring a state that doesn't exist. Remove the kludgy code that prevented this.

Remove unnecessary `$whitespace` line.

GreyWyvern · 2023-08-18T17:21:49Z

Hrms. It seems all the Windows PHP checks are failing because the Fileinfo functions haven't been installed. They're installed by default on Linux, but the Installation manual says: 'Windows users must include the bundled php_fileinfo.dll DLL file in php.ini to enable this extension.' So Windows users have the DLL file, but I guess it's just not enabled in a "default" install?

Edit: This has been resolved by using a different method to detect whether a content stream is binary and can be safely ignored.

For "PHPUnit coverage (PHP 7.4)" it is failing an assertion where it tests to see if PDFParser's memory usage is higher than a checked value, but with my edits, the memory usage is actually lower than it expects, so that's probably actually a win? :D

~~https://github.com/smalot/pdfparser/blob/f3a5a3ea20f1f7d4f70f909c70a6c9c0a4a6fc63/tests/PHPUnit/Integration/ParserTest.php#L379C22-L379C22~~

Edit: This has been resolved by checking for a fixed numerical value for memory use above the $baselineMemory instead of a multiple of it in the ParserTest::testRetainImageContentImpact() method.

Edit: It's very possible that the ~~Fileinfo MIME-type~~ mb_detect_encoding() check is what's reducing the memory use since non-document-stream data is rejected before the various interpreting functions in PHPObject.php and Font.php tries to work on it. As an example, There are 4 (I think?) streams that get passed to getSectionsText() from 12249.pdf in #458 and one of them is actually binary JPEG data.

src/Smalot/PdfParser/Font.php

src/Smalot/PdfParser/PDFObject.php

k00ni · 2023-08-21T06:10:13Z

@GreyWyvern Thank you very much for all your work. I will see how I manage to give more feedback soon, but I hope that our community has time to comment too. Surprisingly I can't mark this PR as draft.

The correct matrix elements to use for scaling the x-axis are actually the first *column*, so 'a' and 'i', not 'a' and 'b'. My bad! It worked before because almost always the x-axis scaling is equal to the y-axis scaling.

The Fileinfo functions are not installed by default on Windows, so use a different method to determine whether the stream is valid or binary.

PHP CS Fixer native_function_invocation

Make the cases a little bit more alphabetical. Remove cases/commands that aren't relevant to getting and positioning text.

k00ni · 2023-08-24T05:19:00Z

However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

GreyWyvern · 2023-08-24T13:58:41Z

Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead.

I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR.

We can't really tell without people actually using it, so I think putting out a release candidate is a very good idea.

Add some documentation comment text to PDFObject.php and fix a comment typo in Font.php. Add a test accounting for text-matrix scaling in Font::decodeText(). Add a test verifying that a string prefixed by a UTF-16BE BOM is decoded directly by Font::decodeContent(). Move "ET in font name" test from testCleanContent() to testGetSectionsText() as that is the function the test uses. Add a test that verifies cleanContent() returns an empty string for binary content. Remove unnecessary variable reset from ET command in Page::getDataTm. Only needed under BT.

Account for the entire font-factor (font-size multiplied by the horizontal scaling factor of the text-matrix) when estimating the width of the current text block. Insert a fix when decoding octal strings in Font::decodeOctal(), check further ahead for escaped backslashes. Remove tests for images in DocumentGeneratorFocusTest.php. These also fail in the current v2.7.0 release and they should be looked at in a separate PR.

Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated. Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR smalot#634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated. Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f. Add new tests for both of these.

k00ni · 2023-09-20T06:09:07Z

@GreyWyvern

I'll rely on your judgement @k00ni, if you think they should be marked as @internal I'll do that.

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

About the images: Just remove test-code then, if you didn't change any image related code.

I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

GreyWyvern · 2023-09-20T14:11:42Z

Please only mark the functions you added in this PR with @internal. If you are unsure, leave it as it is. I would like to keep our public API as small as possible for now or at least not let it expand it uncontrolled.

Will do! These functions are not really something a regular user would use, so @internal makes enough sense.

About the images: Just remove test-code then, if you didn't change any image related code.

Removed.

I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion.

Well... the main change of function is in the two functions formatContent() and getSectionsText(). Especially with the latter, this is an all-or-nothing switch. Without the new way the document stream is formatted by these two functions, the rest of the changes (especially getTextArray()) won't really work.

The BOM-related changes in Font.php, and all the changes in Page.php (except removing lines 403-404) are really the only changes that are essentially entirely unrelated to the meat of this PR. I will revert my change to Font::decodeOctal() in prep for the other PR I've made.

I think at most I would be able to split this into two PRs, the first of which would be all the additions that don't actually change execution flow (like adding formatContent() and parseDictionary() but not calling them anywhere). And then the ones that do affect execution flow second. Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

I'm not in a hurry. I'm happy to allow you all the time you need to review :)

Revert the change to Font::decodeOctal() as it's been superceded by PR smalot#640. Add @internal notes to formatContent() and parseDictionary().

The `@internal` tag hides the content that comes _after_ it from the documentation, so adjust these comments as appropriate. See: https://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_tags.internal.pkg.html

This test will succeed once PR smalot#640 is merged. It doesn't have anything to do with the current PR, so disable it for now.

k00ni · 2023-09-21T06:19:15Z

Personally, that just feels like unnecessary extra paperwork to me. I would rather it be merged all at once, and then we do release candidates instead of a full release.

Alright. If there are no further objections, we can merge it all at once.

How is your timetable for this PR?

CC @j0k3r @smalot

k00ni

In the following I am proposing a few changes in the function headers. As far as I understand the doc, @internal should reflect why a function is for internal use only. It was accompanied with general information here in some cases.

src/Smalot/PdfParser/PDFObject.php

Switch to tagging method for `@internal`. Adjust comments.

PHP-CS-Fixer requires spaces between `@` statements I guess.

* Better octal and hex-entity decode Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated. Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR #634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated. Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f. Add new tests for both of these. * Use #2D to ensure we're capturing hex letters * Change order of special string replacement Move the special string replacement after the unescaping of parentheses so we don't unescape any parentheses we shouldn't. Add more tests to make sure this is working. * Apply suggestions from code review Co-authored-by: Konrad Abicht <[email protected]> --------- Co-authored-by: Konrad Abicht <[email protected]>

In some edge cases, the formatContent() method may return a document stream row containing an invalid command. Make sure we just ignore these commands instead of triggering warnings for missing $matches array elements.

Re-enable this assertion, now that we have merged smalot#640.

j0k3r

This is massive rewrite and it's hard to follow up 😅

j0k3r · 2023-09-27T17:16:52Z

src/Smalot/PdfParser/PDFObject.php

+     * @internal
+     */
+    public function formatContent(?string $content): string


Why don't you define it as private instead? It'll avoid adding the @internal

Well, it's called explicitly in PDFObjectTest.php, but I can work around that with ReflectionMethod. Initially I thought it would be useful to be able to use this method publicly to format any old document stream, but that's probably not necessary.

Maybe this can be applied to other @internal method?

The less public methods we have the better, because @internal is just a label and can't be enforced.

About the testing of private methods: I support the view that private functionality is not to test, because it is the objects obligation. One should only test public methods. In practice this might lead to complicated situations in which it is hard to cover a certain case sometimes.

Initially I thought it would be useful to be able to use this method publicly to format any old document stream, but that's probably not necessary.

I was thinking that a method like that should be extracted from PDFObject and moved to a utility class or something. It is useful and might be handy outside of PDFObject context. But we should finalize this one first and then see. Because it is private now, we could extract it from PDFObject and make it available later on, if we want.

Maybe this can be applied to other @internal method?

Almost certainly. I wouldn't want to do it in this PR though. :)

About the testing of private methods: I support the view that private functionality is not to test, because it is the objects obligation. One should only test public methods. In practice this might lead to complicated situations in which it is hard to cover a certain case sometimes.

Since we can test private methods by making them accessible via Reflection (well supported for pretty much exactly this purpose since PHP 5) I don't see why we shouldn't, personally. The more targeted the tests, the easier they are to isolate and fix. More tests!!!

I was thinking that a method like that should be extracted from PDFObject and moved to a utility class or something. It is useful and might be handy outside of PDFObject context.

Definitely. For instance, I feel like my added function parseDictionary() duplicates much of the protected parsing functionality from Parser.php that breaks apart PDF 'dictionary' structures, which aren't only used in document streams, but trailer info, etc. In the future, there should probably be one global class or function that parses dictionaries (a fundamental PDF data structure) for all situations.

Make the formatContent() method private to PDFObject so that `@internal` isn't required. Adjust the unit tests with `ReflectionMethod` to account for this.

k00ni · 2023-11-01T07:27:19Z

@GreyWyvern A quick follow up: Are you planning any updates on this one in the near future? I won't have that much time in the next weeks/months most likely. I remember your suggestion to "throw further PDFs" at the code.

I suggest the following next steps:

You finalize your work
You mark the PR as ready to review
I will merge it (if there are no objections)
I will create a release candidate, so our community can start to test it

This way your work gets out to more people and we can observe, if there are any remaining bugs in your code. I can help organize further steps regarding releases or issues.

WDYT?

GreyWyvern · 2023-11-01T14:26:27Z

WDYT?

I think this is a good plan. I have actually found a couple more PDFs that PdfParser has trouble with since I last visited, but related to embedded images rather than text-extraction.

Putting out a release candidate will certainly help get the new code more testing. The most important thing to find are PDFs that v2.7.0 parses "correctly" while this new version does not. That's where the most tweaking information will come from. :)

I will mark this PR as Ready for review.

k00ni · 2023-11-01T14:34:07Z

@j0k3r any objections? I would merge it right away and proceed as suggested.

j0k3r · 2023-11-02T09:10:31Z

@k00ni This is good to me

k00ni · 2023-11-07T07:04:13Z

Big thanks to @GreyWyvern and all commentators.

GreyWyvern added 5 commits August 14, 2023 15:01

Update Font.php

bced73a

Remove unnecessary `$whitespace` line.

PHP-cs-fixer edits

7e78a24