getText() result contains annacceptable unicode chars #585

se-ti · 2023-03-18T08:49:29Z

PHP Version: 7.3.33
PDFParser Version: 2.3.0

Description:

text, returned by getText() is contaminated with unicode control characters (0x97 hex, 151 dec -- https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_controls )

PDF input

page92.pdf

Expected

actual: "слева по ходу � веревка"
expected: "слева по ходу — веревка"

Same text with correct em-dash symbol.

Code

use Smalot\PdfParser\Parser;
include 'alt_autoload.php-dist';

$path = 'page92.pdf';
$content = file_get_contents($path);

$config = new \Smalot\PdfParser\Config();
$parser = new Parser([], $config);
$pdf = $parser->parseContent($content);
$text = $pdf->getText();
$sub = substr($text, 1658, 43);              // just to extract essential

$json = json_encode($sub);
// json === false, errorcode 5 : Malformed UTF-8 characters, possibly incorrectly encoded

se-ti · 2023-03-18T08:53:26Z

You can see this em-dash in the 4-th line of a table.

@se-ti

Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue smalot#585 User @se-ti allowed use of this file in issue smalot#586 (comment) In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from smalot#458 Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works.

@se-ti

* Rewrite PDFObject.php + ancillary This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions, this update changes the behaviour of `cleanContent()` to the following: - Hide all (strings) - Remove all newlines and carriage-returns - Hide all <<dictionaries>> - Normalize all whitespace - Using a list of valid Operators from the PDF Reference, add \r\n back to the remaining text so that there is a single PDF command on every line - Restore the <<dictionaries>> and (strings) By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of PREG extraction. `getSections()` text has been updated to do just this, stepping through the output of `cleanContent()` line-by-line and returning an array of only the relevant commands needed to position and display text. The guts of `getText()` have been moved to `getTextArray()` to reduce code duplication. `getTextArray()` now takes into account both the current graphics position `cm` and the scaling factors of the text matrix `Tm` when adding \n and \t whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a `Tm` or `Td`/`TD` was found. It also treats `Q` and `q` as a stack of stored states rather than a single stored state. The presence of `ActualText` `BDC` commands is also taken into account and the contents of the `ActualText` will replace the formerly output text in both content and position. This requires the new `parseDictionary()` method to reliably extract such commands as well as any others PdfParser may take into account in the future. `decodeText()` in **Font.php** now takes into account the current text matrix when considering whether or not to add spaces between words. Instead of `implode()`-ing the result array with a space joiner, rely on the positioning check to add all required spacing. In `decodeContent()` in **Font.php** add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so. * Add test, check for text/plain Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue #585 User @se-ti allowed use of this file in issue #586 (comment) In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from #458 Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works. * Modify some code comments + q and Q update Modify several code comments to be clearer. Remove the `$key => ` from `decodeText()` in **Font.php** as it's no longer needed. Also, now that `cleanContent()` is ignoring non `text/plain` content, there should be no errant `q` or `Q` commands that cause the stored-state stack to try restoring a state that doesn't exist. Remove the kludgy code that prevented this. * Update Font.php Remove unnecessary `$whitespace` line. * PHP-cs-fixer edits * Address some reviewed changes * Use text matrix 'i' instead of 'b' The correct matrix elements to use for scaling the x-axis are actually the first *column*, so 'a' and 'i', not 'a' and 'b'. My bad! It worked before because almost always the x-axis scaling is equal to the y-axis scaling. * Use mb_detect_encoding() instead of finfo() The Fileinfo functions are not installed by default on Windows, so use a different method to determine whether the stream is valid or binary. * Update Font.php PHP CS Fixer native_function_invocation * Sort switch statement cases in getTextArray() Make the cases a little bit more alphabetical. Remove cases/commands that aren't relevant to getting and positioning text. * Edit documentation comments, add tests Add some documentation comment text to PDFObject.php and fix a comment typo in Font.php. Add a test accounting for text-matrix scaling in Font::decodeText(). Add a test verifying that a string prefixed by a UTF-16BE BOM is decoded directly by Font::decodeContent(). Move "ET in font name" test from testCleanContent() to testGetSectionsText() as that is the function the test uses. Add a test that verifies cleanContent() returns an empty string for binary content. Remove unnecessary variable reset from ET command in Page::getDataTm. Only needed under BT. * Update ParserTest.php Add a fixed numerical value instead of a multiple of $baselineMemory so there's less dependence on the initial memory value and more focus on how much memory the loading of the PDF actually takes. Hopefully fixes the last unit-test failure? Change tests to assertGreaterThan and assertLessThan so it provides better failure info than just assertTrue (will tell us the values). In my PHP 8.2.7 test env, the $usedMemory exceeds $baselineMemory by ~209,000,000 bytes at this point (line 379). 180,000,000 should be safe? * Restore original cleanContent() Rename the new function formatContent() and restore the original cleanContent() function and tests. If anyone relied on these functions before then this will not be a backwards incompatible change. * added a few PDFs generated with various tools to check conformance * fixed CS issue * Remove textMatrix from decodeText() As it turns out, the current text matrix *can* be used to position text in decodeText(), but it is actually applied **both** to the $offset and the user-selected font space limit. In this way it cancels out, and importing the text matrix into the function is not necessary. Remove two tests in FontTest.php that were associated with this, as they're no longer valid. In decodeText() account for strings of text that have spaces with large offsets, such as 'a line of text that is full-width justified.' Treat these as single spaces instead of honouring the offsets given and spacing them out unnecessarily. Save and use the current font-size when guessing the "fudge" factor for text block width. Make sure this is saved as well when the q command is seen. Adjust the position factor threshold up from 10 to 24 for deciding when to insert tabs or spaces. Also update the current horizontal position when storing the last-written-position for use by an ActualText. Add a test for Microsoft Print-to-PDF v1.7. This change actually reverts the output value in a ParserTest assertion back to the original value before this PR. So that's good! * Account for text leading, font-size factor Keep track of, and account for, the current text leading (space between lines) value when determining whether or not to insert newlines \n. Use the current font size as a factor in determining whether or not to insert positioning whitespace. Take more care wrt adding whitespace. Examine some points where whitespace might be added, and remove it if it's unnecessary. Implode words in Font.php, but examine the words to see if they already have space characters on their fronts and ends. TD and Td "move to the start of the next line" so use a $current_position_td['x'] of zero (0) to start with. This change shouldn't really affect much since enough of a y-axis leading should be added that a newline \n results. In Integration\FontTest.php remove "group linux-only" from testDecodeTextForFontWithIndirectEncodingWithoutTypeEncoding() as this now succeeds in any environment. Before where PdfParser was adding spaces after the "zůstatek:" lines whereas now PdfParser adds tabs, which makes far more sense when looking at the document /samples/bugs/PullRequest500.pdf So change the expected results to match the output. * Update FontTest.php Don't use HEREDOC syntax to compose the expected value for the testDecodeTextForFontWithIndirectEncodingWithoutTypeEncoding() test in FontTest.php. This should allow it to work regardless of the PHP file line-endings. * Misc fixes - When assigning $current_text_leading, take the negative as described in the PDF Reference. Initialize it to zero, also as in the Reference. - Add back $current_position_td['x'] when considering a Td or TD command as this appears to be the way it's supposed to be handled according to the reference. - Add another check to remove unnecessary whitespace by examining the previous string of text for a trailing space. - $current_position_tm['x'] and $current_position_tm['y'] aren't tested for an initial false value, so initialize them to 0. - Use double null bytes to implode the string in Font.php to avoid catching UTF-8 bytes that might include a single null byte. - Base the text-box width guess on the current font size instead of a constant value of 4. - In PageTest.php, PdfParser now correctly inserts a newline between Aardappelen and Zak, so change the test. - In ParserTest.php, we are now removing unnecessary whitespace, including double spaces, so change this test to match the output. (Double spaces changed to single spaces) * Add SmallPDF sample PDF and unit test * Pass horizontal fontFactor to decodeText() Pass the fontFactor value (font size multiplied by the text-matrix scale value) to decodeText() so it can more accurately separate text into words by position. A font factor divided by four seems to give the best results for both horizontal and vertical positioning, although this value is selected arbitrarily. In Font::decodeText() some PDFs may have text-matrix scale values so large that directly encoded horizontal tabs are used in place of spaces. Remove internal tabs from texts and replace them with plain spaces. * Update PDFObject.php Restore the $offset parameter to public function getCommandsText() for backwards compatibility. * moved new PDFs into special folder adapted paths in tests * split DocumentTest class spit into a general one, one that is issue related and one which is PDF generator related * added 3 PDFs generated by different generators + tests * More accounting for horizontal font factor Account for the entire font-factor (font-size multiplied by the horizontal scaling factor of the text-matrix) when estimating the width of the current text block. Insert a fix when decoding octal strings in Font::decodeOctal(), check further ahead for escaped backslashes. Remove tests for images in DocumentGeneratorFocusTest.php. These also fail in the current v2.7.0 release and they should be looked at in a separate PR. * Revert decodeOctal(), add @internal Revert the change to Font::decodeOctal() as it's been superceded by PR #640. Add @internal notes to formatContent() and parseDictionary(). * Proper use of PHPDoc "internal" The `@internal` tag hides the content that comes _after_ it from the documentation, so adjust these comments as appropriate. See: https://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_tags.internal.pkg.html * Disable a failing test This test will succeed once PR #640 is merged. It doesn't have anything to do with the current PR, so disable it for now. * Update PDFObject.php Switch to tagging method for `@internal`. Adjust comments. * Update PDFObject.php PHP-CS-Fixer requires spaces between `@` statements I guess. * Return empty array if no valid command detected In some edge cases, the formatContent() method may return a document stream row containing an invalid command. Make sure we just ignore these commands instead of triggering warnings for missing $matches array elements. * Update DocumentGeneratorFocusTest.php Re-enable this assertion, now that we have merged #640. * Make formatContent() private Make the formatContent() method private to PDFObject so that `@internal` isn't required. Adjust the unit tests with `ReflectionMethod` to account for this. --------- Co-authored-by: Konrad Abicht <[email protected]>

se-ti · 2024-01-03T20:57:00Z

Thank you, Brian @GreyWyvern, for a tremendous job, You've done!

I have tested changes, You did. You have eliminated not only the problems from my issues, but also removed chaotic extra whitespaces. They weren't critical for my purposes, but now texts look much more eye-friendly.

Thank You!
I'm really impressed!

GreyWyvern · 2024-01-04T16:16:12Z

Thanks @se-ti! While it was very fun and challenging to work on, if I've learned at least two things from this, it is these:

The PDF format sucks and should be illegal.
There's always more work to be done!

😆

k00ni added bug de-/encoding issue labels Mar 20, 2023

GreyWyvern mentioned this issue Jun 27, 2023

Read XMP Metadata and add it to data returned by getDetails() #606

Merged

se-ti mentioned this issue Jul 13, 2023

text encoding breaks in the middle of the line #586

Closed

GreyWyvern mentioned this issue Aug 10, 2023

PdfParser does not consider the entire document stream #628

Closed

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed in #634 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getText() result contains annacceptable unicode chars #585

getText() result contains annacceptable unicode chars #585

se-ti commented Mar 18, 2023

se-ti commented Mar 18, 2023

se-ti commented Jan 3, 2024 •

edited

Loading

GreyWyvern commented Jan 4, 2024

getText() result contains annacceptable unicode chars #585

getText() result contains annacceptable unicode chars #585

Comments

se-ti commented Mar 18, 2023

Description:

PDF input

Expected

Code

se-ti commented Mar 18, 2023

se-ti commented Jan 3, 2024 • edited Loading

GreyWyvern commented Jan 4, 2024

se-ti commented Jan 3, 2024 •

edited

Loading