PdfParser does not consider the entire document stream #628

GreyWyvern · 2023-08-05T18:02:24Z

pdfparser/src/Smalot/PdfParser/PDFObject.php

Lines 195 to 201 in 2608ac3

    
           // Extract text blocks. 
        
           if (preg_match_all('/(\sQ)?\s+BT[\s|\(|\[]+(.*?)\s*ET(\sq)?/s', $textCleaned, $matches, \PREG_OFFSET_CAPTURE)) { 
        
               foreach ($matches[2] as $pos => $part) { 
        
                   $text = $part[0]; 
        
                   if ('' === $text) { 
        
                       continue; 
        
                   }

Above is the code in PDFObject.php that extracts lines from a document stream to determine what to display. It only considers content between BT and ET commands (and maybe a Q or q on either side) to be valid commands. However, many valid commands such as cm (graphics position affecting initial position of BT) and Tf (font changes) can and do occur outside of BT ... ET blocks. Even q and Q occur regularly in streams and not just adjacent to BT ... ET. In order to more correctly display the content of a PDF, the entire stream must be used, with mainly graphics-related commands able to be ignored.

As well, q and Q are currently handled in a two state manner. If a q is encountered, the state is saved; if a Q is encountered, the saved state is restored. This does not account for the fact that multiple states can be saved and restored in a stack in a push/pop manner. Both fonts (Tf) and graphics positions (cm) should be stored in this fashion.

pdfparser/src/Smalot/PdfParser/PDFObject.php

Lines 387 to 390 in 2608ac3

    
           case 'Tm': 
        
               $args = preg_split('/\s/s', $command[self::COMMAND]); 
        
               $y = array_pop($args); 
        
               $x = array_pop($args);

Affect on Positioning

In addition to ignoring cm positioning commands, PdfParser's treatment of Tm (set text matrix) and Td/TD (set text current point) does not take into account the full matrix position of 6 values. In the following example stream commands:

0.8 0 0 0.8 100 100 Tm
200 200 Td
(Hello World)Tj

... PdfParser only considers the 100 100 from the Tm command and sets that as the current text position. Then it sees the 200 200 from the Td and overwrites the current text position so it is now 200 200. The correct positioning interpretation is the following:

Set the current text position to 100 100.
Also set the current text size ratio to 0.8 0.8.
Take the values from the Td command, multiply them by the text size ratio, then add them to the current text position: 200 x 0.8 + 100 = 260
Therefore the current text position to display "Hello World" is actually 260 260 and not 200 200.

Fortunately we can ignore the graphics size ratios from the cm commands as they only affect graphics commands. :)

I'm preparing a PR that will essentially completely re-write the cleanContent(), getSectionsText(), getText(), and getCommandsText() methods from PDFObject.php (as well as a couple minor changes in Font.php and Page.php) to switch to this new way of interpreting the document stream. It is an extensive change which I hope gets a lot of scrutiny! Already in my test environment it is passing all unit tests except one, and resolves a large number of open issues.

Opening this issue for discussion purposes, and I may start tagging issues here that will (hopefully) be resolved by the change.

The text was updated successfully, but these errors were encountered:

GreyWyvern · 2023-08-09T22:31:12Z

Putting this down just so I don't lose it...

Okay, so I've just gone through the whole list of issues, and where sample PDFs were available, I tested them using my new setup. The updated code I'm working on resolves the following issues (not tagged for cleanliness).

110, 149, 261, 353, 387, 398, 458, 508, 527, 528, 542, 551, 564, 568, 575, 576, 585, 607, 608, 628

In addition to the above, I can't verify whether it's my update that has fixed these, but they are resolved.

474 - May have been fixed by 533, but my code definitely changes how ET commands are handled.
491 - May have been fixed by 597.
537
541 - Sample PDF is mostly text stored as curves so unreadable, but there is no memory error.
578 - At least the second texte.pdf file is working with my setup.

All tests are now passing, however I did have to modify several since the way the script handles and parses the document stream is now different.

GreyWyvern · 2023-08-10T19:57:29Z

I've gone through the list again, once with my updated setup, and once with the latest release v2.7.0 to get a definitive list of all the issues this change will fix. And here it is:

#353, #398, #464, #474, ~~491~~, #508, #528, #537, #564, #568, #575, #576, #585, #608 and this issue 628.

Now I just need to write tests for these, lol.

GreyWyvern · 2023-08-14T19:03:19Z

I've committed my first set of changes here: https://github.com/GreyWyvern/pdfparser

There are still more tests needed, but at least you can try out the changes from the repo. :)

k00ni · 2023-08-16T05:37:15Z

There are still more tests needed, but at least you can try out the changes from the repo. :)

I suggest you create a PR regardless, because it will be easier to follow changes and discuss them. A draft PR will be sufficient at the beginning.

k00ni added enhancement help wanted labels Aug 7, 2023

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed in #634 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfParser does not consider the entire document stream #628

PdfParser does not consider the entire document stream #628

GreyWyvern commented Aug 5, 2023

GreyWyvern commented Aug 9, 2023

GreyWyvern commented Aug 10, 2023 •

edited

Loading

GreyWyvern commented Aug 14, 2023

k00ni commented Aug 16, 2023 •

edited

Loading

PdfParser does not consider the entire document stream #628

PdfParser does not consider the entire document stream #628

Comments

GreyWyvern commented Aug 5, 2023

Affect on Positioning

GreyWyvern commented Aug 9, 2023

GreyWyvern commented Aug 10, 2023 • edited Loading

GreyWyvern commented Aug 14, 2023

k00ni commented Aug 16, 2023 • edited Loading

GreyWyvern commented Aug 10, 2023 •

edited

Loading

k00ni commented Aug 16, 2023 •

edited

Loading