Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfParser does not consider the entire document stream #628

Closed
GreyWyvern opened this issue Aug 5, 2023 · 4 comments · Fixed by #634
Closed

PdfParser does not consider the entire document stream #628

GreyWyvern opened this issue Aug 5, 2023 · 4 comments · Fixed by #634

Comments

@GreyWyvern
Copy link
Contributor

// Extract text blocks.
if (preg_match_all('/(\sQ)?\s+BT[\s|\(|\[]+(.*?)\s*ET(\sq)?/s', $textCleaned, $matches, \PREG_OFFSET_CAPTURE)) {
foreach ($matches[2] as $pos => $part) {
$text = $part[0];
if ('' === $text) {
continue;
}

Above is the code in PDFObject.php that extracts lines from a document stream to determine what to display. It only considers content between BT and ET commands (and maybe a Q or q on either side) to be valid commands. However, many valid commands such as cm (graphics position affecting initial position of BT) and Tf (font changes) can and do occur outside of BT ... ET blocks. Even q and Q occur regularly in streams and not just adjacent to BT ... ET. In order to more correctly display the content of a PDF, the entire stream must be used, with mainly graphics-related commands able to be ignored.

As well, q and Q are currently handled in a two state manner. If a q is encountered, the state is saved; if a Q is encountered, the saved state is restored. This does not account for the fact that multiple states can be saved and restored in a stack in a push/pop manner. Both fonts (Tf) and graphics positions (cm) should be stored in this fashion.

case 'Tm':
$args = preg_split('/\s/s', $command[self::COMMAND]);
$y = array_pop($args);
$x = array_pop($args);

Affect on Positioning

In addition to ignoring cm positioning commands, PdfParser's treatment of Tm (set text matrix) and Td/TD (set text current point) does not take into account the full matrix position of 6 values. In the following example stream commands:

0.8 0 0 0.8 100 100 Tm
200 200 Td
(Hello World)Tj

... PdfParser only considers the 100 100 from the Tm command and sets that as the current text position. Then it sees the 200 200 from the Td and overwrites the current text position so it is now 200 200. The correct positioning interpretation is the following:

  • Set the current text position to 100 100.
  • Also set the current text size ratio to 0.8 0.8.
  • Take the values from the Td command, multiply them by the text size ratio, then add them to the current text position: 200 x 0.8 + 100 = 260
  • Therefore the current text position to display "Hello World" is actually 260 260 and not 200 200.

Fortunately we can ignore the graphics size ratios from the cm commands as they only affect graphics commands. :)

I'm preparing a PR that will essentially completely re-write the cleanContent(), getSectionsText(), getText(), and getCommandsText() methods from PDFObject.php (as well as a couple minor changes in Font.php and Page.php) to switch to this new way of interpreting the document stream. It is an extensive change which I hope gets a lot of scrutiny! Already in my test environment it is passing all unit tests except one, and resolves a large number of open issues.

Opening this issue for discussion purposes, and I may start tagging issues here that will (hopefully) be resolved by the change.

@GreyWyvern
Copy link
Contributor Author

Putting this down just so I don't lose it...

Okay, so I've just gone through the whole list of issues, and where sample PDFs were available, I tested them using my new setup. The updated code I'm working on resolves the following issues (not tagged for cleanliness).

110, 149, 261, 353, 387, 398, 458, 508, 527, 528, 542, 551, 564, 568, 575, 576, 585, 607, 608, 628

In addition to the above, I can't verify whether it's my update that has fixed these, but they are resolved.

  • 474 - May have been fixed by 533, but my code definitely changes how ET commands are handled.
  • 491 - May have been fixed by 597.
  • 537
  • 541 - Sample PDF is mostly text stored as curves so unreadable, but there is no memory error.
  • 578 - At least the second texte.pdf file is working with my setup.

All tests are now passing, however I did have to modify several since the way the script handles and parses the document stream is now different.

@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Aug 10, 2023

I've gone through the list again, once with my updated setup, and once with the latest release v2.7.0 to get a definitive list of all the issues this change will fix. And here it is:

#353, #398, #464, #474, 491, #508, #528, #537, #564, #568, #575, #576, #585, #608 and this issue 628.

Now I just need to write tests for these, lol.

@GreyWyvern
Copy link
Contributor Author

I've committed my first set of changes here: https://github.com/GreyWyvern/pdfparser

There are still more tests needed, but at least you can try out the changes from the repo. :)

@k00ni
Copy link
Collaborator

k00ni commented Aug 16, 2023

There are still more tests needed, but at least you can try out the changes from the repo. :)

I suggest you create a PR regardless, because it will be easier to follow changes and discuss them. A draft PR will be sufficient at the beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants