-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PdfParser does not consider the entire document stream #628
Comments
Putting this down just so I don't lose it... Okay, so I've just gone through the whole list of issues, and where sample PDFs were available, I tested them using my new setup. The updated code I'm working on resolves the following issues (not tagged for cleanliness). 110, 149, 261, 353, 387, 398, 458, 508, 527, 528, 542, 551, 564, 568, 575, 576, 585, 607, 608, 628 In addition to the above, I can't verify whether it's my update that has fixed these, but they are resolved.
All tests are now passing, however I did have to modify several since the way the script handles and parses the document stream is now different. |
I've gone through the list again, once with my updated setup, and once with the latest release v2.7.0 to get a definitive list of all the issues this change will fix. And here it is: #353, #398, #464, #474, Now I just need to write tests for these, lol. |
I've committed my first set of changes here: https://github.com/GreyWyvern/pdfparser There are still more tests needed, but at least you can try out the changes from the repo. :) |
I suggest you create a PR regardless, because it will be easier to follow changes and discuss them. A draft PR will be sufficient at the beginning. |
pdfparser/src/Smalot/PdfParser/PDFObject.php
Lines 195 to 201 in 2608ac3
Above is the code in PDFObject.php that extracts lines from a document stream to determine what to display. It only considers content between
BT
andET
commands (and maybe aQ
orq
on either side) to be valid commands. However, many valid commands such ascm
(graphics position affecting initial position ofBT
) andTf
(font changes) can and do occur outside ofBT ... ET
blocks. Evenq
andQ
occur regularly in streams and not just adjacent toBT ... ET
. In order to more correctly display the content of a PDF, the entire stream must be used, with mainly graphics-related commands able to be ignored.As well,
q
andQ
are currently handled in a two state manner. If aq
is encountered, the state is saved; if aQ
is encountered, the saved state is restored. This does not account for the fact that multiple states can be saved and restored in a stack in a push/pop manner. Both fonts (Tf
) and graphics positions (cm
) should be stored in this fashion.pdfparser/src/Smalot/PdfParser/PDFObject.php
Lines 387 to 390 in 2608ac3
Affect on Positioning
In addition to ignoring
cm
positioning commands, PdfParser's treatment ofTm
(set text matrix) andTd
/TD
(set text current point) does not take into account the full matrix position of 6 values. In the following example stream commands:... PdfParser only considers the
100 100
from theTm
command and sets that as the current text position. Then it sees the200 200
from theTd
and overwrites the current text position so it is now200 200
. The correct positioning interpretation is the following:100 100
.0.8 0.8
.Td
command, multiply them by the text size ratio, then add them to the current text position: 200 x 0.8 + 100 = 260260 260
and not200 200
.Fortunately we can ignore the graphics size ratios from the
cm
commands as they only affect graphics commands. :)I'm preparing a PR that will essentially completely re-write the
cleanContent()
,getSectionsText()
,getText()
, andgetCommandsText()
methods from PDFObject.php (as well as a couple minor changes in Font.php and Page.php) to switch to this new way of interpreting the document stream. It is an extensive change which I hope gets a lot of scrutiny! Already in my test environment it is passing all unit tests except one, and resolves a large number of open issues.Opening this issue for discussion purposes, and I may start tagging issues here that will (hopefully) be resolved by the change.
The text was updated successfully, but these errors were encountered: