Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsed incorectly, wrong symbols #592

Closed
NazarSolovei opened this issue Apr 2, 2023 · 5 comments · Fixed by #622
Closed

parsed incorectly, wrong symbols #592

NazarSolovei opened this issue Apr 2, 2023 · 5 comments · Fixed by #622

Comments

@NazarSolovei
Copy link

  • PHP Version: 8.0
  • PDFParser Version: Latest: 2.5.5

Description: pdf file parsed and outputted as unreadable characters.

PDF input: inv.pdf

Expected output & actual output

EXPECTED:
Buyer Receipt
Receipt # 19509334
Receipt Date 10/26/2022
Received By AC/ASAP S...... (and so on as in pdf file)

ACTUAL:
%X\HU�5HFHLSW 5HFHLSW�� �������� 5HFHLSW�'DWH ���������� 5HFHLYHG�%\ $&�$6$3�6 6ROG�$W�%UDQFK ������'UHD

Code

include "vendor/autoload.php";
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("inv.pdf");
$text = $pdf->getText();
echo $text;

@NazarSolovei
Copy link
Author

@k00ni have you tried to fix it?
Thank you for your answer.

@k00ni
Copy link
Collaborator

k00ni commented Apr 17, 2023

@NazarSolovei No, my spare time is very limited so I have to triage what issues I focus on.

@GreyWyvern
Copy link
Contributor

There are actually two issues here. One is that the font CIDMaps are gzcompressed with a CRC32 checksum instead of the Adler-32 PHP expects. See: https://www.php.net/manual/en/function.gzuncompress.php#79042

The code at the link above fixes the majority of the PdfParser extracted output by at least setting the default page font (F14). However, there is one section of the file in a different font (F17) which still displays undecoded bytes, and the reason F17 is not getting used is because the change font commands (/F17 9.744095 Tf for example) have been placed outside of the text blocks. Between blocks, basically. See this example bit of stream:

S
/F17 9.744095 Tf
/GS0 gs
0 0 0 rg
BT
54.035439 -109.771729 Td
[______________________________________________________________________________________________] TJ
ET
/F17 9.744095 Tf
/GS0 gs
0 0 0 rg
BT
236.515762 -110.657532 Td
[________________________] TJ
ET
/GS0 gs
0 0 0 RG 0.885827 w 0 J 0 j 1.414 M [] 0 d
121.358276 -110.657532 m
224.114197 -110.657532 l
S
/GS0 gs
0 0 0 RG 0.885827 w 0 J 0 j 1.414 M [] 0 d
271.948822 -110.657532 m
403.937042 -110.657532 l
S
/F14 7.972442 Tf
/GS0 gs
0 0 0 rg
BT
53.149609 -128.374084 Td
[_____________________________________________________________________________] TJ
ET
/F14 7.972442 Tf
/GS0 gs
0 0 0 rg
BT
318.011841 -130.145752 Td
[_________________________________________________________________________________] TJ
ET

PdfParser considers everything between a BT and an ET as a "text block". Normally font commands Tf only occur inside these blocks. But in the stream above, they don't.

The current getSectionsText() function in PDFObject.php only extracts commands that are within these text blocks and ignores all other lines of the stream. I think this function will need to be heavily overhauled to account for commands that potentially happen outside of text blocks BT ... ET.

@GreyWyvern
Copy link
Contributor

Hi @NazarSolovei. May I use your inv.pdf as an addition to the PdfParser test suite? Is it free to use? Thanks.

@NazarSolovei
Copy link
Author

NazarSolovei commented Jul 31, 2023 via email

GreyWyvern added a commit to GreyWyvern/pdfparser that referenced this issue Jul 31, 2023
A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042
Partially resolves smalot#592.
@k00ni k00ni closed this as completed in #622 Aug 5, 2023
k00ni pushed a commit that referenced this issue Aug 5, 2023
* Allow for CRC23 checksum gzuncompress()

A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042
Partially resolves #592.

* Simplify decodeFilterFlateDecode() error-handling

Instead of setting an error handler to catch the E_WARNING's that gzuncompress() emits, suppress it with an @ so we can do away with the try/catch. Make a note of this in the comments.
Switch from using tempnam() to tmpfile() because tempnam() can emit E_NOTICE's and would have to be suppressed as well. tmpfile() just returns a handle or false.
Limit file_get_contents() by the $decodeMemoryLimit. Unlike gzuncompress() for which a limit value of zero (0) means "no limit", file_get_contents() takes null to mean "no limit".

* Update FilterHelper.php

Fix for PHP < 8.0 that doesn't like a length limit of null for file_get_contents().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants