parsed incorectly, wrong symbols #592

NazarSolovei · 2023-04-02T19:12:28Z

PHP Version: 8.0
PDFParser Version: Latest: 2.5.5

Description: pdf file parsed and outputted as unreadable characters.

PDF input: inv.pdf

Expected output & actual output

EXPECTED:
Buyer Receipt
Receipt # 19509334
Receipt Date 10/26/2022
Received By AC/ASAP S...... (and so on as in pdf file)

ACTUAL:
%X\HU�5HFHLSW 5HFHLSW�� 5HFHLSW�'DWH �� 5HFHLYHG�%\ $&�$6$3�6 6ROG�$W�%UDQFK ��'UHD

Code

include "vendor/autoload.php";
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("inv.pdf");
$text = $pdf->getText();
echo $text;

NazarSolovei · 2023-04-14T13:27:42Z

@k00ni have you tried to fix it?
Thank you for your answer.

k00ni · 2023-04-17T07:26:42Z

@NazarSolovei No, my spare time is very limited so I have to triage what issues I focus on.

GreyWyvern · 2023-07-18T17:37:21Z

There are actually two issues here. One is that the font CIDMaps are gzcompressed with a CRC32 checksum instead of the Adler-32 PHP expects. See: https://www.php.net/manual/en/function.gzuncompress.php#79042

The code at the link above fixes the majority of the PdfParser extracted output by at least setting the default page font (F14). However, there is one section of the file in a different font (F17) which still displays undecoded bytes, and the reason F17 is not getting used is because the change font commands (/F17 9.744095 Tf for example) have been placed outside of the text blocks. Between blocks, basically. See this example bit of stream:

S
/F17 9.744095 Tf
/GS0 gs
0 0 0 rg
BT
54.035439 -109.771729 Td
[______________________________________________________________________________________________] TJ
ET
/F17 9.744095 Tf
/GS0 gs
0 0 0 rg
BT
236.515762 -110.657532 Td
[________________________] TJ
ET
/GS0 gs
0 0 0 RG 0.885827 w 0 J 0 j 1.414 M [] 0 d
121.358276 -110.657532 m
224.114197 -110.657532 l
S
/GS0 gs
0 0 0 RG 0.885827 w 0 J 0 j 1.414 M [] 0 d
271.948822 -110.657532 m
403.937042 -110.657532 l
S
/F14 7.972442 Tf
/GS0 gs
0 0 0 rg
BT
53.149609 -128.374084 Td
[_____________________________________________________________________________] TJ
ET
/F14 7.972442 Tf
/GS0 gs
0 0 0 rg
BT
318.011841 -130.145752 Td
[_________________________________________________________________________________] TJ
ET

PdfParser considers everything between a BT and an ET as a "text block". Normally font commands Tf only occur inside these blocks. But in the stream above, they don't.

The current getSectionsText() function in PDFObject.php only extracts commands that are within these text blocks and ignores all other lines of the stream. I think this function will need to be heavily overhauled to account for commands that potentially happen outside of text blocks BT ... ET.

GreyWyvern · 2023-07-31T15:51:44Z

Hi @NazarSolovei. May I use your inv.pdf as an addition to the PdfParser test suite? Is it free to use? Thanks.

NazarSolovei · 2023-07-31T16:09:52Z

Hello, Yes it’s free. Can I ask you to notify me once you solve the issue? Пн, 31 июля 2023 г. в 18:51, Brian Huisman ***@***.***>:

Hi @NazarSolovei <https://github.com/NazarSolovei>. May I use your inv.pdf as an addition to the PdfParser test suite? Is it free to use? Thanks. — Reply to this email directly, view it on GitHub <#592 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A65NKCQUT66U3JMEVWHSJ63XS7IBZANCNFSM6AAAAAAWQPHWBU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Nazar Solovei

A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042 Partially resolves smalot#592.

* Allow for CRC23 checksum gzuncompress() A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042 Partially resolves #592. * Simplify decodeFilterFlateDecode() error-handling Instead of setting an error handler to catch the E_WARNING's that gzuncompress() emits, suppress it with an @ so we can do away with the try/catch. Make a note of this in the comments. Switch from using tempnam() to tmpfile() because tempnam() can emit E_NOTICE's and would have to be suppressed as well. tmpfile() just returns a handle or false. Limit file_get_contents() by the $decodeMemoryLimit. Unlike gzuncompress() for which a limit value of zero (0) means "no limit", file_get_contents() takes null to mean "no limit". * Update FilterHelper.php Fix for PHP < 8.0 that doesn't like a length limit of null for file_get_contents().

k00ni added bug de-/encoding issue labels Apr 3, 2023

This was referenced Jul 31, 2023

Allow for CRC23 checksum gzuncompress() #622

Merged

calculateTextWidth throws an error for some fonts #570

Open

k00ni closed this as completed in #622 Aug 5, 2023

NickHahac mentioned this issue Dec 20, 2023

gzuncompress(): data error #659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsed incorectly, wrong symbols #592

parsed incorectly, wrong symbols #592

NazarSolovei commented Apr 2, 2023

NazarSolovei commented Apr 14, 2023

k00ni commented Apr 17, 2023

GreyWyvern commented Jul 18, 2023

GreyWyvern commented Jul 31, 2023

NazarSolovei commented Jul 31, 2023 via email

parsed incorectly, wrong symbols #592

parsed incorectly, wrong symbols #592

Comments

NazarSolovei commented Apr 2, 2023

Description: pdf file parsed and outputted as unreadable characters.

PDF input: inv.pdf

Expected output & actual output

Code

NazarSolovei commented Apr 14, 2023

k00ni commented Apr 17, 2023

GreyWyvern commented Jul 18, 2023

GreyWyvern commented Jul 31, 2023

NazarSolovei commented Jul 31, 2023 via email