-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsed incorectly, wrong symbols #592
Comments
@k00ni have you tried to fix it? |
@NazarSolovei No, my spare time is very limited so I have to triage what issues I focus on. |
There are actually two issues here. One is that the font CIDMaps are gzcompressed with a CRC32 checksum instead of the Adler-32 PHP expects. See: https://www.php.net/manual/en/function.gzuncompress.php#79042 The code at the link above fixes the majority of the PdfParser extracted output by at least setting the default page font (F14). However, there is one section of the file in a different font (F17) which still displays undecoded bytes, and the reason F17 is not getting used is because the change font commands (
PdfParser considers everything between a The current |
Hi @NazarSolovei. May I use your inv.pdf as an addition to the PdfParser test suite? Is it free to use? Thanks. |
Hello,
Yes it’s free. Can I ask you to notify me once you solve the issue?
Пн, 31 июля 2023 г. в 18:51, Brian Huisman ***@***.***>:
Hi @NazarSolovei <https://github.com/NazarSolovei>. May I use your
inv.pdf as an addition to the PdfParser test suite? Is it free to use?
Thanks.
—
Reply to this email directly, view it on GitHub
<#592 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A65NKCQUT66U3JMEVWHSJ63XS7IBZANCNFSM6AAAAAAWQPHWBU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--
Nazar Solovei
|
A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042 Partially resolves smalot#592.
* Allow for CRC23 checksum gzuncompress() A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042 Partially resolves #592. * Simplify decodeFilterFlateDecode() error-handling Instead of setting an error handler to catch the E_WARNING's that gzuncompress() emits, suppress it with an @ so we can do away with the try/catch. Make a note of this in the comments. Switch from using tempnam() to tmpfile() because tempnam() can emit E_NOTICE's and would have to be suppressed as well. tmpfile() just returns a handle or false. Limit file_get_contents() by the $decodeMemoryLimit. Unlike gzuncompress() for which a limit value of zero (0) means "no limit", file_get_contents() takes null to mean "no limit". * Update FilterHelper.php Fix for PHP < 8.0 that doesn't like a length limit of null for file_get_contents().
Description: pdf file parsed and outputted as unreadable characters.
PDF input: inv.pdf
Expected output & actual output
EXPECTED:
Buyer Receipt
Receipt # 19509334
Receipt Date 10/26/2022
Received By AC/ASAP S...... (and so on as in pdf file)
ACTUAL:
%X\HU�5HFHLSW 5HFHLSW�� �������� 5HFHLSW�'DWH ���������� 5HFHLYHG�%\ $&�$6$3�6 6ROG�$W�%UDQFK ������'UHD
Code
include "vendor/autoload.php";
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("inv.pdf");
$text = $pdf->getText();
echo $text;
The text was updated successfully, but these errors were encountered: