Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not parse text from pdf file #564

Closed
mapexpert opened this issue Nov 30, 2022 · 9 comments · Fixed by #634
Closed

Does not parse text from pdf file #564

mapexpert opened this issue Nov 30, 2022 · 9 comments · Fixed by #634
Labels

Comments

@mapexpert
Copy link

$content = Storage::get('iaa/receipt1.pdf');
$parser = new \Smalot\PdfParser\Parser;
$data = $parser->parseContent($content);
dd($data->getText());

Output:
b"%\t\n\x00X\x00\x00H\x00U\x00\x03\x005\x00H\x00F\x00H\x00L\x00S\x00W\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00\x06\t\n\x00\x14\t\n \x00\x1C\x00\x19\x00\x17\x00\x1A\x00\x19\x00\x17\x00\x13\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00'\x00D\x00W\x00H\t\n\x00\x14\t\n \x00\x14\x00\x12\x00\x15\x00\x16\x00\x12\x00\x15\x00\x13\x00\x15\x00\x15\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00Y\x00H\x00G\x00\x03\x00%\x00\t\n\x00$\t\n \x006\x00$\x003\x00\x03\x006\t\n\x006\t\n\n\x00R\x00O\x00G\x00\x03\x00$\x00W\x00\x03\x00%\x00U\x00D\x00Q\x00F\x00K\t\n\x00\x19\t\n

receipt1.pdf
IMHO parser does not parse font correctly and does not load translate tables.

@k00ni k00ni added the bug label Dec 2, 2022
@Uplink03
Copy link

Uplink03 commented Jan 18, 2023

I've been having a similar problem with a PDF, but it turns out it's different from yours, @mapexpert . The problem in your case is how Font::loadTranslateTable wants to figure out the Unicode table, and it ends up with an empty table. Because of that, the Font::decodeContentByToUnicodeCMapOrDescendantFonts also fails. Not that it guarantees that it will work, as it has this comment: @todo Seems this is invalid algorithm that do not follow pdf-format specification. Must be rewritten.

My similar problem was a lot easier to fix, as it was caused by Encoding::init. If I replace this:

if ($this->has('BaseEncoding') {
    $this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();

    // the code that loads Differences
}

with

if ($this->has('BaseEncoding') {
    $this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();
}

// the code that loads Differences

which is basically to take // the code that loads Differences out of the big if block, then my problem is fixed. This seems to be the problem described in #462 .

I'm not capable enough at the moment to understand how to fix the problem you're having though.

I gave your PDF to pdf2txt.py (from the pdfminer Python project) and to pdftotext (from the poppler-utils package on Ubuntu 22.04), and they both barfed at it, while both decoded my own PDF's text just fine.

I'm writing this hoping it will give someone a hint of where to look when they make an attempt to fix this.

@NazarSolovei
Copy link

@mapexpert, have you solved the issue?

@mapexpert
Copy link
Author

@mapexpert, have you solved the issue?

no. I did not. still have the issue

@k00ni
Copy link
Collaborator

k00ni commented Jul 10, 2023

@mapexpert can you try the suggestion of @Uplink03 and get back to us, if this works: #564 (comment)

@mapexpert
Copy link
Author

I tried @Uplink03 suggestion and it does not work in my case

@GreyWyvern
Copy link
Contributor

GreyWyvern commented Jul 10, 2023

Definitely something weird going on here with fonts. If I save the file as a reduced size PDF in Acrobat the text issue remains. If I select all the text in the PDF in Adobe Acrobat, convert the font to Arial, then save the file, PdfParser parses the text properly.

Edit: The translate table used by translateChar() in Font.php is indeed empty. In the loadTranslateTable() function the function to get $content ($this->get('ToUnicode')->getContent()) returns an undecoded binary string instead of the plain text the rest of the function clearly expects judging by the preg_match_all() calls. Any idea what this encoding might be?

x�\�ϊ�@�����a�=�����Y������}�Ğ�Cc��{����ת�����L�'��U}���n�Wߧ�9�9�t};���>5)?�k�g��ۮ��>���4f����q����_�l��W?���yz�/�v8�O���Ԧ���˯��<����w��~΋L$o�e��/������{=���n~�.g����cL�����4C���Iө��lS���"ג����:��:_>|��B���t�R��y���tk�j1]��^LWk�,�t;�J1�g���n��6		��^L*Qb1ɡ�Q !P�Q !P�����JL
(mŤ�����&Q� a��NLڢ�c���8�@%L�y1=2:M�=2:�K���.�#�[���Ubzdt��zdt�����^L����c�T��������%���Zb1��)���HQLFFZ���H���������j1��	�R�K�^̀ux�3�U��h�G1�Z�X��Ъ��h�c�j@�^�l0�'�k1��_�m�7��b�,��1��ƺ.��4vb�,�Ï�1��1	�bFL�����`�K���WbFL��bFL�u�0b��
���{�߅������]���4-7���z�>/׮O��`���|��������

@GreyWyvern
Copy link
Contributor

This appears to be fixed in the latest release v2.7.0. Although a lot of spacing issues remain, the text is extracted successfully.

@ephrin
Copy link

ephrin commented Sep 8, 2023

@GreyWyvern @mapexpert close this then?

@mapexpert
Copy link
Author

@GreyWyvern @mapexpert close this then?

Yes

@k00ni k00ni closed this as completed Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants