Does not parse text from pdf file #564

mapexpert · 2022-11-30T14:07:55Z

$content = Storage::get('iaa/receipt1.pdf');
$parser = new \Smalot\PdfParser\Parser;
$data = $parser->parseContent($content);
dd($data->getText());

Output:
b"%\t\n\x00X\x00\x00H\x00U\x00\x03\x005\x00H\x00F\x00H\x00L\x00S\x00W\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00\x06\t\n\x00\x14\t\n \x00\x1C\x00\x19\x00\x17\x00\x1A\x00\x19\x00\x17\x00\x13\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00'\x00D\x00W\x00H\t\n\x00\x14\t\n \x00\x14\x00\x12\x00\x15\x00\x16\x00\x12\x00\x15\x00\x13\x00\x15\x00\x15\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00Y\x00H\x00G\x00\x03\x00%\x00\t\n\x00$\t\n \x006\x00$\x003\x00\x03\x006\t\n\x006\t\n\n\x00R\x00O\x00G\x00\x03\x00$\x00W\x00\x03\x00%\x00U\x00D\x00Q\x00F\x00K\t\n\x00\x19\t\n

receipt1.pdf
IMHO parser does not parse font correctly and does not load translate tables.

The text was updated successfully, but these errors were encountered:

Uplink03 · 2023-01-18T03:41:38Z

I've been having a similar problem with a PDF, but it turns out it's different from yours, @mapexpert . The problem in your case is how Font::loadTranslateTable wants to figure out the Unicode table, and it ends up with an empty table. Because of that, the Font::decodeContentByToUnicodeCMapOrDescendantFonts also fails. Not that it guarantees that it will work, as it has this comment: @todo Seems this is invalid algorithm that do not follow pdf-format specification. Must be rewritten.

My similar problem was a lot easier to fix, as it was caused by Encoding::init. If I replace this:

if ($this->has('BaseEncoding') {
    $this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();

    // the code that loads Differences
}

with

if ($this->has('BaseEncoding') {
    $this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();
}

// the code that loads Differences

which is basically to take // the code that loads Differences out of the big if block, then my problem is fixed. This seems to be the problem described in #462 .

I'm not capable enough at the moment to understand how to fix the problem you're having though.

I gave your PDF to pdf2txt.py (from the pdfminer Python project) and to pdftotext (from the poppler-utils package on Ubuntu 22.04), and they both barfed at it, while both decoded my own PDF's text just fine.

I'm writing this hoping it will give someone a hint of where to look when they make an attempt to fix this.

NazarSolovei · 2023-04-02T18:32:35Z

@mapexpert, have you solved the issue?

mapexpert · 2023-07-09T09:00:20Z

@mapexpert, have you solved the issue?

no. I did not. still have the issue

k00ni · 2023-07-10T06:51:47Z

@mapexpert can you try the suggestion of @Uplink03 and get back to us, if this works: #564 (comment)

mapexpert · 2023-07-10T07:26:59Z

I tried @Uplink03 suggestion and it does not work in my case

GreyWyvern · 2023-07-10T13:28:01Z

Definitely something weird going on here with fonts. If I save the file as a reduced size PDF in Acrobat the text issue remains. If I select all the text in the PDF in Adobe Acrobat, convert the font to Arial, then save the file, PdfParser parses the text properly.

Edit: The translate table used by translateChar() in Font.php is indeed empty. In the loadTranslateTable() function the function to get $content ($this->get('ToUnicode')->getContent()) returns an undecoded binary string instead of the plain text the rest of the function clearly expects judging by the preg_match_all() calls. Any idea what this encoding might be?

x�\�ϊ�@�����a�=�����Y������}�Ğ�Cc��{����ת�����L�'��U}���n�Wߧ�9�9�t};���>5)?�k�g��ۮ��>���4f����q����_�l��W?���yz�/�v8�O���Ԧ���˯��<����w��~΋L$o�e��/������{=���n~�.g����cL�����4C���Iө��lS���"ג����:��:_>|��B���t�R��y���tk�j1]��^LWk�,�t;�J1�g���n��6		��^L*Qb1ɡ�Q !P�Q !P�����JL
(mŤ�����&Q� a��NLڢ�c���8�@%L�y1=2:M�=2:�K���.�#�[���Ubzdt��zdt�����^L����c�T��������%���Zb1��)���HQLFFZ���H���������j1��	�R�K�^̀ux�3�U��h�G1�Z�X��Ъ��h�c�j@�^�l0�'�k1��_�m�7��b�,��1��ƺ.��4vb�,�Ï�1��1	�bFL�����`�K���WbFL��bFL�u�0b��
���{�߅������]���4-7���z�>/׮O��`���|��������

GreyWyvern · 2023-08-10T18:14:12Z

This appears to be fixed in the latest release v2.7.0. Although a lot of spacing issues remain, the text is extracted successfully.

ephrin · 2023-09-08T09:40:48Z

@GreyWyvern @mapexpert close this then?

mapexpert · 2023-09-08T09:43:04Z

@GreyWyvern @mapexpert close this then?

Yes

k00ni added the bug label Dec 2, 2022

GreyWyvern mentioned this issue Aug 10, 2023

PdfParser does not consider the entire document stream #628

Closed

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not parse text from pdf file #564

Does not parse text from pdf file #564

mapexpert commented Nov 30, 2022

Uplink03 commented Jan 18, 2023 •

edited

Loading

NazarSolovei commented Apr 2, 2023

mapexpert commented Jul 9, 2023

k00ni commented Jul 10, 2023

mapexpert commented Jul 10, 2023

GreyWyvern commented Jul 10, 2023 •

edited

Loading

GreyWyvern commented Aug 10, 2023

ephrin commented Sep 8, 2023 •

edited

Loading

mapexpert commented Sep 8, 2023

Does not parse text from pdf file #564

Does not parse text from pdf file #564

Comments

mapexpert commented Nov 30, 2022

Uplink03 commented Jan 18, 2023 • edited Loading

NazarSolovei commented Apr 2, 2023

mapexpert commented Jul 9, 2023

k00ni commented Jul 10, 2023

mapexpert commented Jul 10, 2023

GreyWyvern commented Jul 10, 2023 • edited Loading

GreyWyvern commented Aug 10, 2023

ephrin commented Sep 8, 2023 • edited Loading

mapexpert commented Sep 8, 2023

Uplink03 commented Jan 18, 2023 •

edited

Loading

GreyWyvern commented Jul 10, 2023 •

edited

Loading

ephrin commented Sep 8, 2023 •

edited

Loading