Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solved: getText() returns some portions of the PDF with "unintelligible" text #389

Closed
TheCyberMike opened this issue Feb 12, 2021 · 3 comments · Fixed by #401
Closed

Solved: getText() returns some portions of the PDF with "unintelligible" text #389

TheCyberMike opened this issue Feb 12, 2021 · 3 comments · Fixed by #401

Comments

@TheCyberMike
Copy link

TheCyberMike commented Feb 12, 2021

I had a problem getting text from various PDF Invoice files from the same source. Some portions of the text would contain "unintelligible" character strings, which are in-fact just un-decoded Unicode bytecodes. This similar type of problem has been reported several times in these Issues. I don't have a non-proprietary example PDF to share, but I did find the bug in PDFParser code, corrected it in my implementation, and its now parsing all these PDFs.

I successfully debugged and solved at least my PDF's problem. The PDF has a /Font object that uses a ToUnicode element. The translate table object was properly included, however its BFCHAR section had the following subset of mappings:
<22><0072>
<23><0076>
<24><0069> ...
Note there is NO space between the from and to components in each row. Most examples online of the mapping table structure and contents DOES have a space between the two components.

in Font.php line 167 the regular expression currently is:
'/<(?P<from>[0-9A-F]+)> +<(?P<to>[0-9A-F]+)>[ \r\n]+/is'
Note the space-plus in the middle of the expression ... that means 1 or more spaces are allowed, but not ZERO spaces.
A simple change to the regular expression fixed the decoding problem by allowing the translate table to load:
'/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is'
Now zero, one, or more spaces are allowed. The translate table actually loads. All the text gets properly decoded from the PDF.

Note in the same loadTranslateTable() function in Font.php, the other regular expressions on lines 151 and 193 do properly use space-asterisk instead of space-plus so they should work fine with these PDFs with /Font translate tables without spaces between the mapping elements. This may also solve some other's reports of this issue.

Solution: change Font.php line 167 to
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is';

@k00ni
Copy link
Collaborator

k00ni commented Feb 12, 2021

Thank you for the report and a potential solution!

Can you please run tests using 0.18.1 with your changes and get back to me?

Also if you can't provide a non-proprietary example PDF, could you use echo right before the function and paste some binary PDF output here? That could also help to write at least a basic test for it.

@TheCyberMike
Copy link
Author

TheCyberMike commented Feb 12, 2021

PHP version: 7.3
PDFParser version: 0.18.1
So my own testing is on the latest code base.

I could not find a sample PDF that was affected by this bug, and of course I cannot edit the PDF to remove the proprietary information like account numbers. However here is relevant encoded and decoded object sections from the PDFs in question:

Objects in question:

28_0:  Font  points to objects 29_0 and 30_0
29_0:  Unicode Mapping (contains the <from><to> mappings without spaces)
30_0:  FontDescriptor
31_0:  FontFile2 (contains the <from><to> mappings without spaces)

Decoded Sample BT sections from the PDF that would not translate their Unicode bytecodes since the translate tables were not loading from the later Font objects:

BT 12.000 697.425 Td /F0201 10.000 Tf ( !"#$%!&'(("!))&&&) Tj ET
BT 453.076 697.425 Td /F0201 10.000 Tf (07#8$%!&9:;!
) Tj ET
BT 561.074 697.425 Td /F0201 10.000 Tf (<=>,+>?=) Tj ET
BT 12.000 682.177 Td /F0201 10.000 Tf ('%%8@7;&1@AB!"*&&&) Tj ET

Font object 28_0 direct from the PDF:

28 0 obj
<< /Type /Font
/Subtype /TrueType
/BaseFont /ArialMT
/FirstChar 32
/LastChar 255
/Widths [ 667 556 333 500 222 500 278 667 556 500 278 556 556 556 722 833 278 722 611 778 611 722 667 556 556 722 556 278 556 556 278 556
556 833 556 667 556 556 667 556 556 333 556 556 278 278 944 278 722 1015 556 584 556 222 278 667 667 722 778 584 500 500 722 556
500 889 389 333 333 500 556 556 222 778 556 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750
750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750
750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750
750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750
750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 ]
/ToUnicode 29 0 R
/FontDescriptor 30 0 R
/Name /F0201 >>
endobj

Still encoded 29_0 Unicode Mapping object direct from the PDF:

29 0 obj
<< /Length 546 /Filter [ /FlateDecode ] >>
stream
xœ]”KoÚ@�…÷Hü‡Y¦‹�{^N$„D HYô¡Ðv�ö@-�Û2°àß×þN�µÝÀñøÜ3ß½03[½¬_šúbf_û¶Ü¦‹9ÔMÕ§s{íËdöéX7ÓInMU——·G¾ÊÓ®›NfCýöv¾¤ÓKshÍ|nf¯ÃÛ󥿙»eÕîÓ�3ûÒW©¯›£¹û¾Ú�ÏÛk×ýJ§Ô\Lf��S¥Ã˜ôi×}Þ�’™Qw?xï³,ûëí�ÔŸë¶1ùÇ�׿ݺd¬�rA–m•ÎÝ®Lý®9¦édže‹ùf³˜NRSýÿ²�*Ú�ÊŸ»0ÛÁœeÁ-F��:�´�uaÑ��Ñ�Ï#: U�G=¤�º�µÏÑ�x<ú‘�ù—£vKô�Zž�Zû®ÑÚ÷™LÕnÐëQ;ø=<�ÿŒ†?�éà÷�4üžL�� _�¿—†?Güª…ß+�þH��þBëâg�NübƒßG­‡¿`_¯ùÓ—‡?2���€Ù‹�…Öü�´øå�ÿ��'�üVùš¿öÕü™­‡ßÒ»‡�I/�þ �ø- Aóg¯ ùã�ð[f�àwì�àwò‹_9ðGz ð[iøƒ2á�ô�à÷ð„�ßÚ\ëðÇ'4ü�ó�ð�òÃ_��5�x"ü–™DñóÿŒð[yà·dFø�y�~›åÌ0Â_ðûFý�ä�¿Öá·NÇôí8Ž'–Ëæý†(¯}?Ü�ÜIÜ ãÙ¯›ô~oumGÝŸ�ß{J�?
endstream
endobj

Decoded 29_0 Unicode Mapping object in the PDF ... this is the object with the lack of spaces between elements in the BFCHAR subsection that is failing in the regular expresssion in Font.php line 167:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-UCS-000 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00>
endcodespacerange
2 beginbfchar
<20><2013>
<21><25CF>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

The 30_0 FontDescriptor and its 31_0 FontFIle2 also has the much larger full mapping table, but also has the missing space between the on every line in the BFCHAR subsection.

Direct from the PDF:

30 0 obj
<< /Type /FontDescriptor
/FontName /ArialMT
/FontBBox [-665 -325 2000 1006]
/Ascent 905
/Descent -212
/CapHeight 660
/StemV 110
/ItalicAngle 0
/MissingWidth 278
/Flags 4
/Style << /Panose <080502110604020202020204> >>
/FontFile2 31 0 R >>
endobj

Decoded object 31_0; partial; this is another object with the lack of spaces between elements in the BFCHAR subsection that is failing in the regular expresssion in Font.php line 167:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-UCS-000 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00>
endcodespacerange
75 beginbfchar
<20><0053>
<21><0065>
<22><0072>
<23><0076>
<24><0069>
<25><0063>
<26><0020>
<27><0041>
<28><0064>
<29><0073>
<2A><003A>
<2B><0034>
<2C><0032>
<2D><0036>
<2E><0043>
<2F><004D>
<30><0049>
<31><004E>
<32><0054>
<33><004F>
<34><0046>
<35><0055>
<36><0045>
<37><006E>
<38><006F>
<39><0044>
<3A><0061>
<3B><0074>
<3C><0030>
<3D><0039>
<3E><002F>
<3F><0031>
<40><0075>
<41><006D>
<42><0062>
:

@smalot
Copy link
Owner

smalot commented Mar 9, 2021

Hi @TheCyberMike
Can you try using the "issue-387" branch ?

@k00ni k00ni linked a pull request Mar 10, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants