-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solved: getText() returns some portions of the PDF with "unintelligible" text #389
Comments
Thank you for the report and a potential solution! Can you please run tests using 0.18.1 with your changes and get back to me? Also if you can't provide a non-proprietary example PDF, could you use |
PHP version: 7.3 I could not find a sample PDF that was affected by this bug, and of course I cannot edit the PDF to remove the proprietary information like account numbers. However here is relevant encoded and decoded object sections from the PDFs in question: Objects in question:
Decoded Sample BT sections from the PDF that would not translate their Unicode bytecodes since the translate tables were not loading from the later Font objects:BT 12.000 697.425 Td /F0201 10.000 Tf ( !"#$%!&'(("!))&&&) Tj ET Font object 28_0 direct from the PDF:28 0 obj Still encoded 29_0 Unicode Mapping object direct from the PDF:29 0 obj Decoded 29_0 Unicode Mapping object in the PDF ... this is the object with the lack of spaces between elements in the BFCHAR subsection that is failing in the regular expresssion in Font.php line 167:/CIDInit /ProcSet findresource begin The 30_0 FontDescriptor and its 31_0 FontFIle2 also has the much larger full mapping table, but also has the missing space between the on every line in the BFCHAR subsection. Direct from the PDF:30 0 obj Decoded object 31_0; partial; this is another object with the lack of spaces between elements in the BFCHAR subsection that is failing in the regular expresssion in Font.php line 167:/CIDInit /ProcSet findresource begin |
Hi @TheCyberMike |
I had a problem getting text from various PDF Invoice files from the same source. Some portions of the text would contain "unintelligible" character strings, which are in-fact just un-decoded Unicode bytecodes. This similar type of problem has been reported several times in these Issues. I don't have a non-proprietary example PDF to share, but I did find the bug in PDFParser code, corrected it in my implementation, and its now parsing all these PDFs.
I successfully debugged and solved at least my PDF's problem. The PDF has a /Font object that uses a ToUnicode element. The translate table object was properly included, however its BFCHAR section had the following subset of mappings:
<22><0072>
<23><0076>
<24><0069> ...
Note there is NO space between the from and to components in each row. Most examples online of the mapping table structure and contents DOES have a space between the two components.
in Font.php line 167 the regular expression currently is:
'/<(?P<from>[0-9A-F]+)> +<(?P<to>[0-9A-F]+)>[ \r\n]+/is'
Note the space-plus in the middle of the expression ... that means 1 or more spaces are allowed, but not ZERO spaces.
A simple change to the regular expression fixed the decoding problem by allowing the translate table to load:
'/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is'
Now zero, one, or more spaces are allowed. The translate table actually loads. All the text gets properly decoded from the PDF.
Note in the same loadTranslateTable() function in Font.php, the other regular expressions on lines 151 and 193 do properly use space-asterisk instead of space-plus so they should work fine with these PDFs with /Font translate tables without spaces between the mapping elements. This may also solve some other's reports of this issue.
Solution: change Font.php line 167 to
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)>[ \r\n]+/is';
The text was updated successfully, but these errors were encountered: