-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Font Fallback Issue #657
Comments
@GreyWyvern this one may interests you. I was just thinking to make the mentioned parameter of |
It would be useful to see the data from the PDF in question. Any of a number of things might be happening. The document might be trying to define a font that PdfParser doesn't accept, or a mismatched set of I would definitely want to see what was happening before allowing |
@GreyWyvern - I understand, but as stated, the PDF in question is my payslip, and I wouldn't be comfortable sharing that document. Perhaps I can try and edit some key values and see if the issue still exists, then I would be happy to share. I'll try and come back to you. |
EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share. Perhaps I can help. I have the same issue with the |
Yep, your file is working for me too in 2.8.0-RC2. :( If you can figure out how to get it to display the error using a PDF you can post, please share! |
Hi! |
Running |
I just tried it and you're right. I don't know what happened. I'll post a new comment with more details if it happens again. |
Is this issue solved now? @bleigh-gemnisw and @paytah232, please give us a short ping. |
@k00ni I still have files that it occurs in but unfortunately cannot share them for troubleshooting. I'm of the opinion that your previous suggestion: "I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here." Is the solution. It allows files with the problem to not error out without having to know what's wrong with their font and shouldn't interfere with anything else as long as downstream code is made to handle the same condition. Then I can deal with those files as needed on the backend analyzing the produced json (i.e. giving it a default or replacing whatever bad font is causing it). As it stands I can't process those files at all. |
I suspect this might be another inline image issue, the same as #691, where binary image data containing 'q' or 'Q' is unbalancing the stored state of the document, which includes fonts. @bleigh-gemnisw if it is at all possible to send the affected PDFs to bhuisman at greywyvern dot com so I can verify this privately, I'd appreciate it. |
@k00ni @GreyWyvern - Sorry for being absent from this for so long, but whatever was causing my files not to work, now seems to be resolved when running on v2.11 Both of the examples I have still have a very interesting looking text output (i.e the encoding seems odd - mostly legible, but weird - characters swapped, missing or just wrong), but it now at least outputs the data from getDataTm() without erroring out. In its current state, this is now usable for me on those original documents, but I understand others like @bleigh-gemnisw may still be having other issues. I did also try it on a graphic heavy NRMA insurace certificate, and it died stating an infinite loop. I'm assuming this is due to the complexity, rather than the content, but I do not know. I have a small snippet if it is at all helpful: This seems to come from the data and dies in FilterHelper.php (according to my log): |
@k00ni @GreyWyvern - I ran the original payslip (left) pdf through an online editor (right) to redact key data, see the image below: For whatever reason, after running through the online editor, the content at least makes sense - there are no weird characters anymore, that are visible on the left side. As I have now redacted private data, I have attached the edited file. Perhaps it may be helpful to identify a cause, or possibly understand why the outputs are so different in 'seemingly random' places. Ran on v2.11 |
Description:
PDF input
Personal payslip, so unable to provide, but will do what I can
Expected output & actual output
Get text seems to work, although there is some odd encoding here or there.
When trying to run getDataTm, it fails - seems it's due to a font issue.
Fatal error: Uncaught TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 and defined in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 Call Stack: 0.0019 370824 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.1753 1337680 2. Smalot\PdfParser\Page->getDataTm($dataCommands = ???) /volume1/web/devel/scripts/testing/pdf.php:25 0.1861 1510200 3. Smalot\PdfParser\Page->getTextArray($page = ???) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:701 0.1861 1547256 4. Smalot\PdfParser\PDFObject->getTextArray($page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:365 0.1900 1578944 5. Smalot\PdfParser\PDFObject->getTJUsingFontFallback($font = NULL, $command = [0 => ['t' => '(', 'o' => '\'', 'c' => '\000,']], $page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php:531
It does work on another invoice I have, just not this payslip.
Code
`
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;
The text was updated successfully, but these errors were encountered: