Font Fallback Issue #657

paytah232 · 2023-12-16T13:15:18Z

PHP Version: 8.2.5
PDFParser Version: 2.7.0

Description:

PDF input

Personal payslip, so unable to provide, but will do what I can

Expected output & actual output

Get text seems to work, although there is some odd encoding here or there.
When trying to run getDataTm, it fails - seems it's due to a font issue.

Fatal error: Uncaught TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 and defined in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 Call Stack: 0.0019 370824 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.1753 1337680 2. Smalot\PdfParser\Page->getDataTm($dataCommands = ???) /volume1/web/devel/scripts/testing/pdf.php:25 0.1861 1510200 3. Smalot\PdfParser\Page->getTextArray($page = ???) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:701 0.1861 1547256 4. Smalot\PdfParser\PDFObject->getTextArray($page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:365 0.1900 1578944 5. Smalot\PdfParser\PDFObject->getTJUsingFontFallback($font = NULL, $command = [0 => ['t' => '(', 'o' => '\'', 'c' => '\000,']], $page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php:531

It does work on another invoice I have, just not this payslip.

Code

`
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;

$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new Parser([], $config);

$pdf = $parser->parseFile('paySlip.pdf');
//$pdf = $parser->parseFile('Invoice INV-0007.pdf');

$text = $pdf->getText();

$debugger->force_out($text, 'Text');

$metaData = $pdf->getDetails();

$debugger->force_out($metaData, 'Meta');

$pages = $pdf->getPages();
$debugger->force_out($pages);

$pos = $pdf->getPages()[0]->getDataTm();

$debugger->force_out($pos, 'Data');	`

The text was updated successfully, but these errors were encountered:

k00ni · 2023-12-17T09:42:12Z

@GreyWyvern this one may interests you.

I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here.

GreyWyvern · 2023-12-18T23:08:27Z

It would be useful to see the data from the PDF in question. Any of a number of things might be happening. The document might be trying to define a font that PdfParser doesn't accept, or a mismatched set of q and Q commands are leading to a null value for the current font, or... it could be a lot of things.

I would definitely want to see what was happening before allowing getTJUsingFontFallback to accept a null value. It should always be a valid font in the current context when it's called. Allowing null might fix the issue, but it would be akin to putting a band-aid on the problem instead of fixing it at the source.

paytah232 · 2023-12-27T01:25:59Z

@GreyWyvern - I understand, but as stated, the PDF in question is my payslip, and I wouldn't be comfortable sharing that document. Perhaps I can try and edit some key values and see if the issue still exists, then I would be happy to share. I'll try and come back to you.

bleigh-gemnisw · 2024-01-22T18:24:43Z

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Perhaps I can help. I have the same issue with the
output.pdf
very simple pdf file attached.

GreyWyvern · 2024-01-24T16:53:06Z

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Yep, your file is working for me too in 2.8.0-RC2. :( If you can figure out how to get it to display the error using a PDF you can post, please share!

thomasage · 2024-03-14T23:08:52Z

Hi!
I have the same issue. After re-opening the file in Adobe and save it again, the error has gone.
I can provide the 2 files (with error and without error).
I hope it can help.
file-error.pdf
file-success.pdf

GreyWyvern · 2024-03-26T14:48:12Z

Hi! I have the same issue. After re-opening the file in Adobe and save it again, the error has gone. I can provide the 2 files (with error and without error). I hope it can help.

Running getDataTm() on both files gives output without any errors for me in 2.9.0.

thomasage · 2024-03-29T10:48:06Z

I just tried it and you're right. I don't know what happened. I'll post a new comment with more details if it happens again.

k00ni · 2024-03-29T13:29:12Z

Is this issue solved now? @bleigh-gemnisw and @paytah232, please give us a short ping.

bleigh-gemnisw · 2024-03-29T14:25:22Z

@k00ni I still have files that it occurs in but unfortunately cannot share them for troubleshooting.

I'm of the opinion that your previous suggestion:

"I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here."

Is the solution. It allows files with the problem to not error out without having to know what's wrong with their font and shouldn't interfere with anything else as long as downstream code is made to handle the same condition.

Then I can deal with those files as needed on the backend analyzing the produced json (i.e. giving it a default or replacing whatever bad font is causing it). As it stands I can't process those files at all.

GreyWyvern · 2024-04-01T17:22:54Z

I suspect this might be another inline image issue, the same as #691, where binary image data containing 'q' or 'Q' is unbalancing the stored state of the document, which includes fonts.

@bleigh-gemnisw if it is at all possible to send the affected PDFs to bhuisman at greywyvern dot com so I can verify this privately, I'd appreciate it.

paytah232 · 2024-09-16T12:56:56Z

@k00ni @GreyWyvern - Sorry for being absent from this for so long, but whatever was causing my files not to work, now seems to be resolved when running on v2.11

Both of the examples I have still have a very interesting looking text output (i.e the encoding seems odd - mostly legible, but weird - characters swapped, missing or just wrong), but it now at least outputs the data from getDataTm() without erroring out.

In its current state, this is now usable for me on those original documents, but I understand others like @bleigh-gemnisw may still be having other issues.

I did also try it on a graphic heavy NRMA insurace certificate, and it died stating an infinite loop. I'm assuming this is due to the complexity, rather than the content, but I do not know. I have a small snippet if it is at all helpful:
Fatal error: Uncaught Error: Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames in /volume1/web/devel/includes/database.php on line 60 Error: Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames in /volume1/web/devel/includes/database.php on line 60 Call Stack: 0.0002 371400 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.0227 1657608 2. Smalot\PdfParser\Parser->parseFile($filename = 'nrma.pdf') /volume1/web/devel/scripts/testing/pdf.php:13 0.0228 1727240 3. Smalot\PdfParser\Parser->parseContent($content = '%PDF-1.4\n%��\n1 0 obj\n<<\n/Creator <2800EFAC7483BAB7AF48191E3A90BA50354B84CD9B75A7C2665FAE>\n/Producer <2800EFAC7483BAB7AF48191E3A90BA50354B84CD9B75A7C2665FAE>\n/CreationDate <3F57BCEE3297C9E1F3145D4767C6E9167715DA998B68A0>\n>>\nendobj\n2 0 obj\n<<\n /N 3\n /Length 3 0 R\n /Filter

This seems to come from the data and dies in FilterHelper.php (according to my log):
/volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239 0.0235 1748480 10. {closure:/volume1/web/devel/includes/load.php:94-107}($errno = 2, $errstr = 'gzuncompress(): data error', $errfile = '/volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php', $errline = 239) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239 0.0235 1749216 11. logFailure($action = 'Error: #: 2 Message: gzuncompress(): data error File: /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php Line: 239 ', $backtrace_error = ???

paytah232 · 2024-12-21T13:50:56Z

@k00ni @GreyWyvern - I ran the original payslip (left) pdf through an online editor (right) to redact key data, see the image below:

For whatever reason, after running through the online editor, the content at least makes sense - there are no weird characters anymore, that are visible on the left side.

As I have now redacted private data, I have attached the edited file. Perhaps it may be helpful to identify a cause, or possibly understand why the outputs are so different in 'seemingly random' places.

Ran on v2.11

paySlip_edit.pdf

k00ni added the bug label Dec 17, 2023

k00ni added the stale needs decision label Mar 29, 2024

k00ni removed the stale needs decision label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Font Fallback Issue #657

Font Fallback Issue #657

paytah232 commented Dec 16, 2023

k00ni commented Dec 17, 2023

GreyWyvern commented Dec 18, 2023

paytah232 commented Dec 27, 2023

bleigh-gemnisw commented Jan 22, 2024 •

edited

Loading

GreyWyvern commented Jan 24, 2024

thomasage commented Mar 14, 2024

GreyWyvern commented Mar 26, 2024

thomasage commented Mar 29, 2024

k00ni commented Mar 29, 2024 •

edited

Loading

bleigh-gemnisw commented Mar 29, 2024

GreyWyvern commented Apr 1, 2024

paytah232 commented Sep 16, 2024

paytah232 commented Dec 21, 2024

Font Fallback Issue #657

Font Fallback Issue #657

Comments

paytah232 commented Dec 16, 2023

Description:

PDF input

Expected output & actual output

Code

k00ni commented Dec 17, 2023

GreyWyvern commented Dec 18, 2023

paytah232 commented Dec 27, 2023

bleigh-gemnisw commented Jan 22, 2024 • edited Loading

GreyWyvern commented Jan 24, 2024

thomasage commented Mar 14, 2024

GreyWyvern commented Mar 26, 2024

thomasage commented Mar 29, 2024

k00ni commented Mar 29, 2024 • edited Loading

bleigh-gemnisw commented Mar 29, 2024

GreyWyvern commented Apr 1, 2024

paytah232 commented Sep 16, 2024

paytah232 commented Dec 21, 2024

bleigh-gemnisw commented Jan 22, 2024 •

edited

Loading

k00ni commented Mar 29, 2024 •

edited

Loading