Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Font Fallback Issue #657

Open
paytah232 opened this issue Dec 16, 2023 · 13 comments
Open

Font Fallback Issue #657

paytah232 opened this issue Dec 16, 2023 · 13 comments
Labels

Comments

@paytah232
Copy link

  • PHP Version: 8.2.5
  • PDFParser Version: 2.7.0

Description:

PDF input

Personal payslip, so unable to provide, but will do what I can

Expected output & actual output

Get text seems to work, although there is some odd encoding here or there.
When trying to run getDataTm, it fails - seems it's due to a font issue.

Fatal error: Uncaught TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 and defined in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 Call Stack: 0.0019 370824 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.1753 1337680 2. Smalot\PdfParser\Page->getDataTm($dataCommands = ???) /volume1/web/devel/scripts/testing/pdf.php:25 0.1861 1510200 3. Smalot\PdfParser\Page->getTextArray($page = ???) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:701 0.1861 1547256 4. Smalot\PdfParser\PDFObject->getTextArray($page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:365 0.1900 1578944 5. Smalot\PdfParser\PDFObject->getTJUsingFontFallback($font = NULL, $command = [0 => ['t' => '(', 'o' => '\'', 'c' => '\000,']], $page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php:531

It does work on another invoice I have, just not this payslip.

Code

`
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;

$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new Parser([], $config);

$pdf = $parser->parseFile('paySlip.pdf');
//$pdf = $parser->parseFile('Invoice INV-0007.pdf');

$text = $pdf->getText();

$debugger->force_out($text, 'Text');

$metaData = $pdf->getDetails();

$debugger->force_out($metaData, 'Meta');

$pages = $pdf->getPages();
$debugger->force_out($pages);

$pos = $pdf->getPages()[0]->getDataTm();

$debugger->force_out($pos, 'Data');	`
@k00ni k00ni added the bug label Dec 17, 2023
@k00ni
Copy link
Collaborator

k00ni commented Dec 17, 2023

@GreyWyvern this one may interests you.

I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here.

@GreyWyvern
Copy link
Contributor

It would be useful to see the data from the PDF in question. Any of a number of things might be happening. The document might be trying to define a font that PdfParser doesn't accept, or a mismatched set of q and Q commands are leading to a null value for the current font, or... it could be a lot of things.

I would definitely want to see what was happening before allowing getTJUsingFontFallback to accept a null value. It should always be a valid font in the current context when it's called. Allowing null might fix the issue, but it would be akin to putting a band-aid on the problem instead of fixing it at the source.

@paytah232
Copy link
Author

@GreyWyvern - I understand, but as stated, the PDF in question is my payslip, and I wouldn't be comfortable sharing that document. Perhaps I can try and edit some key values and see if the issue still exists, then I would be happy to share. I'll try and come back to you.

@bleigh-gemnisw
Copy link

bleigh-gemnisw commented Jan 22, 2024

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Perhaps I can help. I have the same issue with the
output.pdf
very simple pdf file attached.

@GreyWyvern
Copy link
Contributor

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Yep, your file is working for me too in 2.8.0-RC2. :( If you can figure out how to get it to display the error using a PDF you can post, please share!

@thomasage
Copy link

Hi!
I have the same issue. After re-opening the file in Adobe and save it again, the error has gone.
I can provide the 2 files (with error and without error).
I hope it can help.
file-error.pdf
file-success.pdf

@GreyWyvern
Copy link
Contributor

Hi! I have the same issue. After re-opening the file in Adobe and save it again, the error has gone. I can provide the 2 files (with error and without error). I hope it can help.

Running getDataTm() on both files gives output without any errors for me in 2.9.0.

@thomasage
Copy link

I just tried it and you're right. I don't know what happened. I'll post a new comment with more details if it happens again.

@k00ni
Copy link
Collaborator

k00ni commented Mar 29, 2024

Is this issue solved now? @bleigh-gemnisw and @paytah232, please give us a short ping.

@k00ni k00ni added the stale needs decision label Mar 29, 2024
@bleigh-gemnisw
Copy link

@k00ni I still have files that it occurs in but unfortunately cannot share them for troubleshooting.

I'm of the opinion that your previous suggestion:

"I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here."

Is the solution. It allows files with the problem to not error out without having to know what's wrong with their font and shouldn't interfere with anything else as long as downstream code is made to handle the same condition.

Then I can deal with those files as needed on the backend analyzing the produced json (i.e. giving it a default or replacing whatever bad font is causing it). As it stands I can't process those files at all.

@GreyWyvern
Copy link
Contributor

I suspect this might be another inline image issue, the same as #691, where binary image data containing 'q' or 'Q' is unbalancing the stored state of the document, which includes fonts.

@bleigh-gemnisw if it is at all possible to send the affected PDFs to bhuisman at greywyvern dot com so I can verify this privately, I'd appreciate it.

@k00ni k00ni removed the stale needs decision label Apr 2, 2024
@paytah232
Copy link
Author

@k00ni @GreyWyvern - Sorry for being absent from this for so long, but whatever was causing my files not to work, now seems to be resolved when running on v2.11

Both of the examples I have still have a very interesting looking text output (i.e the encoding seems odd - mostly legible, but weird - characters swapped, missing or just wrong), but it now at least outputs the data from getDataTm() without erroring out.

In its current state, this is now usable for me on those original documents, but I understand others like @bleigh-gemnisw may still be having other issues.

I did also try it on a graphic heavy NRMA insurace certificate, and it died stating an infinite loop. I'm assuming this is due to the complexity, rather than the content, but I do not know. I have a small snippet if it is at all helpful:
Fatal error: Uncaught Error: Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames in /volume1/web/devel/includes/database.php on line 60 Error: Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames in /volume1/web/devel/includes/database.php on line 60 Call Stack: 0.0002 371400 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.0227 1657608 2. Smalot\PdfParser\Parser->parseFile($filename = 'nrma.pdf') /volume1/web/devel/scripts/testing/pdf.php:13 0.0228 1727240 3. Smalot\PdfParser\Parser->parseContent($content = '%PDF-1.4\n%����\n1 0 obj\n<<\n/Creator <2800EFAC7483BAB7AF48191E3A90BA50354B84CD9B75A7C2665FAE>\n/Producer <2800EFAC7483BAB7AF48191E3A90BA50354B84CD9B75A7C2665FAE>\n/CreationDate <3F57BCEE3297C9E1F3145D4767C6E9167715DA998B68A0>\n>>\nendobj\n2 0 obj\n<<\n /N 3\n /Length 3 0 R\n /Filter

This seems to come from the data and dies in FilterHelper.php (according to my log):
/volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239 0.0235 1748480 10. {closure:/volume1/web/devel/includes/load.php:94-107}($errno = 2, $errstr = 'gzuncompress(): data error', $errfile = '/volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php', $errline = 239) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239 0.0235 1749216 11. logFailure($action = 'Error: #: 2 Message: gzuncompress(): data error File: /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php Line: 239 ', $backtrace_error = ???

@paytah232
Copy link
Author

@k00ni @GreyWyvern - I ran the original payslip (left) pdf through an online editor (right) to redact key data, see the image below:
image

For whatever reason, after running through the online editor, the content at least makes sense - there are no weird characters anymore, that are visible on the left side.

As I have now redacted private data, I have attached the edited file. Perhaps it may be helpful to identify a cause, or possibly understand why the outputs are so different in 'seemingly random' places.

Ran on v2.11

paySlip_edit.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants