Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculateTextWidth throws an error for some fonts #570

Open
benlongstaff opened this issue Jan 23, 2023 · 8 comments
Open

calculateTextWidth throws an error for some fonts #570

benlongstaff opened this issue Jan 23, 2023 · 8 comments
Labels

Comments

@benlongstaff
Copy link

benlongstaff commented Jan 23, 2023

Undefined array key "Widths"
in vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:279

Not all fonts have the widths data in the font header e.g. $font->getDetails() returns.

Array
(
    [Name] => WPJPNX+E13BP
    [Type] => Type0
    [Encoding] => Identity-H
    [BaseFont] => WPJPNX+E13BP
    [DescendantFonts] => Array
        (
            [0] => Array
                (
                    [Name] => WPJPNX+E13BP
                    [Type] => CIDFontType2
                    [Encoding] => Ansi
                    [BaseFont] => WPJPNX+E13BP
                    [CIDToGIDMap] => Identity
                    [DW] => 1000
                    [Subtype] => CIDFontType2
                )

        )

    [Subtype] => Type0
    [ToUnicode] => Array
        (
            [Filter] => FlateDecode
            [Length1] => 1024
            [Length] => 303
        )

)

vs the expected

Array
(
    [Name] => MyriadPro-Regular
    [Type] => Type1
    [Encoding] => WinAnsiEncoding
    [BaseFont] => MyriadPro-Regular
    [FirstChar] => 1
    [FontDescriptor] => Array
        (
            [Ascent] => 750
            [CapHeight] => 674
            [Descent] => -250
            [Flags] => 32
            [FontName] => MyriadPro-Regular
            [ItalicAngle] => 0
            [StemV] => 80
            [Type] => FontDescriptor
        )

    [LastChar] => 255
    [Subtype] => Type1
    [Widths] => Array
        (
            [0] => 0
            ...
            [254] => 471
        )

)
@k00ni k00ni added the bug label Jan 23, 2023
@k00ni
Copy link
Collaborator

k00ni commented Jan 23, 2023

Can you provide us the PDF?

@benlongstaff
Copy link
Author

Unfortunately the files are Bank Statements, I will need to find a way to remove the elements with sensitive information.

Is there other information about the font I could provide in the meantime?

@k00ni
Copy link
Collaborator

k00ni commented Jan 24, 2023

The most helpful would be PHP exploit code which triggers the error. In the following (untested) a rough example. Please have a look.

/*
 * $elements must contain faulty data to trigger the error.
 * $header->getDetails() is used inside "calculateTextWidth".
 * If it doesnt return an array with key "Widths", the error occur.
 *
 * You can build $elements yourself or you place var_dump near
 * https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L278
 * and use that.
 */
$elements = [
    'Name' => ''...',
    'Type' => '...',
    'Encoding' => '...',
    // 'Widths' => '...'       <=== must be missing to trigger the error
];
$header = new Header($elements);

$font = new Font(new Document(), $header);
$font->calculateTextWidth('', null); // call this to trigger error

@k00ni
Copy link
Collaborator

k00ni commented Jan 25, 2023

@benlongstaff Ignore my last comment. I realized it is more a basement for a unit test to first trigger the error and after fixing it, make sure it doesn't happen again.

We would need two things fix it:

  1. a unit test which triggers the error (see my comment with example code above)
  2. and a check inside the function to avoid the error in case no width is given.

Would you take the time and prepare a pull request? I will lead/assist you until its merged.

Does PDF specification allows no Widths on the font? If so a simple check should be fine (and/or setting a default value even). If it doesn't, its more an anomaly. In this case, what is the best way then? Stick with the check?

@GreyWyvern
Copy link
Contributor

Fortunately, the PDF for issue #592 has this font-without-Width problem as well and we already have permission to use it. /samples/bugs/Issue592.pdf

The key thing is, what do we want PdfParser to do in this case? Return zero (0)? Something like (in Font.php):

    /**
     * Calculate text width with data from header 'Widths'. If width of character is not found then character is added to missing array.
     */
    public function calculateTextWidth(string $text, array &$missing = null): ?float
    {
        $index_map = array_flip($this->table);
        $details = $this->getDetails();

        // If 'Widths' is not defined for this font, return 0
        // See: https://github.com/smalot/pdfparser/issues/570
        if (!isset($details['Widths'])) return 0;

        $widths = $details['Widths'];
...

@k00ni
Copy link
Collaborator

k00ni commented Aug 3, 2023

The key thing is, what do we want PdfParser to do in this case? Return zero (0)?

I suggest -1 or null because it is an invalid width which is easy to check for. Whatever is returned in this case, the function header should be extended to reflect this behavior.

@GreyWyvern
Copy link
Contributor

This function doesn't seem to be used by any other function in PdfParser after running a quick search, so I think returning null, -1 or even false would be okay.

@mbideau-atreal
Copy link
Contributor

I also have the same issue : font with no Widths that generates a PHP Notice and fail to calculate text width.
The test PDF is the same as provided in the issue #629.

The following code, triggers the PHP Notice using the mentioned PDF sample.

<?php

require_once __DIR__.'/pdfparser/alt_autoload.php-dist';

$config = new \Smalot\PdfParser\Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new \Smalot\PdfParser\Parser(array(), $config);

$pdf = $parser->parseFile('/tmp/doc.pdf');

$pages = $pdf->getPages();
$lastpage = end($pages);
$data = $lastpage->getDataTm();

echo "Items:".PHP_EOL;
$current_text = null;
foreach($data as $item) {
    if(is_array($item)) {
        $text = $item[1];
        if ($text != $current_text) {
            echo "- '$text'".PHP_EOL;
            $font = $lastpage->getFont($item[2]);
            echo "  font: ".$font->getName()." (".$font->getType().")"." size: ".$item[3].PHP_EOL;
            $missing = array();
            echo "  text width: ".$font->calculateTextWidth($text, $missing)." (missing: ".implode(',', $missing).")".PHP_EOL;
            $current_text = $text;
        }
    }
}

PS: this code needs the fix of the issue #629 in order to detect the font properly

Is there something I can do when generating the PDF to fix this issue in the PDF ? I have (a little) control over the PDF generation.

I am mainly interested in making text width calculation works rather than preventing a PHP Notice.

Thank again for you software and contributors.
Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants