Error thrown on getDataTm() - Call to a member function decodeText() on null #450

eddturtle · 2021-08-17T09:48:27Z

Hello, I'm trying to find the X, Y coords for a specific piece of text inside a PDF. I'm trying to use getDataTm() (correct me if that's the wrong method to use).

This works for many pdfs, but throws an error for this one example pdf.

myfile.pdf

Example code:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('myfile.pdf');

$pages = $pdf->getPages();
$page = $pages[0];
$dataTm = $page->getDataTm();

Error thrown:

EXCEPTION (Error): vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php line 484 user 1234 -- Call to a member function decodeText() on null

array(3) {
  [0]=>
  array(5) {
    ["file"]=>
    string(62) "/vagrant/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php"
    ["line"]=>
    int(257)
    ["function"]=>
    string(12) "getTextArray"
    ["class"]=>
    string(26) "Smalot\PdfParser\PDFObject"
    ["type"]=>
    string(2) "->"
  }
  [1]=>
  array(5) {
    ["file"]=>
    string(62) "/vagrant/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php"
    ["line"]=>
    int(561)
    ["function"]=>
    string(12) "getTextArray"
    ["class"]=>
    string(21) "Smalot\PdfParser\Page"
    ["type"]=>
    string(2) "->"
  },
  // removed
}

I've tried this on php 7.4 and php 8.0 (running through apache2) on ubuntu 18.04.

Any ideas on how to get this pdf to process?

The text was updated successfully, but these errors were encountered:

k00ni · 2021-08-17T12:08:45Z

Maybe @Connum or @izabala can help out here?

izabala · 2021-08-18T18:40:24Z

@k00ni I will have a look.

@eddturtle I have a look at the file, it just have 1 line... with: {signature:signer505906:Please+Sign+Here} on it. Is that ok?

izabala · 2021-08-19T02:26:04Z

The problem with this Case, is that the PDF file, doesnt behaves like "Normal" pdfs files. Actually, I was going to open another issue with one file, which I am working with, that is created using FPDI, that also doesnt behave like "Normal" pdfs files.

In both cases Page::getTextArray() doesnt give the right data. I already have a work around for this case using Page::getTextArray (but changing it a little bid).

I will let to open a new Issue to discuss the FPDI and not get a mess with this case.

izabala · 2021-08-19T20:40:40Z

Hi, I already made the fix/workaround, but when I make the pull request, the automatica validation is giving me some errors, can someone help me with that?? (By the way, if the directions are for a windows machine, is better for me).

eddturtle · 2021-08-20T09:11:13Z

I've tried the same thing again, but changed the pdfparser code on my local computer to copy the changes you made in the linked commit and it looks like it works to me. It's returning data + text through getDataTm(). Thanks for the fix @izabala

k00ni · 2021-08-20T09:18:30Z

Please bear with me, related pull request is #453. I will have a look next week to bring the fix on the way. @eddturtle it would be great if you could help us test these changes.

izabala · 2021-08-20T17:01:18Z

Your welcome @eddturtle !!!! I just waiting the help from @k00ni so we can have the code finally merge in the master branch.

@j0k3r

* workaround for the Issue #450 The file makes that 2 of the Page methods fails. The Page->extractDecodedRawData was not returning the correct string. This was corrected. The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work. This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done. * PageTest: attempt to fix cs issues * Page.php: fixed cs issues * ParserTest: fixed failing test testRetainImageContentImpact This test is a bit wonky because it relies on memory values which may differ from system to system and run to run. Adjusted values to fix it. Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22 * refined memory threshold in ParserTest::testRetainImageContentImpact * Update Page.php * Taking out line Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r * Changing the catch of the Error To catching Throwable. Co-authored-by: Konrad Abicht <[email protected]>

@j0k3r

* workaround for the Issue #450 The file makes that 2 of the Page methods fails. The Page->extractDecodedRawData was not returning the correct string. This was corrected. The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work. This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done. * PageTest: attempt to fix cs issues * Page.php: fixed cs issues * ParserTest: fixed failing test testRetainImageContentImpact This test is a bit wonky because it relies on memory values which may differ from system to system and run to run. Adjusted values to fix it. Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22 * refined memory threshold in ParserTest::testRetainImageContentImpact * Update Page.php * Taking out line Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r * Changing the catch of the Error To catching Throwable. * Fix/workaround for Issue #454 When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods. But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects. * Update Page.php Some of the changes asked in Github by kOOni * Update Page.php Other changes asked by k00ny * Some other recomendations Some other @k00ni recommendations * After manually doing php-cs-fixer I manually run dev-tools\vendor\bin\php-cs-fixer fix * Correcting the phpstan error * Update Page.php just to make a code enhacement * Removing vscode\lauch.json and some corrections Some corrections metions by @k00ni. * creating some function to get this clearer Follow the recomendation of @k00ni on using extra function to have the code clearer. * After applaying some @k00ni recomendations Many changes following @k00ni recommendations. * Updating the comment for the isFpdf function Better explanation for the function * Changes for correcting phpstan errors * some changes Changes in comments, functions names and variable names. * Reformatted some code parts Co-authored-by: Konrad Abicht <[email protected]>

izabala · 2021-10-18T17:43:21Z

@k00ni @eddturtle This problem should be closed, shouldn't it?

k00ni added the bug label Aug 17, 2021

izabala mentioned this issue Aug 24, 2021

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

Closed

k00ni closed this as completed Oct 19, 2021

eddturtle mentioned this issue Oct 21, 2021

Issue loading pdf generated from FPDI #472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error thrown on getDataTm() - Call to a member function decodeText() on null #450

Error thrown on getDataTm() - Call to a member function decodeText() on null #450

eddturtle commented Aug 17, 2021

k00ni commented Aug 17, 2021

izabala commented Aug 18, 2021 •

edited

Loading

izabala commented Aug 19, 2021

izabala commented Aug 19, 2021

eddturtle commented Aug 20, 2021

k00ni commented Aug 20, 2021 •

edited

Loading

izabala commented Aug 20, 2021

izabala commented Oct 18, 2021

Error thrown on getDataTm() - Call to a member function decodeText() on null #450

Error thrown on getDataTm() - Call to a member function decodeText() on null #450

Comments

eddturtle commented Aug 17, 2021

k00ni commented Aug 17, 2021

izabala commented Aug 18, 2021 • edited Loading

izabala commented Aug 19, 2021

izabala commented Aug 19, 2021

eddturtle commented Aug 20, 2021

k00ni commented Aug 20, 2021 • edited Loading

izabala commented Aug 20, 2021

izabala commented Oct 18, 2021

izabala commented Aug 18, 2021 •

edited

Loading

k00ni commented Aug 20, 2021 •

edited

Loading