Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error thrown on getDataTm() - Call to a member function decodeText() on null #450

Closed
eddturtle opened this issue Aug 17, 2021 · 8 comments
Closed
Labels

Comments

@eddturtle
Copy link

Hello, I'm trying to find the X, Y coords for a specific piece of text inside a PDF. I'm trying to use getDataTm() (correct me if that's the wrong method to use).

This works for many pdfs, but throws an error for this one example pdf.

myfile.pdf

Example code:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('myfile.pdf');

$pages = $pdf->getPages();
$page = $pages[0];
$dataTm = $page->getDataTm();

Error thrown:

EXCEPTION (Error): vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php line 484 user 1234 -- Call to a member function decodeText() on null

array(3) {
  [0]=>
  array(5) {
    ["file"]=>
    string(62) "/vagrant/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php"
    ["line"]=>
    int(257)
    ["function"]=>
    string(12) "getTextArray"
    ["class"]=>
    string(26) "Smalot\PdfParser\PDFObject"
    ["type"]=>
    string(2) "->"
  }
  [1]=>
  array(5) {
    ["file"]=>
    string(62) "/vagrant/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php"
    ["line"]=>
    int(561)
    ["function"]=>
    string(12) "getTextArray"
    ["class"]=>
    string(21) "Smalot\PdfParser\Page"
    ["type"]=>
    string(2) "->"
  },
  // removed
}

I've tried this on php 7.4 and php 8.0 (running through apache2) on ubuntu 18.04.

Any ideas on how to get this pdf to process?

@k00ni
Copy link
Collaborator

k00ni commented Aug 17, 2021

Maybe @Connum or @izabala can help out here?

@k00ni k00ni added the bug label Aug 17, 2021
@izabala
Copy link
Contributor

izabala commented Aug 18, 2021

@k00ni I will have a look.

@eddturtle I have a look at the file, it just have 1 line... with: {signature:signer505906:Please+Sign+Here} on it. Is that ok?

@izabala
Copy link
Contributor

izabala commented Aug 19, 2021

The problem with this Case, is that the PDF file, doesnt behaves like "Normal" pdfs files. Actually, I was going to open another issue with one file, which I am working with, that is created using FPDI, that also doesnt behave like "Normal" pdfs files.

In both cases Page::getTextArray() doesnt give the right data. I already have a work around for this case using Page::getTextArray (but changing it a little bid).

I will let to open a new Issue to discuss the FPDI and not get a mess with this case.

@izabala
Copy link
Contributor

izabala commented Aug 19, 2021

Hi, I already made the fix/workaround, but when I make the pull request, the automatica validation is giving me some errors, can someone help me with that?? (By the way, if the directions are for a windows machine, is better for me).

@eddturtle
Copy link
Author

I've tried the same thing again, but changed the pdfparser code on my local computer to copy the changes you made in the linked commit and it looks like it works to me. It's returning data + text through getDataTm(). Thanks for the fix @izabala

@k00ni
Copy link
Collaborator

k00ni commented Aug 20, 2021

Please bear with me, related pull request is #453. I will have a look next week to bring the fix on the way. @eddturtle it would be great if you could help us test these changes.

@izabala
Copy link
Contributor

izabala commented Aug 20, 2021

Your welcome @eddturtle !!!! I just waiting the help from @k00ni so we can have the code finally merge in the master branch.

k00ni added a commit that referenced this issue Aug 27, 2021
* workaround for the Issue #450

The file makes that 2 of the Page methods fails.

The Page->extractDecodedRawData was not returning the correct string. This was corrected.

The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work.

This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

* PageTest: attempt to fix cs issues

* Page.php: fixed cs issues

* ParserTest: fixed failing test testRetainImageContentImpact 

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run.
Adjusted values to fix it.

Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

* refined memory threshold in ParserTest::testRetainImageContentImpact

* Update Page.php

* Taking out line

Taking out the line:
$decodedText = '';
This was not needed. Thanks @j0k3r

* Changing the catch of the Error

To catching Throwable.

Co-authored-by: Konrad Abicht <[email protected]>
k00ni added a commit that referenced this issue Oct 18, 2021
* workaround for the Issue #450

The file makes that 2 of the Page methods fails.

The Page->extractDecodedRawData was not returning the correct string. This was corrected.

The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work.

This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

* PageTest: attempt to fix cs issues

* Page.php: fixed cs issues

* ParserTest: fixed failing test testRetainImageContentImpact 

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run.
Adjusted values to fix it.

Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

* refined memory threshold in ParserTest::testRetainImageContentImpact

* Update Page.php

* Taking out line

Taking out the line:
$decodedText = '';
This was not needed. Thanks @j0k3r

* Changing the catch of the Error

To catching Throwable.

* Fix/workaround for Issue #454

When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods.
But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects.

* Update Page.php

Some of the changes asked in Github by kOOni

* Update Page.php

Other changes asked by k00ny

* Some other recomendations

Some other @k00ni recommendations

* After manually doing php-cs-fixer

I manually run dev-tools\vendor\bin\php-cs-fixer fix

* Correcting the phpstan error

* Update Page.php

just to make a code enhacement

* Removing vscode\lauch.json and some corrections

Some corrections metions by @k00ni.

* creating some function to get this clearer

Follow the recomendation of @k00ni on using extra function to have the code clearer.

* After applaying some @k00ni recomendations

Many changes following @k00ni recommendations.

* Updating the comment for the isFpdf function

Better explanation for the function

* Changes for correcting phpstan errors

* some changes

Changes in comments, functions names and variable names.

* Reformatted some code parts

Co-authored-by: Konrad Abicht <[email protected]>
@izabala
Copy link
Contributor

izabala commented Oct 18, 2021

@k00ni @eddturtle This problem should be closed, shouldn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants