PDF only being partly read. #474

bclarkson72 · 2021-10-27T12:11:48Z

Have many PDF invoices that are able to be read fine by the library. But the odd one just does not return the full document

Reads Fine.pdf
Bad Read.pdf

I cannot see any difference in the files at all.

k00ni · 2021-10-27T13:55:20Z

Maybe related to #473 and #471.

Was it always this way or did it work in the past?

bclarkson72 · 2021-10-27T13:58:38Z

TBH only started with these invoice PDFs recently - previous files have been fine but noticed issues when trying to parse these particular client files. Was using an old version but upgraded to recent build and issues the same.

bclarkson72 · 2021-10-28T12:32:52Z

Found problem - ET present in textCleaned so screwing with preg_match_all:

The following change gets around the problem but don't think it is a suitable fix.

public function getSectionsText(?string $content): array
{
    $sections = [];
    $content = ' '.$content.' ';
    $textCleaned = $this->cleanContent($content, '_');
	$textCleaned = str_replace("PET","TEP",$textCleaned); //Added this as ET was present in other lines so screwed up the preg_match_all
	//echo $textCleaned;
    // Extract text blocks.
    if (preg_match_all('/(\sQ)?\s+BT[\s|\(|\[]+(.*?)\s*ET(\sq)?/s', $textCleaned, $matches, \PREG_OFFSET_CAPTURE)) {

bclarkson72 · 2021-10-28T12:48:49Z

Output from cleanContent which shows ET :

58.03 288.9 485.01 10.01 re
f
58.03 242.8 485.01 10.01 re
f
58.03 242.8 485.01 10.01 re
f
BT
0 g
0 Tr
/FTxkPETkkj 8 Tf
1 0 0 1 535.55 627.4 Tm
[_________________________________________________________________________]TJ
ET
q
0 38.88 -24 0 42.1 452.40002 cm
/IMcGhHwtqz Do

bclarkson72 · 2021-10-28T13:00:10Z

This change seems to be the solution as ET should always be followed by newline?

if (preg_match_all('/\s+BT[\s|(|[]+(.?)\sET\n/s', $textCleaned, $matches, PREG_OFFSET_CAPTURE)) {

k00ni · 2021-10-29T08:46:17Z

If you think its a sustainable solution please send us a pull request and we can discuss details there.

k00ni · 2022-07-12T07:39:57Z

@bclarkson72 can you please test if #533 fixes your problem?

k00ni added missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. bug labels Oct 27, 2021

k00ni added the fix label Oct 29, 2021

PrinsFrank mentioned this issue May 5, 2022

Handle duplicate Xrefs in documents (Fixes missing pages bugs) #533

Closed

GreyWyvern mentioned this issue Aug 10, 2023

PdfParser does not consider the entire document stream #628

Closed

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed in #634 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF only being partly read. #474

PDF only being partly read. #474

bclarkson72 commented Oct 27, 2021

k00ni commented Oct 27, 2021

bclarkson72 commented Oct 27, 2021

bclarkson72 commented Oct 28, 2021

bclarkson72 commented Oct 28, 2021

bclarkson72 commented Oct 28, 2021

k00ni commented Oct 29, 2021

k00ni commented Jul 12, 2022

PDF only being partly read. #474

PDF only being partly read. #474

Comments

bclarkson72 commented Oct 27, 2021

k00ni commented Oct 27, 2021

bclarkson72 commented Oct 27, 2021

bclarkson72 commented Oct 28, 2021

bclarkson72 commented Oct 28, 2021

bclarkson72 commented Oct 28, 2021

k00ni commented Oct 29, 2021

k00ni commented Jul 12, 2022