Skip to content

Commit

Permalink
workaround for the Issue #450 (#453)
Browse files Browse the repository at this point in the history
* workaround for the Issue #450

The file makes that 2 of the Page methods fails.

The Page->extractDecodedRawData was not returning the correct string. This was corrected.

The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work.

This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

* PageTest: attempt to fix cs issues

* Page.php: fixed cs issues

* ParserTest: fixed failing test testRetainImageContentImpact 

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run.
Adjusted values to fix it.

Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

* refined memory threshold in ParserTest::testRetainImageContentImpact

* Update Page.php

* Taking out line

Taking out the line:
$decodedText = '';
This was not needed. Thanks @j0k3r

* Changing the catch of the Error

To catching Throwable.

Co-authored-by: Konrad Abicht <[email protected]>
  • Loading branch information
izabala and k00ni authored Aug 27, 2021
1 parent 5667bdf commit 5dd2329
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 8 deletions.
Binary file added samples/bugs/Issue450.pdf
Binary file not shown.
12 changes: 7 additions & 5 deletions src/Smalot/PdfParser/Page.php
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,12 @@ public function getTextArray(self $page = null): array

$header = new Header([], $this->document);
$contents = new PDFObject($this->document, $header, $new_content, $this->config);
} else {
try {
$contents->getTextArray($this);
} catch (\Throwable $e) {
return $contents->getTextArray();
}
}
} elseif ($contents instanceof ElementArray) {
// Create a virtual global content.
Expand Down Expand Up @@ -342,11 +348,7 @@ public function extractDecodedRawData(array $extractedRawData = null): array
continue;
}
$tmpText = $data[$i]['c'];
$decodedText = '';
if (isset($currentFont)) {
$decodedText = $currentFont->decodeOctal($tmpText);
//$tmpText = $currentFont->decodeHexadecimal($tmpText, false);
}
$decodedText = isset($currentFont) ? $currentFont->decodeOctal($tmpText) : $tmpText;
$decodedText = str_replace(
['\\\\', '\(', '\)', '\n', '\r', '\t', '\ '],
['\\', '(', ')', "\n", "\r", "\t", ' '],
Expand Down
41 changes: 41 additions & 0 deletions tests/Integration/PageTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -570,4 +570,45 @@ public function testGetTextXY()
$result = $page->getTextXY(174, 827, 1, 1);
$this->assertStringContainsString('Purchase 2', $result[0][1]);
}

public function testExtractDecodedRawDataIssue450()
{
$filename = $this->rootDir.'/samples/bugs/Issue450.pdf';
$parser = $this->getParserInstance();
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$extractedDecodedRawData = $page->extractDecodedRawData();
$this->assertIsArray($extractedDecodedRawData);
$this->assertGreaterThan(3, \count($extractedDecodedRawData));
$this->assertIsArray($extractedDecodedRawData[3]);
$this->assertEquals('TJ', $extractedDecodedRawData[3]['o']);
$this->assertIsArray($extractedDecodedRawData[3]['c']);
$this->assertIsArray($extractedDecodedRawData[3]['c'][0]);
$this->assertEquals(3, \count($extractedDecodedRawData[3]['c'][0]));
$this->assertEquals('{signature:signer505906:Please+Sign+Here}', $extractedDecodedRawData[3]['c'][0]['c']);
}

public function testGetDataTmIssue450()
{
$filename = $this->rootDir.'/samples/bugs/Issue450.pdf';
$parser = $this->getParserInstance();
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$dataTm = $page->getDataTm();
$this->assertIsArray($dataTm);
$this->assertEquals(1, \count($dataTm));
$this->assertIsArray($dataTm[0]);
$this->assertEquals(2, \count($dataTm[0]));
$this->assertIsArray($dataTm[0][0]);
$this->assertEquals(6, \count($dataTm[0][0]));
$this->assertEquals(1, $dataTm[0][0][0]);
$this->assertEquals(0, $dataTm[0][0][1]);
$this->assertEquals(0, $dataTm[0][0][2]);
$this->assertEquals(1, $dataTm[0][0][3]);
$this->assertEquals(67.5, $dataTm[0][0][4]);
$this->assertEquals(756.25, $dataTm[0][0][5]);
$this->assertEquals('{signature:signer505906:Please+Sign+Here}', $dataTm[0][1]);
}
}
6 changes: 3 additions & 3 deletions tests/Integration/ParserTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ public function testRetainImageContentImpact()
}

$filename = $this->rootDir.'/samples/bugs/Issue104a.pdf';
$iterations = 1;
$iterations = 2;

/*
* check default (= true)
Expand All @@ -335,7 +335,7 @@ public function testRetainImageContentImpact()
}

$usedMemory = memory_get_usage(true);
$this->assertTrue($usedMemory > 100000000, 'Memory is only '.$usedMemory);
$this->assertTrue($usedMemory > 200000000, 'Memory is only '.$usedMemory);
$this->assertTrue(null != $document && 0 < \strlen($document->getText()));

// force garbage collection
Expand All @@ -359,7 +359,7 @@ public function testRetainImageContentImpact()
* note: the following memory value is set manually and may differ from system to system.
* it must be high enough to not produce a false negative though.
*/
$this->assertTrue($usedMemory < 106000000, 'Memory is '.$usedMemory);
$this->assertTrue($usedMemory < 107000000, 'Memory is '.$usedMemory);
$this->assertTrue(0 < \strlen($document->getText()));
}
}
Expand Down

0 comments on commit 5dd2329

Please sign in to comment.