Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #434. Reworked the Document's object cache dictionary. #435

Merged
merged 34 commits into from
Aug 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
874af8e
Fix for #434. Reworked the Document's object cache dictionary. The ge…
jee7 Jun 13, 2021
ef89b92
Added type declarations.
jee7 Jun 28, 2021
467c99a
Testing performance test workflow
Jul 12, 2021
a880d7f
Testing performance test workflow
Jul 12, 2021
8de0ac6
Testing performance test workflow
Jul 12, 2021
426c062
Testing performance test workflow
Jul 12, 2021
3f2a14a
Testing performance test workflow
Jul 12, 2021
b8304bf
Testing performance test workflow
Jul 12, 2021
9653ad6
Added performance testing as requested for PR to fix the issue #434
Jul 12, 2021
ddbd300
Style fix
Jul 12, 2021
663d07e
File require fix.
Jul 12, 2021
ce0ace3
File require fix. Could not get autoload to work.
Jul 12, 2021
1c480e1
GitHub performance is lower than in localhost.
Jul 12, 2021
25de6b7
Style fix
Jul 12, 2021
5d2db06
Performance tests GitHub Action name change.
jee7 Jul 13, 2021
7c397b7
Autoload test (pretty sure this did not work before).
jee7 Jul 13, 2021
bb6df21
Yep, autoload does not work. Revert.
jee7 Jul 13, 2021
1d5b6c5
Performance tests run name change.
jee7 Jul 13, 2021
e9ca8d5
Removed unnecessary PHPDocs and refactored methods to use Type Declar…
jee7 Jul 13, 2021
d9ff2dc
Style fix.
jee7 Jul 13, 2021
468b0a6
Performance test also succeeds, when time is exactly the same as requ…
jee7 Jul 13, 2021
f031337
More PHPDoc removal in favour of Type Declarations.
jee7 Jul 13, 2021
b488ef2
Document cache dictionary performance test tweak.
jee7 Jul 13, 2021
c15135d
Removed unused parameters.
jee7 Jul 13, 2021
a30f2d4
Another Type Declarations fix.
jee7 Jul 13, 2021
b7a07e1
Another Type Declarations fix.
jee7 Jul 13, 2021
5ae30ef
Merge branch 'master' into master
jee7 Jul 13, 2021
d393566
Autoload test with composer update.
jee7 Jul 14, 2021
40ff4ef
Merge remote-tracking branch 'origin/master'
jee7 Jul 14, 2021
3e2b4a2
Autoload test with composer update.
jee7 Jul 14, 2021
bc520f4
Added the thesis document used in the document cache dictionary perfo…
jee7 Jul 14, 2021
36802e9
Merge branch 'smalot:master' into master
jee7 Jul 28, 2021
508f5a6
Merge branch 'master' into master
jee7 Aug 25, 2021
9ccc204
Automatic code style fix.
jee7 Aug 25, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/performance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: "Performance Tests"

on:
pull_request:
push:
branches:
- "master"

env:
fail-fast: true

jobs:
performance-tests:
name: "Tests for the performance testing the PDF parsing"
runs-on: "ubuntu-20.04"

strategy:
matrix:
php:
- "7.4"
k00ni marked this conversation as resolved.
Show resolved Hide resolved

steps:
- name: "Checkout"
uses: "actions/checkout@v2"

- name: "Run composer for further autoloading"
run: "composer update"

- name: "Run performance tests"
run: "php tests/Performance/runPerformanceTests.php"
Binary file added samples/DocumentWithLotsOfObjects.pdf
Binary file not shown.
59 changes: 45 additions & 14 deletions src/Smalot/PdfParser/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,26 @@ protected function buildDictionary()
$this->dictionary = [];

foreach ($this->objects as $id => $object) {
// Cache objects by type and subtype
$type = $object->getHeader()->get('Type')->getContent();

if (!empty($type)) {
$this->dictionary[$type][$id] = $id;
if (null != $type) {
k00ni marked this conversation as resolved.
Show resolved Hide resolved
if (!isset($this->dictionary[$type])) {
$this->dictionary[$type] = [
'all' => [],
'subtype' => [],
];
}

$this->dictionary[$type]['all'][$id] = $object;

$subtype = $object->getHeader()->get('Subtype')->getContent();
if (null != $subtype) {
if (!isset($this->dictionary[$type]['subtype'][$subtype])) {
$this->dictionary[$type]['subtype'][$subtype] = [];
}
$this->dictionary[$type]['subtype'][$subtype][$id] = $object;
}
}
}
}
Expand Down Expand Up @@ -164,19 +180,26 @@ public function getObjectById(string $id)
return null;
}

public function hasObjectsByType(string $type, ?string $subtype = null): bool
{
return 0 < \count($this->getObjectsByType($type, $subtype));
}

public function getObjectsByType(string $type, ?string $subtype = null): array
{
$objects = [];
if (!isset($this->dictionary[$type])) {
return [];
}

foreach ($this->objects as $id => $object) {
if ($object->getHeader()->get('Type') == $type &&
(null === $subtype || $object->getHeader()->get('Subtype') == $subtype)
) {
$objects[$id] = $object;
if (null != $subtype) {
if (!isset($this->dictionary[$type]['subtype'][$subtype])) {
return [];
}

return $this->dictionary[$type]['subtype'][$subtype];
}

return $objects;
return $this->dictionary[$type]['all'];
}

/**
Expand All @@ -187,25 +210,33 @@ public function getFonts()
return $this->getObjectsByType('Font');
}

public function getFirstFont(): ?Font
{
$fonts = $this->getFonts();

return reset($fonts);
}

/**
* @return Page[]
*
* @throws \Exception
*/
public function getPages()
{
if (isset($this->dictionary['Catalog'])) {
if ($this->hasObjectsByType('Catalog')) {
// Search for catalog to list pages.
$id = reset($this->dictionary['Catalog']);
$catalogues = $this->getObjectsByType('Catalog');
$catalogue = reset($catalogues);

/** @var Pages $object */
$object = $this->objects[$id]->get('Pages');
$object = $catalogue->get('Pages');
k00ni marked this conversation as resolved.
Show resolved Hide resolved
if (method_exists($object, 'getPages')) {
return $object->getPages(true);
}
}

if (isset($this->dictionary['Pages'])) {
if ($this->hasObjectsByType('Pages')) {
// Search for pages to list kids.
$pages = [];

Expand All @@ -218,7 +249,7 @@ public function getPages()
return $pages;
}

if (isset($this->dictionary['Page'])) {
if ($this->hasObjectsByType('Page')) {
// Search for 'page' (unordered pages).
$pages = $this->getObjectsByType('Page');

Expand Down
4 changes: 1 addition & 3 deletions src/Smalot/PdfParser/Encoding.php
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,7 @@ public function translateChar($dec): int
}

/**
* Returns the name of the encoding class, if available.
*
* @return string Returns encoding class name if available or empty string (only prior PHP 7.4).
* Returns encoding class name if available or empty string (only prior PHP 7.4).
*
* @throws \Exception On PHP 7.4+ an exception is thrown if encoding class doesn't exist.
*/
Expand Down
2 changes: 1 addition & 1 deletion src/Smalot/PdfParser/Header.php
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ public function getDetails(bool $deep = true): array
/**
* Indicate if an element name is available in header.
*
* @param string $name the name of the element
* @param string $name the name of the element
*/
public function has(string $name): bool
{
Expand Down
2 changes: 1 addition & 1 deletion src/Smalot/PdfParser/PDFObject.php
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ private function getDefaultFont(Page $page = null): Font
$fonts = $page->getFonts();
}

$fonts = array_merge($fonts, array_values($this->document->getFonts()));
$fonts[] = $this->document->getFirstFont();

if (\count($fonts) > 0) {
return reset($fonts);
Expand Down
2 changes: 1 addition & 1 deletion src/Smalot/PdfParser/Page.php
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ public function getTextArray(self $page = null): array
/**
* Gets all the text data with its internal representation of the page.
*
* @return array An array with the data and the internal representation
* Returns an array with the data and the internal representation
*/
public function extractRawData(): array
{
Expand Down
4 changes: 2 additions & 2 deletions tests/Integration/DocumentTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,8 @@ public function testDictionary()

$objects = $document->getDictionary();
$this->assertEquals(1, \count($objects));
$this->assertEquals(1, \count($objects['Page']));
$this->assertEquals(2, $objects['Page'][2]);
$this->assertEquals(1, \count($objects['Page']['all']));
$this->assertEquals($object2, $objects['Page']['all'][2]);
}

public function testGetObjectsByType()
Expand Down
7 changes: 7 additions & 0 deletions tests/Performance/Exception/PerformanceFailException.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<?php

namespace Tests\Smalot\PdfParser\Performance\Exception;

class PerformanceFailException extends \Exception
{
}
21 changes: 21 additions & 0 deletions tests/Performance/Test/AbstractPerformanceTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
<?php

namespace Tests\Smalot\PdfParser\Performance\Test;

abstract class AbstractPerformanceTest
{
/**
* Initializes the test (eg, fetches the files etc).
*/
abstract public function init(): void;

/**
* Executes the test.
*/
abstract public function run(): void;

/**
* Returns the time over which the test is considered a fail.
*/
abstract public function getMaxEstimatedTime(): int;
}
83 changes: 83 additions & 0 deletions tests/Performance/Test/DocumentDictionaryCacheTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
<?php

/**
* @file This file is part of the PdfParser library.
*
* @author Konrad Abicht <[email protected]>
* @date 2020-06-01
*
* @author Sébastien MALOT <[email protected]>
* @date 2017-01-03
*
* @license LGPLv3
* @url <https://github.com/smalot/pdfparser>
*
* PdfParser is a pdf library written in PHP, extraction oriented.
* Copyright (C) 2017 - Sébastien MALOT <[email protected]>
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program.
* If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
*/

namespace Tests\Smalot\PdfParser\Performance\Test;

use Smalot\PdfParser\Page;
use Smalot\PdfParser\Parser;

/**
* This test checks does a performance test with certain PDF files that extensively use
* the getFirstFont() method of Document.php. If Document.php correctly uses a dictionary
* to cache the objects inside the PDF file, then the parsing should be quick.
* If it does not, the parsing can be extensively slow or even crash.
*/
class DocumentDictionaryCacheTest extends AbstractPerformanceTest
{
/**
* @var Parser
*/
protected $parser;
protected $data;

public function init(): void
{
$this->parser = new Parser();

// load PDF file content
$this->data = file_get_contents(__DIR__.'/../../../samples/DocumentWithLotsOfObjects.pdf');
}

public function run(): void
{
// give PDF content to function and parse it
$pdf = $this->parser->parseContent($this->data);

$pages = $pdf->getPages();

foreach ($pages as $i => $page) { /* @var $page Page */
if ($i < 77) {
continue;
}
if ($i > 78) {
continue;
}

$page->getText(); // Test this method
}
}

public function getMaxEstimatedTime(): int
{
return 20;
}
}
21 changes: 21 additions & 0 deletions tests/Performance/runPerformanceTests.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
<?php

require __DIR__.'/../../vendor/autoload.php';

$tests = [
new \Tests\Smalot\PdfParser\Performance\Test\DocumentDictionaryCacheTest(),
];

foreach ($tests as $test) { /* @var $test \Tests\Smalot\PdfParser\Performance\Test\AbstractPerformanceTest */
$test->init();

$startTime = microtime(true);
$test->run();
$endTime = microtime(true);

$time = $endTime - $startTime;

if ($test->getMaxEstimatedTime() <= $time) {
throw new \Tests\Smalot\PdfParser\Performance\Exception\PerformanceFailException(sprintf('Performance failed on test "%s". Time taken was %.2f seconds, expected less than %d seconds.', get_class($test), $time, $test->getMaxEstimatedTime()));
}
}