Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document's object dictionary not used when asking for default font #434

Closed
jee7 opened this issue Jun 13, 2021 · 2 comments
Closed

Document's object dictionary not used when asking for default font #434

jee7 opened this issue Jun 13, 2021 · 2 comments

Comments

@jee7
Copy link
Contributor

jee7 commented Jun 13, 2021

From v0.17.0 a PDFObject::getDefaultFont() was asked in the beginning of PDFObject::getText() method. The getDefaultFont() asks for all the fonts inside a document and the first one is picked. The Document::getFonts() method uses Document::getObjectsByType(), which is uncached. Meaning it does not use the dictionary built inside the Document class, but rather loops through all the objects again, checking their type to find the fonts.

For PDF files with a lot of objects, this causes significant overhead. For example with the PDF file from this page. I apologize for not having a custom-made PDF here. With that PDF there is a drastic difference when parsing pages 77 and 78. With the v0.16.2 version both pages take < 0.1 seconds to parse. With versions starting from 0.17.0 (including the current latest version) they take ~300 seconds together.

image

Test code:

<?php

include('vendor/autoload.php');

set_time_limit(720);

$parser = new \Smalot\PdfParser\Parser();

// load PDF file content
$data = file_get_contents('https://comserv.cs.ut.ee/home/files/Shoush_ComputerScience_2020.pdf?study=ATILoputoo&reference=76F6FAFD4C9E6981D9A434D32D2E7EE2AE9C49E7');


// give PDF content to function and parse it
$pdf = $parser->parseContent($data); 

$pages = $pdf->getPages();

foreach ($pages as $i => $page) { /** @var $page Page */
	if ($i < 77) continue;
	if ($i > 78) continue;

	$startTime = microtime(true);
	$pageText = $page->getText();
	$endTime = microtime(true);
	
	echo '<b>Page ' . $i . ' (took ' . ($endTime - $startTime) . ' seconds, ' . round(memory_get_usage() / (1000 * 1000), 2) . ' MB RAM)</b><br>';
	var_dump($pageText);
	echo '<br><br>';
}

?>

When getting the text for all the pages (not just 77, 78), then the script doesn't seem to complete (Apache crashed at one point). The main issues here seems to be that the dictionary cache is not used and thus this huge number of calls to the Header::get() and other functions create an overhead. This is even more unnecessary as only the first font is actually required.

PR coming up.

jee7 added a commit to jee7/pdfparser that referenced this issue Jun 13, 2021
…The getObjectsByType() method now uses it correctly. The dictionary also should support subtype searches. Only one font is asked for and returned to get the default font.
jee7 pushed a commit to jee7/pdfparser that referenced this issue Jul 12, 2021
k00ni pushed a commit that referenced this issue Aug 30, 2021
* Fix for #434. Reworked the Document's object cache dictionary. The getObjectsByType() method now uses it correctly. The dictionary also should support subtype searches. Only one font is asked for and returned to get the default font.

* Added type declarations.

* Testing performance test workflow

* Testing performance test workflow

* Testing performance test workflow

* Testing performance test workflow

* Testing performance test workflow

* Testing performance test workflow

* Added performance testing as requested for PR to fix the issue #434

* Style fix

* File require fix.

* File require fix. Could not get autoload to work.

* GitHub performance is lower than in localhost.

* Style fix

* Performance tests GitHub Action name change.

* Autoload test (pretty sure this did not work before).

* Yep, autoload does not work. Revert.

* Performance tests run name change.

* Removed unnecessary PHPDocs and refactored methods to use Type Declarations instead when able.

* Style fix.

* Performance test also succeeds, when time is exactly the same as required (although this will likely never happen).

* More PHPDoc removal in favour of Type Declarations.

* Document cache dictionary performance test tweak.

* Removed unused parameters.

* Another Type Declarations fix.

* Another Type Declarations fix.

* Autoload test with composer update.

* Autoload test with composer update.

* Added the thesis document used in the document cache dictionary performance test to the repository. The author gave his approval.

* Automatic code style fix.

Co-authored-by: vagrant <[email protected]>
@k00ni
Copy link
Collaborator

k00ni commented Sep 10, 2021

Whats the status of this issue after the PR was merged?

@k00ni k00ni added the stale needs decision label Sep 10, 2021
@jee7
Copy link
Contributor Author

jee7 commented Sep 12, 2021

Yes, tested it and dev-master now runs in comparable time performance with v0.16.2 for me.

@jee7 jee7 closed this as completed Sep 12, 2021
@k00ni k00ni removed the stale needs decision label Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants