-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document's object dictionary not used when asking for default font #434
Comments
jee7
added a commit
to jee7/pdfparser
that referenced
this issue
Jun 13, 2021
…The getObjectsByType() method now uses it correctly. The dictionary also should support subtype searches. Only one font is asked for and returned to get the default font.
jee7
pushed a commit
to jee7/pdfparser
that referenced
this issue
Jul 12, 2021
k00ni
pushed a commit
that referenced
this issue
Aug 30, 2021
* Fix for #434. Reworked the Document's object cache dictionary. The getObjectsByType() method now uses it correctly. The dictionary also should support subtype searches. Only one font is asked for and returned to get the default font. * Added type declarations. * Testing performance test workflow * Testing performance test workflow * Testing performance test workflow * Testing performance test workflow * Testing performance test workflow * Testing performance test workflow * Added performance testing as requested for PR to fix the issue #434 * Style fix * File require fix. * File require fix. Could not get autoload to work. * GitHub performance is lower than in localhost. * Style fix * Performance tests GitHub Action name change. * Autoload test (pretty sure this did not work before). * Yep, autoload does not work. Revert. * Performance tests run name change. * Removed unnecessary PHPDocs and refactored methods to use Type Declarations instead when able. * Style fix. * Performance test also succeeds, when time is exactly the same as required (although this will likely never happen). * More PHPDoc removal in favour of Type Declarations. * Document cache dictionary performance test tweak. * Removed unused parameters. * Another Type Declarations fix. * Another Type Declarations fix. * Autoload test with composer update. * Autoload test with composer update. * Added the thesis document used in the document cache dictionary performance test to the repository. The author gave his approval. * Automatic code style fix. Co-authored-by: vagrant <[email protected]>
Whats the status of this issue after the PR was merged? |
Yes, tested it and dev-master now runs in comparable time performance with v0.16.2 for me. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
From v0.17.0 a
PDFObject::getDefaultFont()
was asked in the beginning ofPDFObject::getText()
method. ThegetDefaultFont()
asks for all the fonts inside a document and the first one is picked. TheDocument::getFonts()
method usesDocument::getObjectsByType()
, which is uncached. Meaning it does not use the dictionary built inside theDocument
class, but rather loops through all the objects again, checking their type to find the fonts.For PDF files with a lot of objects, this causes significant overhead. For example with the PDF file from this page. I apologize for not having a custom-made PDF here. With that PDF there is a drastic difference when parsing pages 77 and 78. With the v0.16.2 version both pages take < 0.1 seconds to parse. With versions starting from 0.17.0 (including the current latest version) they take ~300 seconds together.
Test code:
When getting the text for all the pages (not just 77, 78), then the script doesn't seem to complete (Apache crashed at one point). The main issues here seems to be that the dictionary cache is not used and thus this huge number of calls to the
Header::get()
and other functions create an overhead. This is even more unnecessary as only the first font is actually required.PR coming up.
The text was updated successfully, but these errors were encountered: