Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #434. Reworked the Document's object cache dictionary. #435

Merged
merged 34 commits into from
Aug 30, 2021

Conversation

jee7
Copy link
Contributor

@jee7 jee7 commented Jun 13, 2021

The getObjectsByType() method now uses it correctly. The dictionary also should support subtype searches. Only one font is asked for and returned to get the default font.

This solution parses the 2 pages of the big PDF document linked in #434 in about the same time as v0.16.2 did (<0.1 sec per page). On my machine, it's 0.005 s for page 77 and 0.113 s for page 78 (comparable numbers to v0.16.2 speed).

…The getObjectsByType() method now uses it correctly. The dictionary also should support subtype searches. Only one font is asked for and returned to get the default font.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jee7 thanks for your pull request! 👍🏿

Besides my comments below, is there a way to add at least 1 test which check changes or outlines performance boost? Maybe based on your test code from here.

src/Smalot/PdfParser/Document.php Show resolved Hide resolved
src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Document.php Show resolved Hide resolved
@k00ni k00ni self-assigned this Jun 15, 2021
@k00ni
Copy link
Collaborator

k00ni commented Jun 28, 2021

Hey @jee7 how is it going?

@jee7
Copy link
Contributor Author

jee7 commented Jun 28, 2021

Hey @jee7 how is it going?

Apologies, I did not get any notifications from your comments before.

  • The type declarations are added as per recommendations. Although, if this will be the new convention for this project, then it should be done everywhere, right? Currently, these two methods are now different from the rest of the code because of the PHP type declarations.
  • I cannot come up with a way to implement the performance test in a good manner. I'm new to unit tests and as far as I understood they should not be used for measuring performance. Your suggestion?

@k00ni
Copy link
Collaborator

k00ni commented Jul 12, 2021

  • The type declarations are added as per recommendations. Although, if this will be the new convention for this project, then it should be done everywhere, right? Currently, these two methods are now different from the rest of the code because of the PHP type declarations.

No, in my opinion we should be as explicit about new things as possible (e.g. add return values etc.), but that doesn't mean we have to adapt the old code too. It can be updated later on during another PR or some refactoring. This way we have better code quality and understanding and don't need to rely on PHP doc.

  • I cannot come up with a way to implement the performance test in a good manner. I'm new to unit tests and as far as I understood they should not be used for measuring performance. Your suggestion?

Another Github Action could be used which only contains performance tests. You can copy .github/workflows/continuous-integration.yml to performance-tests.yml (or something better than that) and write 1, 2 simple tests (maybe based on your previos work?). We can start small here to get this going. I am not an expert in performance testing, but we do the best we can for the moment. Everyone is welcomed to comment. Each performance test should load a PDF and should fail if it needs longer than x seconds to parse, for instance.

@jee7
Copy link
Contributor Author

jee7 commented Jul 12, 2021

Okay, how about now?

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your changes. Looks good, we are getting there!

tests/Performance/runPerformanceTests.php Outdated Show resolved Hide resolved
.github/workflows/performance.yml Outdated Show resolved Hide resolved
.github/workflows/performance.yml Outdated Show resolved Hide resolved
.github/workflows/performance.yml Show resolved Hide resolved
src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved
tests/Performance/Test/AbstractPerformanceTest.php Outdated Show resolved Hide resolved
tests/Performance/Test/AbstractPerformanceTest.php Outdated Show resolved Hide resolved
tests/Performance/Test/DocumentDictionaryCacheTest.php Outdated Show resolved Hide resolved
tests/Performance/Test/DocumentDictionaryCacheTest.php Outdated Show resolved Hide resolved
tests/Performance/runPerformanceTests.php Outdated Show resolved Hide resolved
@jee7
Copy link
Contributor Author

jee7 commented Jul 13, 2021

The Type Declaration should be now refactored to be better throughout. Other smaller fixes were done as well. Autoloading does not work. I asked the author of the PDF file if we can also host it. I'll let you know what he responds.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your adaptions.

But why did you refactored unrelated files too? It bloats your pull request and mixes improvements and unnecessary changes (like refactorings). We do such changes separately.

Here are some information why pull requests should be kept small: https://essenceofcode.com/2019/10/29/the-art-of-small-pull-requests/

Creating small, focused PRs creates a good invitation to your reviewer to provide quality feedback. Smaller PRs usually result in:

  • Improved code quality
  • Better feedback
  • Faster feedback
  • Increased understandability
  • Better communication (with team members)

Please revert these changes so your pull request only contains optimizations towards a better object cache dictionary.

@jee7
Copy link
Contributor Author

jee7 commented Jul 14, 2021

Autoloading with running composer update before works, thanks. The author of the thesis e-mailed me back with the approval of using his thesis PDF in this repo. Added it to the samples folder. You told me to remove the PHPDoc in favor of Type Declarations. It only makes sense to then unify the entire project to use the same convention. I find it bothersome to work with code, which mixes different conventions together. If we want new code to use Type Declarations instead of PHPDoc, then the rest of the project must follow suit before this.

@k00ni
Copy link
Collaborator

k00ni commented Jul 14, 2021

I find it bothersome to work with code, which mixes different conventions together. If we want new code to use Type Declarations instead of PHPDoc, then the rest of the project must follow suit before this.

I understand your point. My suggestion to solve this issue would be that you move these changes into a separate branch and make another pull request. Finish this PR first or do refactoring before, as you like. What do you think?

@jee7
Copy link
Contributor Author

jee7 commented Jul 14, 2021

Okay. PR #440 now includes the PHPDoc / Type Declaration changes.

@k00ni
Copy link
Collaborator

k00ni commented Jul 15, 2021

Is this PR now dependent on #440?

@jee7
Copy link
Contributor Author

jee7 commented Jul 15, 2021

Is this PR now dependent on #440?

That would seem efficient, yes. When that gets approved, I can pull and merge it back into this PR. This will hopefully save a few hours of trying to put all the PHPDocs back here the way they were.

@k00ni
Copy link
Collaborator

k00ni commented Jul 22, 2021

@jee7 can you check if #441 affects your performance test?

@jee7
Copy link
Contributor Author

jee7 commented Jul 28, 2021

@jee7 can you check if #441 affects your performance test?

Seems not to affect it to me.

@k00ni
Copy link
Collaborator

k00ni commented Aug 25, 2021

@jee7 your PR with refactored code was merged. Can you solve merge conflicts so we can continue here?

@jee7
Copy link
Contributor Author

jee7 commented Aug 25, 2021

@jee7 your PR with refactored code was merged. Can you solve merge conflicts so we can continue here?

Should be good now.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏿 Thank you, looks good!

I will keep this open for around week then I will merge.

@k00ni k00ni merged commit e2ca3d2 into smalot:master Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants