-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding for encoding dictionary without Type item. #500
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@likemusic First of all, happy new year and thank you for your pull request!
Please check my questions and remarks below.
- You added a lot of code. Please describe inline (e.g. function headers) what each code part does and why you added/changed it.
- Add tests to cover your changes.
- Explain
Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).
Let me know if you need help.
@k00ni Happy New Year to you, too! And thank you too for active maintenance this helpfull project! |
I must add comment like you do with you reamarks and question and I shouldn't commit any comment to source code. Did I understand you correctly? |
@k00ni Unfotrunaly I can't commit pdf-file bacause it's a bank statement :) But I'll try to find suitable pdf-file in some of opened issues related to incorect output encodings. By the way I will test those issues and mark wich one this PR will fix. |
@k00ni I want to mark encoding related issued by I want to add #encoding tag in comments for issues related to encoding and point you (for adding label Is it good idea? Can I do this? |
This project is a personal project of @smalot, I as a collaborator have no rights to enable this for you. @smalot is also very rarely seen around here, so we have to help ourselves. What do you think about this: You post in related issues a short hint and I will add the label for you. I already created "encoding issues", which is more clear. You agree?
I think there is a misunderstanding. In my opinion documentation should be inline next to the code it is about. Additional documentation can be used in, for instance, README.md or DEVERLOPER.md. Please add information, links, thoughts etc. next to your new code (via further commits). You can commit as much as you want, in the end I will squash all your commits anyway.
That is wonderful and will help us too! You can test your changes with mocks and stubs for instance, you don't need a working PDF file. Can you create a minimal string which represents a "broken" PDF? You then pass it to the Parser and see how it reacts. Take this line for instance: https://github.com/smalot/pdfparser/pull/500/files#diff-27bc594b24bd5e2779e8d81ee79810d0ffda03f200f0d53be007a44a9d2cb2deR391 You can inject
This way you can inject correct and incorrect data and see how it reacts. Let me know if you need help. |
@k00ni It would be great! I meant the same, but my English is not as good as I would like :) At the moment I sterted and hangs on the investigeting for first issue, not related to but fixed in this PR. |
@k00ni I made small refactoring and add coments for not unobvious logic and methods names. |
@k00ni I don't like this idea because It allows create combinations not compartible with real PDF-file format. Therefore real pdf file would be preferable, bacause it can be tested visually by Adobe Acrobat Reader. Frst I will try to find suitable file in opened encoding-related issues. If I have no luck I'll try to "rebuld" manually my bank statement's pdf to small suitable file with one string. If it would be too difficult I'll make synthetic test as you suggested. |
Just some background information: There are different types of tests, like unit and integration tests. Each one includes another/wider range of system parts. For instance, synthetic tests can be implemented as unit or integration tests and allow very narrow views/checks. PDF files are usually complex and include many parts of the system under test, which might lead to a "polluted" result. Your plan is sound, looking forward to your next commits. |
Fixed all files in 0.018 seconds, 12.000 MB memory used
If anyone finds the time, the test file samples/InternationalChars.pdf I created a while ago would be a starting point for additional testing. There are tests for 6 of the languages in
|
Actually writing I have been following this PR from the beginning. I didn't want to write from the beginning. However, I am writing as a courtesy since I was invited, thanks for @k00ni. The reason I wanted to write was that I could not bring a new thought to the subject. I disagree with @j0k3r about ":void noise". It may not be correct to add this as an addition here, but it doesn't make sense to do other PR. If it is already done here, there is no need to do it again. After all, it's not such a big change that it needs to be separated. It's fine, except that it makes it harder to skim over this change later on. If this is seen as a problem, necessary actions can be taken. Like my dear friend @Connum, I do not have much idea about the corrections made here. Looking at the PRs in the past, I can't find any change in this direction. That's why I find this work of @likemusic very valuable. I hope he continues these efforts. I agree with @Connum on tests. In this regard, some tests that have been added before can be passed. It can also be added in a few tests. There are many tests I want to add, but unfortunately I can't find the time. Finally, I can say that @likemusic increases the readability considerably in some encodings. I see no obstacle to the acceptance of this PR in its current form. I hope @likemusic makes more corrections like this with valuable expertise. |
A bit off topic: I hope I didn't make the impression, that anyone needs my permission to contribute. We welcome everyone to contribute, comment and ask on issues and pull requests, invitations not needed here :) |
@likemusic how are you doing? Can you tell us your plans for this PR? @j0k3r @Connum and @Reqrefusion thank you for your feedback. I can live with the It would be great if we can finish this PR so I have more time starting documentation changes (#498). |
@k00ni I'll get back to this PR on the Weekend or during next week. |
@Connum This PR doesn't relate to these issues. As far as I understand issues in InternationalChars.pdf are related to Identity-h/Identity-v encodings handling. Thank you for this useful notice! It will help to test Identity-h/Identity-v fixes. |
@k00ni I don't really understand why this could happen. I've run tests on my laptop under PHP v7.4 without errors. If this happens again I will try to reproduce it, profile by xDebug, and fix it. |
There is one remaining failing test, could you fix it locally?
|
@likemusic you made the right call to do these off-topic changes in a separate PR. I approved and merged it right away. Please merge in After performance tests were fixed, this PR is good to go. |
I updated the code to handle pdfparser/src/Smalot/PdfParser/Font.php Lines 491 to 496 in 43ca68f
This then fix the |
I think there has been a problem with old coding related to performance testing, I hope something will be done about this in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition @j0k3r
I merged in master
branch which leads to a way smaller PR.
PR is accepted. Is there anything left to say or do @likemusic?
@likemusic sounds good. Thank you for being patient and your work here. |
In PDF-file in Font internal encryption point to dictionary without Type item, therefore encryption treated not like encryption but just like PDFObject and as result text incorrectly decoded (because both BaseEncoding item and differences array item ignored).
This PR fixes it.
Also has been deleted unnecessary unicode string decode test for file with WinAnsiEncoding text encoding.
Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).
Potentially this PR can fix some of the many other opened issues related to incorrect result text encoding.