Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding for encoding dictionary without Type item. #500

Merged
merged 23 commits into from
Feb 3, 2022

Conversation

likemusic
Copy link
Contributor

@likemusic likemusic commented Dec 25, 2021

image

In PDF-file in Font internal encryption point to dictionary without Type item, therefore encryption treated not like encryption but just like PDFObject and as result text incorrectly decoded (because both BaseEncoding item and differences array item ignored).

This PR fixes it.

Also has been deleted unnecessary unicode string decode test for file with WinAnsiEncoding text encoding.

Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).

Potentially this PR can fix some of the many other opened issues related to incorrect result text encoding.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@likemusic First of all, happy new year and thank you for your pull request!

Please check my questions and remarks below.

  • You added a lot of code. Please describe inline (e.g. function headers) what each code part does and why you added/changed it.
  • Add tests to cover your changes.
  • Explain Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).

Let me know if you need help.

@likemusic
Copy link
Contributor Author

@k00ni Happy New Year to you, too!

And thank you too for active maintenance this helpfull project!

@likemusic
Copy link
Contributor Author

@k00ni

Please describe inline (e.g. function headers) what each code part does and why you added/changed it.

I must add comment like you do with you reamarks and question and I shouldn't commit any comment to source code. Did I understand you correctly?

@likemusic
Copy link
Contributor Author

Add tests to cover your changes.

@k00ni Unfotrunaly I can't commit pdf-file bacause it's a bank statement :) But I'll try to find suitable pdf-file in some of opened issues related to incorect output encodings. By the way I will test those issues and mark wich one this PR will fix.

@likemusic
Copy link
Contributor Author

@k00ni I want to mark encoding related issued by encoding label to have quick filter and mayby to fix all of them in future.

I want to add #encoding tag in comments for issues related to encoding and point you (for adding label encoding to issue).

Is it good idea? Can I do this?

@k00ni
Copy link
Collaborator

k00ni commented Jan 4, 2022

@k00ni I want to mark encoding related issued by encoding label to have quick filter and mayby to fix all of them in future.
I want to add #encoding tag in comments for issues related to encoding and point you (for adding label encoding to issue).
Is it good idea? Can I do this?

This project is a personal project of @smalot, I as a collaborator have no rights to enable this for you. @smalot is also very rarely seen around here, so we have to help ourselves. What do you think about this: You post in related issues a short hint and I will add the label for you. I already created "encoding issues", which is more clear. You agree?

I must add comment like you do with you reamarks and question and I shouldn't commit any comment to source code. Did I understand you correctly?

I think there is a misunderstanding. In my opinion documentation should be inline next to the code it is about. Additional documentation can be used in, for instance, README.md or DEVERLOPER.md. Please add information, links, thoughts etc. next to your new code (via further commits). You can commit as much as you want, in the end I will squash all your commits anyway.

Unfotrunaly I can't commit pdf-file bacause it's a bank statement :) But I'll try to find suitable pdf-file in some of opened issues related to incorect output encodings. By the way I will test those issues and mark wich one this PR will fix.

That is wonderful and will help us too! You can test your changes with mocks and stubs for instance, you don't need a working PDF file. Can you create a minimal string which represents a "broken" PDF? You then pass it to the Parser and see how it reacts.

Take this line for instance: https://github.com/smalot/pdfparser/pull/500/files#diff-27bc594b24bd5e2779e8d81ee79810d0ffda03f200f0d53be007a44a9d2cb2deR391

You can inject Encoding by preparing a Header instance, which is given via constructor here:

?Header $header = null,

This way you can inject correct and incorrect data and see how it reacts.

Let me know if you need help.

@likemusic
Copy link
Contributor Author

What do you think about this: You post in related issues a short hint and I will add the label for you. I already created "encoding issues", which is more clear. You agree?

@k00ni It would be great! I meant the same, but my English is not as good as I would like :)

At the moment I sterted and hangs on the investigeting for first issue, not related to but fixed in this PR.

@likemusic
Copy link
Contributor Author

Please add information, links, thoughts etc. next to your new code (via further commits).

@k00ni I made small refactoring and add coments for not unobvious logic and methods names.

@likemusic
Copy link
Contributor Author

Can you create a minimal string which represents a "broken" PDF? You then pass it to the Parser and see how it reacts.

@k00ni I don't like this idea because It allows create combinations not compartible with real PDF-file format. Therefore real pdf file would be preferable, bacause it can be tested visually by Adobe Acrobat Reader.

Frst I will try to find suitable file in opened encoding-related issues. If I have no luck I'll try to "rebuld" manually my bank statement's pdf to small suitable file with one string. If it would be too difficult I'll make synthetic test as you suggested.

@k00ni
Copy link
Collaborator

k00ni commented Jan 5, 2022

Frst I will try to find suitable file in opened encoding-related issues. If I have no luck I'll try to "rebuld" manually my bank statement's pdf to small suitable file with one string. If it would be too difficult I'll make synthetic test as you suggested.

Just some background information: There are different types of tests, like unit and integration tests. Each one includes another/wider range of system parts. For instance, synthetic tests can be implemented as unit or integration tests and allow very narrow views/checks. PDF files are usually complex and include many parts of the system under test, which might lead to a "polluted" result. Your plan is sound, looking forward to your next commits.

Fixed all files in 0.018 seconds, 12.000 MB memory used
@Connum
Copy link
Contributor

Connum commented Jan 10, 2022

If anyone finds the time, the test file samples/InternationalChars.pdf I created a while ago would be a starting point for additional testing. There are tests for 6 of the languages in

public function testUnicodeDecoding()
but Hindi, Simplified Chinese, Japanese, Korean, Arabic and Hebrew didn't work back then so I left them out of the test. Though I'm not sure this fix will work for those language as well, as they use UTF-16 and as far as I can tell, this fix specifically handles UTF-8 strings?

@Reqrefusion
Copy link

Actually writing I have been following this PR from the beginning. I didn't want to write from the beginning. However, I am writing as a courtesy since I was invited, thanks for @k00ni. The reason I wanted to write was that I could not bring a new thought to the subject. I disagree with @j0k3r about ":void noise". It may not be correct to add this as an addition here, but it doesn't make sense to do other PR. If it is already done here, there is no need to do it again. After all, it's not such a big change that it needs to be separated. It's fine, except that it makes it harder to skim over this change later on. If this is seen as a problem, necessary actions can be taken.

Like my dear friend @Connum, I do not have much idea about the corrections made here. Looking at the PRs in the past, I can't find any change in this direction. That's why I find this work of @likemusic very valuable. I hope he continues these efforts. I agree with @Connum on tests. In this regard, some tests that have been added before can be passed. It can also be added in a few tests. There are many tests I want to add, but unfortunately I can't find the time.

Finally, I can say that @likemusic increases the readability considerably in some encodings. I see no obstacle to the acceptance of this PR in its current form. I hope @likemusic makes more corrections like this with valuable expertise.

@k00ni
Copy link
Collaborator

k00ni commented Jan 17, 2022

A bit off topic: I hope I didn't make the impression, that anyone needs my permission to contribute. We welcome everyone to contribute, comment and ask on issues and pull requests, invitations not needed here :)

@k00ni
Copy link
Collaborator

k00ni commented Jan 20, 2022

@likemusic how are you doing? Can you tell us your plans for this PR?

@j0k3r @Connum and @Reqrefusion thank you for your feedback.

I can live with the : void noise too. I made another commit suggestion, performance tests keep failing and there was a question from @Connum. Can you get back to us in these regards @likemusic?

It would be great if we can finish this PR so I have more time starting documentation changes (#498).

@k00ni k00ni mentioned this pull request Jan 20, 2022
@likemusic
Copy link
Contributor Author

@k00ni I'll get back to this PR on the Weekend or during next week.

@likemusic
Copy link
Contributor Author

I've updated the test pdf file. Now it opens without errors both in Sumatra PDF Viewer and Adobe Acrobat Reader.

Sumatra Pdf Viewer:
SumatraPdfViewer

Adobe Acrobat Reader
AdobeAcrobatReader

@likemusic
Copy link
Contributor Author

Here is extracted text differences for the test file before and after this PR:
image

@likemusic
Copy link
Contributor Author

but Hindi, Simplified Chinese, Japanese, Korean, Arabic and Hebrew didn't work back then so I left them out of the test. Though I'm not sure this fix will work for those language as well, as they use UTF-16 and as far as I can tell, this fix specifically handles UTF-8 strings?

@Connum This PR doesn't relate to these issues. As far as I understand issues in InternationalChars.pdf are related to Identity-h/Identity-v encodings handling.

Thank you for this useful notice! It will help to test Identity-h/Identity-v fixes.

@likemusic
Copy link
Contributor Author

To not mix in this PR changes with tests return types, todos, and methods descriptions I've made separate PR #509 with cherry-picked 1ee8577

If it would be merged to master-branch firstly, we could squash commits in this PR to not have not related void-noise in return types.

@likemusic
Copy link
Contributor Author

performance tests keep failing

@k00ni I don't really understand why this could happen. I've run tests on my laptop under PHP v7.4 without errors.

If this happens again I will try to reproduce it, profile by xDebug, and fix it.

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 28, 2022

There is one remaining failing test, could you fix it locally?

php tests/Performance/runPerformanceTests.php

@k00ni
Copy link
Collaborator

k00ni commented Jan 28, 2022

@likemusic you made the right call to do these off-topic changes in a separate PR. I approved and merged it right away. Please merge in master branch.

After performance tests were fixed, this PR is good to go.

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 31, 2022

I updated the code to handle $text when the encoding isn't instanceof something. It was the previous behavior so I think we must keep it:

} elseif (!mb_check_encoding($text, 'UTF-8')) {
// don't double-encode strings already in UTF-8
$text = mb_convert_encoding($text, 'UTF-8', 'Windows-1252');
}
return $text;

This then fix the tests/Performance/runPerformanceTests.php script.

@j0k3r j0k3r requested a review from k00ni January 31, 2022 13:19
@Reqrefusion
Copy link

I think there has been a problem with old coding related to performance testing, I hope something will be done about this in the future.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition @j0k3r

I merged in master branch which leads to a way smaller PR.

PR is accepted. Is there anything left to say or do @likemusic?

@k00ni k00ni self-assigned this Feb 1, 2022
@likemusic
Copy link
Contributor Author

likemusic commented Feb 3, 2022

@k00ni I don't like @j0k3r 's fix because it seems like suppressing logic error. But in any way, you could merge this PR, and I will make a new one after researching why Header instance is passed during the performance test.

@k00ni k00ni merged commit 4551cd0 into smalot:master Feb 3, 2022
@k00ni
Copy link
Collaborator

k00ni commented Feb 3, 2022

@likemusic sounds good.

Thank you for being patient and your work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants