Fix encoding for encoding dictionary without Type item. #500

likemusic · 2021-12-25T03:49:38Z

In PDF-file in Font internal encryption point to dictionary without Type item, therefore encryption treated not like encryption but just like PDFObject and as result text incorrectly decoded (because both BaseEncoding item and differences array item ignored).

This PR fixes it.

Also has been deleted unnecessary unicode string decode test for file with WinAnsiEncoding text encoding.

Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).

Potentially this PR can fix some of the many other opened issues related to incorrect result text encoding.

…ype header.

k00ni

@likemusic First of all, happy new year and thank you for your pull request!

Please check my questions and remarks below.

You added a lot of code. Please describe inline (e.g. function headers) what each code part does and why you added/changed it.
Add tests to cover your changes.
Explain Also deleted for $unicode passed by reference parameters in Font class because seems that it have no sense (not sure).

Let me know if you need help.

composer.json

src/Smalot/PdfParser/Font.php

likemusic · 2022-01-03T16:39:22Z

@k00ni Happy New Year to you, too!

And thank you too for active maintenance this helpfull project!

likemusic · 2022-01-03T16:39:46Z

@k00ni

Please describe inline (e.g. function headers) what each code part does and why you added/changed it.

I must add comment like you do with you reamarks and question and I shouldn't commit any comment to source code. Did I understand you correctly?

likemusic · 2022-01-03T16:41:02Z

Add tests to cover your changes.

@k00ni Unfotrunaly I can't commit pdf-file bacause it's a bank statement :) But I'll try to find suitable pdf-file in some of opened issues related to incorect output encodings. By the way I will test those issues and mark wich one this PR will fix.

likemusic · 2022-01-03T16:48:26Z

@k00ni I want to mark encoding related issued by encoding label to have quick filter and mayby to fix all of them in future.

I want to add #encoding tag in comments for issues related to encoding and point you (for adding label encoding to issue).

Is it good idea? Can I do this?

k00ni · 2022-01-04T09:20:12Z

@k00ni I want to mark encoding related issued by encoding label to have quick filter and mayby to fix all of them in future.
I want to add #encoding tag in comments for issues related to encoding and point you (for adding label encoding to issue).
Is it good idea? Can I do this?

This project is a personal project of @smalot, I as a collaborator have no rights to enable this for you. @smalot is also very rarely seen around here, so we have to help ourselves. What do you think about this: You post in related issues a short hint and I will add the label for you. I already created "encoding issues", which is more clear. You agree?

I must add comment like you do with you reamarks and question and I shouldn't commit any comment to source code. Did I understand you correctly?

I think there is a misunderstanding. In my opinion documentation should be inline next to the code it is about. Additional documentation can be used in, for instance, README.md or DEVERLOPER.md. Please add information, links, thoughts etc. next to your new code (via further commits). You can commit as much as you want, in the end I will squash all your commits anyway.

Unfotrunaly I can't commit pdf-file bacause it's a bank statement :) But I'll try to find suitable pdf-file in some of opened issues related to incorect output encodings. By the way I will test those issues and mark wich one this PR will fix.

That is wonderful and will help us too! You can test your changes with mocks and stubs for instance, you don't need a working PDF file. Can you create a minimal string which represents a "broken" PDF? You then pass it to the Parser and see how it reacts.

Take this line for instance: https://github.com/smalot/pdfparser/pull/500/files#diff-27bc594b24bd5e2779e8d81ee79810d0ffda03f200f0d53be007a44a9d2cb2deR391

You can inject Encoding by preparing a Header instance, which is given via constructor here:

pdfparser/src/Smalot/PdfParser/PDFObject.php

Line 76 in 6da093d

?Header $header = null,

This way you can inject correct and incorrect data and see how it reacts.

Let me know if you need help.

likemusic · 2022-01-05T03:02:41Z

What do you think about this: You post in related issues a short hint and I will add the label for you. I already created "encoding issues", which is more clear. You agree?

@k00ni It would be great! I meant the same, but my English is not as good as I would like :)

At the moment I sterted and hangs on the investigeting for first issue, not related to but fixed in this PR.

likemusic · 2022-01-05T04:31:05Z

Please add information, links, thoughts etc. next to your new code (via further commits).

@k00ni I made small refactoring and add coments for not unobvious logic and methods names.

likemusic · 2022-01-05T04:47:16Z

Can you create a minimal string which represents a "broken" PDF? You then pass it to the Parser and see how it reacts.

@k00ni I don't like this idea because It allows create combinations not compartible with real PDF-file format. Therefore real pdf file would be preferable, bacause it can be tested visually by Adobe Acrobat Reader.

Frst I will try to find suitable file in opened encoding-related issues. If I have no luck I'll try to "rebuld" manually my bank statement's pdf to small suitable file with one string. If it would be too difficult I'll make synthetic test as you suggested.

k00ni · 2022-01-05T08:33:37Z

Frst I will try to find suitable file in opened encoding-related issues. If I have no luck I'll try to "rebuld" manually my bank statement's pdf to small suitable file with one string. If it would be too difficult I'll make synthetic test as you suggested.

Just some background information: There are different types of tests, like unit and integration tests. Each one includes another/wider range of system parts. For instance, synthetic tests can be implemented as unit or integration tests and allow very narrow views/checks. PDF files are usually complex and include many parts of the system under test, which might lead to a "polluted" result. Your plan is sound, looking forward to your next commits.

Fixed all files in 0.018 seconds, 12.000 MB memory used

src/Smalot/PdfParser/Font.php

Connum · 2022-01-10T16:44:46Z

If anyone finds the time, the test file samples/InternationalChars.pdf I created a while ago would be a starting point for additional testing. There are tests for 6 of the languages in

pdfparser/tests/Integration/ParserTest.php

Line 92 in 1b3b6eb

public function testUnicodeDecoding()

but Hindi, Simplified Chinese, Japanese, Korean, Arabic and Hebrew didn't work back then so I left them out of the test. Though I'm not sure this fix will work for those language as well, as they use UTF-16 and as far as I can tell, this fix specifically handles UTF-8 strings?

Reqrefusion · 2022-01-16T16:08:28Z

Actually writing I have been following this PR from the beginning. I didn't want to write from the beginning. However, I am writing as a courtesy since I was invited, thanks for @k00ni. The reason I wanted to write was that I could not bring a new thought to the subject. I disagree with @j0k3r about ":void noise". It may not be correct to add this as an addition here, but it doesn't make sense to do other PR. If it is already done here, there is no need to do it again. After all, it's not such a big change that it needs to be separated. It's fine, except that it makes it harder to skim over this change later on. If this is seen as a problem, necessary actions can be taken.

Like my dear friend @Connum, I do not have much idea about the corrections made here. Looking at the PRs in the past, I can't find any change in this direction. That's why I find this work of @likemusic very valuable. I hope he continues these efforts. I agree with @Connum on tests. In this regard, some tests that have been added before can be passed. It can also be added in a few tests. There are many tests I want to add, but unfortunately I can't find the time.

Finally, I can say that @likemusic increases the readability considerably in some encodings. I see no obstacle to the acceptance of this PR in its current form. I hope @likemusic makes more corrections like this with valuable expertise.

k00ni · 2022-01-17T08:24:44Z

A bit off topic: I hope I didn't make the impression, that anyone needs my permission to contribute. We welcome everyone to contribute, comment and ask on issues and pull requests, invitations not needed here :)

k00ni · 2022-01-20T06:53:45Z

@likemusic how are you doing? Can you tell us your plans for this PR?

@j0k3r @Connum and @Reqrefusion thank you for your feedback.

I can live with the : void noise too. I made another commit suggestion, performance tests keep failing and there was a question from @Connum. Can you get back to us in these regards @likemusic?

It would be great if we can finish this PR so I have more time starting documentation changes (#498).

likemusic · 2022-01-20T16:25:07Z

@k00ni I'll get back to this PR on the Weekend or during next week.

Co-authored-by: Konrad Abicht <[email protected]>

…x-encoding

likemusic · 2022-01-28T02:00:46Z

I've updated the test pdf file. Now it opens without errors both in Sumatra PDF Viewer and Adobe Acrobat Reader.

Sumatra Pdf Viewer:

Adobe Acrobat Reader

likemusic · 2022-01-28T02:02:23Z

Here is extracted text differences for the test file before and after this PR:

likemusic · 2022-01-28T02:14:37Z

but Hindi, Simplified Chinese, Japanese, Korean, Arabic and Hebrew didn't work back then so I left them out of the test. Though I'm not sure this fix will work for those language as well, as they use UTF-16 and as far as I can tell, this fix specifically handles UTF-8 strings?

@Connum This PR doesn't relate to these issues. As far as I understand issues in InternationalChars.pdf are related to Identity-h/Identity-v encodings handling.

Thank you for this useful notice! It will help to test Identity-h/Identity-v fixes.

likemusic · 2022-01-28T02:43:21Z

To not mix in this PR changes with tests return types, todos, and methods descriptions I've made separate PR #509 with cherry-picked 1ee8577

If it would be merged to master-branch firstly, we could squash commits in this PR to not have not related void-noise in return types.

likemusic · 2022-01-28T02:50:20Z

performance tests keep failing

@k00ni I don't really understand why this could happen. I've run tests on my laptop under PHP v7.4 without errors.

If this happens again I will try to reproduce it, profile by xDebug, and fix it.

j0k3r · 2022-01-28T06:13:41Z

There is one remaining failing test, could you fix it locally?

php tests/Performance/runPerformanceTests.php

k00ni · 2022-01-28T08:10:26Z

@likemusic you made the right call to do these off-topic changes in a separate PR. I approved and merged it right away. Please merge in master branch.

After performance tests were fixed, this PR is good to go.

j0k3r · 2022-01-31T13:19:45Z

I updated the code to handle $text when the encoding isn't instanceof something. It was the previous behavior so I think we must keep it:

pdfparser/src/Smalot/PdfParser/Font.php

Lines 491 to 496 in 43ca68f

    
           } elseif (!mb_check_encoding($text, 'UTF-8')) { 
        
               // don't double-encode strings already in UTF-8 
        
               $text = mb_convert_encoding($text, 'UTF-8', 'Windows-1252'); 
        
           } 
        
           return $text;

This then fix the tests/Performance/runPerformanceTests.php script.

Reqrefusion · 2022-01-31T17:10:14Z

I think there has been a problem with old coding related to performance testing, I hope something will be done about this in the future.

k00ni

Good addition @j0k3r

I merged in master branch which leads to a way smaller PR.

PR is accepted. Is there anything left to say or do @likemusic?

likemusic · 2022-02-03T12:59:21Z

@k00ni I don't like @j0k3r 's fix because it seems like suppressing logic error. But in any way, you could merge this PR, and I will make a new one after researching why Header instance is passed during the performance test.

k00ni · 2022-02-03T13:31:24Z

@likemusic sounds good.

Thank you for being patient and your work here.

likemusic added 7 commits December 25, 2021 00:22

Font::decodeContent() fixes to support Encoding dictionates without T…

3b61b9a

…ype header.

Pass $unicode by reference in Font::decodeContentByEncoding().

1a77588

Add encoding initialization.

cd7d05f

Font's initialized encoding in private property.

47cf15f

Add "ext-iconv" to "require" in composer.json

9758d6e

Delete unnecessary unicode string test for smalot#95 test.

87ca438

Fix misprint.

6da093d

k00ni added enhancement fix labels Dec 28, 2021

k00ni requested changes Jan 3, 2022

View reviewed changes

composer.json Show resolved Hide resolved

src/Smalot/PdfParser/Font.php Outdated Show resolved Hide resolved

src/Smalot/PdfParser/Font.php Outdated Show resolved Hide resolved

k00ni added needs more info needs work labels Jan 3, 2022

k00ni added de-/encoding issue and removed needs more info labels Jan 4, 2022

Add comments + small refactoring.

c5dcd06

Run dev-tools/vendor/bin/php-cs-fixer fix

21c4797

Fixed all files in 0.018 seconds, 12.000 MB memory used