Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly decode ANSI encodings #257

Closed
wants to merge 3 commits into from
Closed

Conversation

davispuh
Copy link

This PR fixes several bugs which prevented from correctly decoding ANSI encodings.
For example with this PR now can correctly get text of PDF file which doesn't have ToUnicode but has WinAnsiEncoding with differences for Windows-1257 code page

Most likely these issues are fixed: #42 #44 #88 #151 #152 #202 #241

Copy link

@bywciu bywciu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works very well for me, the issues have gone.

@fbett
Copy link

fbett commented Apr 9, 2020

Thx for this PR.

This also fixes a PHP 7.4 error:

chr() expects parameter 1 to be int, string given at smalot/pdfparser/src/Smalot/PdfParser/Font.php:494

@smalot
Copy link
Owner

smalot commented Apr 21, 2020

That's a great work but some unit tests should be really appreciated

@smalot
Copy link
Owner

smalot commented Apr 21, 2020

I just run "unit test" on your PR and some are broken.

Can you try on your side ?

./vendor/bin/atoum -d src/Smalot/PdfParser/Tests/ -ncc

@davispuh
Copy link
Author

I actually didn't even check them so most likely they need fixing.

@k00ni

This comment has been minimized.

@k00ni k00ni self-requested a review May 29, 2020 13:32
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @davispuh, i wanted to help you with the merge, but i couldn't push to https://github.com/skyfms/pdfparser/tree/fix_encoding.

Therefore i created https://github.com/smalot/pdfparser/tree/pr/257. It is based on skyfms/fix_encoding and i merged in master branch and fixed coding style violations.

Remaining TODOs:

Can you have a look at this?

@davispuh
Copy link
Author

davispuh commented Jun 8, 2020

I'm not using this library anymore and I don't really have time to complete this so you can just make new PR and fix whatever needs to be fixed.
That test fails because $currentFont seems to be null but I don't really see how that could happen.

PS. You could have just rebased it rather than merged.

@k00ni
Copy link
Collaborator

k00ni commented Aug 5, 2020

Closed due to no progress. I checked some of the issues mentioned here, there was no indication that this PR helped them.

@k00ni k00ni closed this Aug 5, 2020
@davispuh
Copy link
Author

davispuh commented Aug 5, 2020

there was no indication that this PR helped them.

That's weird, I didn't saw anyone actually testing this. It did fix a lot of issues for me. Anyway I guess not my problem anymore.

k00ni pushed a commit that referenced this pull request Oct 8, 2020
* Get correct default font

* Create header elements with it's respective class

* Properly decode ANSI encodings

* allow for line breaks when splitting xrefs for id and position

* extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING

* added test case for this fix

* only reset error handler when the current handler is the handler we had set before

* work around for failing CI build with PHP 5.6

* added comment and link to the workaround getting the current error handler

* removed unnecessary ini_set call

* remove error level constant name before error message

* restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message

* reverse the changes made to the TestCase class and the code in the test case depending on it

* simplified test case, now checking if object has been parsed correctly

* code linting

* applied linting

* handle failed font lookup

* look up unfiltered font resource name first, then fall back to filtered resource name

* added unit test for #202 bugfix, code linting

* fallback for decoding single-byte ANSI characters that are not in the lookup table

* added test file and unit test for international unicode characters

* don't double-encode strings already in UTF-8

* code linting

* removed remnants from old decodeContent() function signature

* parseHeaderElement() should not return a PDFObject

* some minor changes as requested by the review

* keep $unicode as deprecated parameter in decodeContent function signature

* forgot to add default value for $unicode to make it optional

* added proper doc blocks to PostScriptGlyphs.php

* return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON

* changed @deprecated to parameter description

Co-authored-by: Dāvis Mosāns <[email protected]>
partulaj pushed a commit to partulaj/pdfparser that referenced this pull request Dec 21, 2020
* Get correct default font

* Create header elements with it's respective class

* Properly decode ANSI encodings

* allow for line breaks when splitting xrefs for id and position

* extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING

* added test case for this fix

* only reset error handler when the current handler is the handler we had set before

* work around for failing CI build with PHP 5.6

* added comment and link to the workaround getting the current error handler

* removed unnecessary ini_set call

* remove error level constant name before error message

* restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message

* reverse the changes made to the TestCase class and the code in the test case depending on it

* simplified test case, now checking if object has been parsed correctly

* code linting

* applied linting

* handle failed font lookup

* look up unfiltered font resource name first, then fall back to filtered resource name

* added unit test for smalot#202 bugfix, code linting

* fallback for decoding single-byte ANSI characters that are not in the lookup table

* added test file and unit test for international unicode characters

* don't double-encode strings already in UTF-8

* code linting

* removed remnants from old decodeContent() function signature

* parseHeaderElement() should not return a PDFObject

* some minor changes as requested by the review

* keep $unicode as deprecated parameter in decodeContent function signature

* forgot to add default value for $unicode to make it optional

* added proper doc blocks to PostScriptGlyphs.php

* return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON

* changed @deprecated to parameter description

Co-authored-by: Dāvis Mosāns <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Several Issues with greek characters.
5 participants