Properly decode ANSI encodings #257

davispuh · 2019-09-17T13:07:38Z

This PR fixes several bugs which prevented from correctly decoding ANSI encodings.
For example with this PR now can correctly get text of PDF file which doesn't have ToUnicode but has WinAnsiEncoding with differences for Windows-1257 code page

Most likely these issues are fixed: #42 #44 #88 #151 #152 #202 #241

bywciu

It works very well for me, the issues have gone.

fbett · 2020-04-09T09:59:16Z

Thx for this PR.

This also fixes a PHP 7.4 error:

chr() expects parameter 1 to be int, string given at smalot/pdfparser/src/Smalot/PdfParser/Font.php:494

smalot · 2020-04-21T06:58:25Z

That's a great work but some unit tests should be really appreciated

smalot · 2020-04-21T07:32:57Z

I just run "unit test" on your PR and some are broken.

Can you try on your side ?

./vendor/bin/atoum -d src/Smalot/PdfParser/Tests/ -ncc

davispuh · 2020-04-21T16:38:38Z

I actually didn't even check them so most likely they need fixing.

k00ni

Hey @davispuh, i wanted to help you with the merge, but i couldn't push to https://github.com/skyfms/pdfparser/tree/fix_encoding.

Therefore i created https://github.com/smalot/pdfparser/tree/pr/257. It is based on skyfms/fix_encoding and i merged in master branch and fixed coding style violations.

Remaining TODOs:

Fix failing tests - result is https://travis-ci.org/github/smalot/pdfparser/builds/696039249

Can you have a look at this?

davispuh · 2020-06-08T22:49:48Z

I'm not using this library anymore and I don't really have time to complete this so you can just make new PR and fix whatever needs to be fixed.
That test fails because $currentFont seems to be null but I don't really see how that could happen.

PS. You could have just rebased it rather than merged.

k00ni · 2020-08-05T08:15:08Z

Closed due to no progress. I checked some of the issues mentioned here, there was no indication that this PR helped them.

davispuh · 2020-08-05T13:47:10Z

there was no indication that this PR helped them.

That's weird, I didn't saw anyone actually testing this. It did fix a lot of issues for me. Anyway I guess not my problem anymore.

@deprecated

* Get correct default font * Create header elements with it's respective class * Properly decode ANSI encodings * allow for line breaks when splitting xrefs for id and position * extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING * added test case for this fix * only reset error handler when the current handler is the handler we had set before * work around for failing CI build with PHP 5.6 * added comment and link to the workaround getting the current error handler * removed unnecessary ini_set call * remove error level constant name before error message * restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message * reverse the changes made to the TestCase class and the code in the test case depending on it * simplified test case, now checking if object has been parsed correctly * code linting * applied linting * handle failed font lookup * look up unfiltered font resource name first, then fall back to filtered resource name * added unit test for #202 bugfix, code linting * fallback for decoding single-byte ANSI characters that are not in the lookup table * added test file and unit test for international unicode characters * don't double-encode strings already in UTF-8 * code linting * removed remnants from old decodeContent() function signature * parseHeaderElement() should not return a PDFObject * some minor changes as requested by the review * keep $unicode as deprecated parameter in decodeContent function signature * forgot to add default value for $unicode to make it optional * added proper doc blocks to PostScriptGlyphs.php * return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON * changed @deprecated to parameter description Co-authored-by: Dāvis Mosāns <[email protected]>

@deprecated

* Get correct default font * Create header elements with it's respective class * Properly decode ANSI encodings * allow for line breaks when splitting xrefs for id and position * extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING * added test case for this fix * only reset error handler when the current handler is the handler we had set before * work around for failing CI build with PHP 5.6 * added comment and link to the workaround getting the current error handler * removed unnecessary ini_set call * remove error level constant name before error message * restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message * reverse the changes made to the TestCase class and the code in the test case depending on it * simplified test case, now checking if object has been parsed correctly * code linting * applied linting * handle failed font lookup * look up unfiltered font resource name first, then fall back to filtered resource name * added unit test for smalot#202 bugfix, code linting * fallback for decoding single-byte ANSI characters that are not in the lookup table * added test file and unit test for international unicode characters * don't double-encode strings already in UTF-8 * code linting * removed remnants from old decodeContent() function signature * parseHeaderElement() should not return a PDFObject * some minor changes as requested by the review * keep $unicode as deprecated parameter in decodeContent function signature * forgot to add default value for $unicode to make it optional * added proper doc blocks to PostScriptGlyphs.php * return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON * changed @deprecated to parameter description Co-authored-by: Dāvis Mosāns <[email protected]>

Dāvis Mosāns added 3 commits September 17, 2019 15:43

Get correct default font

a74f5a5

Create header elements with it's respective class

49ba4ee

Properly decode ANSI encodings

33a1746

bywciu approved these changes Feb 6, 2020

View reviewed changes

smalot added the unit tests / CI label Apr 21, 2020

This comment has been minimized.

Sign in to view

k00ni added the needs work label May 29, 2020

k00ni self-requested a review May 29, 2020 13:32

k00ni requested changes Jun 8, 2020

View reviewed changes

k00ni removed the unit tests / CI label Jun 8, 2020

k00ni closed this Aug 5, 2020

Connum mentioned this pull request Sep 30, 2020

revived #257: Properly decode ANSI encodings #349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly decode ANSI encodings #257

Properly decode ANSI encodings #257

davispuh commented Sep 17, 2019

bywciu left a comment

fbett commented Apr 9, 2020

smalot commented Apr 21, 2020

smalot commented Apr 21, 2020

davispuh commented Apr 21, 2020

This comment has been minimized.

k00ni left a comment •

edited

Loading

davispuh commented Jun 8, 2020 •

edited

Loading

k00ni commented Aug 5, 2020

davispuh commented Aug 5, 2020

Properly decode ANSI encodings #257

Properly decode ANSI encodings #257

Conversation

davispuh commented Sep 17, 2019

bywciu left a comment

Choose a reason for hiding this comment

fbett commented Apr 9, 2020

smalot commented Apr 21, 2020

smalot commented Apr 21, 2020

davispuh commented Apr 21, 2020

This comment has been minimized.

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

davispuh commented Jun 8, 2020 • edited Loading

k00ni commented Aug 5, 2020

davispuh commented Aug 5, 2020

k00ni left a comment •

edited

Loading

davispuh commented Jun 8, 2020 •

edited

Loading