-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly decode ANSI encodings #257
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works very well for me, the issues have gone.
Thx for this PR. This also fixes a PHP 7.4 error:
|
That's a great work but some unit tests should be really appreciated |
I just run "unit test" on your PR and some are broken. Can you try on your side ?
|
I actually didn't even check them so most likely they need fixing. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @davispuh, i wanted to help you with the merge, but i couldn't push to https://github.com/skyfms/pdfparser/tree/fix_encoding.
Therefore i created https://github.com/smalot/pdfparser/tree/pr/257. It is based on skyfms/fix_encoding
and i merged in master
branch and fixed coding style violations.
Remaining TODOs:
- Fix failing tests - result is https://travis-ci.org/github/smalot/pdfparser/builds/696039249
Can you have a look at this?
I'm not using this library anymore and I don't really have time to complete this so you can just make new PR and fix whatever needs to be fixed. PS. You could have just rebased it rather than merged. |
Closed due to no progress. I checked some of the issues mentioned here, there was no indication that this PR helped them. |
That's weird, I didn't saw anyone actually testing this. It did fix a lot of issues for me. Anyway I guess not my problem anymore. |
* Get correct default font * Create header elements with it's respective class * Properly decode ANSI encodings * allow for line breaks when splitting xrefs for id and position * extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING * added test case for this fix * only reset error handler when the current handler is the handler we had set before * work around for failing CI build with PHP 5.6 * added comment and link to the workaround getting the current error handler * removed unnecessary ini_set call * remove error level constant name before error message * restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message * reverse the changes made to the TestCase class and the code in the test case depending on it * simplified test case, now checking if object has been parsed correctly * code linting * applied linting * handle failed font lookup * look up unfiltered font resource name first, then fall back to filtered resource name * added unit test for #202 bugfix, code linting * fallback for decoding single-byte ANSI characters that are not in the lookup table * added test file and unit test for international unicode characters * don't double-encode strings already in UTF-8 * code linting * removed remnants from old decodeContent() function signature * parseHeaderElement() should not return a PDFObject * some minor changes as requested by the review * keep $unicode as deprecated parameter in decodeContent function signature * forgot to add default value for $unicode to make it optional * added proper doc blocks to PostScriptGlyphs.php * return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON * changed @deprecated to parameter description Co-authored-by: Dāvis Mosāns <[email protected]>
* Get correct default font * Create header elements with it's respective class * Properly decode ANSI encodings * allow for line breaks when splitting xrefs for id and position * extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING * added test case for this fix * only reset error handler when the current handler is the handler we had set before * work around for failing CI build with PHP 5.6 * added comment and link to the workaround getting the current error handler * removed unnecessary ini_set call * remove error level constant name before error message * restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message * reverse the changes made to the TestCase class and the code in the test case depending on it * simplified test case, now checking if object has been parsed correctly * code linting * applied linting * handle failed font lookup * look up unfiltered font resource name first, then fall back to filtered resource name * added unit test for smalot#202 bugfix, code linting * fallback for decoding single-byte ANSI characters that are not in the lookup table * added test file and unit test for international unicode characters * don't double-encode strings already in UTF-8 * code linting * removed remnants from old decodeContent() function signature * parseHeaderElement() should not return a PDFObject * some minor changes as requested by the review * keep $unicode as deprecated parameter in decodeContent function signature * forgot to add default value for $unicode to make it optional * added proper doc blocks to PostScriptGlyphs.php * return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON * changed @deprecated to parameter description Co-authored-by: Dāvis Mosāns <[email protected]>
This PR fixes several bugs which prevented from correctly decoding ANSI encodings.
For example with this PR now can correctly get text of PDF file which doesn't have
ToUnicode
but hasWinAnsiEncoding
with differences for Windows-1257 code pageMost likely these issues are fixed: #42 #44 #88 #151 #152 #202 #241