-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revived #257: Properly decode ANSI encodings #349
Conversation
…E ERROR HANDLER HAS CHANGED!" message
…st case depending on it
…to skyfms-fix_encoding # Conflicts: # src/Smalot/PdfParser/Encoding.php # src/Smalot/PdfParser/Font.php # src/Smalot/PdfParser/PDFObject.php # src/Smalot/PdfParser/Parser.php
Interesting... The CI build fails due to a test that passes on my environment. edit: Freshly checked out the branch after updating my git config to force LF line endings... And now I can reproduce the failing test, strangely enough... It's caused by the initial code and I need to look into it. |
Got it! The code checking for UTF-8 strings was removed via the original PR, and its author forgot to check if it's already an UTF-8 encoded string before converting the resulting string to UTF-8 in ANSI environment (this can seemingly happen when ANSI and UTF-8 strings are mixed, which is exactly what the failing test was catching!) |
PHPstan issues resolved and ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the effort of reviving #257 and fixing it.
I have a few things:
- I added Undefined Offset in Parser.php causing FollowUp Fatal Error #19 and UTF-8 autodetection not working? #202 as being closed when this PR gets merged. Was that correct?
- Please move
src/Smalot/PdfParser/Encoding/PostScriptGlyphs.json
to another place, becausesrc
should only contain PHP files. Maybeencoding/
is a good place? And isn't it more consistent to use PHP instead of JSON as file format? - When using PHP instead of JSON for
PostScriptGlyphs.json
the filesrc/Smalot/PdfParser/Encoding/PostScriptGlyphs.php
can be removed, because no JSON decoding is required as well asgetCodePoint
can be done at the place, when its needed. - See my comment about
src/Smalot/PdfParser/Font.php
::decodeContent
Ths is correct!
What about keeping PostScriptGlyphs.php, but instead of having
See my comment in the conversation above! :) |
That's even better. Handle the licence information as you see fit. |
…sing and parsing JSON
All points are done from my side, thanks again for your input! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for your effort @Connum! Looks good.
I will merge it near end of the week if there are no objections by the community.
* Get correct default font * Create header elements with it's respective class * Properly decode ANSI encodings * allow for line breaks when splitting xrefs for id and position * extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING * added test case for this fix * only reset error handler when the current handler is the handler we had set before * work around for failing CI build with PHP 5.6 * added comment and link to the workaround getting the current error handler * removed unnecessary ini_set call * remove error level constant name before error message * restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message * reverse the changes made to the TestCase class and the code in the test case depending on it * simplified test case, now checking if object has been parsed correctly * code linting * applied linting * handle failed font lookup * look up unfiltered font resource name first, then fall back to filtered resource name * added unit test for smalot#202 bugfix, code linting * fallback for decoding single-byte ANSI characters that are not in the lookup table * added test file and unit test for international unicode characters * don't double-encode strings already in UTF-8 * code linting * removed remnants from old decodeContent() function signature * parseHeaderElement() should not return a PDFObject * some minor changes as requested by the review * keep $unicode as deprecated parameter in decodeContent function signature * forgot to add default value for $unicode to make it optional * added proper doc blocks to PostScriptGlyphs.php * return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON * changed @deprecated to parameter description Co-authored-by: Dāvis Mosāns <[email protected]>
This is a revival of PR #257, which was closed due to inactivity and its author no longer using pdfparser.
The original PR listed several issues that it supposedly fixed. However, most of them were already fixed due to other code changes or didn't provide any sample PDF files and were subsequently closed.
However, I can now confirm that this definitely fixes issue #202, without breaking any other files or unit tests for me, after I made some additional changes to the original code. I also added a unit test specific to this decoding issue.
It also fixes missing decoded text in a file provided in #19 (depending on PR #345 for another problem).