Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revived #257: Properly decode ANSI encodings #349

Merged
merged 37 commits into from
Oct 8, 2020
Merged

Conversation

Connum
Copy link
Contributor

@Connum Connum commented Sep 30, 2020

This is a revival of PR #257, which was closed due to inactivity and its author no longer using pdfparser.

The original PR listed several issues that it supposedly fixed. However, most of them were already fixed due to other code changes or didn't provide any sample PDF files and were subsequently closed.

However, I can now confirm that this definitely fixes issue #202, without breaking any other files or unit tests for me, after I made some additional changes to the original code. I also added a unit test specific to this decoding issue.

It also fixes missing decoded text in a file provided in #19 (depending on PR #345 for another problem).

Dāvis Mosāns and others added 26 commits September 17, 2019 15:43
…to skyfms-fix_encoding

# Conflicts:
#	src/Smalot/PdfParser/Encoding.php
#	src/Smalot/PdfParser/Font.php
#	src/Smalot/PdfParser/PDFObject.php
#	src/Smalot/PdfParser/Parser.php
@Connum
Copy link
Contributor Author

Connum commented Sep 30, 2020

Interesting... The CI build fails due to a test that passes on my environment.

edit: Freshly checked out the branch after updating my git config to force LF line endings... And now I can reproduce the failing test, strangely enough... It's caused by the initial code and I need to look into it.

@Connum
Copy link
Contributor Author

Connum commented Sep 30, 2020

Got it! The code checking for UTF-8 strings was removed via the original PR, and its author forgot to check if it's already an UTF-8 encoded string before converting the resulting string to UTF-8 in ANSI environment (this can seemingly happen when ANSI and UTF-8 strings are mixed, which is exactly what the failing test was catching!)

@Connum
Copy link
Contributor Author

Connum commented Sep 30, 2020

PHPstan issues resolved and ready for review!

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the effort of reviving #257 and fixing it.

I have a few things:

  • I added Undefined Offset in Parser.php causing FollowUp Fatal Error #19 and UTF-8 autodetection not working? #202 as being closed when this PR gets merged. Was that correct?
  • Please move src/Smalot/PdfParser/Encoding/PostScriptGlyphs.json to another place, because src should only contain PHP files. Maybe encoding/ is a good place? And isn't it more consistent to use PHP instead of JSON as file format?
  • When using PHP instead of JSON for PostScriptGlyphs.json the file src/Smalot/PdfParser/Encoding/PostScriptGlyphs.php can be removed, because no JSON decoding is required as well as getCodePoint can be done at the place, when its needed.
  • See my comment about src/Smalot/PdfParser/Font.php::decodeContent

src/Smalot/PdfParser/Font.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Font.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Font.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Header.php Show resolved Hide resolved
src/Smalot/PdfParser/Page.php Outdated Show resolved Hide resolved
tests/Integration/ParserTest.php Outdated Show resolved Hide resolved
@Connum
Copy link
Contributor Author

Connum commented Oct 1, 2020

* [ ]  I added #19 and #202 as being closed when this PR gets merged. Was that correct?

Ths is correct!

* [ ]  Please move `src/Smalot/PdfParser/Encoding/PostScriptGlyphs.json` to another place, because `src` should only contain PHP files. Maybe `encoding/` is a good place? And isn't it more consistent to use PHP instead of JSON as file format?
* [ ]  When using PHP instead of JSON for `PostScriptGlyphs.json` the file `src/Smalot/PdfParser/Encoding/PostScriptGlyphs.php` can be removed, because no JSON decoding is required as well as `getCodePoint` can be done at the place, when its needed.

What about keeping PostScriptGlyphs.php, but instead of having getGlyphs() parse the data from the JSON file, have it return the data directly as an array, like the other Encoding classes do with getTranslations()?
But we should add the doc block with the licensing information to the file in case we decide to keep it!

* [ ]  See my comment about `src/Smalot/PdfParser/Font.php`::decodeContent

See my comment in the conversation above! :)

@k00ni
Copy link
Collaborator

k00ni commented Oct 2, 2020

What about keeping PostScriptGlyphs.php, but instead of having getGlyphs() parse the data from the JSON file, have it return the data directly as an array, like the other Encoding classes do with getTranslations()?

That's even better. Handle the licence information as you see fit.

@Connum Connum requested a review from k00ni October 3, 2020 17:45
@Connum
Copy link
Contributor Author

Connum commented Oct 3, 2020

All points are done from my side, thanks again for your input!

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for your effort @Connum! Looks good.

I will merge it near end of the week if there are no objections by the community.

@k00ni k00ni added the fix label Oct 5, 2020
@k00ni k00ni self-assigned this Oct 5, 2020
@Connum
Copy link
Contributor Author

Connum commented Oct 6, 2020

Merged in the current master, resolving a conflict after #345 was merged.
Also found out this fixes #250.

@k00ni k00ni linked an issue Oct 6, 2020 that may be closed by this pull request
@k00ni k00ni merged commit 1f4056d into smalot:master Oct 8, 2020
partulaj pushed a commit to partulaj/pdfparser that referenced this pull request Dec 21, 2020
* Get correct default font

* Create header elements with it's respective class

* Properly decode ANSI encodings

* allow for line breaks when splitting xrefs for id and position

* extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING

* added test case for this fix

* only reset error handler when the current handler is the handler we had set before

* work around for failing CI build with PHP 5.6

* added comment and link to the workaround getting the current error handler

* removed unnecessary ini_set call

* remove error level constant name before error message

* restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message

* reverse the changes made to the TestCase class and the code in the test case depending on it

* simplified test case, now checking if object has been parsed correctly

* code linting

* applied linting

* handle failed font lookup

* look up unfiltered font resource name first, then fall back to filtered resource name

* added unit test for smalot#202 bugfix, code linting

* fallback for decoding single-byte ANSI characters that are not in the lookup table

* added test file and unit test for international unicode characters

* don't double-encode strings already in UTF-8

* code linting

* removed remnants from old decodeContent() function signature

* parseHeaderElement() should not return a PDFObject

* some minor changes as requested by the review

* keep $unicode as deprecated parameter in decodeContent function signature

* forgot to add default value for $unicode to make it optional

* added proper doc blocks to PostScriptGlyphs.php

* return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON

* changed @deprecated to parameter description

Co-authored-by: Dāvis Mosāns <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants