revived #257: Properly decode ANSI encodings #349

Connum · 2020-09-30T16:16:18Z

This is a revival of PR #257, which was closed due to inactivity and its author no longer using pdfparser.

The original PR listed several issues that it supposedly fixed. However, most of them were already fixed due to other code changes or didn't provide any sample PDF files and were subsequently closed.

However, I can now confirm that this definitely fixes issue #202, without breaking any other files or unit tests for me, after I made some additional changes to the original code. I also added a unit test specific to this decoding issue.

It also fixes missing decoded text in a file provided in #19 (depending on PR #345 for another problem).

…ad set before

…ndler

…E ERROR HANDLER HAS CHANGED!" message

…st case depending on it

…to skyfms-fix_encoding # Conflicts: # src/Smalot/PdfParser/Encoding.php # src/Smalot/PdfParser/Font.php # src/Smalot/PdfParser/PDFObject.php # src/Smalot/PdfParser/Parser.php

…ed resource name

… lookup table

Connum · 2020-09-30T17:27:25Z

Interesting... The CI build fails due to a test that passes on my environment.

edit: Freshly checked out the branch after updating my git config to force LF line endings... And now I can reproduce the failing test, strangely enough... It's caused by the initial code and I need to look into it.

Connum · 2020-09-30T18:45:44Z

Got it! The code checking for UTF-8 strings was removed via the original PR, and its author forgot to check if it's already an UTF-8 encoded string before converting the resulting string to UTF-8 in ANSI environment (this can seemingly happen when ANSI and UTF-8 strings are mixed, which is exactly what the failing test was catching!)

Connum · 2020-09-30T19:00:07Z

PHPstan issues resolved and ready for review!

k00ni

Thank you for the effort of reviving #257 and fixing it.

I have a few things:

I added Undefined Offset in Parser.php causing FollowUp Fatal Error #19 and UTF-8 autodetection not working? #202 as being closed when this PR gets merged. Was that correct?
Please move src/Smalot/PdfParser/Encoding/PostScriptGlyphs.json to another place, because src should only contain PHP files. Maybe encoding/ is a good place? And isn't it more consistent to use PHP instead of JSON as file format?
When using PHP instead of JSON for PostScriptGlyphs.json the file src/Smalot/PdfParser/Encoding/PostScriptGlyphs.php can be removed, because no JSON decoding is required as well as getCodePoint can be done at the place, when its needed.
See my comment about src/Smalot/PdfParser/Font.php::decodeContent

src/Smalot/PdfParser/Font.php

src/Smalot/PdfParser/Header.php

src/Smalot/PdfParser/Page.php

tests/Integration/ParserTest.php

Connum · 2020-10-01T16:41:58Z

* [ ]  I added #19 and #202 as being closed when this PR gets merged. Was that correct?

Ths is correct!

* [ ]  Please move `src/Smalot/PdfParser/Encoding/PostScriptGlyphs.json` to another place, because `src` should only contain PHP files. Maybe `encoding/` is a good place? And isn't it more consistent to use PHP instead of JSON as file format?
* [ ]  When using PHP instead of JSON for `PostScriptGlyphs.json` the file `src/Smalot/PdfParser/Encoding/PostScriptGlyphs.php` can be removed, because no JSON decoding is required as well as `getCodePoint` can be done at the place, when its needed.

What about keeping PostScriptGlyphs.php, but instead of having getGlyphs() parse the data from the JSON file, have it return the data directly as an array, like the other Encoding classes do with getTranslations()?
But we should add the doc block with the licensing information to the file in case we decide to keep it!

* [ ]  See my comment about `src/Smalot/PdfParser/Font.php`::decodeContent

See my comment in the conversation above! :)

k00ni · 2020-10-02T07:34:14Z

What about keeping PostScriptGlyphs.php, but instead of having getGlyphs() parse the data from the JSON file, have it return the data directly as an array, like the other Encoding classes do with getTranslations()?

That's even better. Handle the licence information as you see fit.

…ture

…sing and parsing JSON

Connum · 2020-10-03T21:38:38Z

All points are done from my side, thanks again for your input!

k00ni

Thank you again for your effort @Connum! Looks good.

I will merge it near end of the week if there are no objections by the community.

Connum · 2020-10-06T10:25:58Z

Merged in the current master, resolving a conflict after #345 was merged.
Also found out this fixes #250.

@deprecated

* Get correct default font * Create header elements with it's respective class * Properly decode ANSI encodings * allow for line breaks when splitting xrefs for id and position * extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING * added test case for this fix * only reset error handler when the current handler is the handler we had set before * work around for failing CI build with PHP 5.6 * added comment and link to the workaround getting the current error handler * removed unnecessary ini_set call * remove error level constant name before error message * restore error from the error handler itself, to prevent PHPUnit's "THE ERROR HANDLER HAS CHANGED!" message * reverse the changes made to the TestCase class and the code in the test case depending on it * simplified test case, now checking if object has been parsed correctly * code linting * applied linting * handle failed font lookup * look up unfiltered font resource name first, then fall back to filtered resource name * added unit test for smalot#202 bugfix, code linting * fallback for decoding single-byte ANSI characters that are not in the lookup table * added test file and unit test for international unicode characters * don't double-encode strings already in UTF-8 * code linting * removed remnants from old decodeContent() function signature * parseHeaderElement() should not return a PDFObject * some minor changes as requested by the review * keep $unicode as deprecated parameter in decodeContent function signature * forgot to add default value for $unicode to make it optional * added proper doc blocks to PostScriptGlyphs.php * return array from PostScriptGlyphs::getGlyphs() directly instead of using and parsing JSON * changed @deprecated to parameter description Co-authored-by: Dāvis Mosāns <[email protected]>

Dāvis Mosāns and others added 26 commits September 17, 2019 15:43

Get correct default font

a74f5a5

Create header elements with it's respective class

49ba4ee

Properly decode ANSI encodings

33a1746

allow for line breaks when splitting xrefs for id and position

b5ddbf7

extend TestCase.php with functionality to "catch" E_NOTICE and E_WARNING

c800c65

added test case for this fix

d87b51a

only reset error handler when the current handler is the handler we h…

b12df3b

…ad set before

work around for failing CI build with PHP 5.6

e1673b2

added comment and link to the workaround getting the current error ha…

5403abb

…ndler

removed unnecessary ini_set call

5b5b480

remove error level constant name before error message

2319f85

restore error from the error handler itself, to prevent PHPUnit's "TH…

51b8ea3

…E ERROR HANDLER HAS CHANGED!" message

reverse the changes made to the TestCase class and the code in the te…

86525f6

…st case depending on it

simplified test case, now checking if object has been parsed correctly

4e4b3e2

Merge branch 'master' into fix-19

8cb6ed2

code linting

c2cb436

Merge branch 'fix_encoding' of https://github.com/skyfms/pdfparser in…

8416c42

…to skyfms-fix_encoding # Conflicts: # src/Smalot/PdfParser/Encoding.php # src/Smalot/PdfParser/Font.php # src/Smalot/PdfParser/PDFObject.php # src/Smalot/PdfParser/Parser.php

applied linting

637eae3

Merge branch 'skyfms-fix_encoding' into 257-revived

ce70a4d

handle failed font lookup

48c472b

look up unfiltered font resource name first, then fall back to filter…

4042e74

…ed resource name

Merge branch 'master' into 257-revived

4994a46

added unit test for smalot#202 bugfix, code linting

6c74459

mb_convert_encoding does not support 'Mac', replace with iconv()

cb1299b

fallback for decoding single-byte ANSI characters that are not in the…

4f8e812

… lookup table

added test file and unit test for international unicode characters

d9b1c1c

Connum mentioned this pull request Sep 30, 2020

Undefined Offset in Parser.php causing FollowUp Fatal Error #19

Closed

Connum added 2 commits September 30, 2020 20:42

don't double-encode strings already in UTF-8

9811fac

code linting

7203d42

Connum added 2 commits September 30, 2020 20:55

removed remnants from old decodeContent() function signature

1c0208d

parseHeaderElement() should not return a PDFObject

c620aeb

This was linked to issues Oct 1, 2020

UTF-8 autodetection not working? #202

Closed

Undefined Offset in Parser.php causing FollowUp Fatal Error #19

Closed

k00ni requested changes Oct 1, 2020

View reviewed changes

some minor changes as requested by the review

7957624

Connum added 4 commits October 3, 2020 17:56

keep $unicode as deprecated parameter in decodeContent function signa…

1f41ccc

…ture

forgot to add default value for $unicode to make it optional

c8e05fc

added proper doc blocks to PostScriptGlyphs.php

9e41fbb

return array from PostScriptGlyphs::getGlyphs() directly instead of u…

a9c33b4

…sing and parsing JSON

Connum requested a review from k00ni October 3, 2020 17:45

changed @deprecated to parameter description

09d718d

k00ni approved these changes Oct 5, 2020

View reviewed changes

k00ni added the fix label Oct 5, 2020

k00ni self-assigned this Oct 5, 2020

Connum mentioned this pull request Oct 6, 2020

issue in page->getText() function #250

Closed

Merge branch 'master' into 257-revived

54e953f

k00ni linked an issue Oct 6, 2020 that may be closed by this pull request

issue in page->getText() function #250

Closed

k00ni merged commit 1f4056d into smalot:master Oct 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revived #257: Properly decode ANSI encodings #349

revived #257: Properly decode ANSI encodings #349

Connum commented Sep 30, 2020

Connum commented Sep 30, 2020 •

edited

Loading

Connum commented Sep 30, 2020

Connum commented Sep 30, 2020

k00ni left a comment •

edited

Loading

Connum commented Oct 1, 2020 •

edited

Loading

k00ni commented Oct 2, 2020 •

edited

Loading

Connum commented Oct 3, 2020

k00ni left a comment •

edited

Loading

Connum commented Oct 6, 2020

revived #257: Properly decode ANSI encodings #349

revived #257: Properly decode ANSI encodings #349

Conversation

Connum commented Sep 30, 2020

Connum commented Sep 30, 2020 • edited Loading

Connum commented Sep 30, 2020

Connum commented Sep 30, 2020

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

Connum commented Oct 1, 2020 • edited Loading

k00ni commented Oct 2, 2020 • edited Loading

Connum commented Oct 3, 2020

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

Connum commented Oct 6, 2020

Connum commented Sep 30, 2020 •

edited

Loading

k00ni left a comment •

edited

Loading

Connum commented Oct 1, 2020 •

edited

Loading

k00ni commented Oct 2, 2020 •

edited

Loading

k00ni left a comment •

edited

Loading