Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes are breaking something commit 1f4056d #359

Closed
Karmel0x opened this issue Oct 23, 2020 · 11 comments · Fixed by #362
Closed

Changes are breaking something commit 1f4056d #359

Karmel0x opened this issue Oct 23, 2020 · 11 comments · Fixed by #362
Labels

Comments

@Karmel0x
Copy link

Unfortunately I can't provide sample PDF this time.
Output is correct before commit 1f4056d.

Output with latest commit / release:

Notice: Undefined offset: 67 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 117 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 114 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 114 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 105 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 99 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 117 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 108 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 117 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 109 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 32 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 86 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 105 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 116 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 97 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 101 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 32 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 48 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 55 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 46 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 48 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 53 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091

... more than 5000 similary errors here

text:

ż ś ą ą żą ń ą ą ą ą ą ą ą ą ę ś ę ń ą ą ą ń ę ż ą ś ś ż ą ą ż ą ą ą ż ą ę ą ń ż ń ą ą ś ą ą ą ę ż ęż ń ę ę ą ą ą Ś Ś ą ę ż żą ż ż ż ą ś ąś ń ń ń ń ą ę ś ść Ś ść ść ść ść ś ść ę ść ą ść ść ść ść ę ś ż ę ą ą ą

@Karmel0x Karmel0x changed the title Changes are breaking something (again) commit 1f4056d Changes are breaking something commit 1f4056d Oct 23, 2020
@k00ni k00ni added the bug label Oct 23, 2020
@k00ni
Copy link
Collaborator

k00ni commented Oct 23, 2020

It seems that there are glyphs in use which are not part of PostScriptGlyphs.php. Because this functionality is from @Connum, lets see what (s)he has to say.

@Connum
Copy link
Contributor

Connum commented Oct 23, 2020

It would be really great to have a sample, otherwise I'm not sure how to debug this properly. Maybe you can remove any sensitive content and just leave in a few sentences or paragraphs that cause the issue?

@Connum
Copy link
Contributor

Connum commented Oct 23, 2020

Also, could you please try whether the issue happens with this branch as well? https://github.com/skyfms/pdfparser/tree/fix_encoding
Because I didn't implement the functionality originally, but rather adapted it to integrate it into the current state of the library. That way I could see if I introduced the breaking change myself or if the original code was already faulty.

@Karmel0x
Copy link
Author

Karmel0x commented Oct 23, 2020

Thought after editing (I am not the author of this file) problem will disappear but it's still there.
I've deleted some pages, so I can provide PDF now: cv_Zj.pdf

Problem occurs in this branch too.

@k00ni
Copy link
Collaborator

k00ni commented Oct 23, 2020

Can we add the PDF file (https://github.com/smalot/pdfparser/files/5429471/cv_Zj.pdf) to our test environment? Is it free of charge and has no obligations?

@Karmel0x
Copy link
Author

I am not the author but I've removed sensitive informations, so.. probably.

@k00ni
Copy link
Collaborator

k00ni commented Oct 24, 2020

@Connum: #360 is another issue which reports a problem with PostScriptGlyphs, which was added lately. What needs to be done to fix the underlying problem?

@Connum
Copy link
Contributor

Connum commented Oct 24, 2020

@Connum: #360 is another issue which reports a problem with PostScriptGlyphs, which was added lately. What needs to be done to fix the underlying problem?

Sorry for not responding, I'm quite busy at the moment with several work projects in parallel, but I haven't forgotten about the issue and will look at it as soon as I find the time!

@Connum
Copy link
Contributor

Connum commented Oct 26, 2020

Finally got around looking into this. I just created PR #362 to handle this issue.
The PDF content appears to be some kind of legal text or at least a factual report and also seems to be a small enough excerpt, so I don't see why we couldn't use it for a test case (which I did).

Please note the following, which I also added as a comment in the related test case:

Note that the "ł" in przepływu is decoded as a space character. This was already
the case before the PR that caused this issue and is not currently covered by this
test case. However, this issue should be addressed in the future and its fix
can then be incorporated into this test by uncommenting the third assertion below.

@k00ni
Copy link
Collaborator

k00ni commented Oct 27, 2020

@Karmel0x can you test again with the patch @Connum provided? Do you approve the fix?

@Karmel0x
Copy link
Author

Karmel0x commented Oct 29, 2020

If it's reading provided pdf then it's fine and it looks like it is, original one too.

k00ni pushed a commit that referenced this issue Oct 29, 2020
…ookup table (#362)

* hotfix for #359 and #360: fallback for glyphs not in the postscript lookup table

* test comment and assertion: actually just one character being decoded incorrectly

* added @todo keyword to test case comment so we can keep track of this

* moved comment before code as requested in the review

* fix code linting
partulaj pushed a commit to partulaj/pdfparser that referenced this issue Dec 21, 2020
…postscript lookup table (smalot#362)

* hotfix for smalot#359 and smalot#360: fallback for glyphs not in the postscript lookup table

* test comment and assertion: actually just one character being decoded incorrectly

* added @todo keyword to test case comment so we can keep track of this

* moved comment before code as requested in the review

* fix code linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants