-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: pdfminer.pdfexceptions.PDFTypeError: invalid length: 6 #1361
Comments
Probably corrupt font, but will need test file. |
Test file: in.pdf |
If I rewrite this file using GhostScript (with the below command) and then use ocrmypdf, the issue disappears.
But still, the quality of the OCR is very poor. OCRmyPDF barely changes any of the original (incorrect) text when using the I see this warning/advise in the Terminal: 1 some text on this page cannot be mapped to characters: consider using --force-ocr instead Now, if I use Or is the issue (of |
The issue was with pdfminer interpreting the Unicode mapping data. If Ghostscript rewrote it, it could have worked around the issue. Even a one byte adjustment could have been a workaround.
|
Regarding this
That means the mapping to Unicode is incomplete - this can cause characters to appear correctly when selected, but they will copy-paste as gibberish, and also the behavior will vary based on the PDF viewer since some try heuristics to detect the text encoding. That's why it's best to throw out everything and force OCR for this file. |
Is it not possible for OCRmyPDF to correct the mapping of the characters based on the characters detected by OCR? To clarify, my question is not whether OCRmyPDF is currently able to correct the mapping (which I assume it can't). My question is whether OCRmyPDF can be modified to be able to correct the mapping. The main reasons for which I don't want to use
If there is a way around, I would really like to avoid using |
Possible but hard. That's pretty major surgery and the results from doing something like force-ocr are often better. Ghostscript recently added a mode that attempts to fix broken font mappings (whether the font is OCR-derived or some other origin). |
You can avoid lossy recompression using |
I would really appreciate if such a feature is eventually added to OCRmyPDF (because you said that it's hard, I don't expect it anytime soon).
Can you please tell how to activate that mode?
Isn't --optimize 1 the default? |
Describe the bug
OCR failed to complete.
Steps to reproduce
Files
Let me know if you need the file (if the issue is not clear from the error message)
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.4.2
Relevant log output
The text was updated successfully, but these errors were encountered: