-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Once more unto Ghostscript mangling Tesseract-produced PDFs #712
Comments
As I think you know, Ken was already instrumental in our most recent invisible font iteration. Can you confirm that the problems you are seeing are true with HEAD (either the 3.0.5 or 4.x branch) as opposed to something older like 3.0.4? I want to make sure you are working with our the very latest compatibility tweaks to font metrics. Attaching an example document to this bug doesn't hurt. |
Yes, I should have included an example, but it seems to affect just about everything so it didn't seem that hard to come up with one.... Tesseract 4.00alpha (commit 2f10be5)
Using testing/phototest.tif
click to see the rest of pdftotext
After Tesseract, before Ghostscript After Ghostscript After Ghostscript, streams uncompressed with qpdf for easy viewing I also tried omitting Before Ghostscript, here is Acrobat XI showing that text search for words works normally. After Ghostscript, here is Acrobat showing that a search for the word "p o i n t" matches because it is now convinced that there are spaces between each character. The highlighting is now misaligned as well. |
Thanks, that is very clear. I'm always happy to tweak things on the Tesseract side to improve compatibility, but it does require careful testing. The tool of choice is |
Well, I tried --- pdf.ttx_original 2017-02-09 22:43:11.000000000 -0800
+++ pdf.ttx 2017-02-09 22:40:31.000000000 -0800
@@ -121,8 +121,8 @@
</OS_2>
<hmtx>
- <mtx name=".notdef" width="0" lsb="0"/>
- <mtx name=".null" width="0" lsb="0"/>
+ <mtx name=".notdef" width="1024" lsb="0"/>
+ <mtx name=".null" width="1024" lsb="0"/>
</hmtx>
<cmap> I took Ken's remark that Ghostscript didn't like individual glyphs width a width of 0, so I gave them a width equal to full the glyph box. (1024 in .ttx units, 500 in PDF font units, from what I infer) pdftotext works, search works, even macOS Preview works. |
Chrome uses pdfium, Firefox uses pdf.js. Will take a closer look when I get a chance. Thanks for investigating. |
Only the null character is used. Here's a control vs. experiment for compatibility testing. I took a quick look at Acroread, Chrome, Firefox, evince on Linux and did not notice a difference. Need testing on all the other popular platforms (including the mobile PDF viewers) to feel comfortable. I'd also like to know exactly what you did when you said "Search ... seems to be broken." --- pdf.ttx.orig 2017-02-10 09:35:03.000000000 -0800
+++ pdf.ttx 2017-02-10 09:25:06.000000000 -0800
@@ -122,7 +122,7 @@
<hmtx>
<mtx name=".notdef" width="0" lsb="0"/>
- <mtx name=".null" width="0" lsb="0"/>
+ <mtx name=".null" width="1024" lsb="0"/>
</hmtx>
<cmap> |
on both pdfs pdf.js is broken: image: copy & paste to gedit:
search: |
I'm trying to sort similar issues. I am working with poppler, pdf2htmlEX (which uses poppler for extraction iirc), and Acrobat Pro 10. I have been fighting exactly the same issues described here. I see the same results with pdf.js that @amitdo mentioned in the reply above on the following file... This file started as a PDF from 300dpi scans, I extracted it to PNGs with Imagemagick, and OCR'd those with Tess v4 LTSM into a new PDF. Here's a copy/paste of the first paragraph of the first page from OSX preview
and Acrobat Pro (also on OSX)...
And pdf.js...
In Chrome...
pdftotext via poppler...
Versions.. Preview 909.12 |
For control.pdf and experiment.pdf, I checked that:
Both files passed. I then created control_gs.pdf and experiment_gs.pdf using Ghostscript 9.20. control_gs.pdf For these two files, control_gs.pdf failed all tests, and experiment_gs.pdf passed all tests. The change in the experiment, assigning a width to the .null glyph, is therefore an improvement without any known regressions (yay!). The outputs of pdftotext on experiment.pdf and experiment_gs.pdf is binary identical. I must have been mistaken on my early remark that there was a search functionality regression on "pdf.js" (by which I meant pdfium). I cannot replicate whatever problem I found with either my test files or experiment_gs.pdf. |
@amitdo With the way this experiment is set up, finding that pdf.js gives the same result on control and experiment is not a regression. It just means there are more cases of text extraction not working perfectly unrelated to running them through Ghostscript. I confirmed that experiment.pdf, experiment_gs.pdf and control.pdf all have the problem you identified with "making use of the theory". Maybe there's something else we can do. @RNCTX In this issue we're discussing how Ghostscript's pdfwrite utility seems to utterly ruin spacing in Tesseract-produced PDFs that previously appeared correctly in most viewers, rather than the general issue of spacing between characters not working in Tesseract PDFs. The real problem is the PDF spec itself:
|
I read that with Windows 10 the default pdf reader is the Edge browser. Someone should test it. |
Checked Win10/Edge. Same thing, control_gs.psf fails and the others are
acceptable. Someone other than me should check things though.
…On Sun, Feb 12, 2017 at 02:32 Amit D. ***@***.***> wrote:
I read that with Windows 10 the default reader is the Edge browser.
Someone should test it.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM8nWXMrLQOBIMd6SZiBjotpDRqpqks5rbt-kgaJpZM4L6dNS>
.
|
Yes, I understand the context, perhaps I should have clarified my post a bit better. In working with the output in other tools, as you say in the OP, "ghostscript is used in many utilities, perhaps without even the knowledge that it is being used by the user." In my case the PDF output of Tesseract is fine, in fact in terms of cleanliness as input for other tools it fares better than any other. But I arrived at this thread after attempting to resize a tesseract output PDF with Imagemagick (which, of course, uses ghostscript). I am looking at your files in my various tools... OSX Preview, pdf.js, and poppler output remain un-usable, but I agree that the others are improved. Interestingly, OSX Preview is different for the two files you posted. The run-on words are in different places. Your change leaves us with Chrome and Acrobat working flawlessly, which is a pretty good start. Chrome.txt |
Can someone please test on iOS? |
Here ya go.. Also tried iBooks on iOS but predictably the same output as Safari. Acrobat reader on iOS does not allow text highlighting, but it does not find multi-word searches in either control_gs or experiment_gs, so Acrobat on iOS seems to be using a different renderer than it does on the desktop apps. Acrobat Pro X on the Mac desktop app does find multi-word searches on experiment_gs, but not control_gs. The dropbox viewer on iOS is apparently using Chrome's desktop renderer, but Chrome on iOS is using Safari's/Apple's instead of the Chrome desktop pdf renderer going by these results. Firefox on iOS has very poor touch recognition in pdf files, so all I could do was pick the first word and "select all" which gave me 'some' text from each file but not all text on a page in either control_gs or experiment_gs This is on an iPad Air2 with iOS 10.x Chrome_iOS.txt |
Sorry for not being more clear. I need testing of control.pdf against experiment.pdf on iOS before we can submit the change. |
DropboxViewer_Control-Experiment_iOS.txt To clarify, these are from the PDF files in this post... |
Thank you very much. Okay, no known regressions, so let's get that revised font (pdf.ttf) in, snapshot the 3.0.5 branch, and ship to millions of users. |
Jeff,
Does this new ttf need to be added to both 3.05 and master branches?
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Feb 14, 2017 at 12:07 AM, jbreiden ***@***.***> wrote:
pdf.ttf.zip
<https://github.com/tesseract-ocr/tesseract/files/772035/pdf.ttf.zip>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0O3YkyHAe9OMKsC_2n1nr9wovq4ks5rcKL1gaJpZM4L6dNS>
.
|
@zdenop added it to the 3.05 branch, and I guess he will add it to 'master' soon... |
done |
Sorry to re-open, but it seems tesseract 4.0.0 shows the same behaviour:
|
@WillemJansen Please open a new issue. Include your input, intermediate, and output files, the command lines you use to produce each, and note what PDF viewer you are using to extract text from the PDF. |
To recap, when a Tesseract PDF (3.0x or 4.x) is run through Ghostscript the OCR layer will be mangled. Ghostscript's pdfwrite (
gs -sDEVICE=pdfwrite -o out.pdf in.pdf
) will display spaces between every character and get confused about word boundaries. Other PDF viewers tend to work but usually have problems with searching for text, because they read as the text as having spaces in between.Before (
pdftext
)After
(A related issue I reported was fixed in Ghostscript 9.20, but unfortunately that is not complete solution. Ghostscript <9.20 also corrupts any characters above U+00FF that happen to be present.)
There are lots of reasons someone might run a Tesseract PDF through Ghostscript pdfwrite: producing lower DPI renderings, PDF/A conversion, merging PDFs, changing paper sizes, sanitizing potential security holes like Javascript. There are also a lot of programs and services that use Ghostscript internally, sometimes without the user being aware of this. It's unfortunate that Tesseract PDFs don't play nicely with Ghostscript.
Ken Sharp (Ghostscript PDF dev) swears up and down that he can't do anything about it, essentially because Ghostscript interprets the input PDF into a page description language that is then rendered using pdfwrite. The output is visually identical, but otherwise the file is rewritten. Artifex also views preserving OCR text or other metadata as a bonus; if pdfwrite produces visually identical output they are satisfied.
See this comment from 2015:
https://bugs.ghostscript.com/show_bug.cgi?id=696116
Ken Sharp explains the essential difference is that the
/DW
(default glyph width) parameter on the GlyphLessFont is not understood by GhostPDL so it sets/DW 0
and manually positions each glyph (the-500
).In English, Tesseract renders OCR with a font whose glyphs are 500 arbitrary units wide. Ghostscript reinterprets this as glyphs that are 0 units wide and moves the cursor 500 units between characters, and insists that it's the same thing.
I tried surgery on a pdfwrite-mangled file. I removed all of the
-500
offsets, set to/DW 500
on the main font object, and removed the individual glyph width array/W [...]
from the same. That works. (pdfwrite makes other minor changes to the PDF output too, but these don't matter as far as I know.) Writing a little script to fix mangled PDFs is possible, but it would be better to find a workaround.So, is there any possibility of adjusting the glyphless font to work more like what Ghostscript expects so it survives the trip... without losing all of the other considerable and much appreciated effort that has gone into making glyphless work great with most other interpreters?
What are the commercial OCR tools doing to avoid similar issues?
The text was updated successfully, but these errors were encountered: