Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Once more unto Ghostscript mangling Tesseract-produced PDFs #712

Closed
jbarlow83 opened this issue Feb 8, 2017 · 24 comments
Closed

Once more unto Ghostscript mangling Tesseract-produced PDFs #712

jbarlow83 opened this issue Feb 8, 2017 · 24 comments

Comments

@jbarlow83
Copy link

To recap, when a Tesseract PDF (3.0x or 4.x) is run through Ghostscript the OCR layer will be mangled. Ghostscript's pdfwrite (gs -sDEVICE=pdfwrite -o out.pdf in.pdf) will display spaces between every character and get confused about word boundaries. Other PDF viewers tend to work but usually have problems with searching for text, because they read as the text as having spaces in between.

Before (pdftext)

Portez ce vieux whisky au juge
blond qui fume sur son île

After

P o r t e z c e v i e u x w h i s k y a u j u g e
o n d q u i f u m e s u r s o n île

(A related issue I reported was fixed in Ghostscript 9.20, but unfortunately that is not complete solution. Ghostscript <9.20 also corrupts any characters above U+00FF that happen to be present.)

There are lots of reasons someone might run a Tesseract PDF through Ghostscript pdfwrite: producing lower DPI renderings, PDF/A conversion, merging PDFs, changing paper sizes, sanitizing potential security holes like Javascript. There are also a lot of programs and services that use Ghostscript internally, sometimes without the user being aware of this. It's unfortunate that Tesseract PDFs don't play nicely with Ghostscript.

Ken Sharp (Ghostscript PDF dev) swears up and down that he can't do anything about it, essentially because Ghostscript interprets the input PDF into a page description language that is then rendered using pdfwrite. The output is visually identical, but otherwise the file is rewritten. Artifex also views preserving OCR text or other metadata as a bonus; if pdfwrite produces visually identical output they are satisfied.

See this comment from 2015:
https://bugs.ghostscript.com/show_bug.cgi?id=696116

Ken Sharp explains the essential difference is that the /DW (default glyph width) parameter on the GlyphLessFont is not understood by GhostPDL so it sets /DW 0 and manually positions each glyph (the -500).

[(T)-500(h)-500(e)-500]TJ

In English, Tesseract renders OCR with a font whose glyphs are 500 arbitrary units wide. Ghostscript reinterprets this as glyphs that are 0 units wide and moves the cursor 500 units between characters, and insists that it's the same thing.

I tried surgery on a pdfwrite-mangled file. I removed all of the -500 offsets, set to /DW 500 on the main font object, and removed the individual glyph width array /W [...] from the same. That works. (pdfwrite makes other minor changes to the PDF output too, but these don't matter as far as I know.) Writing a little script to fix mangled PDFs is possible, but it would be better to find a workaround.

So, is there any possibility of adjusting the glyphless font to work more like what Ghostscript expects so it survives the trip... without losing all of the other considerable and much appreciated effort that has gone into making glyphless work great with most other interpreters?

What are the commercial OCR tools doing to avoid similar issues?

@jbreiden
Copy link
Contributor

As I think you know, Ken was already instrumental in our most recent invisible font iteration. Can you confirm that the problems you are seeing are true with HEAD (either the 3.0.5 or 4.x branch) as opposed to something older like 3.0.4? I want to make sure you are working with our the very latest compatibility tweaks to font metrics. Attaching an example document to this bug doesn't hurt.

@jbarlow83
Copy link
Author

jbarlow83 commented Feb 10, 2017

Yes, I should have included an example, but it seems to affect just about everything so it didn't seem that hard to come up with one....

Tesseract 4.00alpha (commit 2f10be5)

$ sha1sum tessdata/pdf.ttf
ac5300b169c99e90e9825dd8859b8a850edde22f  tessdata/pdf.ttf

Using testing/phototest.tif

$ tesseract --tessdata-dir . --oem 1  testing/phototest.tif _phototest pdf
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
$ pdftotext _phototest.pdf  -
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

$ gs -sDEVICE=pdfwrite -o _phototest_gs.pdf _phototest.pdf
GPL Ghostscript 9.20 (2016-09-26)
Copyright (C) 2016 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1

$ pdftotext _phototest_gs.pdf -
T h i s
o c r

i s a

l o t o f

c o d e

a n d

1 2

p o i n t

s e e

t e x t

t o

if it w o r k s
click to see the rest of pdftotext
t e s t

o n

t h e

a l l t y p e s

o f f i l e f o r m a t .
T h e

q u i c k

l a z y

f o x .

T h e

q u i c k

o v e r

t h e

l a z y

f o x .

T h e

q u i c k

j u m p e d

o v e r

t h e

l a z y

f o x .

b r o w n

b r o w n

d o g

j u m p e d

d o g

j u m p e d

b r o w n

o v e r

o v e r

d o g

t h e

b r o w n

T h e

t h e

j u m p e d
d o g

q u i c k

l a z y

f o x .

After Tesseract, before Ghostscript
_phototest.pdf

After Ghostscript
_phototest_gs.pdf

After Ghostscript, streams uncompressed with qpdf for easy viewing
_phototest_gs_uncompress.pdf

I also tried omitting --oem, not that we would expect this to make a difference. Tesseract 3.04.01 behavior is similar. I replicated this is in a Docker container with Tesseract 3.04.01 and Ghostscript 9.19.

Before Ghostscript, here is Acrobat XI showing that text search for words works normally.
image

After Ghostscript, here is Acrobat showing that a search for the word "p o i n t" matches because it is now convinced that there are spaces between each character. The highlighting is now misaligned as well.
image

@jbreiden
Copy link
Contributor

Thanks, that is very clear. I'm always happy to tweak things on the Tesseract side to improve compatibility, but it does require careful testing. The tool of choice is ttx from fonttools which can transform the font pdf.ttf into an editable XML representation and back. I don't think we had anything missing with respect to font metrics, but you never know. Not sure when I might have time to play with this, but anyone is welcome to try. I'm somewhat hesitant to bother Ken more after all his contributions, but maybe that is just shyness.

@jbarlow83
Copy link
Author

jbarlow83 commented Feb 10, 2017

Well, I tried fonttools for fun, and to my surprise I might have found a fix – with the important caveat that I have no idea what I'm doing.

--- pdf.ttx_original	2017-02-09 22:43:11.000000000 -0800
+++ pdf.ttx	2017-02-09 22:40:31.000000000 -0800
@@ -121,8 +121,8 @@
   </OS_2>
 
   <hmtx>
-    <mtx name=".notdef" width="0" lsb="0"/>
-    <mtx name=".null" width="0" lsb="0"/>
+    <mtx name=".notdef" width="1024" lsb="0"/>
+    <mtx name=".null" width="1024" lsb="0"/>
   </hmtx>
 
   <cmap>

I took Ken's remark that Ghostscript didn't like individual glyphs width a width of 0, so I gave them a width equal to full the glyph box. (1024 in .ttx units, 500 in PDF font units, from what I infer)

pdftotext works, search works, even macOS Preview works. Search in Chrome pdf.js seems to be broken however. (Edit: was mistaken)

@jbreiden
Copy link
Contributor

Chrome uses pdfium, Firefox uses pdf.js. Will take a closer look when I get a chance. Thanks for investigating.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 10, 2017

Only the null character is used. Here's a control vs. experiment for compatibility testing. I took a quick look at Acroread, Chrome, Firefox, evince on Linux and did not notice a difference. Need testing on all the other popular platforms (including the mobile PDF viewers) to feel comfortable. I'd also like to know exactly what you did when you said "Search ... seems to be broken."

--- pdf.ttx.orig	2017-02-10 09:35:03.000000000 -0800
+++ pdf.ttx	2017-02-10 09:25:06.000000000 -0800
@@ -122,7 +122,7 @@
 
   <hmtx>
     <mtx name=".notdef" width="0" lsb="0"/>
-    <mtx name=".null" width="0" lsb="0"/>
+    <mtx name=".null" width="1024" lsb="0"/>
   </hmtx>
 
   <cmap>

control.pdf

experiment.pdf

@amitdo
Copy link
Collaborator

amitdo commented Feb 10, 2017

on both pdfs pdf.js is broken:

image:
making use of the theory

copy & paste to gedit:

making
use
of  
the
theory

search:
makinguseof thetheory

@RNCTX
Copy link
Contributor

RNCTX commented Feb 10, 2017

I'm trying to sort similar issues. I am working with poppler, pdf2htmlEX (which uses poppler for extraction iirc), and Acrobat Pro 10.

I have been fighting exactly the same issues described here.

I see the same results with pdf.js that @amitdo mentioned in the reply above on the following file...

asdf.pdf

This file started as a PDF from 300dpi scans, I extracted it to PNGs with Imagemagick, and OCR'd those with Tess v4 LTSM into a new PDF.

Here's a copy/paste of the first paragraph of the first page from OSX preview

take i atll back, and sure enough that's going to come but itwill take time. Firstofallletus ask a rather simple question. How can we be sure, how can we tell, whether any utterance is to be classed as a performative or not? Surely, we feel, we ought to be able to do that. And we should obviously very muchliketobeabletosaythatthereisagrammaticalcriterionforthis, some grammatical means ofdeciding whether an utterance isperformative. All the examples I have given hitherto do in fact have the same grammatical form;theyallofthem beginwith theverbinthefirstpersonsingularpresent indicative active-not just any kind of verb of course, but still they all are in fact of that form. Furthermore, with these verbs that I have used there is a typical asymmetry between the use of this person and tense of the verb and the use of the same verb in other persons and other tenses, and this asym- metry is rather an important clue.

and Acrobat Pro (also on OSX)...

take it all back, and sure enough that's going to come but it will take time.
First of all let us ask a rather simple question. How can we be sure, how can
we tell, whether any utterance is to be classed as a performative or not?
Surely, we feel, we ought to be able to do that. And we should obviously very
much like to be able to say that there is a grammatical criterion for this,
some grammatical means of deciding whether an utterance is performative.
All the examples I have given hitherto do in fact have the same grammatical
form; they all of them begin with the verb in the first person singular present
indicative active-not just any kind of verb of course, but still they all are in
fact of that form. Furthermore, with these verbs that I have used there is a
typical asymmetry between the use of this person and tense of the verb and
the use of the same verb in other persons and other tenses, and this asymmetry
is rather an important clue.

And pdf.js...

it
all
back,
and
sure
enough
that's
going
to
(ad infinitum, all words on a separate line)

In Chrome...

take i allt back, and sure enough that's going to come but it will take time.
First of all let us ask a rather simple question. How can we be sure, how can
we tell, whether any utterance is to be classed as a performative or not?
Surely, we feel, we ought to be able to do that. And we should obviously very
much like to be able to say that there is a grammatical criterion for this,
some grammatical means of deciding whether an utterance is performative.
All the examples I have given hitherto do in fact have the same grammatical
form; they all of them begin with the verb in the first person singular present
indicative active-not just any kind of verb of course, but still they all are in
fact of that form. Furthermore, with these verbs that I have used there is a
typical asymmetry between the use of this person and tense of the verb and
the use of the same verb in other persons and other tenses, and this asymmetry
is rather an important clue.

pdftotext via poppler...

take it all back, and sure enough that's going to come but it will take time.
First of all let us ask a rather simple question. How can we be sure, how can
we tell, whether any utterance is to be classed as a performative or not?
Surely, we feel, we ought to be able to do that. And we should obviously very
much like to be able to say that there is a grammatical criterion for this,
some grammatical means of deciding whether an utterance is performative.
All the examples I have given hitherto do in fact have the same grammatical
form; they all of them begin with the verb in the first person singular present
indicative active-not just any kind of verb of course, but still they all are in
fact of that form. Furthermore, with these verbs that I have used there is a
typical asymmetry between the use of this person and tense of the verb and
the use of the same verb in other persons and other tenses, and this asymmetry is rather an important clue.

Versions..

Preview 909.12
Firefox 51.0.1
Adobe Acrobat Pro 10.1.3
Chrome 55.0.2883.95
poppler 0.51.0

@jbarlow83
Copy link
Author

For control.pdf and experiment.pdf, I checked that:

  • using Acrobat XI, Chrome (pdfium), Firefox (pdf.js):
    • selecting a word will completely highlight that word rather than missing about half of the final character
    • random text that is copied and pasted will preserves word breaks
    • searching for "rela" will select all occurrences of "relativity"
  • pdftotext produces a reasonable representation of the document contents with no extra spaces

Both files passed.

I then created control_gs.pdf and experiment_gs.pdf using Ghostscript 9.20.

control_gs.pdf
experiment_gs.pdf

For these two files, control_gs.pdf failed all tests, and experiment_gs.pdf passed all tests. The change in the experiment, assigning a width to the .null glyph, is therefore an improvement without any known regressions (yay!). The outputs of pdftotext on experiment.pdf and experiment_gs.pdf is binary identical.

I must have been mistaken on my early remark that there was a search functionality regression on "pdf.js" (by which I meant pdfium). I cannot replicate whatever problem I found with either my test files or experiment_gs.pdf.

@jbarlow83
Copy link
Author

@amitdo With the way this experiment is set up, finding that pdf.js gives the same result on control and experiment is not a regression. It just means there are more cases of text extraction not working perfectly unrelated to running them through Ghostscript. I confirmed that experiment.pdf, experiment_gs.pdf and control.pdf all have the problem you identified with "making use of the theory". Maybe there's something else we can do.

@RNCTX In this issue we're discussing how Ghostscript's pdfwrite utility seems to utterly ruin spacing in Tesseract-produced PDFs that previously appeared correctly in most viewers, rather than the general issue of spacing between characters not working in Tesseract PDFs. The real problem is the PDF spec itself:

Identifying Word Breaks
A document’s text stream defines not only the characters in a page’s text but also the words. Unlike a character, the notion of a word is not precisely defined but depends on the purpose for which the text is being processed. [...] applications all have their own ideas of what constitutes a word.

@amitdo
Copy link
Collaborator

amitdo commented Feb 12, 2017

I read that with Windows 10 the default pdf reader is the Edge browser. Someone should test it.

@jbarlow83
Copy link
Author

jbarlow83 commented Feb 12, 2017 via email

@RNCTX
Copy link
Contributor

RNCTX commented Feb 13, 2017

@jbarlow83:

In this issue we're discussing how Ghostscript's pdfwrite utility seems to utterly ruin spacing in Tesseract-produced PDFs that previously appeared correctly in most viewers, rather than the general issue of spacing between characters not working in Tesseract PDFs. The real problem is the PDF spec itself:

Yes, I understand the context, perhaps I should have clarified my post a bit better. In working with the output in other tools, as you say in the OP, "ghostscript is used in many utilities, perhaps without even the knowledge that it is being used by the user." In my case the PDF output of Tesseract is fine, in fact in terms of cleanliness as input for other tools it fares better than any other. But I arrived at this thread after attempting to resize a tesseract output PDF with Imagemagick (which, of course, uses ghostscript).

I am looking at your files in my various tools...

OSX Preview, pdf.js, and poppler output remain un-usable, but I agree that the others are improved. Interestingly, OSX Preview is different for the two files you posted. The run-on words are in different places.

Your change leaves us with Chrome and Acrobat working flawlessly, which is a pretty good start.

Chrome.txt
pdftotext.txt
pdf.js.txt
OSX Preview.txt
Acrobat X.txt

@jbreiden
Copy link
Contributor

Can someone please test on iOS?

@RNCTX
Copy link
Contributor

RNCTX commented Feb 13, 2017

Here ya go..

Also tried iBooks on iOS but predictably the same output as Safari.

Acrobat reader on iOS does not allow text highlighting, but it does not find multi-word searches in either control_gs or experiment_gs, so Acrobat on iOS seems to be using a different renderer than it does on the desktop apps. Acrobat Pro X on the Mac desktop app does find multi-word searches on experiment_gs, but not control_gs.

The dropbox viewer on iOS is apparently using Chrome's desktop renderer, but Chrome on iOS is using Safari's/Apple's instead of the Chrome desktop pdf renderer going by these results.

Firefox on iOS has very poor touch recognition in pdf files, so all I could do was pick the first word and "select all" which gave me 'some' text from each file but not all text on a page in either control_gs or experiment_gs

This is on an iPad Air2 with iOS 10.x

Chrome_iOS.txt
Safari_iOS.txt
Dropbox_Viewer_iOS.txt
Firefox_iOS.txt

@jbreiden
Copy link
Contributor

Sorry for not being more clear. I need testing of control.pdf against experiment.pdf on iOS before we can submit the change.

@RNCTX
Copy link
Contributor

RNCTX commented Feb 13, 2017

@jbreiden
Copy link
Contributor

Thank you very much. Okay, no known regressions, so let's get that revised font (pdf.ttf) in, snapshot the 3.0.5 branch, and ship to millions of users.

@jbreiden
Copy link
Contributor

pdf.ttf.zip

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 15, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Feb 15, 2017

@zdenop added it to the 3.05 branch, and I guess he will add it to 'master' soon...

@zdenop zdenop closed this as completed in a011b15 Feb 15, 2017
@zdenop
Copy link
Contributor

zdenop commented Feb 15, 2017

done

@WillemJansen
Copy link

WillemJansen commented Feb 8, 2019

Sorry to re-open, but it seems tesseract 4.0.0 shows the same behaviour:

  • Use Tesseract to ocr pdf (4.0.0)
  • Use gs to reduce quality of images after ocr (latest version 28.11.2018)
  • Receive a s p a c e a f t e r e v e r y l e t t e r

@jbarlow83
Copy link
Author

@WillemJansen Please open a new issue. Include your input, intermediate, and output files, the command lines you use to produce each, and note what PDF viewer you are using to extract text from the PDF.

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants