Significant change to invisible font system

to improve correctness and compatibility with external programs, particularly ghostscript. We will start mapping everything to a single glyph, rather than allowing characters to run off the end of the font. A more detailed design discussion is embedded into pdfrenderer.cpp comments. The font, source code that produces the font, and the design comments were contributed by Ken Sharp from Artifex Software.
tesseract-ocr · May 13, 2015 · 6b63417 · 6b63417
1 parent 2924d3a
commit 6b63417
Showing 5 changed files with 1,039 additions and 1,765 deletions.
diff --git a/api/pdfrenderer.cpp b/api/pdfrenderer.cpp
@@ -14,6 +14,139 @@
 #include "mathfix.h"
 #endif
 
+/*
+
+Design notes from Ken Sharp, with light editing.
+
+We think one solution is a font with a single glyph (.notdef) and a
+CIDToGIDMap which maps all the CIDs to 0. That map would then be
+stored as a stream in the PDF file, and when flate compressed should
+be pretty small. The font, of course, will be approximately the same
+size as the one you currently use.
+
+I'm working on such a font now, the CIDToGIDMap is trivial, you just
+create a stream object which contains 128k bytes (2 bytes per possible
+CID and your CIDs range from 0 to 65535) and where you currently have
+"/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
+
+Note that if, in future, you were to use a different (ie not 2 byte)
+CMap for character codes you could trivially extend the CIDToGIDMap.
+
+The following is an explanation of how some of the font stuff works,
+this may be too simple for you in which case please accept my
+apologies, its hard to know how much knowledge someone has. You can
+skip all this anyway, its just for information.
+
+The font embedded in a PDF file is usually intended just to be
+rendered, but extensions allow for at least some ability to locate (or
+copy) text from a document. This isn't something which was an original
+goal of the PDF format, but its been retro-fitted, presumably due to
+popular demand.
+
+To do this reliably the PDF file must contain a ToUnicode CMap, a
+device for mapping character codes to Unicode code points. If one of
+these is present, then this will be used to convert the character
+codes into Unicode values. If its not present then the reader will
+fall back through a series of heuristics to try and guess the
+result. This is, as you would expect, prone to failure.
+
+This doesn't concern you of course, since you always write a ToUnicode
+CMap, so because you are writing the text in text rendering mode 3 it
+would seem that you don't really need to worry about this, but in the
+PDF spec you cannot have an isolated ToUnicode CMap, it has to be
+attached to a font, so in order to get even copy/paste to work you
+need to define a font.
+
+This is what leads to problems, tools like pdfwrite assume that they
+are going to be able to (or even have to) modify the font entries, so
+they require that the font being embedded be valid, and to be honest
+the font Tesseract embeds isn't valid (for this purpose).
+
+
+To see why lets look at how text is specified in a PDF file:
+
+(Test) Tj
+
+Now that looks like text but actually it isn't. Each of those bytes is
+a 'character code'. When it comes to rendering the text a complex
+sequence of events takes place, which converts the character code into
+'something' which the font understands. Its entirely possible via
+character mappings to have that text render as 'Sftu'
+
+For simple fonts (PostScript type 1), we use the character code as the
+index into an Encoding array (256 elements), each element of which is
+a glyph name, so this gives us a glyph name. We then consult the
+CharStrings dictionary in the font, that's a complex object which
+contains pairs of keys and values, you can use the key to retrieve a
+given value. So we have a glyph name, we then use that as the key to
+the dictionary and retrieve the associated value. For a type 1 font,
+the value is a glyph program that describes how to draw the glyph.
+
+For CIDFonts, its a little more complicated. Because CIDFonts can be
+large, using a glyph name as the key is unreasonable (it would also
+lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
+as the key. CIDs are just numbers.
+
+But.... We don't use the character code as the CID. What we do is use
+a CMap to convert the character code into a CID. We then use the CID
+to key the CharStrings dictionary and proceed as before. So the 'CMap'
+is the equivalent of the Encoding array, but its a more compact and
+flexible representation.
+
+Note that you have to use the CMap just to find out how many bytes
+constitute a character code, and it can be variable. For example you
+can say if the first byte is 0x00->0x7f then its just one byte, if its
+0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
+have seen CMaps defining character codes up to 5 bytes wide.
+
+Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
+TrueType CIDFonts. The thing is that TrueType fonts are accessed using
+a Glyph ID (GID) (and the LOCA table) which may well not be anything
+like the CID. So for this case PDF includes a CIDToGIDMap. That maps
+the CIDs to GIDs, and we can then use the GID to get the glyph
+description from the GLYF table of the font.
+
+So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
+
+Looking at the PDF file I was supplied with we see that it contains
+text like :
+
+<0x0075> Tj
+
+So we start by taking the character code (117) and look it up in the
+CMap. Well you don't supply a CMap, you just use the Identity-H one
+which is predefined. So character code 117 maps to CID 117. Then we
+use the CIDToGIDMap, again you don't supply one, you just use the
+predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
+were supplied with only contains 116 glyphs.
+
+Now for Latin that's not a huge problem, you can just supply a bigger
+font. But for more complex languages that *is* going to be more of a
+problem. Either you need to supply a font which contains glyphs for
+all the possible CID->GID mappings, or we need to think laterally.
+
+Our solution using a TrueType CIDFont is to intervene at the
+CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
+font with just one glyph, the .notdef glyph at GID 0. This is what I'm
+looking into now.
+
+It would also be possible to have a 'PostScript' (ie type 1 outlines)
+CIDFont which contained 1 glyph, and a CMap which mapped all character
+codes to CID 0. The effect would be the same.
+
+Its possible (I haven't checked) that the PostScript CIDFont and
+associated CMap would be smaller than the TrueType font and associated
+CIDToGIDMap.
+
+--- in a followup ---
+
+OK there is a small problem there, if I use GID 0 then Acrobat gets
+upset about it and complains it cannot extract the font. If I set the
+CIDToGIDMap so that all the entries are 1 instead, its happy. Totally
+mad......
+
+*/
+
 namespace tesseract {
 
 // Use for PDF object fragments. Must be large enough
@@ -334,7 +467,8 @@ bool TessPDFRenderer::BeginDocumentHandler() {
                "  /Type /Catalog\n"
                "  /Pages %ld 0 R\n"
                ">>\n"
-               "endobj\n", 2L);
+               "endobj\n",
+               2L);
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);
 
@@ -355,8 +489,8 @@ bool TessPDFRenderer::BeginDocumentHandler() {
                "  /Type /Font\n"
                ">>\n"
                "endobj\n",
-               4L,          // CIDFontType2 font
-               5L           // ToUnicode
+               4L,         // CIDFontType2 font
+               6L          // ToUnicode
                );
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);
@@ -366,7 +500,7 @@ bool TessPDFRenderer::BeginDocumentHandler() {
                "4 0 obj\n"
                "<<\n"
                "  /BaseFont /GlyphLessFont\n"
-               "  /CIDToGIDMap /Identity\n"
+               "  /CIDToGIDMap %ld 0 R\n"
                "  /CIDSystemInfo\n"
                "  <<\n"
                "     /Ordering (Identity)\n"
@@ -379,11 +513,44 @@ bool TessPDFRenderer::BeginDocumentHandler() {
                "  /DW %d\n"
                ">>\n"
                "endobj\n",
-               6L,         // Font descriptor
+               5L,         // CIDToGIDMap
+               7L,         // Font descriptor
                1000 / kCharWidth);
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);
 
+  // CIDTOGIDMAP
+  const int kCIDToGIDMapSize = 2 * (1 << 16);
+  unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize];
+  for (int i = 0; i < kCIDToGIDMapSize; i++) {
+    cidtogidmap[i] = (i % 2) ? 1 : 0;
+  }
+  size_t len;
+  unsigned char *comp =
+      zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len);
+  delete[] cidtogidmap;
+  n = snprintf(buf, sizeof(buf),
+               "5 0 obj\n"
+               "<<\n"
+               "  /Length %ld /Filter /FlateDecode\n"
+               ">>\n"
+               "stream\n", len);
+  if (n >= sizeof(buf)) {
+    lept_free(comp);
+    return false;
+  }
+  AppendString(buf);
+  long objsize = strlen(buf);
+  AppendData(reinterpret_cast<char *>(comp), len);
+  objsize += len;
+  lept_free(comp);
+  const char *endstream_endobj =
+      "endstream\n"
+      "endobj\n";
+  AppendString(endstream_endobj);
+  objsize += strlen(endstream_endobj);
+  AppendPDFObjectDIY(objsize);
+
   const char *stream =
       "/CIDInit /ProcSet findresource begin\n"
       "12 dict begin\n"
@@ -409,7 +576,7 @@ bool TessPDFRenderer::BeginDocumentHandler() {
 
   // TOUNICODE
   n = snprintf(buf, sizeof(buf),
-               "5 0 obj\n"
+               "6 0 obj\n"
                "<< /Length %lu >>\n"
                "stream\n"
                "%s"
@@ -421,7 +588,7 @@ bool TessPDFRenderer::BeginDocumentHandler() {
   // FONT DESCRIPTOR
   const int kCharHeight = 2;  // Effect: highlights are half height
   n = snprintf(buf, sizeof(buf),
-               "6 0 obj\n"
+               "7 0 obj\n"
                "<<\n"
                "  /Ascent %d\n"
                "  /CapHeight %d\n"
@@ -439,7 +606,7 @@ bool TessPDFRenderer::BeginDocumentHandler() {
                1000 / kCharHeight,
                1000 / kCharWidth,
                1000 / kCharHeight,
-               7L      // Font data
+               8L      // Font data
                );
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);
@@ -461,23 +628,20 @@ bool TessPDFRenderer::BeginDocumentHandler() {
   fclose(fp);
   // FONTFILE2
   n = snprintf(buf, sizeof(buf),
-               "7 0 obj\n"
+               "8 0 obj\n"
                "<<\n"
                "  /Length %ld\n"
                "  /Length1 %ld\n"
                ">>\n"
                "stream\n", size, size);
   if (n >= sizeof(buf)) return false;
   AppendString(buf);
-  size_t objsize  = strlen(buf);
+  objsize  = strlen(buf);
   AppendData(buffer, size);
   delete[] buffer;
   objsize += size;
-  const char *b2 =
-      "endstream\n"
-      "endobj\n";
-  AppendString(b2);
-  objsize += strlen(b2);
+  AppendString(endstream_endobj);
+  objsize += strlen(endstream_endobj);
   AppendPDFObjectDIY(objsize);
   return true;
 }
@@ -679,9 +843,7 @@ bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
   unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext);
   size_t len;
   unsigned char *comp_pdftext =
-      zlibCompress(pdftext_casted,
-                   pdftext_len,
-                   &len);
+      zlibCompress(pdftext_casted, pdftext_len, &len);
   long comp_pdftext_len = len;
   n = snprintf(buf, sizeof(buf),
                "%ld 0 obj\n"

diff --git a/tessdata/pdf.ttf b/tessdata/pdf.ttf
diff --git a/tessdata/pdf.ttx b/tessdata/pdf.ttx
diff --git a/training/GlyphLessFont.c b/training/GlyphLessFont.c
diff --git a/training/GlyphLessFont.h b/training/GlyphLessFont.h
@@ -0,0 +1,228 @@
+/* I don't expect anyone to run this program, ever again.  It is
+ * included primarily as documentation for how the GlyphLessFont was
+ * created.
+ */
+
+/* The OpenType data types, we'll duplicate the definitions so that
+ * the code shall be (as far as possible) self-documenting simply by
+ * referencing the OpenType specification. Note that the specification
+ * is soemwhat inconsistent with regards to usage, naming and capitalisation
+ * of the names for these data types.
+ */
+typedef char BYTE;
+typedef char CHAR;
+typedef unsigned short USHORT;
+typedef short SHORT;
+typedef struct _uint24 {char top8;unsigned short bottom16;} UINT24;
+typedef unsigned long ULONG;
+typedef long LONG;
+typedef unsigned long Fixed;
+typedef SHORT FWORD;
+typedef USHORT UFWORD;
+typedef unsigned short F2DOT14;
+typedef struct _datetime {long upper;long lower;} LONGDATETIME;
+typedef char Tag[4];
+typedef USHORT GlyphId;
+typedef USHORT Offset;
+typedef struct _longHorMetric {USHORT advanceWidth;SHORT lsb;} longHorMetric;
+
+/* And now definitions for each of the OpenType tables we will wish to use */
+
+typedef struct {
+    Fixed sfnt_version;
+    USHORT numTables;
+    USHORT searchRange;
+    USHORT entrySelector;
+    USHORT rangeShift;
+} Offset_Table;
+
+typedef struct {
+    Tag tag;        /* The spec defines this as a ULONG,
+                       but also as a 'Tag' in its own right */
+    ULONG checkSum;
+    ULONG offset;
+    ULONG length;
+} TableRecord;
+
+typedef struct {
+    USHORT version;
+    USHORT numTables;
+} cmap_header;
+
+typedef struct {
+    USHORT platformID;
+    USHORT encodingID;
+    ULONG Offset;
+} cmap_record;
+
+typedef struct {
+    USHORT format;
+    USHORT length;
+    USHORT language;
+    BYTE glyphIDArray[256];
+} format0_cmap_table;
+
+/* This structure only works for single segment format 4 tables,
+   for multiple segments it must be constructed */
+typedef struct {
+    USHORT format;
+    USHORT length;
+    USHORT language;
+    USHORT segCountx2;
+    USHORT searchRange;
+    USHORT entrySelector;
+    USHORT rangeShift;
+    USHORT endcount;
+    USHORT reservedPad;
+    USHORT startCount;
+    SHORT idDelta;
+    USHORT idRangeOffset;
+    USHORT glyphIdArray[2];
+} format4_cmap_table;
+
+typedef struct {
+    USHORT format;
+    USHORT length;
+    USHORT language;
+    USHORT firstCode;
+    USHORT entryCount;
+    USHORT glyphIDArray;
+} format6_cmap_table;
+
+typedef struct {
+    cmap_header header;
+    cmap_record records[2];
+    format6_cmap_table AppleTable;
+    format6_cmap_table MSTable;
+} cmap_table;
+
+typedef struct {
+    Fixed version;
+    Fixed FontRevision;
+    ULONG checkSumAdjustment;
+    ULONG MagicNumber;
+    USHORT Flags;
+    USHORT unitsPerEm;
+    LONGDATETIME created;
+    LONGDATETIME modified;
+    SHORT xMin;
+    SHORT yMin;
+    SHORT xMax;
+    SHORT yMax;
+    USHORT macStyle;
+    USHORT lowestRecPPEM;
+    SHORT FontDirectionHint;
+    SHORT indexToLocFormat;
+    SHORT glyphDataFormat;
+    SHORT PAD;
+} head_table;
+
+typedef struct {
+    Fixed version;
+    FWORD Ascender;
+    FWORD Descender;
+    FWORD LineGap;
+    UFWORD advanceWidthMax;
+    FWORD minLeftSideBearing;
+    FWORD minRightSideBearing;
+    FWORD xMaxExtent;
+    SHORT caretSlopeRise;
+    SHORT caretSlopeRun;
+    SHORT caretOffset;
+    SHORT reserved1;
+    SHORT reserved2;
+    SHORT reserved3;
+    SHORT reserved4;
+    SHORT metricDataFormat;
+    USHORT numberOfHMetrics;
+} hhea_table;
+
+typedef struct {
+    longHorMetric hMetrics[2];
+} hmtx_table;
+
+typedef struct {
+    Fixed version;
+    USHORT numGlyphs;
+    USHORT maxPoints;
+    USHORT maxContours;
+    USHORT maxCompositePoints;
+    USHORT maxCompositeContours;
+    USHORT maxZones;
+    USHORT maxTwilightPoints;
+    USHORT maxStorage;
+    USHORT maxFunctionDefs;
+    USHORT maxInstructionDefs;
+    USHORT maxStackElements;
+    USHORT maxSizeOfInstructions;
+    USHORT maxComponentElements;
+    USHORT maxComponentDepth;
+} maxp_table;
+
+typedef struct {
+    USHORT platformID;
+    USHORT encodingID;
+    USHORT languageID;
+    USHORT nameID;
+    USHORT length;
+    USHORT offset;
+} NameRecord;
+
+typedef struct {
+    USHORT format;
+    USHORT count;
+    USHORT stringOffset;
+    NameRecord nameRecord[3];
+} name_table;
+
+typedef struct {
+    USHORT version;
+    SHORT xAvgCharWidth;
+    USHORT usWeightClass;
+    USHORT usWidthClass;
+    USHORT fsType;
+    SHORT ySubscriptXSize;
+    SHORT ySubscriptYSize;
+    SHORT ySubscriptXOffset;
+    SHORT ySubscriptYOffset;
+    SHORT ySuperscriptXSize;
+    SHORT ySuperscriptYSize;
+    SHORT ySuperscriptXOffset;
+    SHORT ySuperscriptYOffset;
+    SHORT yStrikeoutSize;
+    SHORT yStrikeoutPosition;
+    SHORT sFamilyClass;
+    BYTE panose[10];
+    ULONG ulUnicodeRange1;
+    ULONG ulUnicodeRange2;
+    ULONG ulUnicodeRange3;
+    ULONG ulUnicodeRange4;
+    CHAR achVendID[4];
+    USHORT fsSelection;
+    USHORT usFirstCharIndex;
+    USHORT usLastCharIndex;
+    SHORT sTypoAscender;
+    SHORT sTypoDescender;
+    SHORT sTypoLineGap;
+    USHORT usWinAscent;
+    USHORT usWinDescent;
+    ULONG ulCodePageRange1;
+    ULONG ulCodePageRange2;
+    SHORT sxHeight;
+    SHORT sCapHeight;
+    USHORT usDefaultChar;
+    USHORT usBreakChar;
+    USHORT usMaxContent;
+} OS2_table;
+
+typedef struct {
+    Fixed version;
+    Fixed italicAngle;
+    FWORD underlinePosition;
+    FWORD underlineThickness;
+    ULONG isFixedPitch;
+    ULONG minMemType42;
+    ULONG maxMemType42;
+    ULONG minMemType1;
+    ULONG maxMemType1;
+} post_table;