Skip to content
Stefan Weil edited this page May 28, 2020 · 13 revisions

Training Fraktur with Neue Zürcher Zeitung

See https://github.com/impresso/NZZ-black-letter-ground-truth.

Data set

The transcription does not use the long s (so it roughly corresponds to OCR-D level 1) and has some special rules.

It is based on OCR results (ABBYY Finereader) which were fixed manually. Not each OCR artifact was fixed, so there is a certain number of transcription errors.

Line endings are encoded as ¬, but - and can also be found. Whitespace is used inconsistently: there exist lines with leading or trailing space. Sequences of more than one space are also very common.

The ground truth text contains many glyphs which occur less than 10 times, 17 of them even only once. Training of those characters won't work well or not at all if they only occur in the evaluation set.

    Analyse-Report Version 0.1
    Input: [PosixPath('NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/gt')]
    
------------------------------------------------------------

    Statistics combined
    
            261143 : ASCII Spacing Symbols
             18436 : ASCII Digits
           1681630 : ASCII Letters
           1584996 : ASCII Lowercase Letters
             96634 : ASCII Uppercase Letters
             58251 : Punctuation & Symbols
           2029283 : Total Glyphes
    
------------------------------------------------------------

    {'Overall unicode character statistics'}
    
        character
        ---------
                        278275  {e} LATIN SMALL LETTER E
                        261143  { } SPACE
                        171401  {n} LATIN SMALL LETTER N
                        126325  {i} LATIN SMALL LETTER I
                        122983  {r} LATIN SMALL LETTER R
                         96211  {t} LATIN SMALL LETTER T
                         93176  {s} LATIN SMALL LETTER S
                         84502  {a} LATIN SMALL LETTER A
                         79678  {d} LATIN SMALL LETTER D
                         72574  {h} LATIN SMALL LETTER H
                         62064  {u} LATIN SMALL LETTER U
                         58013  {l} LATIN SMALL LETTER L
                         50688  {g} LATIN SMALL LETTER G
                         47243  {c} LATIN SMALL LETTER C
                         41059  {o} LATIN SMALL LETTER O
                         35108  {m} LATIN SMALL LETTER M
                         27061  {b} LATIN SMALL LETTER B
                         23589  {f} LATIN SMALL LETTER F
                         22609  {,} COMMA
                         22544  {.} FULL STOP
                         21081  {w} LATIN SMALL LETTER W
                         18877  {z} LATIN SMALL LETTER Z
                         16450  {k} LATIN SMALL LETTER K
                         12154  {v} LATIN SMALL LETTER V
                         11833  {ü} LATIN SMALL LETTER U WITH DIAERESIS
                          9684  {¬} NOT SIGN
                          9519  {ä} LATIN SMALL LETTER A WITH DIAERESIS
                          9273  {p} LATIN SMALL LETTER P
                          9011  {S} LATIN CAPITAL LETTER S
                          7801  {A} LATIN CAPITAL LETTER A
                          6999  {B} LATIN CAPITAL LETTER B
                          6886  {ß} LATIN SMALL LETTER SHARP S
                          6532  {D} LATIN CAPITAL LETTER D
                          5521  {G} LATIN CAPITAL LETTER G
                          5109  {M} LATIN CAPITAL LETTER M
                          5053  {F} LATIN CAPITAL LETTER F
                          5019  {E} LATIN CAPITAL LETTER E
                          4857  {K} LATIN CAPITAL LETTER K
                          4455  {ö} LATIN SMALL LETTER O WITH DIAERESIS
                          4191  {V} LATIN CAPITAL LETTER V
                          4098  {R} LATIN CAPITAL LETTER R
                          4069  {P} LATIN CAPITAL LETTER P
                          3985  {W} LATIN CAPITAL LETTER W
                          3730  {1} DIGIT ONE
                          3486  {H} LATIN CAPITAL LETTER H
                          3469  {0} DIGIT ZERO
                          3455  {Z} LATIN CAPITAL LETTER Z
                          2953  {L} LATIN CAPITAL LETTER L
                          2779  {N} LATIN CAPITAL LETTER N
                          2729  {T} LATIN CAPITAL LETTER T
                          2532  {I} LATIN CAPITAL LETTER I
                          2193  {„} DOUBLE LOW-9 QUOTATION MARK
                          2172  {U} LATIN CAPITAL LETTER U
                          2135  {2} DIGIT TWO
                          1896  {j} LATIN SMALL LETTER J
                          1822  {5} DIGIT FIVE
                          1688  {-} HYPHEN-MINUS
                          1666  {J} LATIN CAPITAL LETTER J
                          1529  {8} DIGIT EIGHT
                          1510  {"} QUOTATION MARK
                          1453  {O} LATIN CAPITAL LETTER O
                          1438  {y} LATIN SMALL LETTER Y
                          1357  {3} DIGIT THREE
                          1333  {—} EM DASH
                          1320  {:} COLON
                          1273  {)} RIGHT PARENTHESIS
                          1256  {4} DIGIT FOUR
                          1246  {6} DIGIT SIX
                          1240  {;} SEMICOLON
                          1128  {(} LEFT PARENTHESIS
                          1007  {C} LATIN CAPITAL LETTER C
                           997  {9} DIGIT NINE
                           895  {7} DIGIT SEVEN
                           687  {x} LATIN SMALL LETTER X
                           400  {!} EXCLAMATION MARK
                           363  {?} QUESTION MARK
                           321  {'} APOSTROPHE
                           235  {q} LATIN SMALL LETTER Q
                           166  {é} LATIN SMALL LETTER E WITH ACUTE
                            93  {*} ASTERISK
                            90  {Q} LATIN CAPITAL LETTER Q
                            59  {/} SOLIDUS
                            47  {½} VULGAR FRACTION ONE HALF
                            45  {à} LATIN SMALL LETTER A WITH GRAVE
                            34  {&} AMPERSAND
                            34  {Y} LATIN CAPITAL LETTER Y
                            26  {’} RIGHT SINGLE QUOTATION MARK
                            25  {§} SECTION SIGN
                            25  {X} LATIN CAPITAL LETTER X
                            22  {“} LEFT DOUBLE QUOTATION MARK
                            22  {è} LATIN SMALL LETTER E WITH GRAVE
                            20  {%} PERCENT SIGN
                            14  {=} EQUALS SIGN
                             8  {<} LESS-THAN SIGN
                             8  {”} RIGHT DOUBLE QUOTATION MARK
                             8  {­} SOFT HYPHEN
                             7  {»} RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
                             7  {Δ} GREEK CAPITAL LETTER DELTA
                             7  {♂} MALE SIGN
                             6  {>} GREATER-THAN SIGN
                             6  {Ä} LATIN CAPITAL LETTER A WITH DIAERESIS
                             6  {ë} LATIN SMALL LETTER E WITH DIAERESIS
                             6  {″} DOUBLE PRIME
                             5  {ç} LATIN SMALL LETTER C WITH CEDILLA
                             5  {+} PLUS SIGN
                             4  {⁕} FLOWER PUNCTUATION MARK
                             4  {□} WHITE SQUARE
                             4  {ô} LATIN SMALL LETTER O WITH CIRCUMFLEX
                             4  {°} DEGREE SIGN
                             4  {ó} LATIN SMALL LETTER O WITH ACUTE
                             3  {ù} LATIN SMALL LETTER U WITH GRAVE
                             3  {‟} DOUBLE HIGH-REVERSED-9 QUOTATION MARK
                             3  {«} LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
                             3  {•} BULLET
                             3  {ê} LATIN SMALL LETTER E WITH CIRCUMFLEX
                             3  {[} LEFT SQUARE BRACKET
                             3  {]} RIGHT SQUARE BRACKET
                             3  {‧} HYPHENATION POINT
                             3  {♀} FEMALE SIGN
                             2  {ψ} GREEK SMALL LETTER PSI
                             2  {⅓} VULGAR FRACTION ONE THIRD
                             2  {△} WHITE UP-POINTING TRIANGLE
                             2  {♱} EAST SYRIAC CROSS
                             2  {¼} VULGAR FRACTION ONE QUARTER
                             2  {ϯ} COPTIC SMALL LETTER DEI
                             2  {–} EN DASH
                             2  {♎} LIBRA
                             2  {¾} VULGAR FRACTION THREE QUARTERS
                             2  {^} CIRCUMFLEX ACCENT
                             2  {‚} SINGLE LOW-9 QUOTATION MARK
                             1  {î} LATIN SMALL LETTER I WITH CIRCUMFLEX
                             1  {⅐} VULGAR FRACTION ONE SEVENTH
                             1  {á} LATIN SMALL LETTER A WITH ACUTE
                             1  {Ω} GREEK CAPITAL LETTER OMEGA
                             1  {Ü} LATIN CAPITAL LETTER U WITH DIAERESIS
                             1  {â} LATIN SMALL LETTER A WITH CIRCUMFLEX
                             1  {⅔} VULGAR FRACTION TWO THIRDS
                             1  {♉} TAURUS
                             1  {#} NUMBER SIGN
                             1  {✝} LATIN CROSS
                             1  {÷} DIVISION SIGN
                             1  {†} DAGGER
                             1  {‰} PER MILLE SIGN
                             1  {ñ} LATIN SMALL LETTER N WITH TILDE
                             1  {Ƨ} LATIN CAPITAL LETTER TONE TWO
                             1  {♋} CANCER
                             1  {¹} SUPERSCRIPT ONE
    
------------------------------------------------------------ 

Trainings

Training set 1

All original lines were randomly split in 90 % for training and 10 % for evaluation. 3 line images were skipped because they exceeded the width limit in current lstmtrain. Training is running for 10 epochs.

make -r MODEL_NAME=nzz-new GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=388550 training

Current CER: 1.384 %, CPU time: 28:33 h

Training set 2

17 pages (same as in original test) were used for evaluation. The code for lstmtrain was modified to allow line images with a width of up to 4096 px. Training is running for 10 epochs with the default network specification and an alternate specification which scales to 64 px height. Scaling to 64 px height also results in wider images, so the new limit of 4096 px is still too small, resulting in skipped lines.

make -r MODEL_NAME=nzz-ref GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 training

Current CER: 1.115 %, CPU time: 33 h

make -r MODEL_NAME=nzz-64 GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 NET_SPEC="[1,64,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c148]" training

Current CER: 2.852 %, CPU time: 15:43 h

Intermediate results

nzz-64_2.852_48798_92600

An intermediate model nzz-64_2.852_48798_92600.traineddata achieves a character accuracy of 95.48 % on the evaluation set of 17 pages. The confusion list shows that there are some likely transcription errors in the ground truth data. These will be fixed using GTCheck before we try the next iteration.

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  207870   Characters
    9396   Errors
   95.48%  Accuracy

       0   Reject Characters
       0   Suspect Markers
       0   False Marks
    0.00%  Characters Marked
   95.48%  Accuracy After Correction

     Ins    Subst      Del   Errors
       0        0        0        0   Marked
    1640     5829     1927     9396   Unmarked
    1640     5829     1927     9396   Total

   Count   Missed   %Right
   26448      309    98.83   ASCII Spacing Characters
    6038      504    91.65   ASCII Special Symbols
    2234      238    89.35   ASCII Digits
    9991      712    92.87   ASCII Uppercase Letters
  163159     5706    96.50   ASCII Lowercase Letters
  207870     7469    96.41   Total

  Errors   Marked   Correct-Generated
      89        0   {}-{ }
      83        0   {}-{,}
      80        0   {h}-{b}
      72        0   {f}-{s}
      67        0   { }-{}
      51        0   {u}-{n}
      49        0   {}-{n}
      49        0   {}-{s}
      47        0   {}-{b}
      46        0   {l}-{t}
      46        0   {n}-{u}
      43        0   {,}-{}
      43        0   {.}-{,}
      42        0   {i}-{}
      42        0   {}-{.}
      41        0   {}-{t}
      40        0   {r}-{n}
      39        0   {.}-{}
      38        0   {g}-{a}
      36        0   {B}-{V}
      35        0   {e}-{}
      33        0   {l}-{}
      33        0   {}-{i}
      32        0   {N}-{R}
      32        0   {r}-{}
      31        0   {t}-{}
      30        0   {}-{a}
      30        0   {}-{u}
      28        0   {s}-{f}
      28        0   {s}-{}
      27        0   {c}-{e}
      27        0   {k}-{t}
      26        0   {i}-{t}
      26        0   {}-{8}
      26        0   {}-{e}
      25        0   {d}-{b}

nzz-new_1.115_131575_404000

A newer model nzz-new_1.115_131575_404000 results in an even higher character accuracy of 97.55 %. The OCR for the evaluation set was produced like this:

model=nzz-new_1.115_131575_404000
for i in $(for xml in $(cat ../../../test-set-filenames.txt); do p=${xml/.xml/}; ls *$p*.png; done); do
  echo $i;
  tesseract $i $model/${i/.png/} -l nzz-new_1.115_131575_404000 --psm 13 -c page_separator=;
done

Here is the result:

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  207870   Characters
    5090   Errors
   97.55%  Accuracy

       0   Reject Characters
       0   Suspect Markers
       0   False Marks
    0.00%  Characters Marked
   97.55%  Accuracy After Correction

     Ins    Subst      Del   Errors
       0        0        0        0   Marked
    1530     2977      583     5090   Unmarked
    1530     2977      583     5090   Total

   Count   Missed   %Right
   26448      270    98.98   ASCII Spacing Characters
    6038      339    94.39   ASCII Special Symbols
    2234       95    95.75   ASCII Digits
    9991      333    96.67   ASCII Uppercase Letters
  163159     3470    97.87   ASCII Lowercase Letters
  207870     4507    97.83   Total

  Errors   Marked   Correct-Generated
      61        0   {,}-{}
      54        0   { }-{}
      46        0   {auch mit äußersten, kata...}-{W}
      45        0   {isch schöner Motive. Die...}-{}
      39        0   {.}-{}
      37        0   {h}-{b}
      34        0   {}-{,}
      32        0   {n (von Staaten, Kirchen,...}-{1}
      32        0   {u}-{n}
      29        0   {b}-{h}
      29        0   {i}-{}
      28        0   {Poren durchschlüpfen, de...}-{}
      27        0   {s}-{f}
      25        0   {-}-{¬}
      25        0   {us entspricht vergleichs...}-{.}
      23        0   {}-{.}
      22        0   {lichem. Hierher gehöre}-{n}
      22        0   {r}-{n}
      22        0   {urchmesser viel gerin¬}-{}
      21        0   {s}-{}
      20        0   {a}-{g}
      20        0   {}-{n}
      20        0   {}-{s}
      19        0   { Miscellaneen, Bio¬}-{}
      19        0   {e Notizen, die sich}-{}
      19        0   {e}-{}
      19        0   {t}-{i}

Using a different evaluation with Text Eval from PRImA Research which does a bag-of-word compare shows a character accuracy of 98.1 % and a word accuracy of 94.1 %.

The currently best bag-of-words F1-measure has a mean value of 0.926 with a standard deviation of 0.062.

Latest results

nzz-36_0.37_249849_1294100

Training needed about 160 h CPU time.

wordCountFMeasure 0.930523232890304 ± 0.0751233788950039

nzz-64_0.781_197743_751200

Training needed about 160 h CPU time.

wordCountFMeasure 0.936328899143037 ± 0.05966589199092726

nzz-64_0.591_233126_1008500

Training needed about 250 h CPU time.

wordCountFMeasure 0.9369579509758102 ± 0.05511434574259581