-
Notifications
You must be signed in to change notification settings - Fork 190
See https://github.com/impresso/NZZ-black-letter-ground-truth.
The transcription does not use the long s (so it roughly corresponds to OCR-D level 1) and has some special rules.
It is based on OCR results (ABBYY Finereader) which were fixed manually. Not each OCR artifact was fixed, so there is a certain number of transcription errors.
Line endings are encoded as ¬
, but -
and -¬
can also be found.
Whitespace is used inconsistently: there exist lines with leading or trailing space.
Sequences of more than one space are also very common.
The ground truth text contains many glyphs which occur less than 10 times, 17 of them even only once. Training of those characters won't work well or not at all if they only occur in the evaluation set.
Analyse-Report Version 0.1
Input: [PosixPath('NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/gt')]
------------------------------------------------------------
Statistics combined
261143 : ASCII Spacing Symbols
18436 : ASCII Digits
1681630 : ASCII Letters
1584996 : ASCII Lowercase Letters
96634 : ASCII Uppercase Letters
58251 : Punctuation & Symbols
2029283 : Total Glyphes
------------------------------------------------------------
{'Overall unicode character statistics'}
character
---------
278275 {e} LATIN SMALL LETTER E
261143 { } SPACE
171401 {n} LATIN SMALL LETTER N
126325 {i} LATIN SMALL LETTER I
122983 {r} LATIN SMALL LETTER R
96211 {t} LATIN SMALL LETTER T
93176 {s} LATIN SMALL LETTER S
84502 {a} LATIN SMALL LETTER A
79678 {d} LATIN SMALL LETTER D
72574 {h} LATIN SMALL LETTER H
62064 {u} LATIN SMALL LETTER U
58013 {l} LATIN SMALL LETTER L
50688 {g} LATIN SMALL LETTER G
47243 {c} LATIN SMALL LETTER C
41059 {o} LATIN SMALL LETTER O
35108 {m} LATIN SMALL LETTER M
27061 {b} LATIN SMALL LETTER B
23589 {f} LATIN SMALL LETTER F
22609 {,} COMMA
22544 {.} FULL STOP
21081 {w} LATIN SMALL LETTER W
18877 {z} LATIN SMALL LETTER Z
16450 {k} LATIN SMALL LETTER K
12154 {v} LATIN SMALL LETTER V
11833 {ü} LATIN SMALL LETTER U WITH DIAERESIS
9684 {¬} NOT SIGN
9519 {ä} LATIN SMALL LETTER A WITH DIAERESIS
9273 {p} LATIN SMALL LETTER P
9011 {S} LATIN CAPITAL LETTER S
7801 {A} LATIN CAPITAL LETTER A
6999 {B} LATIN CAPITAL LETTER B
6886 {ß} LATIN SMALL LETTER SHARP S
6532 {D} LATIN CAPITAL LETTER D
5521 {G} LATIN CAPITAL LETTER G
5109 {M} LATIN CAPITAL LETTER M
5053 {F} LATIN CAPITAL LETTER F
5019 {E} LATIN CAPITAL LETTER E
4857 {K} LATIN CAPITAL LETTER K
4455 {ö} LATIN SMALL LETTER O WITH DIAERESIS
4191 {V} LATIN CAPITAL LETTER V
4098 {R} LATIN CAPITAL LETTER R
4069 {P} LATIN CAPITAL LETTER P
3985 {W} LATIN CAPITAL LETTER W
3730 {1} DIGIT ONE
3486 {H} LATIN CAPITAL LETTER H
3469 {0} DIGIT ZERO
3455 {Z} LATIN CAPITAL LETTER Z
2953 {L} LATIN CAPITAL LETTER L
2779 {N} LATIN CAPITAL LETTER N
2729 {T} LATIN CAPITAL LETTER T
2532 {I} LATIN CAPITAL LETTER I
2193 {„} DOUBLE LOW-9 QUOTATION MARK
2172 {U} LATIN CAPITAL LETTER U
2135 {2} DIGIT TWO
1896 {j} LATIN SMALL LETTER J
1822 {5} DIGIT FIVE
1688 {-} HYPHEN-MINUS
1666 {J} LATIN CAPITAL LETTER J
1529 {8} DIGIT EIGHT
1510 {"} QUOTATION MARK
1453 {O} LATIN CAPITAL LETTER O
1438 {y} LATIN SMALL LETTER Y
1357 {3} DIGIT THREE
1333 {—} EM DASH
1320 {:} COLON
1273 {)} RIGHT PARENTHESIS
1256 {4} DIGIT FOUR
1246 {6} DIGIT SIX
1240 {;} SEMICOLON
1128 {(} LEFT PARENTHESIS
1007 {C} LATIN CAPITAL LETTER C
997 {9} DIGIT NINE
895 {7} DIGIT SEVEN
687 {x} LATIN SMALL LETTER X
400 {!} EXCLAMATION MARK
363 {?} QUESTION MARK
321 {'} APOSTROPHE
235 {q} LATIN SMALL LETTER Q
166 {é} LATIN SMALL LETTER E WITH ACUTE
93 {*} ASTERISK
90 {Q} LATIN CAPITAL LETTER Q
59 {/} SOLIDUS
47 {½} VULGAR FRACTION ONE HALF
45 {à} LATIN SMALL LETTER A WITH GRAVE
34 {&} AMPERSAND
34 {Y} LATIN CAPITAL LETTER Y
26 {’} RIGHT SINGLE QUOTATION MARK
25 {§} SECTION SIGN
25 {X} LATIN CAPITAL LETTER X
22 {“} LEFT DOUBLE QUOTATION MARK
22 {è} LATIN SMALL LETTER E WITH GRAVE
20 {%} PERCENT SIGN
14 {=} EQUALS SIGN
8 {<} LESS-THAN SIGN
8 {”} RIGHT DOUBLE QUOTATION MARK
8 {} SOFT HYPHEN
7 {»} RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
7 {Δ} GREEK CAPITAL LETTER DELTA
7 {♂} MALE SIGN
6 {>} GREATER-THAN SIGN
6 {Ä} LATIN CAPITAL LETTER A WITH DIAERESIS
6 {ë} LATIN SMALL LETTER E WITH DIAERESIS
6 {″} DOUBLE PRIME
5 {ç} LATIN SMALL LETTER C WITH CEDILLA
5 {+} PLUS SIGN
4 {⁕} FLOWER PUNCTUATION MARK
4 {□} WHITE SQUARE
4 {ô} LATIN SMALL LETTER O WITH CIRCUMFLEX
4 {°} DEGREE SIGN
4 {ó} LATIN SMALL LETTER O WITH ACUTE
3 {ù} LATIN SMALL LETTER U WITH GRAVE
3 {‟} DOUBLE HIGH-REVERSED-9 QUOTATION MARK
3 {«} LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
3 {•} BULLET
3 {ê} LATIN SMALL LETTER E WITH CIRCUMFLEX
3 {[} LEFT SQUARE BRACKET
3 {]} RIGHT SQUARE BRACKET
3 {‧} HYPHENATION POINT
3 {♀} FEMALE SIGN
2 {ψ} GREEK SMALL LETTER PSI
2 {⅓} VULGAR FRACTION ONE THIRD
2 {△} WHITE UP-POINTING TRIANGLE
2 {♱} EAST SYRIAC CROSS
2 {¼} VULGAR FRACTION ONE QUARTER
2 {ϯ} COPTIC SMALL LETTER DEI
2 {–} EN DASH
2 {♎} LIBRA
2 {¾} VULGAR FRACTION THREE QUARTERS
2 {^} CIRCUMFLEX ACCENT
2 {‚} SINGLE LOW-9 QUOTATION MARK
1 {î} LATIN SMALL LETTER I WITH CIRCUMFLEX
1 {⅐} VULGAR FRACTION ONE SEVENTH
1 {á} LATIN SMALL LETTER A WITH ACUTE
1 {Ω} GREEK CAPITAL LETTER OMEGA
1 {Ü} LATIN CAPITAL LETTER U WITH DIAERESIS
1 {â} LATIN SMALL LETTER A WITH CIRCUMFLEX
1 {⅔} VULGAR FRACTION TWO THIRDS
1 {♉} TAURUS
1 {#} NUMBER SIGN
1 {✝} LATIN CROSS
1 {÷} DIVISION SIGN
1 {†} DAGGER
1 {‰} PER MILLE SIGN
1 {ñ} LATIN SMALL LETTER N WITH TILDE
1 {Ƨ} LATIN CAPITAL LETTER TONE TWO
1 {♋} CANCER
1 {¹} SUPERSCRIPT ONE
------------------------------------------------------------
All original lines were randomly split in 90 % for training and 10 % for evaluation.
3 line images were skipped because they exceeded the width limit in current lstmtrain
.
Training is running for 10 epochs.
make -r MODEL_NAME=nzz-new GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=388550 training
Current CER: 1.384 %, CPU time: 28:33 h
17 pages (same as in original test) were used for evaluation.
The code for lstmtrain
was modified to allow line images with a width of up to 4096 px.
Training is running for 10 epochs with the default network specification and
an alternate specification which scales to 64 px height.
Scaling to 64 px height also results in wider images,
so the new limit of 4096 px is still too small, resulting in skipped lines.
make -r MODEL_NAME=nzz-ref GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 training
Current CER: 1.115 %, CPU time: 33 h
make -r MODEL_NAME=nzz-64 GROUND_TRUTH_DIR=NZZ_groundtruth/gt MAX_ITERATIONS=387780 NET_SPEC="[1,64,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c148]" training
Current CER: 2.852 %, CPU time: 15:43 h
An intermediate model nzz-64_2.852_48798_92600.traineddata
achieves a character accuracy of 95.48 % on the evaluation set of 17 pages. The confusion list shows that there are some likely transcription errors in the ground truth data. These will be fixed using GTCheck
before we try the next iteration.
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
207870 Characters
9396 Errors
95.48% Accuracy
0 Reject Characters
0 Suspect Markers
0 False Marks
0.00% Characters Marked
95.48% Accuracy After Correction
Ins Subst Del Errors
0 0 0 0 Marked
1640 5829 1927 9396 Unmarked
1640 5829 1927 9396 Total
Count Missed %Right
26448 309 98.83 ASCII Spacing Characters
6038 504 91.65 ASCII Special Symbols
2234 238 89.35 ASCII Digits
9991 712 92.87 ASCII Uppercase Letters
163159 5706 96.50 ASCII Lowercase Letters
207870 7469 96.41 Total
Errors Marked Correct-Generated
89 0 {}-{ }
83 0 {}-{,}
80 0 {h}-{b}
72 0 {f}-{s}
67 0 { }-{}
51 0 {u}-{n}
49 0 {}-{n}
49 0 {}-{s}
47 0 {}-{b}
46 0 {l}-{t}
46 0 {n}-{u}
43 0 {,}-{}
43 0 {.}-{,}
42 0 {i}-{}
42 0 {}-{.}
41 0 {}-{t}
40 0 {r}-{n}
39 0 {.}-{}
38 0 {g}-{a}
36 0 {B}-{V}
35 0 {e}-{}
33 0 {l}-{}
33 0 {}-{i}
32 0 {N}-{R}
32 0 {r}-{}
31 0 {t}-{}
30 0 {}-{a}
30 0 {}-{u}
28 0 {s}-{f}
28 0 {s}-{}
27 0 {c}-{e}
27 0 {k}-{t}
26 0 {i}-{t}
26 0 {}-{8}
26 0 {}-{e}
25 0 {d}-{b}
A newer model nzz-new_1.115_131575_404000
results in an even higher character accuracy of 97.55 %.
The OCR for the evaluation set was produced like this:
model=nzz-new_1.115_131575_404000
for i in $(for xml in $(cat ../../../test-set-filenames.txt); do p=${xml/.xml/}; ls *$p*.png; done); do
echo $i;
tesseract $i $model/${i/.png/} -l nzz-new_1.115_131575_404000 --psm 13 -c page_separator=;
done
Here is the result:
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
207870 Characters
5090 Errors
97.55% Accuracy
0 Reject Characters
0 Suspect Markers
0 False Marks
0.00% Characters Marked
97.55% Accuracy After Correction
Ins Subst Del Errors
0 0 0 0 Marked
1530 2977 583 5090 Unmarked
1530 2977 583 5090 Total
Count Missed %Right
26448 270 98.98 ASCII Spacing Characters
6038 339 94.39 ASCII Special Symbols
2234 95 95.75 ASCII Digits
9991 333 96.67 ASCII Uppercase Letters
163159 3470 97.87 ASCII Lowercase Letters
207870 4507 97.83 Total
Errors Marked Correct-Generated
61 0 {,}-{}
54 0 { }-{}
46 0 {auch mit äußersten, kata...}-{W}
45 0 {isch schöner Motive. Die...}-{}
39 0 {.}-{}
37 0 {h}-{b}
34 0 {}-{,}
32 0 {n (von Staaten, Kirchen,...}-{1}
32 0 {u}-{n}
29 0 {b}-{h}
29 0 {i}-{}
28 0 {Poren durchschlüpfen, de...}-{}
27 0 {s}-{f}
25 0 {-}-{¬}
25 0 {us entspricht vergleichs...}-{.}
23 0 {}-{.}
22 0 {lichem. Hierher gehöre}-{n}
22 0 {r}-{n}
22 0 {urchmesser viel gerin¬}-{}
21 0 {s}-{}
20 0 {a}-{g}
20 0 {}-{n}
20 0 {}-{s}
19 0 { Miscellaneen, Bio¬}-{}
19 0 {e Notizen, die sich}-{}
19 0 {e}-{}
19 0 {t}-{i}
Using a different evaluation with Text Eval from PRImA Research which does a bag-of-word compare shows a character accuracy of 98.1 % and a word accuracy of 94.1 %.
The currently best bag-of-words F1-measure has a mean value of 0.926 with a standard deviation of 0.062.
Training needed about 160 h CPU time.
wordCountFMeasure 0.930523232890304 ± 0.0751233788950039
Training needed about 160 h CPU time.
wordCountFMeasure 0.936328899143037 ± 0.05966589199092726
Training needed about 250 h CPU time.
wordCountFMeasure 0.9369579509758102 ± 0.05511434574259581