This repository contains:
- training and evaluation data from allographetic transcriptions of various Old French and Old Occitan manuscripts, in various states of correctness, in Kraken training format;
- HTR models trained and tested using this data.
If you plan of using this data or the provided model for a publication, please cite it, as:
Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.
The data is as following:
- each line image in a
.png
file; - each transcription in a
.gt.txt
file.
Unicode NFD normalisation has been applied on the ground-truth text.
The root folder contains a vanilla Kraken model (model_froc.mlmodel
),
trained with default settings and without any additional data (e.g. no artificial noised data).
Data was randomly divided in 80% for training (train.txt
), 10% for in-training
validation (val.txt
) and 10% for final testing of the model (test.txt
).
It achieved a C.E.R. of:
- ** 8.11 % ** on validation data (7.03% ignoring spaces);
- ** 7.83 % ** on test data (6.92% ignoring spaces).
There were 13540 characters and 1061 errors on test data.
Globally, the error are as follow:
- 536 characters from the ground truth were not predicted by the model;
- 132 characters absent from the ground truth were wrongly predicted;
- 393 character substitutions.
The most frequent confusions concerned spacing.
The 20 most frequent confusions are:
Errors Ground Truth-Prediction
70 { SPACE } - { }
54 { } - { SPACE }
48 { ı } - { }
43 { n } - { }
43 { COMBINING ACUTE ACCENT } - { }
27 { e } - { }
24 { l } - { }
24 { u } - { }
21 { . } - { }
20 { u } - { n }
18 { ſ } - { }
18 { a } - { }
17 { r } - { }
14 { t } - { }
13 { COMBINING TILDE } - { }
13 { } - { ı }
12 { o } - { e }
12 { o } - { }
12 { ı } - { m }
11 { e } - { c }
The data comes from partial allographetic transcription of the following mss:
- Clermont-Ferrand, archives départementales, 1F2 (XIII 1/3, anglo-norman praegothica script; Chanson d'Aspremont); 52 lines.
- Paris, Bibliothèque nationale de France, fr. 854 (XIII 4/4, Venice or Venetian area; gothic textualis; occitan chansonnier I); 1112 lines.
- Cologny-Genève, fondation Martin-Bodmer, cod. Bodm. 168 (XIII 3/3, anglo-norman gothic textualis; Chanson d'Otinel); 1908 lines.
- Oxford, Bodleian Library, Digby 23 (XII 1/2, anglo-norman praegothica; Chanson de Roland); 564 lines.
For these transcriptions, see: Jean-Baptiste Camps, La `Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique, PhD thesis, dir. Dominique Boutet, Paris-Sorbonne, 2016, DOI: https://doi.org/10.5281/zenodo.1116735.
Cette œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution 4.0 International.
If you want to contribute training data or models, you can do so by cloning the repository and sending us a pull request, or by sending an email at jbcamps at hotmail.com .
Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.