FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models

This repository contains:

training and evaluation data from allographetic transcriptions of various Old French and Old Occitan manuscripts, in various states of correctness, in Kraken training format;
HTR models trained and tested using this data.

If you plan of using this data or the provided model for a publication, please cite it, as:

Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.

Data format

The data is as following:

each line image in a .png file;
each transcription in a .gt.txt file.

Unicode NFD normalisation has been applied on the ground-truth text.

Models

Summary and C.E.R.

The root folder contains a vanilla Kraken model (model_froc.mlmodel), trained with default settings and without any additional data (e.g. no artificial noised data).

Data was randomly divided in 80% for training (train.txt), 10% for in-training validation (val.txt) and 10% for final testing of the model (test.txt).

It achieved a C.E.R. of:

** 8.11 % ** on validation data (7.03% ignoring spaces);
** 7.83 % ** on test data (6.92% ignoring spaces).

Errors and most frequent confusions on test data

There were 13540 characters and 1061 errors on test data.

Globally, the error are as follow:

536 characters from the ground truth were not predicted by the model;
132 characters absent from the ground truth were wrongly predicted;
393 character substitutions.

The most frequent confusions concerned spacing.

The 20 most frequent confusions are:

Errors	Ground Truth-Prediction
70	{ SPACE } - {  }
54	{  } - { SPACE }
48	{ ı } - {  }
43	{ n } - {  }
43	{ COMBINING ACUTE ACCENT } - {  }
27	{ e } - {  }
24	{ l } - {  }
24	{ u } - {  }
21	{ . } - {  }
20	{ u } - { n }
18	{ ſ } - {  }
18	{ a } - {  }
17	{ r } - {  }
14	{ t } - {  }
13	{ COMBINING TILDE } - {  }
13	{  } - { ı }
12	{ o } - { e }
12	{ o } - {  }
12	{ ı } - { m }
11	{ e } - { c }

List of manuscripts

The data comes from partial allographetic transcription of the following mss:

Clermont-Ferrand, archives départementales, 1F2 (XIII 1/3, anglo-norman praegothica script; Chanson d'Aspremont); 52 lines.
Paris, Bibliothèque nationale de France, fr. 854 (XIII 4/4, Venice or Venetian area; gothic textualis; occitan chansonnier I); 1112 lines.
Cologny-Genève, fondation Martin-Bodmer, cod. Bodm. 168 (XIII 3/3, anglo-norman gothic textualis; Chanson d'Otinel); 1908 lines.
Oxford, Bodleian Library, Digby 23 (XII 1/2, anglo-norman praegothica; Chanson de Roland); 564 lines.

For these transcriptions, see: Jean-Baptiste Camps, La `Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique, PhD thesis, dir. Dominique Boutet, Paris-Sorbonne, 2016, DOI: https://doi.org/10.5281/zenodo.1116735.

License

Cette œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution 4.0 International.

Contribute

If you want to contribute training data or models, you can do so by cloning the repository and sending us a pull request, or by sending an email at jbcamps at hotmail.com .

Cite this repository

Jean-Baptiste Camps (éd.), FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models, Paris: École nationale des chartes (PSL), 2018, https://github.com/Jean-Baptiste-Camps/FROC-MSS.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ADClermontFerrand_1F2/0001		ADClermontFerrand_1F2/0001
BnF_fr_854_Chansonnier-I		BnF_fr_854_Chansonnier-I
Bodmer_168		Bodmer_168
digby23		digby23
README.md		README.md
clean.py		clean.py
model_froc.mlmodel		model_froc.mlmodel
randomise_data.py		randomise_data.py
search.py		search.py
test.txt		test.txt
train.txt		train.txt
val.txt		val.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models

Data format

Models

Summary and C.E.R.

Errors and most frequent confusions on test data

List of manuscripts

License

Contribute

Cite this repository

About

Releases

Packages

Languages

Jean-Baptiste-Camps/FROC-MSS

Folders and files

Latest commit

History

Repository files navigation

FROC-MSS: Old French and Old Occitan Medieval Manuscripts HTR Data and Models

Data format

Models

Summary and C.E.R.

Errors and most frequent confusions on test data

List of manuscripts

License

Contribute

Cite this repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages