Adding dataset Ground Truth data for printed Devanagari #89

nidame · 2022-11-11T13:59:14Z

Hello ! I'd like to include the metadata for my GT dataset on HTR United. The Alto XML files and the images are archived FID4SA@heiDATA, the research data repository of Heidelberg University. DOI to the dataset is included in the metadata.
Hope it works! Please get in touch in case there are any questions. Best wishes, Nicole

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Ground truth data for printed Devanagari
url: https://doi.org/10.11588/data/EGOKEI
authors:
 - name: Merkel-Hilf
   surname: Nicole
   orcid: 0000-0002-0344-6169
   roles:
     - transcriber
     - project-manager
institutions: Heidelberg University Library
description: >-
 Ground truth (GT) data (jpg and alto xml files) for an OCR model that
 recognizes printed text in Devanagari script.

 The GT data was trained on Transkribus with the HTR+ engine. The training was
 performed on appr. 220 pages with appr. 27,000 words. The validation set was
 10% of the training set.

 The training material is comprised of letterpress printings from the Naval
 Kishore Press (Lakhnau, North India) from the late 19th and early 20th century
 in the Hindi, Sanskrit, Braj Bhasha and Awadhi languages.

 Transcription was performed by Nicole Merkel-Hilf (CATS Library / Heidelberg
 University Library) with support by Daria Peshcherova (CATS Library /
 Heidelberg University Library).
project-name: Naval Kishore Press - digital
project-website: https://digi.ub.uni-heidelberg.de/en/sammlungen/suedasien/navalkishore.html
language:
 - hin
 - san
 - bra
production-software: Transkribus
script:
 - iso: Deva
script-type: only-typed
time:
 notBefore: '1880'
 notAfter: '1953'
hands:
 count: less-than-11
 precision: exact
license:
 - name: CC-BY 4.0
   url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
 - metric: lines
   count: 4333
transcription-guidelines: Diplomatic transcription, no correction of mispelling

alix-tz · 2022-11-11T14:45:15Z

Hello! Thank you very much!

It looks pretty good to me :)

alix-tz · 2022-11-11T14:45:21Z

@PonteIneptique will we be able to use HUMG on this dataset?

PonteIneptique · 2022-11-11T14:47:48Z

If it's PageXML, yes absolutely :)

nidame · 2022-11-11T14:48:59Z

It's alto xml but I can also export page xml from the Transkribus website, if necessary

PonteIneptique · 2022-11-11T14:49:32Z

ALTO XML is even better for HUMG :)

PonteIneptique · 2022-11-11T14:49:56Z

(Next time I'll read the proposed record before commenting)

nidame · 2022-11-11T14:50:34Z

:-))

nidame · 2023-02-27T11:17:41Z

@PonteIneptique @alix-tz Hi, I just wanted to ask if you could include the metadata of the Devanagari GT in the HTR-United catalogue. Couldn't find it when searching.
And I've got new data - GT for the South Indian script Malayalam provided by Tuebingen University Library. Would you be interested in that as well? If yes, I'll write a new issue. Best wishes Nicole

alix-tz · 2023-02-27T14:47:22Z

Hello, I just checked the content of the dataset in Mayalam script and it looks good so yes, it would be really interesting to add it. Can you make another issue for it?

Just a note: importing the Page is eScriptorium works, but not the ALTO (because of 1 missing information in the file exported by Transkribus), so can you make sure to keep the Page version in the dataset ?

nidame · 2023-02-27T14:52:44Z

Before I start a new issue, could you please kindly give me any information on the Devanagari dataset I submitted in November?

alix-tz · 2023-03-01T14:08:27Z

I think this issue can be closed, the remaining discussion about the second dataset will be in #104

alix-tz added a commit that referenced this issue Feb 28, 2023

Adding YML file corresponding to #89

9e84d24

alix-tz mentioned this issue Mar 1, 2023

Adding YML file for Naval Kishore dataset (printed Devanagari) #105

Merged

alix-tz closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding dataset Ground Truth data for printed Devanagari #89

Adding dataset Ground Truth data for printed Devanagari #89

nidame commented Nov 11, 2022

alix-tz commented Nov 11, 2022

alix-tz commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022

nidame commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022

nidame commented Nov 11, 2022

nidame commented Feb 27, 2023

alix-tz commented Feb 27, 2023

nidame commented Feb 27, 2023

alix-tz commented Mar 1, 2023

Adding dataset Ground Truth data for printed Devanagari #89

Adding dataset Ground Truth data for printed Devanagari #89

Comments

nidame commented Nov 11, 2022

alix-tz commented Nov 11, 2022

alix-tz commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022

nidame commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022

nidame commented Nov 11, 2022

nidame commented Feb 27, 2023

alix-tz commented Feb 27, 2023

nidame commented Feb 27, 2023

alix-tz commented Mar 1, 2023