Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding dataset Ground Truth data for printed Devanagari #89

Closed
nidame opened this issue Nov 11, 2022 · 11 comments
Closed

Adding dataset Ground Truth data for printed Devanagari #89

nidame opened this issue Nov 11, 2022 · 11 comments

Comments

@nidame
Copy link

nidame commented Nov 11, 2022

Hello ! I'd like to include the metadata for my GT dataset on HTR United. The Alto XML files and the images are archived FID4SA@heiDATA, the research data repository of Heidelberg University. DOI to the dataset is included in the metadata.
Hope it works! Please get in touch in case there are any questions. Best wishes, Nicole

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Ground truth data for printed Devanagari
url: https://doi.org/10.11588/data/EGOKEI
authors:
 - name: Merkel-Hilf
   surname: Nicole
   orcid: 0000-0002-0344-6169
   roles:
     - transcriber
     - project-manager
institutions: Heidelberg University Library
description: >-
 Ground truth (GT) data (jpg and alto xml files) for an OCR model that
 recognizes printed text in Devanagari script.

 The GT data was trained on Transkribus with the HTR+ engine. The training was
 performed on appr. 220 pages with appr. 27,000 words. The validation set was
 10% of the training set.

 The training material is comprised of letterpress printings from the Naval
 Kishore Press (Lakhnau, North India) from the late 19th and early 20th century
 in the Hindi, Sanskrit, Braj Bhasha and Awadhi languages.

 Transcription was performed by Nicole Merkel-Hilf (CATS Library / Heidelberg
 University Library) with support by Daria Peshcherova (CATS Library /
 Heidelberg University Library).
project-name: Naval Kishore Press - digital
project-website: https://digi.ub.uni-heidelberg.de/en/sammlungen/suedasien/navalkishore.html
language:
 - hin
 - san
 - bra
production-software: Transkribus
script:
 - iso: Deva
script-type: only-typed
time:
 notBefore: '1880'
 notAfter: '1953'
hands:
 count: less-than-11
 precision: exact
license:
 - name: CC-BY 4.0
   url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
 - metric: lines
   count: 4333
transcription-guidelines: Diplomatic transcription, no correction of mispelling
@alix-tz
Copy link
Member

alix-tz commented Nov 11, 2022

Hello! Thank you very much!

It looks pretty good to me :)

@alix-tz
Copy link
Member

alix-tz commented Nov 11, 2022

@PonteIneptique will we be able to use HUMG on this dataset?

@PonteIneptique
Copy link
Member

If it's PageXML, yes absolutely :)

@nidame
Copy link
Author

nidame commented Nov 11, 2022

It's alto xml but I can also export page xml from the Transkribus website, if necessary

@PonteIneptique
Copy link
Member

ALTO XML is even better for HUMG :)

@PonteIneptique
Copy link
Member

(Next time I'll read the proposed record before commenting)

@nidame
Copy link
Author

nidame commented Nov 11, 2022

:-))

@nidame
Copy link
Author

nidame commented Feb 27, 2023

@PonteIneptique @alix-tz Hi, I just wanted to ask if you could include the metadata of the Devanagari GT in the HTR-United catalogue. Couldn't find it when searching.
And I've got new data - GT for the South Indian script Malayalam provided by Tuebingen University Library. Would you be interested in that as well? If yes, I'll write a new issue. Best wishes Nicole

@alix-tz
Copy link
Member

alix-tz commented Feb 27, 2023

Hello, I just checked the content of the dataset in Mayalam script and it looks good so yes, it would be really interesting to add it. Can you make another issue for it?

Just a note: importing the Page is eScriptorium works, but not the ALTO (because of 1 missing information in the file exported by Transkribus), so can you make sure to keep the Page version in the dataset ?

@nidame
Copy link
Author

nidame commented Feb 27, 2023

Before I start a new issue, could you please kindly give me any information on the Devanagari dataset I submitted in November?

@alix-tz
Copy link
Member

alix-tz commented Mar 1, 2023

I think this issue can be closed, the remaining discussion about the second dataset will be in #104

@alix-tz alix-tz closed this as completed Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants