Skip to content
This repository has been archived by the owner on May 11, 2020. It is now read-only.

Latest commit

 

History

History
180 lines (146 loc) · 6.25 KB

ocrd-gt-repo.md

File metadata and controls

180 lines (146 loc) · 6.25 KB

Working with OCR-D-(Ground-Truth)-Repository

Upload bagit container from scratch to OCR-D(-GT)-Repository

Example to upload a scanned page to OCR-D-Repo.

Preparation: Create Workspace

Requirements: ocrd (Version 1.0.0) See Setup OCR-D Stack

Activate virtualenv

user@hostname:~$source ~/env-ocrd/bin/activate
(env-ocrd) user@hostname:~$

Initialize Workspace

(env-ocrd) user@hostname:~$ ocrd workspace init communist_manifesto
(env-ocrd) user@hostname:~$ cd communist_manifesto

Create Folder for Scanned Page

(env-ocrd) user@hostname:~/communist_manifesto$ mkdir OCR-D-IMG

Download Image (Google)

(env-ocrd) user@hostname:~/communist_manifesto$ wget https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Manifesto_of_the_Communist_Party.djvu/page15-2745px-Manifesto_of_the_Communist_Party.djvu.jpg -O OCR-D-IMG/OCR-D-IMG_0015.jpg

Add Image to Workspace

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace add -g P0015 -G OCR-D-IMG -i OCR-D-IMG_0015 -m image/jpg OCR-D-IMG/OCR-D-IMG_0015.jpg

Set Unique ID for Workspace

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace set-id 'communist_manifesto'

Validate Workspace

For some images, the resolution of the image is not set. To avoid validation errors, the resolution check is skipped. For further details see 'ocrd workspace validate --help'.

(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace validate --skip pixel_density mets.xml

Create BagIt Container

(env-ocrd) user@hostname:~/communist_manifesto$ cd ..
(env-ocrd) user@hostname:~/$ ocrd zip bag -i communist_manifesto -d communist_manifesto/

Validate BagIt Container

(env-ocrd) user@hostname:~/$ ocrd zip validate communist_manifesto.ocrd.zip
[...]
OK

Upload BagIt Container

user@hostname:~/$ curl -u ingest:GENERATED_PASSWORD -v -F "file=@communist_manifesto.ocrd.zip" http://localhost:8080/api/v1/metastore/bagit 
[...]
OK

Download all BagIt Containers

user@hostname:~/Download$ wget -O listOfContainers.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit

user@hostname:~/Download$ ocrdzips=$(cat listOfContainers.json | tr ",[]\"" "\n")

user@hostname:~/Download$ for addr in $ocrdzips
do
  wget $addr
  filename=$(basename -- "$addr")
  directory="${filename%.*}"

  mkdir $directory
  cd $directory
  unzip ../$filename
  cd ..
done

List all Documents (in Browser)

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit The list shows all ingested documents with its

  • 'Upload Date'
  • 'Version'
  • 'OCR-D Identifier'
  • 'Link for Download'
  • 'Referenced Files'
  • 'Metadata'
  • and 'Semantic Labeling' (Upload is only available for authorized users)

Download Document

https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/71e19490-343a-4d68-a5a7-7cf4c725c843/data/arent_dichtercharaktere_1885.zip Download of the complete document as bagit container.

List all Files inside Document

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/files All files of given resourceID referenced inside the mets.xml are listed here.

Download Single File

https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/71e19490-343a-4d68-a5a7-7cf4c725c843/data/bagit/data/DEFAULT/DEFAULT_0002 Download/view single file (Tiff) of given resourceID, file group and fileID.

List Metadata

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/metadata List metadata of the document (e.g.: title, author, year, identifier, languages, classifications) of given resourceID.

List Ground Truth Metadata

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/groundtruth List all semantic labels of given resourceID.

Search Inside Repository

All searches will return a list of fitting resourceIDs. In order to further investigate the found resources, the listings above can be used.

Search via browser

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/search

Search on command line

Search for Semantic Label

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination Search for documents with e.g. uneven illumination.

Search for Documents Containing Multiple Semantic Labels at Once

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination,condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

Search for Documents with Classification 'Fachtext'

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Fachtext

Search for Documents with Language 'deu'

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/language?lang=deu

Search for Documents with Identifier '16488'

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/identifier?identifier=16529

Search for Documents with Specific Identifier and Type

https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/identifier?identifier=urn:nbn:de:kobv:b4-200905196929&type=urn Search for document with specific identifier of a specific type. Possible types are:

  • purl
  • urn
  • handle
  • url
  • dtaid
  • ...

Download selected BagIt Containers

E.g.: All with Classification 'Belletristik'

# Get all containers
user@hostname:~/Download$ wget -O listOfAllContainers.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit

user@hostname:~/Download$ allocrdzips=$(cat listOfAllContainers.json | tr ",[]\"" "\n")

# Get IDs of fitting containers
user@hostname:~/Download$ wget -O filteredList.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Belletristik

user@hostname:~/Download$ filteredIds=$(cat filteredList.json | tr ",[]\"" "\n")

user@hostname:~/Download$ for bagitid in $filteredIds
do
  for addr in $allocrdzips
  do
    if echo "$addr" | grep -q "$bagitid"; then
      wget $addr
      filename=$(basename -- "$addr")
      directory="${filename%.*}"

      mkdir $directory
      cd $directory
      unzip ../$filename
      cd ..
    fi
  done
done