Example to upload a scanned page to OCR-D-Repo.
Requirements: ocrd (Version 1.0.0) See Setup OCR-D Stack
user@hostname:~$source ~/env-ocrd/bin/activate
(env-ocrd) user@hostname:~$
(env-ocrd) user@hostname:~$ ocrd workspace init communist_manifesto
(env-ocrd) user@hostname:~$ cd communist_manifesto
(env-ocrd) user@hostname:~/communist_manifesto$ mkdir OCR-D-IMG
(env-ocrd) user@hostname:~/communist_manifesto$ wget https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Manifesto_of_the_Communist_Party.djvu/page15-2745px-Manifesto_of_the_Communist_Party.djvu.jpg -O OCR-D-IMG/OCR-D-IMG_0015.jpg
(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace add -g P0015 -G OCR-D-IMG -i OCR-D-IMG_0015 -m image/jpg OCR-D-IMG/OCR-D-IMG_0015.jpg
(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace set-id 'communist_manifesto'
For some images, the resolution of the image is not set. To avoid validation errors, the resolution check is skipped. For further details see 'ocrd workspace validate --help'.
(env-ocrd) user@hostname:~/communist_manifesto$ ocrd workspace validate --skip pixel_density mets.xml
(env-ocrd) user@hostname:~/communist_manifesto$ cd ..
(env-ocrd) user@hostname:~/$ ocrd zip bag -i communist_manifesto -d communist_manifesto/
(env-ocrd) user@hostname:~/$ ocrd zip validate communist_manifesto.ocrd.zip
[...]
OK
user@hostname:~/$ curl -u ingest:GENERATED_PASSWORD -v -F "file=@communist_manifesto.ocrd.zip" http://localhost:8080/api/v1/metastore/bagit
[...]
OK
user@hostname:~/Download$ wget -O listOfContainers.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit
user@hostname:~/Download$ ocrdzips=$(cat listOfContainers.json | tr ",[]\"" "\n")
user@hostname:~/Download$ for addr in $ocrdzips
do
wget $addr
filename=$(basename -- "$addr")
directory="${filename%.*}"
mkdir $directory
cd $directory
unzip ../$filename
cd ..
done
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit The list shows all ingested documents with its
- 'Upload Date'
- 'Version'
- 'OCR-D Identifier'
- 'Link for Download'
- 'Referenced Files'
- 'Metadata'
- and 'Semantic Labeling' (Upload is only available for authorized users)
https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/71e19490-343a-4d68-a5a7-7cf4c725c843/data/arent_dichtercharaktere_1885.zip Download of the complete document as bagit container.
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/files All files of given resourceID referenced inside the mets.xml are listed here.
https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/71e19490-343a-4d68-a5a7-7cf4c725c843/data/bagit/data/DEFAULT/DEFAULT_0002 Download/view single file (Tiff) of given resourceID, file group and fileID.
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/metadata List metadata of the document (e.g.: title, author, year, identifier, languages, classifications) of given resourceID.
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/71e19490-343a-4d68-a5a7-7cf4c725c843/groundtruth List all semantic labels of given resourceID.
All searches will return a list of fitting resourceIDs. In order to further investigate the found resources, the listings above can be used.
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/search
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/labeling?label=condition/acquisition/method-flaws/imaging/uneven-illumination Search for documents with e.g. uneven illumination.
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Fachtext
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/language?lang=deu
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/identifier?identifier=16529
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/identifier?identifier=urn:nbn:de:kobv:b4-200905196929&type=urn Search for document with specific identifier of a specific type. Possible types are:
- purl
- urn
- handle
- url
- dtaid
- ...
E.g.: All with Classification 'Belletristik'
# Get all containers
user@hostname:~/Download$ wget -O listOfAllContainers.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit
user@hostname:~/Download$ allocrdzips=$(cat listOfAllContainers.json | tr ",[]\"" "\n")
# Get IDs of fitting containers
user@hostname:~/Download$ wget -O filteredList.json https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/classification?class=Belletristik
user@hostname:~/Download$ filteredIds=$(cat filteredList.json | tr ",[]\"" "\n")
user@hostname:~/Download$ for bagitid in $filteredIds
do
for addr in $allocrdzips
do
if echo "$addr" | grep -q "$bagitid"; then
wget $addr
filename=$(basename -- "$addr")
directory="${filename%.*}"
mkdir $directory
cd $directory
unzip ../$filename
cd ..
fi
done
done