diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index eac90c000..1f35aea43 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -1,8 +1,8 @@ name: Lint, test, build, and publish -on: +on: push: - + jobs: lint_and_test: @@ -130,7 +130,7 @@ jobs: pypi/* publish-gh-pages: - name: Update kraken.re github pages + name: Update kraken.re github pages needs: lint_and_test runs-on: ubuntu-latest if: | @@ -147,7 +147,7 @@ jobs: python-version: 3.9 - name: Install sphinx-multiversion run: python -m pip install sphinx-multiversion sphinx-autoapi - - name: Create docs + - name: Create docs run: sphinx-multiversion docs build/html - name: Create redirect run: cp docs/redirect.html build/html/index.html diff --git a/README.rst b/README.rst index 8c9ef62b2..747b1036c 100644 --- a/README.rst +++ b/README.rst @@ -55,7 +55,7 @@ branch as well: :: - $ git clone https://github.com/mittagessen/kraken.git + $ git clone https://github.com/mittagessen/kraken.git $ cd kraken $ conda env create -f environment.yml @@ -63,7 +63,7 @@ or: :: - $ git clone https://github.com/mittagessen/kraken.git + $ git clone https://github.com/mittagessen/kraken.git $ cd kraken $ conda env create -f environment_cuda.yml @@ -75,7 +75,7 @@ in the kraken directory for the current user: :: - $ kraken get 10.5281/zenodo.10592716 + $ kraken get 10.5281/zenodo.10592716 A list of libre models available in the central repository can be retrieved by running: @@ -105,7 +105,7 @@ To segment an image (binarized or not) with the new baseline segmenter: :: $ kraken -i image.tif lines.json segment -bl - + To segment and OCR an image using the default model(s): diff --git a/docs/alto.xml b/docs/alto.xml index dbf0ca0a1..70185b516 100644 --- a/docs/alto.xml +++ b/docs/alto.xml @@ -13,18 +13,18 @@ - ... diff --git a/docs/api.rst b/docs/api.rst index ec1d22a0f..56d0fca81 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,10 +1,10 @@ -API Quickstart +API Quickstart ============== Kraken provides routines which are usable by third party tools to access all functionality of the OCR engine. Most functional blocks, binarization, segmentation, recognition, and serialization are encapsulated in one high -level method each. +level method each. Simple use cases of the API which are mostly useful for debugging purposes are contained in the `contrib` directory. In general it is recommended to look at @@ -493,7 +493,7 @@ handling and verbosity options for the CLI. .. code-block:: python - >>> from kraken.lib.train import RecognitionModel, KrakenTrainer + >>> from kraken.lib.train import RecognitionModel, KrakenTrainer >>> ground_truth = glob.glob('training/*.xml') >>> training_files = ground_truth[:250] # training data is shuffled internally >>> evaluation_files = ground_truth[250:] @@ -522,14 +522,14 @@ can be attached to the trainer object: .. code-block:: python >>> from pytorch_lightning.callbacks import Callback - >>> from kraken.lib.train import RecognitionModel, KrakenTrainer + >>> from kraken.lib.train import RecognitionModel, KrakenTrainer >>> class MyPrintingCallback(Callback): def on_init_start(self, trainer): print("Starting to init trainer!") - + def on_init_end(self, trainer): print("trainer is init now") - + def on_train_end(self, trainer, pl_module): print("do something when training ends") >>> ground_truth = glob.glob('training/*.xml') diff --git a/docs/index.rst b/docs/index.rst index 2c2a0e81b..dda41e35e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -30,7 +30,7 @@ kraken's main features are: - :ref:`Public repository ` of model files - :ref:`Variable recognition network architectures ` -Pull requests and code contributions are always welcome. +Pull requests and code contributions are always welcome. Installation ============ @@ -86,7 +86,7 @@ The git repository contains some environment files that aid in setting up the la .. code-block:: console - $ git clone https://github.com/mittagessen/kraken.git + $ git clone https://github.com/mittagessen/kraken.git $ cd kraken $ conda env create -f environment.yml @@ -94,7 +94,7 @@ or: .. code-block:: console - $ git clone https://github.com/mittagessen/kraken.git + $ git clone https://github.com/mittagessen/kraken.git $ cd kraken $ conda env create -f environment_cuda.yml @@ -109,7 +109,7 @@ in the kraken directory for the current user: :: - $ kraken get 10.5281/zenodo.10592716 + $ kraken get 10.5281/zenodo.10592716 A list of libre models available in the central repository can be retrieved by @@ -125,9 +125,9 @@ Model metadata can be extracted using: $ kraken show 10.5281/zenodo.10592716 name: 10.5281/zenodo.10592716 - + CATMuS-Print (Large, 2024-01-30) - Diachronic model for French prints and other languages - +

CATMuS-Print (Large) - Diachronic model for French prints and other West European languages

CATMuS (Consistent Approach to Transcribing ManuScript) Print is a Kraken HTR model trained on data produced by several projects, dealing with different languages (French, Spanish, German, English, Corsican, Catalan, Latin, Italian…) and different centuries (from the first prints of the 16th c. to digital documents of the 21st century).

Transcriptions follow graphematic principles and try to be as compatible as possible with guidelines previously published for French: no ligature (except those that still exist), no allographetic variants (except the long s), and preservation of the historical use of some letters (u/v, i/j). Abbreviations are not resolved. Inconsistencies might be present, because transcriptions have been done over several years and the norms have slightly evolved.

diff --git a/docs/ketos.rst b/docs/ketos.rst index b1c23ae30..b2b2b00e8 100644 --- a/docs/ketos.rst +++ b/docs/ketos.rst @@ -5,12 +5,12 @@ Training This page describes the training utilities available through the ``ketos`` command line utility in depth. For a gentle introduction on model training -please refer to the :ref:`tutorial `. +please refer to the :ref:`tutorial `. There are currently three trainable components in the kraken processing pipeline: * Segmentation: finding lines and regions in images * Reading Order: ordering lines found in the previous segmentation step. Reading order models are closely linked to segmentation models and both are usually trained on the same dataset. -* Recognition: recognition models transform images of lines into text. +* Recognition: recognition models transform images of lines into text. Depending on the use case it is not necessary to manually train new models for each material. The default segmentation model works well on quite a variety of @@ -246,7 +246,7 @@ would be: A better configuration for large and complicated datasets such as handwritten texts: -.. code-block:: console +.. code-block:: console $ ketos train --augment --workers 4 -d cuda -f binary --min-epochs 20 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 dataset_large.arrow @@ -273,10 +273,10 @@ an exact match. Otherwise an error will be raised: $ ketos train -i model_5.mlmodel kamil/*.png Building training set [####################################] 100% Building validation set [####################################] 100% - [0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'} + [0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'} Network codec not compatible with training set - [0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'} - + [0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'} + There are two modes dealing with mismatching alphabets, ``union`` and ``new``. ``union`` resizes the output layer and codec of the loaded model to include all characters in the new training set without removing any characters. ``new`` @@ -340,10 +340,10 @@ layers we define a network stub and index for appending: .. code-block:: console - $ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png + $ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png Building training set [####################################] 100% Building validation set [####################################] 100% - [0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'} + [0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'} Slicing and dicing model ✓ The new model will behave exactly like a new one, except potentially training a @@ -599,7 +599,7 @@ It is also possible to filter out baselines/regions selectively: Finally, we can merge baselines and regions into each other: -.. code-block:: console +.. code-block:: console $ ketos segtrain -f xml --merge-baselines default:foo training_data/*.xml Training line types: @@ -653,7 +653,7 @@ with their segmentation model in a subsequent step. The general sequence is therefore: .. code-block:: console - + $ ketos segtrain -o fr_manu_seg.mlmodel -f xml french/*.xml ... $ ketos rotrain -o fr_manu_ro.mlmodel -f xml french/*.xml @@ -671,8 +671,8 @@ serialized in the final XML output (in ALTO/PAGE XML). Reading order models work purely on the typology and geometric features of the lines and regions. They construct an approximate ordering matrix by feeding feature vectors of two lines (or regions) into the network - to decide which of those two lines precedes the other. - + to decide which of those two lines precedes the other. + These feature vectors are quite simple; just the lines' types, and their start, center, and end points. Therefore they can *not* reliably learn any ordering relying on graphical features of the input page such @@ -705,10 +705,10 @@ sufficiently large training datasets: │ 3 │ ro_net.relu │ ReLU │ 0 │ │ 4 │ ro_net.fc2 │ Linear │ 45 │ └───┴─────────────┴───────────────────┴────────┘ - Trainable params: 1.1 K - Non-trainable params: 0 - Total params: 1.1 K - Total estimated model params size (MB): 0 + Trainable params: 1.1 K + Non-trainable params: 0 + Total params: 1.1 K + Total estimated model params size (MB): 0 stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/35 0:00:00 • -:--:-- 0.00it/s val_spearman: 0.912 val_loss: 0.701 early_stopping: 0/300 inf During validation a metric called Spearman's footrule is computed. To calculate @@ -756,20 +756,20 @@ adding a number of image files as the final argument: Evaluating $model Evaluating [####################################] 100% === report test_model.mlmodel === - + 7012 Characters 6022 Errors 14.12% Accuracy - + 5226 Insertions 2 Deletions 794 Substitutions - + Count Missed %Right 1567 575 63.31% Common 5230 5230 0.00% Arabic 215 215 0.00% Inherited - + Errors Correct-Generated 773 { ا } - { } 536 { ل } - { } diff --git a/docs/training.rst b/docs/training.rst index aa63338f5..704727aa5 100644 --- a/docs/training.rst +++ b/docs/training.rst @@ -142,7 +142,7 @@ that can be adjusted: Training a network will take some time on a modern computer, even with the default parameters. While the exact time required is unpredictable as training is a somewhat random process a rough guide is that accuracy seldom improves -after 50 epochs reached between 8 and 24 hours of training. +after 50 epochs reached between 8 and 24 hours of training. When to stop training is a matter of experience; the default setting employs a fairly reliable approach known as `early stopping @@ -150,10 +150,10 @@ fairly reliable approach known as `early stopping the error rate on the validation set doesn't improve anymore. This will prevent `overfitting `_, i.e. fitting the model to recognize only the training data properly instead of the -general patterns contained therein. +general patterns contained therein. .. code-block:: console - + $ ketos train output_dir/*.png Building training set [####################################] 100% Building validation set [####################################] 100% @@ -164,7 +164,7 @@ general patterns contained therein. Accuracy report (1) 0.0245 3504 3418 epoch 1/-1 [####################################] 788/788 Accuracy report (2) 0.8445 3504 545 - epoch 2/-1 [####################################] 788/788 + epoch 2/-1 [####################################] 788/788 Accuracy report (3) 0.9541 3504 161 epoch 3/-1 [------------------------------------] 13/788 0d 00:22:09 ... @@ -212,8 +212,8 @@ information by appending one or more ``-v`` to the command: .. code-block:: console $ ketos -vv train syr/*.png - [0.7272] Building ground truth set from 876 line images - [0.7281] Taking 88 lines from training for evaluation + [0.7272] Building ground truth set from 876 line images + [0.7281] Taking 88 lines from training for evaluation ... [0.8479] Training set 788 lines, validation set 88 lines, alphabet 48 symbols [0.8481] alphabet mismatch {'\xa0', '0', ':', '݀', '܇', '݂', '5'} @@ -314,20 +314,20 @@ After all lines have been processed a evaluation report will be printed: .. code-block:: console === report === - + 35619 Characters 336 Errors 99.06% Accuracy - + 157 Insertions 81 Deletions 98 Substitutions - + Count Missed %Right 27046 143 99.47% Syriac 7015 52 99.26% Common 1558 60 96.15% Inherited - + Errors Correct-Generated 25 { } - { COMBINING DOT BELOW } 25 { COMBINING DOT BELOW } - { } @@ -433,16 +433,16 @@ Retrieving model metadata for a particular model: $ kraken show arabic-alam-al-kutub name: arabic-alam-al-kutub.mlmodel - + An experimental model for Classical Arabic texts. - + Network trained on 889 lines of [0] as a test case for a general Classical Arabic model. Ground truth was prepared by Sarah Savant and Maxim Romanov . - + Vocalization was omitted in the ground truth. Training was stopped at ~35000 iterations with an accuracy of 97%. - + [0] Ibn al-Faqīh (d. 365 AH). Kitāb al-buldān. Edited by Yūsuf al-Hādī, 1st edition. Bayrūt: ʿĀlam al-kutub, 1416 AH/1996 CE. alphabet: !()-.0123456789:[] «»،؟ءابةتثجحخدذرزسشصضطظعغفقكلمنهوىي ARABIC diff --git a/docs/vgsl.rst b/docs/vgsl.rst index 8a956b213..6a0c42de4 100644 --- a/docs/vgsl.rst +++ b/docs/vgsl.rst @@ -55,11 +55,11 @@ Examples [1,1,0,48 Lbx100 Do 01c59] - Creating new model [1,1,0,48 Lbx100 Do] with 59 outputs - layer type params + Creating new model [1,1,0,48 Lbx100 Do] with 59 outputs + layer type params 0 rnn direction b transposed False summarize False out 100 legacy None - 1 dropout probability 0.5 dims 1 - 2 linear augmented False out 59 + 1 dropout probability 0.5 dims 1 + 2 linear augmented False out 59 A simple recurrent recognition model with a single LSTM layer classifying lines normalized to 48 pixels in height. @@ -68,18 +68,18 @@ normalized to 48 pixels in height. [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do 01c59] - Creating new model [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] with 59 outputs - layer type params - 0 conv kernel 3 x 3 filters 32 activation r - 1 dropout probability 0.1 dims 2 - 2 maxpool kernel 2 x 2 stride 2 x 2 - 3 conv kernel 3 x 3 filters 64 activation r - 4 dropout probability 0.1 dims 2 - 5 maxpool kernel 2 x 2 stride 2 x 2 - 6 reshape from 1 1 x 12 to 1/3 - 7 rnn direction b transposed False summarize False out 100 legacy None - 8 dropout probability 0.5 dims 1 - 9 linear augmented False out 59 + Creating new model [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] with 59 outputs + layer type params + 0 conv kernel 3 x 3 filters 32 activation r + 1 dropout probability 0.1 dims 2 + 2 maxpool kernel 2 x 2 stride 2 x 2 + 3 conv kernel 3 x 3 filters 64 activation r + 4 dropout probability 0.1 dims 2 + 5 maxpool kernel 2 x 2 stride 2 x 2 + 6 reshape from 1 1 x 12 to 1/3 + 7 rnn direction b transposed False summarize False out 100 legacy None + 8 dropout probability 0.5 dims 1 + 9 linear augmented False out 59 A model with a small convolutional stack before a recurrent LSTM layer. The extended dropout layer syntax is used to reduce drop probability on the depth @@ -129,7 +129,7 @@ other branch simply passes through the output of the first convolution layer. The input of the last convolutional layer is then the output of the two branches of the parallel block concatenated, i.e. the output of the first convolutional layer together with the output of the transposed convolutional layer, -giving `32 + 32 = 64` feature dimensions. +giving `32 + 32 = 64` feature dimensions. Convolutional Layers -------------------- diff --git a/kraken/ketos/recognition.py b/kraken/ketos/recognition.py index edc321f80..e4e0b76ea 100644 --- a/kraken/ketos/recognition.py +++ b/kraken/ketos/recognition.py @@ -311,7 +311,7 @@ def train(ctx, batch_size, pad, output, spec, append, load, freq, quit, epochs, codec=codec, resize=resize, legacy_polygons=legacy_polygons) - + # Force upgrade to new polygon extractor if model was not trained with it if model.nn and model.nn.use_legacy_polygons: if not legacy_polygons and not model.legacy_polygons: diff --git a/kraken/templates/layout.html b/kraken/templates/layout.html index 4a7d14dbd..9ca212e0b 100644 --- a/kraken/templates/layout.html +++ b/kraken/templates/layout.html @@ -1,6 +1,6 @@ - + diff --git a/kraken/templates/style.css b/kraken/templates/style.css index 30e3ba30f..117657f74 100644 --- a/kraken/templates/style.css +++ b/kraken/templates/style.css @@ -111,7 +111,7 @@ nav a { } nav a:hover { - text-decoration: underline; + text-decoration: underline; } button.download { diff --git a/tests/resources/FineReader10-schema-v1.xml b/tests/resources/FineReader10-schema-v1.xml index d98b46ce7..80dda2ace 100644 --- a/tests/resources/FineReader10-schema-v1.xml +++ b/tests/resources/FineReader10-schema-v1.xml @@ -1,626 +1,626 @@ - - - Schema for representing OCR results exported from FineReader 10.0 SDK. Copyright 2001-2011 ABBYY, Inc. - - - - - - - - - Global document data - - - - - - - Paragraph formatting styles collection - - - - - - - Paragraph formatting style - - - - - - - - - Document sections collection - - - - - - - Section - - - - - - - - - - - - Recognized page - - - - - - - Recognized block - - - - - - Page Section - - - - - - Running titles and artefacts - - - - - - - - - - If true, all coordinates are relative to original image before opening, otherwise they are relative to the opened (deskewed) image - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Page section is the sequence of page streams - - - - - - - - - - - Page Stream is the sequence of page elements - - - - - - - - - - - - text - - - - - Table - - - - - Barcode - - - - - Picture - - - - - - - - - - - Table captions - - - - - - - Table cells - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Picture captions - - - - - - - - - - - - - - - - - - - - - - - Text Stream is the sequence of paragraphs and/or blocks - - - - - - - - - - - - - - - - - Id of page element - - - - - - - - - - - - - - - - - - - - - - - - - - - Block region, the set of rectangles - - - - - - - - - - - - - - - - - - Recognized block text, presents if blockType attribute is Text - - - - - The set of table rows, presents if blockType attribute is Table - - - - - Separators box block, presents if blockType attribute is SeparatorsBox - - - - - - - - - - - Separator block, presents if blockType attribute is Separator - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Text paragraph - - - - - - - - - - - - - - - - - - - - - - - - - Table cell - - - - - - Cell text - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Text paragraph line - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Group of characters with uniform formatting - - - - - - - - - - - - - - - Attributes of characters are alternated with word's recognition variants. The variants of recognition of the word are written before the word - - - - Attributes of single character - - - - - Variants of recognition of the next word - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Starting point of the separator - - - - - Ending point of the separator - - - - - - - - - - - - - - - - - - - - - - + + + Schema for representing OCR results exported from FineReader 10.0 SDK. Copyright 2001-2011 ABBYY, Inc. + + + + + + + + + Global document data + + + + + + + Paragraph formatting styles collection + + + + + + + Paragraph formatting style + + + + + + + + + Document sections collection + + + + + + + Section + + + + + + + + + + + + Recognized page + + + + + + + Recognized block + + + + + + Page Section + + + + + + Running titles and artefacts + + + + + + + + + + If true, all coordinates are relative to original image before opening, otherwise they are relative to the opened (deskewed) image + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Page section is the sequence of page streams + + + + + + + + + + + Page Stream is the sequence of page elements + + + + + + + + + + + + text + + + + + Table + + + + + Barcode + + + + + Picture + + + + + + + + + + + Table captions + + + + + + + Table cells + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Picture captions + + + + + + + + + + + + + + + + + + + + + + + Text Stream is the sequence of paragraphs and/or blocks + + + + + + + + + + + + + + + + + Id of page element + + + + + + + + + + + + + + + + + + + + + + + + + + + Block region, the set of rectangles + + + + + + + + + + + + + + + + + + Recognized block text, presents if blockType attribute is Text + + + + + The set of table rows, presents if blockType attribute is Table + + + + + Separators box block, presents if blockType attribute is SeparatorsBox + + + + + + + + + + + Separator block, presents if blockType attribute is Separator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Text paragraph + + + + + + + + + + + + + + + + + + + + + + + + + Table cell + + + + + + Cell text + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Text paragraph line + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Group of characters with uniform formatting + + + + + + + + + + + + + + + Attributes of characters are alternated with word's recognition variants. The variants of recognition of the word are written before the word + + + + Attributes of single character + + + + + Variants of recognition of the next word + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Starting point of the separator + + + + + Ending point of the separator + + + + + + + + + + + + + + + + + + + + + + diff --git a/tests/resources/alto-4-3.xsd b/tests/resources/alto-4-3.xsd index f02195b03..cb8daf94f 100644 --- a/tests/resources/alto-4-3.xsd +++ b/tests/resources/alto-4-3.xsd @@ -1,1248 +1,1248 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ALTO (analyzed layout and text object) stores layout information and - OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. - ALTO is a standardized XML format to store layout and content information. - It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), - where METS provides metadata and structural information while ALTO contains content and physical information. - - - - - - - - Describes general settings of the alto file like measurement units and metadata - - - - - Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. - - - - - - Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. - This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags - - - - - - - Describes alternative hierarchical orderings of the page (i.e. total orders over its segments, for linear text flow), - in addition to the explicit flat reading order defined by @IDNEXT on the block level, - and the implicit flat reading order implied by the segment element ordering. - - - - - - The root layout element. - - - - - - Schema version of the ALTO file. - - - - - - - - - - Element deprecated. 'Processing' should be used instead. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - There are following variation of tag types available: - LayoutTag – criteria about arrangement or graphical appearance - StructureTag – criteria about grouping or formation - RoleTag – criteria about function or mission - NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) - OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. - - - - - - - - - - - - - - - - Defines one or more reading orders within the - page. Groups may be either unordered or ordered and can - contain other groups, e.g. a page containing - unrelated texts that are ordered individually - would be encoded as an UnorderedGroup containing - multiple OrderedGroups. The granularity of - elements can vary inside groups. - - - - - - - - - - - - - A reference to an element such as a block, TextLine, String, or Glyph. - - - - - - - A link to the referenced element. Valid - target elements are any block type, - TextLine, String, or Glyph. - - - - - - - Optionally annotates the role of the - referenced element in the reading order - with one or more tags. Examples could be - interlinear additions or marginalia. - - - - - - - - A group containing ordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is ordered). - - - - - - - - - - - - - - Optionally annotates the role of the - group in the reading order - with one or more tags. Examples could be - distinguishing - parallel texts or apparatus criticus and - main text. - - - - - - - A link to the referenced element. Valid - target elements are any block type, - TextLine, or String. - - - - - - - - A group containing unordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is arbitrary). - - - - - - - - - - - - - - - A link to the referenced element. Valid - target elements are any block type, - TextLine, or String. - - - - - - - Gives brief information about original page quality - - - - - - - - - - - - - - Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information - - - - - - Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. - - - - - - - - - - - - Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). - - - - - - - - - One page of a book or journal. - - - - - The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. - - - - - The area between the printspace and the left border of a page. May contain margin notes. - - - - - The area between the printspace and the right border of a page. May contain margin notes. - - - - - The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. - - - - - Rectangle covering the printed area of a page. Page number and running title are not part of the print space. - - - - - - - Any user-defined class like title page. - - - - - - - - - The number of the page within the document. - - - - - The page number that is printed on the page. - - - - - - - - A link to the processing description that has been used for this page. - - - - - Estimated percentage of OCR Accuracy in range from 0 to 100 - - - - - - - - - - - - - A text style defines font properties of text. - - - - - - - A paragraph style defines formatting properties of text blocks. - - - - - Indicates the alignement of the paragraph. Could be left, right, center or justify. - - - - - - - - - - - - - Left indent of the paragraph in relation to the column. - - - - - Right indent of the paragraph in relation to the column. - - - - - Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. - - - - - Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. - - - - - - - - - - - - - - - - - - - - - - - - - - - Group of available block types - - - - - A block of text. - - - - - A picture or image. - - - - - A graphic used to separate blocks. Usually a line or rectangle. - - - - - A block that consists of other blocks - - - - - - - Base type for any kind of block on the page. - - - - - - - - - - - - - - - Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. - - - - - The next block in reading order of the page (if ReadingOrder is not specified, and elements are not in order). - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - - - A white space. - - - - - - - - - - Type of the substitution (if any). - - - - - - - - - - - - - - - Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). - - - - - - - - - - Any alternative for the word. - Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. - The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". - As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. - Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. - - - - - - - Identifies the purpose of the alternative. - - - - - - - - A sequence of chars. Strings are separated by white spaces or hyphenation chars. - - - - - - - - - - - - - - - - - - - - Content of the substitution. - - - - - - Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - Attribute to record language of the string. The language should be recorded at the highest level possible. - - - - - - A region on a page - - - - - - - - - - - - - - - - - - A list of points - - - - - - Describes the bounding shape of a block, if it is not rectangular. - - - - - - - - - - Describes the inline base direction and line orientation of a line or of all lines inside a text block. - The meaning of these terms is defined by the W3C writing modes document: - These values should correspond to the base direction set in the BiDi algorithm to the respective elements during Unicode encoding. A value of "ttb" (top-to-bottom) implies a base direction of left-to-right, a value of "btt" (bottom-to-top) a base direction of right-to-left. - - - - - - - - - - - A polygon shape. - - - - - - An ellipse shape. HPOS and VPOS describe the center of the ellipse. - HLENGTH and VLENGTH are the width and height of the described ellipse. - The attribute ROTATION tells the rotation of the e.g. text or - illustration within the block. The value is in degrees counterclockwise. - - - - - - - - - - A circle shape. HPOS and VPOS describe the center of the circle. - - - - - - - - Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. - - - - The font name. - - - - - - - The font size, in points (1/72 of an inch). - - - - - Font color as RGB value - - - - - - - Serif or Sans-Serif - - - - - - - - - fixed or proportional - - - - - - - - - - - All measurement values inside the alto file are related to - this unit, except the font size. - Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. - The upper left corner of the page is defined as coordinate (0/0). - - values meaning: - mm10: 1/10th of millimeter - inch1200: 1/1200th of inch - pixel: 1 pixel - - The values for pixel will be related to the resolution of the image based - on which the layout is described. Incase the original image is not known - the scaling factor can be calculated based on total width and height of - the image and the according information of the PAGE element. - - - - - - - - - - - Information to identify the image file from which the OCR text was created. - - - - - - - - - - - - - - - - - - - A unique identifier for the image file. This is drawn from MIX. - This identifier must be unique within the local system. - To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. - - - - - - A location qualifier, i.e., a namespace. - - - - - - - - - - - - - - A unique identifier for the document. - This identifier must be unique within the local system. - To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. - - - - - - A location qualifier, i.e., a namespace. - - - - - - - - Deprecated. processingStepType should be used instead. - Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. - - - - - - - - - - Description of the processing step. - - - - - Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. - - - - - Date or DateTime the image was processed. - - - - - Identifies the organizationlevel producer(s) of the processed image. - - - - - An ordinal listing of the image processing steps performed. For example, "image despeckling." - - - - - A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. - - - - - - - - - - - - - - - - - - - - - Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. - - - - - The name of the organization or company that created the application. - - - - - The name of the application. - - - - - The version of the application. - - - - - A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. - - - - - - - - - - List of any combination of font styles - - - - - - - - - - - - - - - - - - - - - - - A block that consists of other blocks - - - - - - - - - A user defined string to identify the type of composed block (e.g. table, advertisement, ...) - - - - - An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. - - - - - - - - A picture or image. - - - - - - A user defined string to identify the type of illustration like photo, map, drawing, chart, ... - - - - - A link to an image which contains only the illustration. - - - - - - - - A graphic used to separate blocks. Usually a line or rectangle. - - - - - - - - A block of text. - - - - - - - A single line of text. - - - - - - - - - - - - - A hyphenation char. Can appear only at the end of a line. - - - - - - - - - - - - - - - - - - - - - Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. - - - - - Attribute to record language of the textline. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - Indicates the inline base direction of this TextLine. Overrides the value on elements higher in the hierarchy. - - - - - - - - Attribute deprecated. LANG should be used instead. - - - - - Attribute to record language of the textblock. - - - - - Indicates the inline base direction of the TextBlock. - - - - - - - - - - - The xml data wrapper element XmlData is used to contain XML encoded metadata. - The content of an XmlData element can be in any namespace or in no namespace. - As permitted by the XML Schema Standard, the processContents attribute value for the - metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are - identified by means of an XML schemaLocation attribute, then an XML processor will validate - the elements for which it can find declarations. If a source schema is not identified, or cannot be - found at the specified schemaLocation, then an XML validator will check for well-formedness, - but otherwise skip over the elements appearing in the XmlData element. - - - - - - - - - - - - - Type can be used to classify and group the information within each tag element type. - - - - - Content / information value of the tag. - - - - - Description text for tag information for clarification. - - - - - Any URI for authority or description relevant information. - - - - - - - Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. - Accordingly the value for the glyph element will be defined as follows: - Pre-composed representation = base + combining character(s) (decomposed representation) - See http://www.fileformat.info/info/unicode/char/0101/index.htm - "U+0101" = (U+0061) + (U+0304) - "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. - - Each glyph has its own coordinate information and must be separately addressable as a distinct object. - Correction and verification processes can be carried out for individual characters. - - Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. - In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. - The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. - The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. - - The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. - - The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. - Due to post-processing steps such as correction the values of both attributes may be inconsistent. - - - - - - - - - - - CONTENT contains the precomposed representation (combining character) of the character from the parent String element. - The sequence position of the Gylph element matches the position of the character in the String. - - - - - - - - - - - - - This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the glyph where 1 is certain. - This attribute is optional. If it is not available, the default value for the glyph is “0”. - The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. - - - - - - - - - - - - - - - - - - Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. - In case the variant are two (combining) characters, two characters are outlined in one Variant element. - E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". - Details for different use-cases see on the samples on GitHub. - - - - - - Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. - In case the variant are two (combining) characters, two characters are outlined in one Variant element. - E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". - Details for different use-cases see on the samples on GitHub. - - - - - - - - - - - - - This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. - This attribute is optional. If it is not available, the default value for the variant is “0”. - The VC attribute semantic is the same as the GC attribute on the Glyph element. - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ALTO (analyzed layout and text object) stores layout information and + OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. + ALTO is a standardized XML format to store layout and content information. + It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), + where METS provides metadata and structural information while ALTO contains content and physical information. + + + + + + + + Describes general settings of the alto file like measurement units and metadata + + + + + Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. + + + + + + Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. + This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags + + + + + + + Describes alternative hierarchical orderings of the page (i.e. total orders over its segments, for linear text flow), + in addition to the explicit flat reading order defined by @IDNEXT on the block level, + and the implicit flat reading order implied by the segment element ordering. + + + + + + The root layout element. + + + + + + Schema version of the ALTO file. + + + + + + + + + + Element deprecated. 'Processing' should be used instead. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + There are following variation of tag types available: + LayoutTag – criteria about arrangement or graphical appearance + StructureTag – criteria about grouping or formation + RoleTag – criteria about function or mission + NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) + OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. + + + + + + + + + + + + + + + + Defines one or more reading orders within the + page. Groups may be either unordered or ordered and can + contain other groups, e.g. a page containing + unrelated texts that are ordered individually + would be encoded as an UnorderedGroup containing + multiple OrderedGroups. The granularity of + elements can vary inside groups. + + + + + + + + + + + + + A reference to an element such as a block, TextLine, String, or Glyph. + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, String, or Glyph. + + + + + + + Optionally annotates the role of the + referenced element in the reading order + with one or more tags. Examples could be + interlinear additions or marginalia. + + + + + + + + A group containing ordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is ordered). + + + + + + + + + + + + + + Optionally annotates the role of the + group in the reading order + with one or more tags. Examples could be + distinguishing + parallel texts or apparatus criticus and + main text. + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, or String. + + + + + + + + A group containing unordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is arbitrary). + + + + + + + + + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, or String. + + + + + + + Gives brief information about original page quality + + + + + + + + + + + + + + Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information + + + + + + Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. + + + + + + + + + + + + Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). + + + + + + + + + One page of a book or journal. + + + + + The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. + + + + + The area between the printspace and the left border of a page. May contain margin notes. + + + + + The area between the printspace and the right border of a page. May contain margin notes. + + + + + The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. + + + + + Rectangle covering the printed area of a page. Page number and running title are not part of the print space. + + + + + + + Any user-defined class like title page. + + + + + + + + + The number of the page within the document. + + + + + The page number that is printed on the page. + + + + + + + + A link to the processing description that has been used for this page. + + + + + Estimated percentage of OCR Accuracy in range from 0 to 100 + + + + + + + + + + + + + A text style defines font properties of text. + + + + + + + A paragraph style defines formatting properties of text blocks. + + + + + Indicates the alignement of the paragraph. Could be left, right, center or justify. + + + + + + + + + + + + + Left indent of the paragraph in relation to the column. + + + + + Right indent of the paragraph in relation to the column. + + + + + Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. + + + + + Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. + + + + + + + + + + + + + + + + + + + + + + + + + + + Group of available block types + + + + + A block of text. + + + + + A picture or image. + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + A block that consists of other blocks + + + + + + + Base type for any kind of block on the page. + + + + + + + + + + + + + + + Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. + + + + + The next block in reading order of the page (if ReadingOrder is not specified, and elements are not in order). + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + + + A white space. + + + + + + + + + + Type of the substitution (if any). + + + + + + + + + + + + + + + Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). + + + + + + + + + + Any alternative for the word. + Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. + The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". + As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. + Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. + + + + + + + Identifies the purpose of the alternative. + + + + + + + + A sequence of chars. Strings are separated by white spaces or hyphenation chars. + + + + + + + + + + + + + + + + + + + + Content of the substitution. + + + + + + Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Attribute to record language of the string. The language should be recorded at the highest level possible. + + + + + + A region on a page + + + + + + + + + + + + + + + + + + A list of points + + + + + + Describes the bounding shape of a block, if it is not rectangular. + + + + + + + + + + Describes the inline base direction and line orientation of a line or of all lines inside a text block. + The meaning of these terms is defined by the W3C writing modes document: + These values should correspond to the base direction set in the BiDi algorithm to the respective elements during Unicode encoding. A value of "ttb" (top-to-bottom) implies a base direction of left-to-right, a value of "btt" (bottom-to-top) a base direction of right-to-left. + + + + + + + + + + + A polygon shape. + + + + + + An ellipse shape. HPOS and VPOS describe the center of the ellipse. + HLENGTH and VLENGTH are the width and height of the described ellipse. + The attribute ROTATION tells the rotation of the e.g. text or + illustration within the block. The value is in degrees counterclockwise. + + + + + + + + + + A circle shape. HPOS and VPOS describe the center of the circle. + + + + + + + + Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. + + + + The font name. + + + + + + + The font size, in points (1/72 of an inch). + + + + + Font color as RGB value + + + + + + + Serif or Sans-Serif + + + + + + + + + fixed or proportional + + + + + + + + + + + All measurement values inside the alto file are related to + this unit, except the font size. + Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. + The upper left corner of the page is defined as coordinate (0/0). + + values meaning: + mm10: 1/10th of millimeter + inch1200: 1/1200th of inch + pixel: 1 pixel + + The values for pixel will be related to the resolution of the image based + on which the layout is described. Incase the original image is not known + the scaling factor can be calculated based on total width and height of + the image and the according information of the PAGE element. + + + + + + + + + + + Information to identify the image file from which the OCR text was created. + + + + + + + + + + + + + + + + + + + A unique identifier for the image file. This is drawn from MIX. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + + + + + + + A unique identifier for the document. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + Deprecated. processingStepType should be used instead. + Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. + + + + + + + + + + Description of the processing step. + + + + + Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. + + + + + Date or DateTime the image was processed. + + + + + Identifies the organizationlevel producer(s) of the processed image. + + + + + An ordinal listing of the image processing steps performed. For example, "image despeckling." + + + + + A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. + + + + + + + + + + + + + + + + + + + + + Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. + + + + + The name of the organization or company that created the application. + + + + + The name of the application. + + + + + The version of the application. + + + + + A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. + + + + + + + + + + List of any combination of font styles + + + + + + + + + + + + + + + + + + + + + + + A block that consists of other blocks + + + + + + + + + A user defined string to identify the type of composed block (e.g. table, advertisement, ...) + + + + + An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. + + + + + + + + A picture or image. + + + + + + A user defined string to identify the type of illustration like photo, map, drawing, chart, ... + + + + + A link to an image which contains only the illustration. + + + + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + + + + A block of text. + + + + + + + A single line of text. + + + + + + + + + + + + + A hyphenation char. Can appear only at the end of a line. + + + + + + + + + + + + + + + + + + + + + Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. + + + + + Attribute to record language of the textline. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Indicates the inline base direction of this TextLine. Overrides the value on elements higher in the hierarchy. + + + + + + + + Attribute deprecated. LANG should be used instead. + + + + + Attribute to record language of the textblock. + + + + + Indicates the inline base direction of the TextBlock. + + + + + + + + + + + The xml data wrapper element XmlData is used to contain XML encoded metadata. + The content of an XmlData element can be in any namespace or in no namespace. + As permitted by the XML Schema Standard, the processContents attribute value for the + metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are + identified by means of an XML schemaLocation attribute, then an XML processor will validate + the elements for which it can find declarations. If a source schema is not identified, or cannot be + found at the specified schemaLocation, then an XML validator will check for well-formedness, + but otherwise skip over the elements appearing in the XmlData element. + + + + + + + + + + + + + Type can be used to classify and group the information within each tag element type. + + + + + Content / information value of the tag. + + + + + Description text for tag information for clarification. + + + + + Any URI for authority or description relevant information. + + + + + + + Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. + Accordingly the value for the glyph element will be defined as follows: + Pre-composed representation = base + combining character(s) (decomposed representation) + See http://www.fileformat.info/info/unicode/char/0101/index.htm + "U+0101" = (U+0061) + (U+0304) + "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. + + Each glyph has its own coordinate information and must be separately addressable as a distinct object. + Correction and verification processes can be carried out for individual characters. + + Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. + In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. + The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. + The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. + + The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. + + The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. + Due to post-processing steps such as correction the values of both attributes may be inconsistent. + + + + + + + + + + + CONTENT contains the precomposed representation (combining character) of the character from the parent String element. + The sequence position of the Gylph element matches the position of the character in the String. + + + + + + + + + + + + + This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the glyph where 1 is certain. + This attribute is optional. If it is not available, the default value for the glyph is “0”. + The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. + + + + + + + + + + + + + + + + + + Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + + + + + + + + This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. + This attribute is optional. If it is not available, the default value for the variant is “0”. + The VC attribute semantic is the same as the GC attribute on the Glyph element. + + + + + + + + + + + diff --git a/tests/resources/bsb00084914_00007.xml b/tests/resources/bsb00084914_00007.xml index 311751ad1..538e4a107 100644 --- a/tests/resources/bsb00084914_00007.xml +++ b/tests/resources/bsb00084914_00007.xml @@ -6,10 +6,10 @@ pixel bsb00084914_00007.jpg - + - + @@ -154,7 +154,7 @@ VPOS="0" WIDTH="3177" HEIGHT="4308"> - + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - + - + - - + + - + - + - - + + - + - + - - + + - - + + - + - + - - + + - + - - + + diff --git a/tests/resources/cPAS-2000.xml b/tests/resources/cPAS-2000.xml index d9f844121..adc36de41 100644 --- a/tests/resources/cPAS-2000.xml +++ b/tests/resources/cPAS-2000.xml @@ -1,410 +1,410 @@ - - - - TRP - 2018-12-24T11:28:19+07:00 - 2019-02-05T09:16:48Z - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + TRP + 2018-12-24T11:28:19+07:00 + 2019-02-05T09:16:48Z + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/tests/resources/merge_tests/0014.xml b/tests/resources/merge_tests/0014.xml index 1752f7117..c801094b5 100644 --- a/tests/resources/merge_tests/0014.xml +++ b/tests/resources/merge_tests/0014.xml @@ -6,15 +6,15 @@ pixel 0014.jpg - + - + - + - + - - + + - - + + - - + + - + - + diff --git a/tests/resources/pagecontent.xsd b/tests/resources/pagecontent.xsd index 5874131c8..d45d51680 100644 --- a/tests/resources/pagecontent.xsd +++ b/tests/resources/pagecontent.xsd @@ -1,2644 +1,2644 @@ - - - - - - Page Content - Ground Truth and Storage - - - - - - - - - - - - - - - - The timestamp has to be in UTC (Coordinated - Universal Time) and not local time. - - - - - - - The timestamp has to be in UTC - (Coordinated Universal Time) - and not local time. - - - - - - - - - - - - - External reference of any kind - - - - - - - - Semantic labels / tags - - - - - - - Type of metadata (e.g. author) - - - - - - - - - - - - - - - E.g. imagePhotometricInterpretation - - - - - - E.g. RGB - - - - - - - - - - A semantic label / tag - - - - - - - - Reference to external model / ontology / schema - - - - - - - E.g. an RDF resource identifier - (to be used as subject or object of an RDF triple) - - - - - - - Prefix for all labels (e.g. first part of an URI) - - - - - - - - Semantic label - - - - - The label / tag (e.g. 'person'). - Can be an RDF resource identifier - (e.g. object of an RDF triple). - - - - - - - Additional information on the label - (e.g. 'YYYY-mm-dd' for a date label). - Can be used as predicate of an RDF triple. - - - - - - - - - - - - Alternative document page images - (e.g. black-and-white). - - - - - - - - - - Order of blocks within the page. - - - - - - Unassigned regions are considered to be in the - (virtual) default layer which is to be treated - as below any other layers. - - - - - - - - Default text style - - - - - - - Semantic labels / tags - - - - - - - - - - - - - - - - - - - - - - - - Contains the image file name including the file extension. - - - - - - Specifies the width of the image. - - - - - Specifies the height of the image. - - - - - Specifies the image resolution in width. - - - - - Specifies the image resolution in height. - - - - - - Specifies the unit of the resolution information - referring to a standardised unit of measurement - (pixels per inch, pixels per centimeter or other). - - - - - - - - - - - - - For generic use - - - - - - The angle the rectangle encapsulating the page - (or its Border) has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - (The rotated image can be further referenced - via “AlternativeImage”.) - Range: -179.999,180 - - - - - - - The type of the page within the document - (e.g. cover page). - - - - - - - The primary language used in the page - (lower-level definitions override the page-level definition). - - - - - - - The secondary language used in the page - (lower-level definitions override the page-level definition). - - - - - - - The primary script used in the page - (lower-level definitions override the page-level definition). - - - - - - - The secondary script used in the page - (lower-level definitions override the page-level definition). - - - - - - - The direction in which text within lines - should be read (order of words and characters), - in addition to “textLineOrder” - (lower-level definitions override the page-level definition). - - - - - - - The order of text lines within a block, - in addition to “readingDirection” - (lower-level definitions override the page-level definition). - - - - - - Confidence value for whole page (between 0 and 1) - - - - - - - Pure text is represented as a text region. This includes - drop capitals, but practically ornate text may be - considered as a graphic. - - - - - - - - - - - - - The angle the rectangle encapsulating the region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - (The rotated image can be further referenced - via “AlternativeImage”.) - Range: -179.999,180 - - - - - - - The nature of the text in the region - - - - - - - The degree of space in points between the lines of - text (line spacing) - - - - - - - The direction in which text within lines - should be read (order of words and characters), - in addition to “textLineOrder”. - - - - - - - The order of text lines within the block, - in addition to “readingDirection”. - - - - - - - The angle the baseline of text within the region - has to be rotated (relative to the rectangle - encapsulating the region) in clockwise direction - in order to correct the present skew, - in addition to “orientation” - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - Defines whether a region of text is indented or not - - - - - - Text align - - - - - - The primary language used in the region - - - - - - - The secondary language used in the region - - - - - - - The primary script used in the region - - - - - - - The secondary script used in the region - - - - - - - - - - - - Polygon outline of the element as a path of points. - No points may lie outside the outline of its parent, - which in the case of Border is the bounding rectangle - of the root image. Paths are closed by convention, - i.e. the last point logically connects with the first - (and at least 3 points are required to span an area). - Paths must be planar (i.e. must not self-intersect). - - - - - - Confidence value (between 0 and 1) - - - - - - - - - Alternative text line images (e.g. - black-and-white) - - - - - - - - Multiple connected points that mark the baseline - of the glyphs - - - - - - - - - - - - - - Semantic labels / tags - - - - - - - - Overrides primaryLanguage attribute of parent text - region - - - - - - - The primary script used in the text line - - - - - - - The secondary script used in the text line - - - - - - - The direction in which text within the line - should be read (order of words and characters). - - - - - - - Overrides the production attribute of the parent - text region - - - - - - For generic use - - - - - - - Position (order number) of this text line within the - parent text region. - - - - - - - - - - Alternative word images (e.g. - black-and-white) - - - - - - - - - - - - - - - Semantic labels / tags - - - - - - - - Overrides primaryLanguage attribute of parent line - and/or text region - - - - - - - The primary script used in the word - - - - - - - The secondary script used in the word - - - - - - - The direction in which text within the word - should be read (order of characters). - - - - - - - Overrides the production attribute of the parent - text line and/or text region. - - - - - - For generic use - - - - - - - - - - Alternative glyph images (e.g. - black-and-white) - - - - - - - - Container for graphemes, grapheme groups and - non-printing characters - - - - - - - - - - - - Semantic labels / tags - - - - - - - - - - The script used for the glyph - - - - - - - Overrides the production attribute of the parent - word / text line / text region. - - - - - - For generic use - - - - - - - - - - Text in a "simple" form (ASCII or extended ASCII - as mostly used for typing). I.e. no use of - special characters for ligatures (should be - stored as two separate characters) etc. - - - - - - - Correct encoding of the original, always using - the corresponding Unicode code point. I.e. - ligatures have to be represented as one - character etc. - - - - - - - - Used for sort order in case multiple TextEquivs are defined. - The text content with the lowest index should be interpreted - as the main text content. - - - - - - - - - - - OCR confidence value (between 0 and 1) - - - - - - Type of text content (is it free text or a number, for instance). - This is only a descriptive attribute, the text type - is not checked during XML validation. - - - - - - - Refinement for dataType attribute. Can be a regular expression, for instance. - - - - - - - - - - An image is considered to be more intricate and complex - than a graphic. These can be photos or drawings. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The colour bit depth required for the region - - - - - - - The background colour of the region - - - - - - - Specifies whether the region also contains - text - - - - - - - - - - A line drawing is a single colour illustration without - solid areas. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The pen (foreground) colour of the region - - - - - - - The background colour of the region - - - - - - - Specifies whether the region also contains - text - - - - - - - - - - Regions containing simple graphics, such as a company - logo, should be marked as graphic regions. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The type of graphic in the region - - - - - - - An approximation of the number of colours - used in the region - - - - - - - Specifies whether the region also contains - text. - - - - - - - - - - Tabular data in any form is represented with a table - region. Rows and columns may or may not have separator - lines; these lines are not separator regions. - - - - - - - - Table grid (visible or virtual grid lines) - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The number of rows present in the table - - - - - - - The number of columns present in the table - - - - - - - The colour of the lines used in the region - - - - - - - The background colour of the region - - - - - - - Specifies the presence of line separators - - - - - - - Specifies whether the region also contains - text - - - - - - - - - - Matrix of grid points defining the table grid on the page. - - - - - - - One row in the grid point matrix. - Points with x,y coordinates. - (note: for a table with n table rows there should be n+1 grid rows) - - - - - - - - Points with x,y coordinates. - - - - - The grid row index - - - - - - - - - Regions containing charts or graphs of any type, should - be marked as chart regions. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The type of chart in the region - - - - - - - An approximation of the number of colours - used in the region - - - - - - - The background colour of the region - - - - - - - Specifies whether the region also contains - text - - - - - - - - - - Separators are lines that lie between columns and - paragraphs and can be used to logically separate - different articles from each other. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The colour of the separator - - - - - - - - - - Regions containing equations and mathematical symbols - should be marked as maths regions. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The background colour of the region - - - - - - - - - - Regions containing chemical formulas. - - - - - - - - The angle the rectangle encapsulating a - region has to be rotated in clockwise - direction in order to correct the present - skew (negative values indicate - anti-clockwise rotation). Range: - -179.999,180 - - - - - - - The background colour of the region - - - - - - - - - - Regions containing maps. - - - - - - - - The angle the rectangle encapsulating a - region has to be rotated in clockwise - direction in order to correct the present - skew (negative values indicate - anti-clockwise rotation). Range: - -179.999,180 - - - - - - - - - - Regions containing musical notations. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The background colour of the region - - - - - - - - - - Regions containing advertisements. - - - - - - - - The angle the rectangle encapsulating a region - has to be rotated in clockwise direction - in order to correct the present skew - (negative values indicate anti-clockwise rotation). - Range: -179.999,180 - - - - - - - The background colour of the region - - - - - - - - - - Noise regions are regions where no real data lies, only - false data created by artifacts on the document or - scanner noise. - - - - - - - - - - To be used if the region type cannot be ascertained. - - - - - - - - - - Regions containing content that is not covered - by the default types (text, graphic, image, - line drawing, chart, table, separator, maths, - map, music, chem, advert, noise, unknown). - - - - - - - - Information on the type of content represented by this region - - - - - - - - - - Determines the effective area on the paper of a printed page. - Its size is equal for all pages of a book - (exceptions: titlepage, multipage pictures). - It contains all living elements (except marginals) - like body type, footnotes, headings, running titles. - It does not contain pagenumber (if not part of running title), - marginals, signature mark, preview words. - - - - - - - - - - Definition of the reading order within the page. - To express a reading order between elements - they have to be included in an OrderedGroup. - Groups may contain further groups. - - - - - - - - - Confidence value (between 0 and 1) - - - - - - Numbered region - - - - Position (order number) of this item within the current hierarchy level. - - - - - - - - Indexed group containing ordered elements - - - - - - - Semantic labels / tags - - - - - - - - - - - - - Optional link to a parent region of nested regions. - The parent region doubles as reading order group. - Only the nested regions should be allowed as group members. - - - - - - - Position (order number) of this item within the - current hierarchy level. - - - - - - - - - Is this group a continuation of another group (from - previous column or page, for example)? - - - - - - For generic use - - - - - - - - Indexed group containing unordered elements - - - - - - - - Semantic labels / tags - - - - - - - - - - - - - Optional link to a parent region of nested regions. - The parent region doubles as reading order group. - Only the nested regions should be allowed as group members. - - - - - - - Position (order number) of this item within the - current hierarchy level. - - - - - - - - - Is this group a continuation of another group - (from previous column or page, for example)? - - - - - - For generic use - - - - - - - - - - - Numbered group (contains ordered elements) - - - - - - - - Semantic labels / tags - - - - - - - - - - - - - Optional link to a parent region of nested regions. - The parent region doubles as reading order group. - Only the nested regions should be allowed as group members. - - - - - - - - - Is this group a continuation of another group - (from previous column or page, for example)? - - - - - - For generic use - - - - - - - - Numbered group (contains unordered elements) - - - - - - - - Semantic labels / tags - - - - - - - - - - - - - Optional link to a parent region of nested regions. - The parent region doubles as reading order group. - Only the nested regions should be allowed as group members. - - - - - - - - - Is this group a continuation of another group - (from previous column or page, for example)? - - - - - - For generic use - - - - - - - - Border of the actual page (if the scanned image - contains parts not belonging to the page). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ISO 639.x 2016-07-14 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - iso15924 2016-07-14 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Can be used to express the z-index of overlapping - regions. An element with a greater z-index is always in - front of another element with lower z-index. - - - - - - - - - - - - - - - - - - - - - - Confidence value (between 0 and 1) - - - - - - - - Point list with format "x1,y1 x2,y2 ...", where - "x" / "y" refer to the horizontal / vertical - pixel positions in a coordinate system which always - references the root PageType/@imageFilename, with - "0,0" in the upper left corner of the root image and - "imageWidth,imageHeight" in the lower right. - - - - - - - - - - Container for one-to-one relations between layout - objects (for example: DropCap - paragraph, caption - - image). - - - - - - - - - - - One-to-one relation between to layout object. Use 'link' - for loose relations and 'join' for strong relations - (where something is fragmented for instance). - - Examples for 'link': caption - image floating - - paragraph paragraph - paragraph (when a paragraph is - split across columns and the last word of the first - paragraph DOES NOT continue in the second paragraph) - drop-cap - paragraph (when the drop-cap is a whole word) - - Examples for 'join': word - word (separated word at the - end of a line) drop-cap - paragraph (when the drop-cap - is not a whole word) paragraph - paragraph (when a - pragraph is split across columns and the last word of - the first paragraph DOES continue in the second - paragraph) - - - - - - Semantic labels / tags - - - - - - - - - - - - - - - - - - - For generic use - - - - - - - - Text production type - - - - - - - - - - - - - - - Monospace (fixed-pitch, non-proportional) or - proportional font. - - - - - - For instance: Arial, Times New Roman. - Add more information if necessary - (e.g. blackletter, antiqua). - - - - - - - Serif or sans-serif typeface. - - - - - - - - The size of the characters in points. - - - - - - - The x-height or corpus size refers to the distance - between the baseline and the mean line of - lower-case letters in a typeface. - The unit is assumed to be pixels. - - - - - - - The degree of space (in points) between - the characters in a string of text. - - - - - - - - Text colour in RGB encoded format - (red value) + (256 x green value) + (65536 x blue value). - - - - - - Background colour - - - - - - Background colour in RGB encoded format - (red value) + (256 x green value) + (65536 x blue value). - - - - - - - Specifies whether the colour of the text appears - reversed against a background colour. - - - - - - - - - Line style details if "underlined" is TRUE - - - - - - - - - - - - - - - - Alternative region images - (e.g. black-and-white). - - - - - - - - - Semantic labels / tags - - - - - - Roles the region takes - (e.g. in context of a parent region). - - - - - - - - - - - - - - - - - - - - - - - - For generic use - - - - - - - Is this region a continuation of another region - (in previous column or page, for example)? - - - - - - - - - - - Confidence value (between 0 and 1) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Examples: - "123.456", "+1234.456", - "-1234.456", "-.456", "-456" - - - - - - - Examples: - "123.456", "+1234.456", "-1.2344e56", - "-.45E-6", "INF", "-INF", "NaN" - - - - - - - Examples: - "123456", "+00000012", "-1", "-456" - - - - - - - Examples: "true", "false", "1", "0" - - - - - - - Examples: - "2001-10-26", "2001-10-26+02:00", - "2001-10-26Z", "2001-10-26+00:00", - "-2001-10-26", "-20000-04-01" - - - - - - - Examples: - "21:32:52", "21:32:52+02:00", "19:32:52Z", - "19:32:52+00:00", "21:32:52.12679" - - - - - - - Examples: - "2001-10-26T21:32:52", "2001-10-26T21:32:52+02:00", - "2001-10-26T19:32:52Z", "2001-10-26T19:32:52+00:00", - "-2001-10-26T21:32:52", "2001-10-26T21:32:52.12679" - - - - - - Generic text string - - - - - - An XSD type that is not listed or a custom type - (use dataTypeDetails attribute). - - - - - - - - - Container for graphemes, grapheme groups and - non-printing characters. - - - - - - - - - - - - Base type for graphemes, grapheme groups and non-printing characters. - - - - - - - - - - Order index of grapheme, group, or non-printing character - within the parent container (graphemes or glyph or grapheme group). - - - - - - - - - - - - - Type of character represented by the - grapheme, group, or non-printing character element. - - - - - - - - - - - - For generic use - - - - - For generic use - - - - - - - Represents a sub-element of a glyph. - Smallest graphical unit that can be - assigned a Unicode code point. - - - - - - - - - - - - - - A glyph component without visual representation - but with Unicode code point. - Non-visual / non-printing / control character. - Part of grapheme container (of glyph) or grapheme sub group. - - - - - - - - - - - - - - - - - - - - - Container for user-defined attributes - - - - - - - - - Structured custom data defined by name, type and value. - - - - - - - - - - - - - - - - - - - - Cell position in table starting with row 0 - - - - - Cell position in table starting with column 0 - - - - - Number of rows the cell spans (optional; default is 1) - - - - - Number of columns the cell spans (optional; default is 1) - - - - - - Is the cell a column or row header? - - - - - - - - - - Data for a region that takes on the role - of a table cell within a parent table region. - - - - - - - - - - - - - + + + + + + Page Content - Ground Truth and Storage + + + + + + + + + + + + + + + + The timestamp has to be in UTC (Coordinated + Universal Time) and not local time. + + + + + + + The timestamp has to be in UTC + (Coordinated Universal Time) + and not local time. + + + + + + + + + + + + + External reference of any kind + + + + + + + + Semantic labels / tags + + + + + + + Type of metadata (e.g. author) + + + + + + + + + + + + + + + E.g. imagePhotometricInterpretation + + + + + + E.g. RGB + + + + + + + + + + A semantic label / tag + + + + + + + + Reference to external model / ontology / schema + + + + + + + E.g. an RDF resource identifier + (to be used as subject or object of an RDF triple) + + + + + + + Prefix for all labels (e.g. first part of an URI) + + + + + + + + Semantic label + + + + + The label / tag (e.g. 'person'). + Can be an RDF resource identifier + (e.g. object of an RDF triple). + + + + + + + Additional information on the label + (e.g. 'YYYY-mm-dd' for a date label). + Can be used as predicate of an RDF triple. + + + + + + + + + + + + Alternative document page images + (e.g. black-and-white). + + + + + + + + + + Order of blocks within the page. + + + + + + Unassigned regions are considered to be in the + (virtual) default layer which is to be treated + as below any other layers. + + + + + + + + Default text style + + + + + + + Semantic labels / tags + + + + + + + + + + + + + + + + + + + + + + + + Contains the image file name including the file extension. + + + + + + Specifies the width of the image. + + + + + Specifies the height of the image. + + + + + Specifies the image resolution in width. + + + + + Specifies the image resolution in height. + + + + + + Specifies the unit of the resolution information + referring to a standardised unit of measurement + (pixels per inch, pixels per centimeter or other). + + + + + + + + + + + + + For generic use + + + + + + The angle the rectangle encapsulating the page + (or its Border) has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + (The rotated image can be further referenced + via “AlternativeImage”.) + Range: -179.999,180 + + + + + + + The type of the page within the document + (e.g. cover page). + + + + + + + The primary language used in the page + (lower-level definitions override the page-level definition). + + + + + + + The secondary language used in the page + (lower-level definitions override the page-level definition). + + + + + + + The primary script used in the page + (lower-level definitions override the page-level definition). + + + + + + + The secondary script used in the page + (lower-level definitions override the page-level definition). + + + + + + + The direction in which text within lines + should be read (order of words and characters), + in addition to “textLineOrder” + (lower-level definitions override the page-level definition). + + + + + + + The order of text lines within a block, + in addition to “readingDirection” + (lower-level definitions override the page-level definition). + + + + + + Confidence value for whole page (between 0 and 1) + + + + + + + Pure text is represented as a text region. This includes + drop capitals, but practically ornate text may be + considered as a graphic. + + + + + + + + + + + + + The angle the rectangle encapsulating the region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + (The rotated image can be further referenced + via “AlternativeImage”.) + Range: -179.999,180 + + + + + + + The nature of the text in the region + + + + + + + The degree of space in points between the lines of + text (line spacing) + + + + + + + The direction in which text within lines + should be read (order of words and characters), + in addition to “textLineOrder”. + + + + + + + The order of text lines within the block, + in addition to “readingDirection”. + + + + + + + The angle the baseline of text within the region + has to be rotated (relative to the rectangle + encapsulating the region) in clockwise direction + in order to correct the present skew, + in addition to “orientation” + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + Defines whether a region of text is indented or not + + + + + + Text align + + + + + + The primary language used in the region + + + + + + + The secondary language used in the region + + + + + + + The primary script used in the region + + + + + + + The secondary script used in the region + + + + + + + + + + + + Polygon outline of the element as a path of points. + No points may lie outside the outline of its parent, + which in the case of Border is the bounding rectangle + of the root image. Paths are closed by convention, + i.e. the last point logically connects with the first + (and at least 3 points are required to span an area). + Paths must be planar (i.e. must not self-intersect). + + + + + + Confidence value (between 0 and 1) + + + + + + + + + Alternative text line images (e.g. + black-and-white) + + + + + + + + Multiple connected points that mark the baseline + of the glyphs + + + + + + + + + + + + + + Semantic labels / tags + + + + + + + + Overrides primaryLanguage attribute of parent text + region + + + + + + + The primary script used in the text line + + + + + + + The secondary script used in the text line + + + + + + + The direction in which text within the line + should be read (order of words and characters). + + + + + + + Overrides the production attribute of the parent + text region + + + + + + For generic use + + + + + + + Position (order number) of this text line within the + parent text region. + + + + + + + + + + Alternative word images (e.g. + black-and-white) + + + + + + + + + + + + + + + Semantic labels / tags + + + + + + + + Overrides primaryLanguage attribute of parent line + and/or text region + + + + + + + The primary script used in the word + + + + + + + The secondary script used in the word + + + + + + + The direction in which text within the word + should be read (order of characters). + + + + + + + Overrides the production attribute of the parent + text line and/or text region. + + + + + + For generic use + + + + + + + + + + Alternative glyph images (e.g. + black-and-white) + + + + + + + + Container for graphemes, grapheme groups and + non-printing characters + + + + + + + + + + + + Semantic labels / tags + + + + + + + + + + The script used for the glyph + + + + + + + Overrides the production attribute of the parent + word / text line / text region. + + + + + + For generic use + + + + + + + + + + Text in a "simple" form (ASCII or extended ASCII + as mostly used for typing). I.e. no use of + special characters for ligatures (should be + stored as two separate characters) etc. + + + + + + + Correct encoding of the original, always using + the corresponding Unicode code point. I.e. + ligatures have to be represented as one + character etc. + + + + + + + + Used for sort order in case multiple TextEquivs are defined. + The text content with the lowest index should be interpreted + as the main text content. + + + + + + + + + + + OCR confidence value (between 0 and 1) + + + + + + Type of text content (is it free text or a number, for instance). + This is only a descriptive attribute, the text type + is not checked during XML validation. + + + + + + + Refinement for dataType attribute. Can be a regular expression, for instance. + + + + + + + + + + An image is considered to be more intricate and complex + than a graphic. These can be photos or drawings. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The colour bit depth required for the region + + + + + + + The background colour of the region + + + + + + + Specifies whether the region also contains + text + + + + + + + + + + A line drawing is a single colour illustration without + solid areas. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The pen (foreground) colour of the region + + + + + + + The background colour of the region + + + + + + + Specifies whether the region also contains + text + + + + + + + + + + Regions containing simple graphics, such as a company + logo, should be marked as graphic regions. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The type of graphic in the region + + + + + + + An approximation of the number of colours + used in the region + + + + + + + Specifies whether the region also contains + text. + + + + + + + + + + Tabular data in any form is represented with a table + region. Rows and columns may or may not have separator + lines; these lines are not separator regions. + + + + + + + + Table grid (visible or virtual grid lines) + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The number of rows present in the table + + + + + + + The number of columns present in the table + + + + + + + The colour of the lines used in the region + + + + + + + The background colour of the region + + + + + + + Specifies the presence of line separators + + + + + + + Specifies whether the region also contains + text + + + + + + + + + + Matrix of grid points defining the table grid on the page. + + + + + + + One row in the grid point matrix. + Points with x,y coordinates. + (note: for a table with n table rows there should be n+1 grid rows) + + + + + + + + Points with x,y coordinates. + + + + + The grid row index + + + + + + + + + Regions containing charts or graphs of any type, should + be marked as chart regions. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The type of chart in the region + + + + + + + An approximation of the number of colours + used in the region + + + + + + + The background colour of the region + + + + + + + Specifies whether the region also contains + text + + + + + + + + + + Separators are lines that lie between columns and + paragraphs and can be used to logically separate + different articles from each other. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The colour of the separator + + + + + + + + + + Regions containing equations and mathematical symbols + should be marked as maths regions. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The background colour of the region + + + + + + + + + + Regions containing chemical formulas. + + + + + + + + The angle the rectangle encapsulating a + region has to be rotated in clockwise + direction in order to correct the present + skew (negative values indicate + anti-clockwise rotation). Range: + -179.999,180 + + + + + + + The background colour of the region + + + + + + + + + + Regions containing maps. + + + + + + + + The angle the rectangle encapsulating a + region has to be rotated in clockwise + direction in order to correct the present + skew (negative values indicate + anti-clockwise rotation). Range: + -179.999,180 + + + + + + + + + + Regions containing musical notations. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The background colour of the region + + + + + + + + + + Regions containing advertisements. + + + + + + + + The angle the rectangle encapsulating a region + has to be rotated in clockwise direction + in order to correct the present skew + (negative values indicate anti-clockwise rotation). + Range: -179.999,180 + + + + + + + The background colour of the region + + + + + + + + + + Noise regions are regions where no real data lies, only + false data created by artifacts on the document or + scanner noise. + + + + + + + + + + To be used if the region type cannot be ascertained. + + + + + + + + + + Regions containing content that is not covered + by the default types (text, graphic, image, + line drawing, chart, table, separator, maths, + map, music, chem, advert, noise, unknown). + + + + + + + + Information on the type of content represented by this region + + + + + + + + + + Determines the effective area on the paper of a printed page. + Its size is equal for all pages of a book + (exceptions: titlepage, multipage pictures). + It contains all living elements (except marginals) + like body type, footnotes, headings, running titles. + It does not contain pagenumber (if not part of running title), + marginals, signature mark, preview words. + + + + + + + + + + Definition of the reading order within the page. + To express a reading order between elements + they have to be included in an OrderedGroup. + Groups may contain further groups. + + + + + + + + + Confidence value (between 0 and 1) + + + + + + Numbered region + + + + Position (order number) of this item within the current hierarchy level. + + + + + + + + Indexed group containing ordered elements + + + + + + + Semantic labels / tags + + + + + + + + + + + + + Optional link to a parent region of nested regions. + The parent region doubles as reading order group. + Only the nested regions should be allowed as group members. + + + + + + + Position (order number) of this item within the + current hierarchy level. + + + + + + + + + Is this group a continuation of another group (from + previous column or page, for example)? + + + + + + For generic use + + + + + + + + Indexed group containing unordered elements + + + + + + + + Semantic labels / tags + + + + + + + + + + + + + Optional link to a parent region of nested regions. + The parent region doubles as reading order group. + Only the nested regions should be allowed as group members. + + + + + + + Position (order number) of this item within the + current hierarchy level. + + + + + + + + + Is this group a continuation of another group + (from previous column or page, for example)? + + + + + + For generic use + + + + + + + + + + + Numbered group (contains ordered elements) + + + + + + + + Semantic labels / tags + + + + + + + + + + + + + Optional link to a parent region of nested regions. + The parent region doubles as reading order group. + Only the nested regions should be allowed as group members. + + + + + + + + + Is this group a continuation of another group + (from previous column or page, for example)? + + + + + + For generic use + + + + + + + + Numbered group (contains unordered elements) + + + + + + + + Semantic labels / tags + + + + + + + + + + + + + Optional link to a parent region of nested regions. + The parent region doubles as reading order group. + Only the nested regions should be allowed as group members. + + + + + + + + + Is this group a continuation of another group + (from previous column or page, for example)? + + + + + + For generic use + + + + + + + + Border of the actual page (if the scanned image + contains parts not belonging to the page). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ISO 639.x 2016-07-14 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + iso15924 2016-07-14 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Can be used to express the z-index of overlapping + regions. An element with a greater z-index is always in + front of another element with lower z-index. + + + + + + + + + + + + + + + + + + + + + + Confidence value (between 0 and 1) + + + + + + + + Point list with format "x1,y1 x2,y2 ...", where + "x" / "y" refer to the horizontal / vertical + pixel positions in a coordinate system which always + references the root PageType/@imageFilename, with + "0,0" in the upper left corner of the root image and + "imageWidth,imageHeight" in the lower right. + + + + + + + + + + Container for one-to-one relations between layout + objects (for example: DropCap - paragraph, caption - + image). + + + + + + + + + + + One-to-one relation between to layout object. Use 'link' + for loose relations and 'join' for strong relations + (where something is fragmented for instance). + + Examples for 'link': caption - image floating - + paragraph paragraph - paragraph (when a paragraph is + split across columns and the last word of the first + paragraph DOES NOT continue in the second paragraph) + drop-cap - paragraph (when the drop-cap is a whole word) + + Examples for 'join': word - word (separated word at the + end of a line) drop-cap - paragraph (when the drop-cap + is not a whole word) paragraph - paragraph (when a + pragraph is split across columns and the last word of + the first paragraph DOES continue in the second + paragraph) + + + + + + Semantic labels / tags + + + + + + + + + + + + + + + + + + + For generic use + + + + + + + + Text production type + + + + + + + + + + + + + + + Monospace (fixed-pitch, non-proportional) or + proportional font. + + + + + + For instance: Arial, Times New Roman. + Add more information if necessary + (e.g. blackletter, antiqua). + + + + + + + Serif or sans-serif typeface. + + + + + + + + The size of the characters in points. + + + + + + + The x-height or corpus size refers to the distance + between the baseline and the mean line of + lower-case letters in a typeface. + The unit is assumed to be pixels. + + + + + + + The degree of space (in points) between + the characters in a string of text. + + + + + + + + Text colour in RGB encoded format + (red value) + (256 x green value) + (65536 x blue value). + + + + + + Background colour + + + + + + Background colour in RGB encoded format + (red value) + (256 x green value) + (65536 x blue value). + + + + + + + Specifies whether the colour of the text appears + reversed against a background colour. + + + + + + + + + Line style details if "underlined" is TRUE + + + + + + + + + + + + + + + + Alternative region images + (e.g. black-and-white). + + + + + + + + + Semantic labels / tags + + + + + + Roles the region takes + (e.g. in context of a parent region). + + + + + + + + + + + + + + + + + + + + + + + + For generic use + + + + + + + Is this region a continuation of another region + (in previous column or page, for example)? + + + + + + + + + + + Confidence value (between 0 and 1) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Examples: + "123.456", "+1234.456", + "-1234.456", "-.456", "-456" + + + + + + + Examples: + "123.456", "+1234.456", "-1.2344e56", + "-.45E-6", "INF", "-INF", "NaN" + + + + + + + Examples: + "123456", "+00000012", "-1", "-456" + + + + + + + Examples: "true", "false", "1", "0" + + + + + + + Examples: + "2001-10-26", "2001-10-26+02:00", + "2001-10-26Z", "2001-10-26+00:00", + "-2001-10-26", "-20000-04-01" + + + + + + + Examples: + "21:32:52", "21:32:52+02:00", "19:32:52Z", + "19:32:52+00:00", "21:32:52.12679" + + + + + + + Examples: + "2001-10-26T21:32:52", "2001-10-26T21:32:52+02:00", + "2001-10-26T19:32:52Z", "2001-10-26T19:32:52+00:00", + "-2001-10-26T21:32:52", "2001-10-26T21:32:52.12679" + + + + + + Generic text string + + + + + + An XSD type that is not listed or a custom type + (use dataTypeDetails attribute). + + + + + + + + + Container for graphemes, grapheme groups and + non-printing characters. + + + + + + + + + + + + Base type for graphemes, grapheme groups and non-printing characters. + + + + + + + + + + Order index of grapheme, group, or non-printing character + within the parent container (graphemes or glyph or grapheme group). + + + + + + + + + + + + + Type of character represented by the + grapheme, group, or non-printing character element. + + + + + + + + + + + + For generic use + + + + + For generic use + + + + + + + Represents a sub-element of a glyph. + Smallest graphical unit that can be + assigned a Unicode code point. + + + + + + + + + + + + + + A glyph component without visual representation + but with Unicode code point. + Non-visual / non-printing / control character. + Part of grapheme container (of glyph) or grapheme sub group. + + + + + + + + + + + + + + + + + + + + + Container for user-defined attributes + + + + + + + + + Structured custom data defined by name, type and value. + + + + + + + + + + + + + + + + + + + + Cell position in table starting with row 0 + + + + + Cell position in table starting with column 0 + + + + + Number of rows the cell spans (optional; default is 1) + + + + + Number of columns the cell spans (optional; default is 1) + + + + + + Is the cell a column or row header? + + + + + + + + + + Data for a region that takes on the role + of a table cell within a parent table region. + + + + + + + + + + + + + diff --git a/tests/resources/xlink.xsd b/tests/resources/xlink.xsd index f55eb6dae..8283fe669 100644 --- a/tests/resources/xlink.xsd +++ b/tests/resources/xlink.xsd @@ -1,75 +1,75 @@ - + - + - - - - - + + + + + - - - - + + + + - - - + + + - - - - - - - + + + + + + + - - - + + + - - - - - + + + + + - - - - - - - + + + + + + + - - - - + + + + - + - + diff --git a/tests/test_merging.py b/tests/test_merging.py index a9a00631e..a32b3a47e 100644 --- a/tests/test_merging.py +++ b/tests/test_merging.py @@ -81,7 +81,7 @@ def test_merging_union(self): model.nn.codec.encode("x").shape, (1, ), "x is known to the loaded model and should be encoded through `new`" ) - + def test_merging_union_with_nfd(self): """ Asserts that union, which only takes into account new the original codec and the new data, works as intended diff --git a/tests/test_train.py b/tests/test_train.py index 3d651d927..59b663435 100644 --- a/tests/test_train.py +++ b/tests/test_train.py @@ -200,7 +200,7 @@ def test_krakentrainer_rec_bl_dict(self): self.assertEqual(module.nn.seg_type, 'baselines') self.assertIsInstance(module.train_set.dataset, kraken.lib.dataset.PolygonGTDataset) trainer = KrakenTrainer(max_steps=1) - + def test_krakentrainer_rec_bl_augment(self): """ Test that augmentation is added if specified. @@ -212,14 +212,14 @@ def test_krakentrainer_rec_bl_augment(self): evaluation_data=evaluation_data) module.setup() self.assertEqual(module.train_set.dataset.aug, None) - + module = RecognitionModel({'augment': True}, format_type='xml', training_data=training_data, evaluation_data=evaluation_data) module.setup() self.assertIsInstance(module.train_set.dataset.aug, kraken.lib.dataset.recognition.DefaultAugmenter) - + def test_krakentrainer_rec_box_augment(self): """ Test that augmentation is added if specified. @@ -231,7 +231,7 @@ def test_krakentrainer_rec_box_augment(self): evaluation_data=evaluation_data) module.setup() self.assertEqual(module.train_set.dataset.aug, None) - + module = RecognitionModel({'augment': True}, format_type='path', training_data=training_data,