From 3577eb8d6980ead8332d3e924020b440a16dcb42 Mon Sep 17 00:00:00 2001
From: Benjamin Kiessling <mittagessen@l.unchti.me>
Date: Thu, 11 Jan 2024 01:49:22 +0100
Subject: [PATCH] Some basic rotrain docs

---
 docs/index.rst |   2 +-
 docs/ketos.rst | 148 +++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 119 insertions(+), 31 deletions(-)
diff --git a/docs/index.rst b/docs/index.rst
index 39fc3fc1d..5713d8fdd 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -19,7 +19,7 @@ Features
 
 kraken's main features are:
 
-  - Fully trainable layout analysis and character recognition
+  - Fully trainable :ref:`layout analysis <segtrain>`, :ref:`reading order <rotrain>`, and :ref:`character recognition <predtrain>`
   - `Right-to-Left <https://en.wikipedia.org/wiki/Right-to-left>`_, `BiDi
     <https://en.wikipedia.org/wiki/Bi-directional_text>`_, and Top-to-Bottom
     script support
diff --git a/docs/ketos.rst b/docs/ketos.rst
index 8b32ba916..b1c23ae30 100644
--- a/docs/ketos.rst
+++ b/docs/ketos.rst
@@ -7,9 +7,16 @@ This page describes the training utilities available through the ``ketos``
 command line utility in depth. For a gentle introduction on model training
 please refer to the :ref:`tutorial <training>`. 
 
-Both segmentation and recognition are trainable in kraken. The segmentation
-model finds baselines and regions on a page image. Recognition models convert
-text image lines found by the segmenter into digital text.
+There are currently three trainable components in the kraken processing pipeline:
+* Segmentation: finding lines and regions in images
+* Reading Order: ordering lines found in the previous segmentation step. Reading order models are closely linked to segmentation models and both are usually trained on the same dataset.
+* Recognition: recognition models transform images of lines into text. 
+
+Depending on the use case it is not necessary to manually train new models for
+each material. The default segmentation model works well on quite a variety of
+handwritten and printed documents, a reading order model might not perform
+better than the default heuristic for simple text flows, and there are
+recognition models for some types of material available in the repository.
 
 Best practices
 --------------
@@ -34,7 +41,7 @@ Segmentation model training
 
 * The segmenter is fairly robust when it comes to hyperparameter choice.
 * Start by finetuning from the default model for a fixed number of epochs (50 for reasonably sized datasets) with a cosine schedule.
-* Segmentation models' performance are difficult to evaluate. Pixel accuracy doesn't mean much because there are many more pixels that aren't part of a line or region than just background. Frequency-weighted IoU is good for overall performance, while mean IoU overrepresents rare classes. The best way to evaluate segmentation models is to look at the output on unlabelled data.
+* Segmentation models' performance is difficult to evaluate. Pixel accuracy doesn't mean much because there are many more pixels that aren't part of a line or region than just background. Frequency-weighted IoU is good for overall performance, while mean IoU overrepresents rare classes. The best way to evaluate segmentation models is to look at the output on unlabelled data.
 * If you don't have rare classes you can use a fairly small validation set to make sure everything is converging and just visually validate on unlabelled data.
 
 Training data formats
@@ -44,13 +51,15 @@ The training tools accept a variety of training data formats, usually some kind
 of custom low level format, the XML-based formats that are commony used for
 archival of annotation and transcription data, and in the case of recognizer
 training a precompiled binary format. It is recommended to use the XML formats
-for segmentation training and the binary format for recognition training.
+for segmentation and reading order training and the binary format for
+recognition training.
 
 ALTO
 ~~~~
 
-Kraken parses and produces files according to ALTO 4.2. An example showing the
-attributes necessary for segmentation and recognition training follows:
+Kraken parses and produces files according to ALTO 4.3. An example showing the
+attributes necessary for segmentation, recognition, and reading order training
+follows:
 
 .. literalinclude:: alto.xml
    :language: xml
@@ -67,7 +76,8 @@ PAGE XML
 
 PAGE XML is parsed and produced according to the 2019-07-15 version of the
 schema, although the parser is not strict and works with non-conformant output
-of a variety of tools.
+from a variety of tools. As with ALTO, PAGE XML files can be used to train
+segmentation, reading order, and recognition models.
 
 .. literalinclude:: pagexml.xml
    :language: xml
@@ -79,11 +89,12 @@ Binary Datasets
 .. _binary_datasets:
 
 In addition to training recognition models directly from XML and image files, a
-binary dataset format offering a couple of advantages is supported. Binary
-datasets drastically improve loading performance allowing the saturation of
-most GPUs with minimal computational overhead while also allowing training with
-datasets that are larger than the systems main memory. A minor drawback is a
-~30% increase in dataset size in comparison to the raw images + XML approach.
+binary dataset format offering a couple of advantages is supported for
+recognition training. Binary datasets drastically improve loading performance
+allowing the saturation of most GPUs with minimal computational overhead while
+also allowing training with datasets that are larger than the systems main
+memory. A minor drawback is a ~30% increase in dataset size in comparison to
+the raw images + XML approach.
 
 To realize this speedup the dataset has to be compiled first:
 
@@ -124,6 +135,8 @@ otherwise) while the remaining 10% of the test set is selected by `ketos test`.
 Recognition training
 --------------------
 
+.. _predtrain:
+
 The training utility allows training of :ref:`VGSL <vgsl>` specified models
 both from scratch and from existing models. Here are its most important command line options:
 
@@ -193,23 +206,24 @@ Training will continue until the error does not improve anymore and the best
 model (among intermediate results) will be saved in the current directory; this
 approach is called early stopping.
 
-In some cases, such as color inputs, changing the network architecture might be
-useful:
+In some cases changing the network architecture might be useful. One such
+example would be material that is not well recognized in the grayscale domain,
+as the default architecture definition converts images into grayscale. The
+input definition can be changed quite easily to train on color data (RGB) instead:
 
 .. code-block:: console
 
-        $ ketos train -f page -s '[1,0,0,3 Cr3,3,16 Mp3,3 Lfys64 Lbx128 Lbx256 Do]' syr/*.xml
+        $ ketos train -f page -s '[1,120,0,3 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do0.1,2 Lbx200 Do]]' syr/*.xml
 
 Complete documentation for the network description language can be found on the
 :ref:`VGSL <vgsl>` page.
 
 Sometimes the early stopping default parameters might produce suboptimal
-results such as stopping training too soon. Adjusting the minimum delta an/or
-lag can be useful:
+results such as stopping training too soon. Adjusting the lag can be useful:
 
 .. code-block:: console
 
-        $ ketos train --lag 10 --min-delta 0.001 syr/*.png
+        $ ketos train --lag 10 syr/*.png
 
 To switch optimizers from Adam to SGD or RMSprop just set the option:
 
@@ -516,16 +530,6 @@ recognition. The basic invocation is:
 .. code-block:: console
 
         $ ketos segtrain -f xml training_data/*.xml
-        Training line types:
-          default 2	53980
-          foo     8     134
-        Training region types:
-          graphic	3	135
-          text	4	1128
-          separator	5	5431
-          paragraph	6	10218
-          table	7	16
-        val check  [------------------------------------]  0/0
 
 This takes all text lines and regions encoded in the XML files and trains a
 model to recognize them.
@@ -610,6 +614,8 @@ Finally, we can merge baselines and regions into each other:
         ...
 
 These options are combinable to massage the dataset into any typology you want.
+Tags containing the separator character `:` can be specified by escaping them
+with backslash.
 
 Then there are some options that set metadata fields controlling the
 postprocessing. When computing the bounding polygons the recognized baselines
@@ -633,7 +639,89 @@ be improved:
         $ ketos segtrain --bounding-regions paragraph -f xml training_data/*.xml
         ...
 
-Recognition Testing
+Reading order training
+----------------------
+
+.. _rotrain:
+
+Reading order models work slightly differently from segmentation and reading
+order models. They are closely linked to the typology used in the dataset they
+were trained on as they use type information on lines and regions to make
+ordering decisions. As the same typology was probably used to train a specific
+segmentation model, reading order models are trained separately but bundled
+with their segmentation model in a subsequent step. The general sequence is
+therefore:
+
+.. code-block:: console
+       
+        $ ketos segtrain -o fr_manu_seg.mlmodel -f xml french/*.xml
+        ...
+        $ ketos rotrain -o fr_manu_ro.mlmodel -f xml french/*.xml
+        ...
+        $ ketos roadd -o fr_manu_seg_with_ro.mlmodel -i fr_manu_seg_best.mlmodel  -r fr_manu_ro_best.mlmodel
+
+Only the `fr_manu_seg_with_ro.mlmodel` file will contain the trained reading
+order model.  Segmentation models can exist with or without reading order
+models. If one is added, the neural reading order will be computed *in
+addition* to the one produced by the default heuristic during segmentation and
+serialized in the final XML output (in ALTO/PAGE XML).
+
+.. note::
+
+        Reading order models work purely on the typology and geometric features
+        of the lines and regions. They construct an approximate ordering matrix
+        by feeding feature vectors of two lines (or regions) into the network
+        to decide which of those two lines precedes the other. 
+        
+        These feature vectors are quite simple; just the lines' types, and
+        their start, center, and end points. Therefore they can *not* reliably
+        learn any ordering relying on graphical features of the input page such
+        as: line color, typeface, or writing system.
+
+Reading order models are extremely simple and do not require a lot of memory or
+computational power to train. In fact, the default parameters are extremely
+conservative and it is recommended to increase the batch size for improved
+training speed. Large batch size above 128k are easily possible with
+sufficiently large training datasets:
+
+.. code-block:: console
+
+        $ ketos rotrain -o fr_manu_ro.mlmodel -B 128000 -f french/*.xml
+        Training RO on following baselines types:
+          DefaultLine	1
+          DropCapitalLine	2
+          HeadingLine	3
+          InterlinearLine	4
+        GPU available: False, used: False
+        TPU available: False, using: 0 TPU cores
+        IPU available: False, using: 0 IPUs
+        HPU available: False, using: 0 HPUs
+        ┏━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
+        ┃   ┃ Name        ┃ Type              ┃ Params ┃
+        ┡━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
+        │ 0 │ criterion   │ BCEWithLogitsLoss │      0 │
+        │ 1 │ ro_net      │ MLP               │  1.1 K │
+        │ 2 │ ro_net.fc1  │ Linear            │  1.0 K │
+        │ 3 │ ro_net.relu │ ReLU              │      0 │
+        │ 4 │ ro_net.fc2  │ Linear            │     45 │
+        └───┴─────────────┴───────────────────┴────────┘
+        Trainable params: 1.1 K                                                                                                                                        
+        Non-trainable params: 0                                                                                                                                        
+        Total params: 1.1 K                                                                                                                                            
+        Total estimated model params size (MB): 0                                                                                                                      
+        stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/35 0:00:00 • -:--:-- 0.00it/s val_spearman: 0.912 val_loss: 0.701 early_stopping: 0/300 inf
+
+During validation a metric called Spearman's footrule is computed. To calculate
+Spearman's footrule, the ranks of the lines of text in the ground truth reading
+order and the predicted reading order are compared. The footrule is then
+calculated as the sum of the absolute differences between the ranks of pairs of
+lines. The score increases by 1 for each line between the correct and predicted
+positions of a line.
+
+A lower footrule score indicates a better alignment between the two orders. A
+score of 0 implies perfect alignment of line ranks.
+
+Recognition testing
 -------------------
 
 Picking a particular model from a pool or getting a more detailed look on the