Merge pull request #583 from stweil/crlf+whitespace

Remove trailing whitespace and CR
mittagessen · Apr 11, 2024 · 621305a · 621305a
2 parents 9f4ccf3 + 5696830
commit 621305a
Show file tree

Hide file tree

Showing 20 changed files with 5,208 additions and 5,208 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -1,8 +1,8 @@
 name: Lint, test, build, and publish
 
-on: 
+on:
   push:
-  
+
 
 jobs:
   lint_and_test:
@@ -130,7 +130,7 @@ jobs:
             pypi/*
 
   publish-gh-pages:
-    name: Update kraken.re github pages 
+    name: Update kraken.re github pages
     needs: lint_and_test
     runs-on: ubuntu-latest
     if: |
@@ -147,7 +147,7 @@ jobs:
           python-version: 3.9
       - name: Install sphinx-multiversion
         run:  python -m pip install sphinx-multiversion sphinx-autoapi
-      - name: Create docs 
+      - name: Create docs
         run:  sphinx-multiversion docs build/html
       - name: Create redirect
         run: cp docs/redirect.html build/html/index.html

diff --git a/README.rst b/README.rst
@@ -55,15 +55,15 @@ branch as well:
 
 ::
 
-  $ git clone https://github.com/mittagessen/kraken.git 
+  $ git clone https://github.com/mittagessen/kraken.git
   $ cd kraken
   $ conda env create -f environment.yml
 
 or:
 
 ::
 
-  $ git clone https://github.com/mittagessen/kraken.git 
+  $ git clone https://github.com/mittagessen/kraken.git
   $ cd kraken
   $ conda env create -f environment_cuda.yml
 
@@ -75,7 +75,7 @@ in the kraken directory for the current user:
 
 ::
 
-  $ kraken get 10.5281/zenodo.10592716 
+  $ kraken get 10.5281/zenodo.10592716
 
 A list of libre models available in the central repository can be retrieved by
 running:
@@ -105,7 +105,7 @@ To segment an image (binarized or not) with the new baseline segmenter:
 ::
 
   $ kraken -i image.tif lines.json segment -bl
- 
+
 
 To segment and OCR an image using the default model(s):
 

diff --git a/docs/alto.xml b/docs/alto.xml
@@ -13,18 +13,18 @@
 			<PrintSpace...>
 				<ComposedBlockType ID="block_I"
 						   HPOS="125"
-						   VPOS="523" 
-						   WIDTH="5234" 
+						   VPOS="523"
+						   WIDTH="5234"
 						   HEIGHT="4000"
 						   TYPE="region_type"><!-- for textlines part of a semantic region -->
 					<TextBlock ID="textblock_N">
 						<TextLine ID="line_0"
 							  HPOS="..."
-							  VPOS="..." 
-							  WIDTH="..." 
+							  VPOS="..."
+							  WIDTH="..."
 							  HEIGHT="..."
 							  BASELINE="10 20 15 20 400 20"><!-- necessary for segmentation training -->
-							<String ID="segment_K" 
+							<String ID="segment_K"
 								CONTENT="word_text"><!-- necessary for recognition training. Text is retrieved from <String> and <SP> tags. Lower level glyphs are ignored. -->
 								...
 							</String>

diff --git a/docs/api.rst b/docs/api.rst
@@ -1,10 +1,10 @@
-API Quickstart 
+API Quickstart
 ==============
 
 Kraken provides routines which are usable by third party tools to access all
 functionality of the OCR engine. Most functional blocks, binarization,
 segmentation, recognition, and serialization are encapsulated in one high
-level method each. 
+level method each.
 
 Simple use cases of the API which are mostly useful for debugging purposes are
 contained in the `contrib` directory. In general it is recommended to look at
@@ -493,7 +493,7 @@ handling and verbosity options for the CLI.
 
 .. code-block:: python
 
-        >>> from kraken.lib.train import RecognitionModel, KrakenTrainer 
+        >>> from kraken.lib.train import RecognitionModel, KrakenTrainer
         >>> ground_truth = glob.glob('training/*.xml')
         >>> training_files = ground_truth[:250] # training data is shuffled internally
         >>> evaluation_files = ground_truth[250:]
@@ -522,14 +522,14 @@ can be attached to the trainer object:
 .. code-block:: python
 
         >>> from pytorch_lightning.callbacks import Callback
-        >>> from kraken.lib.train import RecognitionModel, KrakenTrainer 
+        >>> from kraken.lib.train import RecognitionModel, KrakenTrainer
         >>> class MyPrintingCallback(Callback):
             def on_init_start(self, trainer):
                 print("Starting to init trainer!")
-        
+
             def on_init_end(self, trainer):
                 print("trainer is init now")
-        
+
             def on_train_end(self, trainer, pl_module):
                 print("do something when training ends")
         >>> ground_truth = glob.glob('training/*.xml')

diff --git a/docs/index.rst b/docs/index.rst
@@ -30,7 +30,7 @@ kraken's main features are:
   - :ref:`Public repository <repo>` of model files
   - :ref:`Variable recognition network architectures <vgsl>`
 
-Pull requests and code contributions are always welcome. 
+Pull requests and code contributions are always welcome.
 
 Installation
 ============
@@ -86,15 +86,15 @@ The git repository contains some environment files that aid in setting up the la
 
 .. code-block:: console
 
-  $ git clone https://github.com/mittagessen/kraken.git 
+  $ git clone https://github.com/mittagessen/kraken.git
   $ cd kraken
   $ conda env create -f environment.yml
 
 or:
 
 .. code-block:: console
 
-  $ git clone https://github.com/mittagessen/kraken.git 
+  $ git clone https://github.com/mittagessen/kraken.git
   $ cd kraken
   $ conda env create -f environment_cuda.yml
 
@@ -109,7 +109,7 @@ in the kraken directory for the current user:
 
 ::
 
-  $ kraken get 10.5281/zenodo.10592716 
+  $ kraken get 10.5281/zenodo.10592716
 
 
 A list of libre models available in the central repository can be retrieved by
@@ -125,9 +125,9 @@ Model metadata can be extracted using:
 
   $ kraken show 10.5281/zenodo.10592716
   name: 10.5281/zenodo.10592716
-  
+
   CATMuS-Print (Large, 2024-01-30) - Diachronic model for French prints and other languages
-  
+
   <p><strong>CATMuS-Print (Large) - Diachronic model for French prints and other West European languages</strong></p>
   <p>CATMuS (Consistent Approach to Transcribing ManuScript) Print is a Kraken HTR model trained on data produced by several projects, dealing with different languages (French, Spanish, German, English, Corsican, Catalan, Latin, Italian&hellip;) and different centuries (from the first prints of the 16th c. to digital documents of the 21st century).</p>
   <p>Transcriptions follow graphematic principles and try to be as compatible as possible with guidelines previously published for French: no ligature (except those that still exist), no allographetic variants (except the long s), and preservation of the historical use of some letters (u/v, i/j). Abbreviations are not resolved. Inconsistencies might be present, because transcriptions have been done over several years and the norms have slightly evolved.</p>

diff --git a/docs/ketos.rst b/docs/ketos.rst
@@ -5,12 +5,12 @@ Training
 
 This page describes the training utilities available through the ``ketos``
 command line utility in depth. For a gentle introduction on model training
-please refer to the :ref:`tutorial <training>`. 
+please refer to the :ref:`tutorial <training>`.
 
 There are currently three trainable components in the kraken processing pipeline:
 * Segmentation: finding lines and regions in images
 * Reading Order: ordering lines found in the previous segmentation step. Reading order models are closely linked to segmentation models and both are usually trained on the same dataset.
-* Recognition: recognition models transform images of lines into text. 
+* Recognition: recognition models transform images of lines into text.
 
 Depending on the use case it is not necessary to manually train new models for
 each material. The default segmentation model works well on quite a variety of
@@ -246,7 +246,7 @@ would be:
 
 A better configuration for large and complicated datasets such as handwritten texts:
 
-.. code-block:: console 
+.. code-block:: console
 
         $ ketos train --augment --workers 4 -d cuda -f binary --min-epochs 20 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 dataset_large.arrow
 
@@ -273,10 +273,10 @@ an exact match. Otherwise an error will be raised:
         $ ketos train -i model_5.mlmodel kamil/*.png
         Building training set  [####################################]  100%
         Building validation set  [####################################]  100%
-        [0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'} 
+        [0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'}
         Network codec not compatible with training set
-        [0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'} 
-        
+        [0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'}
+
 There are two modes dealing with mismatching alphabets, ``union`` and ``new``.
 ``union`` resizes the output layer and codec of the loaded model to include all
 characters in the new training set without removing any characters. ``new``
@@ -340,10 +340,10 @@ layers we define a network stub and index for appending:
 
 .. code-block:: console
 
-        $ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png 
+        $ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png
         Building training set  [####################################]  100%
         Building validation set  [####################################]  100%
-        [0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'} 
+        [0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'}
         Slicing and dicing model ✓
 
 The new model will behave exactly like a new one, except potentially training a
@@ -599,7 +599,7 @@ It is also possible to filter out baselines/regions selectively:
 
 Finally, we can merge baselines and regions into each other:
 
-.. code-block:: console 
+.. code-block:: console
 
         $ ketos segtrain -f xml --merge-baselines default:foo training_data/*.xml
         Training line types:
@@ -653,7 +653,7 @@ with their segmentation model in a subsequent step. The general sequence is
 therefore:
 
 .. code-block:: console
-       
+
         $ ketos segtrain -o fr_manu_seg.mlmodel -f xml french/*.xml
         ...
         $ ketos rotrain -o fr_manu_ro.mlmodel -f xml french/*.xml
@@ -671,8 +671,8 @@ serialized in the final XML output (in ALTO/PAGE XML).
         Reading order models work purely on the typology and geometric features
         of the lines and regions. They construct an approximate ordering matrix
         by feeding feature vectors of two lines (or regions) into the network
-        to decide which of those two lines precedes the other. 
-        
+        to decide which of those two lines precedes the other.
+
         These feature vectors are quite simple; just the lines' types, and
         their start, center, and end points. Therefore they can *not* reliably
         learn any ordering relying on graphical features of the input page such
@@ -705,10 +705,10 @@ sufficiently large training datasets:
         │ 3 │ ro_net.relu │ ReLU              │      0 │
         │ 4 │ ro_net.fc2  │ Linear            │     45 │
         └───┴─────────────┴───────────────────┴────────┘
-        Trainable params: 1.1 K                                                                                                                                        
-        Non-trainable params: 0                                                                                                                                        
-        Total params: 1.1 K                                                                                                                                            
-        Total estimated model params size (MB): 0                                                                                                                      
+        Trainable params: 1.1 K
+        Non-trainable params: 0
+        Total params: 1.1 K
+        Total estimated model params size (MB): 0
         stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/35 0:00:00 • -:--:-- 0.00it/s val_spearman: 0.912 val_loss: 0.701 early_stopping: 0/300 inf
 
 During validation a metric called Spearman's footrule is computed. To calculate
@@ -756,20 +756,20 @@ adding a number of image files as the final argument:
    Evaluating $model
    Evaluating  [####################################]  100%
    === report test_model.mlmodel ===
-   
+
    7012	Characters
    6022	Errors
    14.12%	Accuracy
-   
+
    5226	Insertions
    2	Deletions
    794	Substitutions
-   
+
    Count Missed   %Right
    1567  575	63.31%	Common
    5230	 5230	0.00%	Arabic
    215	 215	0.00%	Inherited
-   
+
    Errors	Correct-Generated
    773	{ ا } - {  }
    536	{ ل } - {  }

diff --git a/docs/training.rst b/docs/training.rst
@@ -142,18 +142,18 @@ that can be adjusted:
 Training a network will take some time on a modern computer, even with the
 default parameters. While the exact time required is unpredictable as training
 is a somewhat random process a rough guide is that accuracy seldom improves
-after 50 epochs reached between 8 and 24 hours of training. 
+after 50 epochs reached between 8 and 24 hours of training.
 
 When to stop training is a matter of experience; the default setting employs a
 fairly reliable approach known as `early stopping
 <https://en.wikipedia.org/wiki/Early_stopping>`_ that stops training as soon as
 the error rate on the validation set doesn't improve anymore.  This will
 prevent `overfitting <https://en.wikipedia.org/wiki/Overfitting>`_, i.e.
 fitting the model to recognize only the training data properly instead of the
-general patterns contained therein. 
+general patterns contained therein.
 
 .. code-block:: console
-        
+
         $ ketos train output_dir/*.png
         Building training set  [####################################]  100%
         Building validation set  [####################################]  100%
@@ -164,7 +164,7 @@ general patterns contained therein.
         Accuracy report (1) 0.0245 3504 3418
         epoch 1/-1  [####################################]  788/788
         Accuracy report (2) 0.8445 3504 545
-        epoch 2/-1  [####################################]  788/788             
+        epoch 2/-1  [####################################]  788/788
         Accuracy report (3) 0.9541 3504 161
         epoch 3/-1  [------------------------------------]  13/788  0d 00:22:09
         ...
@@ -212,8 +212,8 @@ information by appending one or more ``-v`` to the command:
 .. code-block:: console
 
         $ ketos -vv train syr/*.png
-        [0.7272] Building ground truth set from 876 line images 
-        [0.7281] Taking 88 lines from training for evaluation 
+        [0.7272] Building ground truth set from 876 line images
+        [0.7281] Taking 88 lines from training for evaluation
         ...
         [0.8479] Training set 788 lines, validation set 88 lines, alphabet 48 symbols
         [0.8481] alphabet mismatch {'\xa0', '0', ':', '݀', '܇', '݂', '5'}
@@ -314,20 +314,20 @@ After all lines have been processed a evaluation report will be printed:
 .. code-block:: console
 
       === report  ===
-      
+
       35619	Characters
       336	Errors
       99.06%	Accuracy
-      
+
       157	Insertions
       81	Deletions
       98	Substitutions
-      
+
       Count	Missed	%Right
       27046	143	99.47%	Syriac
       7015	52	99.26%	Common
       1558	60	96.15%	Inherited
-      
+
       Errors	Correct-Generated
       25	{  } - { COMBINING DOT BELOW }
       25	{ COMBINING DOT BELOW } - {  }
@@ -433,16 +433,16 @@ Retrieving model metadata for a particular model:
 
 	$ kraken show arabic-alam-al-kutub
 	name: arabic-alam-al-kutub.mlmodel
-	
+
 	An experimental model for Classical Arabic texts.
-	
+
 	Network trained on 889 lines of [0] as a test case for a general Classical
 	Arabic model. Ground truth was prepared by Sarah Savant
 	<[email protected]> and Maxim Romanov <[email protected]>.
-	
+
 	Vocalization was omitted in the ground truth. Training was stopped at ~35000
 	iterations with an accuracy of 97%.
-	
+
 	[0] Ibn al-Faqīh (d. 365 AH). Kitāb al-buldān. Edited by Yūsuf al-Hādī, 1st
 	edition. Bayrūt: ʿĀlam al-kutub, 1416 AH/1996 CE.
 	alphabet:  !()-.0123456789:[] «»،؟ءابةتثجحخدذرزسشصضطظعغفقكلمنهوىي ARABIC