Skip to content

Commit

Permalink
Merge pull request #583 from stweil/crlf+whitespace
Browse files Browse the repository at this point in the history
Remove trailing whitespace and CR
  • Loading branch information
mittagessen authored Apr 11, 2024
2 parents 9f4ccf3 + 5696830 commit 621305a
Show file tree
Hide file tree
Showing 20 changed files with 5,208 additions and 5,208 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
name: Lint, test, build, and publish

on:
on:
push:


jobs:
lint_and_test:
Expand Down Expand Up @@ -130,7 +130,7 @@ jobs:
pypi/*
publish-gh-pages:
name: Update kraken.re github pages
name: Update kraken.re github pages
needs: lint_and_test
runs-on: ubuntu-latest
if: |
Expand All @@ -147,7 +147,7 @@ jobs:
python-version: 3.9
- name: Install sphinx-multiversion
run: python -m pip install sphinx-multiversion sphinx-autoapi
- name: Create docs
- name: Create docs
run: sphinx-multiversion docs build/html
- name: Create redirect
run: cp docs/redirect.html build/html/index.html
Expand Down
8 changes: 4 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,15 @@ branch as well:

::

$ git clone https://github.com/mittagessen/kraken.git
$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment.yml

or:

::

$ git clone https://github.com/mittagessen/kraken.git
$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment_cuda.yml

Expand All @@ -75,7 +75,7 @@ in the kraken directory for the current user:

::

$ kraken get 10.5281/zenodo.10592716
$ kraken get 10.5281/zenodo.10592716

A list of libre models available in the central repository can be retrieved by
running:
Expand Down Expand Up @@ -105,7 +105,7 @@ To segment an image (binarized or not) with the new baseline segmenter:
::

$ kraken -i image.tif lines.json segment -bl


To segment and OCR an image using the default model(s):

Expand Down
10 changes: 5 additions & 5 deletions docs/alto.xml
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,18 @@
<PrintSpace...>
<ComposedBlockType ID="block_I"
HPOS="125"
VPOS="523"
WIDTH="5234"
VPOS="523"
WIDTH="5234"
HEIGHT="4000"
TYPE="region_type"><!-- for textlines part of a semantic region -->
<TextBlock ID="textblock_N">
<TextLine ID="line_0"
HPOS="..."
VPOS="..."
WIDTH="..."
VPOS="..."
WIDTH="..."
HEIGHT="..."
BASELINE="10 20 15 20 400 20"><!-- necessary for segmentation training -->
<String ID="segment_K"
<String ID="segment_K"
CONTENT="word_text"><!-- necessary for recognition training. Text is retrieved from <String> and <SP> tags. Lower level glyphs are ignored. -->
...
</String>
Expand Down
12 changes: 6 additions & 6 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
API Quickstart
API Quickstart
==============

Kraken provides routines which are usable by third party tools to access all
functionality of the OCR engine. Most functional blocks, binarization,
segmentation, recognition, and serialization are encapsulated in one high
level method each.
level method each.

Simple use cases of the API which are mostly useful for debugging purposes are
contained in the `contrib` directory. In general it is recommended to look at
Expand Down Expand Up @@ -493,7 +493,7 @@ handling and verbosity options for the CLI.

.. code-block:: python
>>> from kraken.lib.train import RecognitionModel, KrakenTrainer
>>> from kraken.lib.train import RecognitionModel, KrakenTrainer
>>> ground_truth = glob.glob('training/*.xml')
>>> training_files = ground_truth[:250] # training data is shuffled internally
>>> evaluation_files = ground_truth[250:]
Expand Down Expand Up @@ -522,14 +522,14 @@ can be attached to the trainer object:
.. code-block:: python
>>> from pytorch_lightning.callbacks import Callback
>>> from kraken.lib.train import RecognitionModel, KrakenTrainer
>>> from kraken.lib.train import RecognitionModel, KrakenTrainer
>>> class MyPrintingCallback(Callback):
def on_init_start(self, trainer):
print("Starting to init trainer!")
def on_init_end(self, trainer):
print("trainer is init now")
def on_train_end(self, trainer, pl_module):
print("do something when training ends")
>>> ground_truth = glob.glob('training/*.xml')
Expand Down
12 changes: 6 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ kraken's main features are:
- :ref:`Public repository <repo>` of model files
- :ref:`Variable recognition network architectures <vgsl>`

Pull requests and code contributions are always welcome.
Pull requests and code contributions are always welcome.

Installation
============
Expand Down Expand Up @@ -86,15 +86,15 @@ The git repository contains some environment files that aid in setting up the la

.. code-block:: console
$ git clone https://github.com/mittagessen/kraken.git
$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment.yml
or:

.. code-block:: console
$ git clone https://github.com/mittagessen/kraken.git
$ git clone https://github.com/mittagessen/kraken.git
$ cd kraken
$ conda env create -f environment_cuda.yml
Expand All @@ -109,7 +109,7 @@ in the kraken directory for the current user:

::

$ kraken get 10.5281/zenodo.10592716
$ kraken get 10.5281/zenodo.10592716


A list of libre models available in the central repository can be retrieved by
Expand All @@ -125,9 +125,9 @@ Model metadata can be extracted using:
$ kraken show 10.5281/zenodo.10592716
name: 10.5281/zenodo.10592716
CATMuS-Print (Large, 2024-01-30) - Diachronic model for French prints and other languages
<p><strong>CATMuS-Print (Large) - Diachronic model for French prints and other West European languages</strong></p>
<p>CATMuS (Consistent Approach to Transcribing ManuScript) Print is a Kraken HTR model trained on data produced by several projects, dealing with different languages (French, Spanish, German, English, Corsican, Catalan, Latin, Italian&hellip;) and different centuries (from the first prints of the 16th c. to digital documents of the 21st century).</p>
<p>Transcriptions follow graphematic principles and try to be as compatible as possible with guidelines previously published for French: no ligature (except those that still exist), no allographetic variants (except the long s), and preservation of the historical use of some letters (u/v, i/j). Abbreviations are not resolved. Inconsistencies might be present, because transcriptions have been done over several years and the norms have slightly evolved.</p>
Expand Down
40 changes: 20 additions & 20 deletions docs/ketos.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ Training

This page describes the training utilities available through the ``ketos``
command line utility in depth. For a gentle introduction on model training
please refer to the :ref:`tutorial <training>`.
please refer to the :ref:`tutorial <training>`.

There are currently three trainable components in the kraken processing pipeline:
* Segmentation: finding lines and regions in images
* Reading Order: ordering lines found in the previous segmentation step. Reading order models are closely linked to segmentation models and both are usually trained on the same dataset.
* Recognition: recognition models transform images of lines into text.
* Recognition: recognition models transform images of lines into text.

Depending on the use case it is not necessary to manually train new models for
each material. The default segmentation model works well on quite a variety of
Expand Down Expand Up @@ -246,7 +246,7 @@ would be:
A better configuration for large and complicated datasets such as handwritten texts:

.. code-block:: console
.. code-block:: console
$ ketos train --augment --workers 4 -d cuda -f binary --min-epochs 20 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 dataset_large.arrow
Expand All @@ -273,10 +273,10 @@ an exact match. Otherwise an error will be raised:
$ ketos train -i model_5.mlmodel kamil/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
[0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'}
[0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'}
Network codec not compatible with training set
[0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'}
[0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'}
There are two modes dealing with mismatching alphabets, ``union`` and ``new``.
``union`` resizes the output layer and codec of the loaded model to include all
characters in the new training set without removing any characters. ``new``
Expand Down Expand Up @@ -340,10 +340,10 @@ layers we define a network stub and index for appending:

.. code-block:: console
$ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png
$ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
[0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'}
[0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'}
Slicing and dicing model ✓
The new model will behave exactly like a new one, except potentially training a
Expand Down Expand Up @@ -599,7 +599,7 @@ It is also possible to filter out baselines/regions selectively:
Finally, we can merge baselines and regions into each other:

.. code-block:: console
.. code-block:: console
$ ketos segtrain -f xml --merge-baselines default:foo training_data/*.xml
Training line types:
Expand Down Expand Up @@ -653,7 +653,7 @@ with their segmentation model in a subsequent step. The general sequence is
therefore:

.. code-block:: console
$ ketos segtrain -o fr_manu_seg.mlmodel -f xml french/*.xml
...
$ ketos rotrain -o fr_manu_ro.mlmodel -f xml french/*.xml
Expand All @@ -671,8 +671,8 @@ serialized in the final XML output (in ALTO/PAGE XML).
Reading order models work purely on the typology and geometric features
of the lines and regions. They construct an approximate ordering matrix
by feeding feature vectors of two lines (or regions) into the network
to decide which of those two lines precedes the other.
to decide which of those two lines precedes the other.

These feature vectors are quite simple; just the lines' types, and
their start, center, and end points. Therefore they can *not* reliably
learn any ordering relying on graphical features of the input page such
Expand Down Expand Up @@ -705,10 +705,10 @@ sufficiently large training datasets:
│ 3 │ ro_net.relu │ ReLU │ 0 │
│ 4 │ ro_net.fc2 │ Linear │ 45 │
└───┴─────────────┴───────────────────┴────────┘
Trainable params: 1.1 K
Non-trainable params: 0
Total params: 1.1 K
Total estimated model params size (MB): 0
Trainable params: 1.1 K
Non-trainable params: 0
Total params: 1.1 K
Total estimated model params size (MB): 0
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/35 0:00:00 • -:--:-- 0.00it/s val_spearman: 0.912 val_loss: 0.701 early_stopping: 0/300 inf
During validation a metric called Spearman's footrule is computed. To calculate
Expand Down Expand Up @@ -756,20 +756,20 @@ adding a number of image files as the final argument:
Evaluating $model
Evaluating [####################################] 100%
=== report test_model.mlmodel ===
7012 Characters
6022 Errors
14.12% Accuracy
5226 Insertions
2 Deletions
794 Substitutions
Count Missed %Right
1567 575 63.31% Common
5230 5230 0.00% Arabic
215 215 0.00% Inherited
Errors Correct-Generated
773 { ا } - { }
536 { ل } - { }
Expand Down
28 changes: 14 additions & 14 deletions docs/training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,18 +142,18 @@ that can be adjusted:
Training a network will take some time on a modern computer, even with the
default parameters. While the exact time required is unpredictable as training
is a somewhat random process a rough guide is that accuracy seldom improves
after 50 epochs reached between 8 and 24 hours of training.
after 50 epochs reached between 8 and 24 hours of training.

When to stop training is a matter of experience; the default setting employs a
fairly reliable approach known as `early stopping
<https://en.wikipedia.org/wiki/Early_stopping>`_ that stops training as soon as
the error rate on the validation set doesn't improve anymore. This will
prevent `overfitting <https://en.wikipedia.org/wiki/Overfitting>`_, i.e.
fitting the model to recognize only the training data properly instead of the
general patterns contained therein.
general patterns contained therein.

.. code-block:: console
$ ketos train output_dir/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
Expand All @@ -164,7 +164,7 @@ general patterns contained therein.
Accuracy report (1) 0.0245 3504 3418
epoch 1/-1 [####################################] 788/788
Accuracy report (2) 0.8445 3504 545
epoch 2/-1 [####################################] 788/788
epoch 2/-1 [####################################] 788/788
Accuracy report (3) 0.9541 3504 161
epoch 3/-1 [------------------------------------] 13/788 0d 00:22:09
...
Expand Down Expand Up @@ -212,8 +212,8 @@ information by appending one or more ``-v`` to the command:
.. code-block:: console
$ ketos -vv train syr/*.png
[0.7272] Building ground truth set from 876 line images
[0.7281] Taking 88 lines from training for evaluation
[0.7272] Building ground truth set from 876 line images
[0.7281] Taking 88 lines from training for evaluation
...
[0.8479] Training set 788 lines, validation set 88 lines, alphabet 48 symbols
[0.8481] alphabet mismatch {'\xa0', '0', ':', '݀', '܇', '݂', '5'}
Expand Down Expand Up @@ -314,20 +314,20 @@ After all lines have been processed a evaluation report will be printed:
.. code-block:: console
=== report ===
35619 Characters
336 Errors
99.06% Accuracy
157 Insertions
81 Deletions
98 Substitutions
Count Missed %Right
27046 143 99.47% Syriac
7015 52 99.26% Common
1558 60 96.15% Inherited
Errors Correct-Generated
25 { } - { COMBINING DOT BELOW }
25 { COMBINING DOT BELOW } - { }
Expand Down Expand Up @@ -433,16 +433,16 @@ Retrieving model metadata for a particular model:
$ kraken show arabic-alam-al-kutub
name: arabic-alam-al-kutub.mlmodel
An experimental model for Classical Arabic texts.
Network trained on 889 lines of [0] as a test case for a general Classical
Arabic model. Ground truth was prepared by Sarah Savant
<[email protected]> and Maxim Romanov <[email protected]>.
Vocalization was omitted in the ground truth. Training was stopped at ~35000
iterations with an accuracy of 97%.
[0] Ibn al-Faqīh (d. 365 AH). Kitāb al-buldān. Edited by Yūsuf al-Hādī, 1st
edition. Bayrūt: ʿĀlam al-kutub, 1416 AH/1996 CE.
alphabet: !()-.0123456789:[] «»،؟ءابةتثجحخدذرزسشصضطظعغفقكلمنهوىي ARABIC
Expand Down
Loading

0 comments on commit 621305a

Please sign in to comment.