Skip to content

Usecase: StAZH

Hervé Déjean edited this page Nov 20, 2019 · 3 revisions

This page describes the first experiment swith the StAZH collection

Contact person: JLM

Table of Contents

Summary

We selected the StAZH collection as advised. (Not too difficult, availability of a HTR model and an expert person)

See collection StAZH 3081

The labels are: 'catch-word', 'header', 'heading', 'marginalia', 'page-number'

One problem is the consistency with the segmentation. We observed that these elements may correspond to 1 or 2 TextRegion , or to 1 TextLine. This is a technical problem, because currently we annotate at a single depth of the structure, e.g. TextRegion (but in near future we have some ideas and plan to extend the underlying ML techno. We'll see... )

Dataset

Annotated:

  1. MM_1_001
  2. MM_1_005
  3. MM_1_012
  4. MM_1_017
  5. MM_1_025
  6. MM_1_028
  7. MM_1_032
  8. MM_1_036
  9. MM_1_040
  10. MM_1_044
  11. MM_1_048
  12. MM_2_235
  13. MM_2_231
  14. MM_2_068

First Experiment

CODE in https://github.com/Transkribus/TranskribusDU

NOTE: the transcript has been manually produced! (text and segmentation). So this dataset does not reflect what the DU will face in the future.

  1. in Transkribus or using the TranskribusPyClient commands, 3 collections were created:
  READDU_JL_TRN : to contain annotated training documents (001, 005, 032, 036, 040)
  READDU_JL_TST : to contain test documents (annotated)   (012)
  READJDU_JL_PRD : to contain the documents to be annotated automatically (012, 017, 068, 231, 235)
  1. Using TranskribusPyClient, we download each collection on disk (preferably without image since we use images only for qualitative human evaluation)
  2. A training is done. (See usecases/DU_StAZH.py) It results in a model stored on disk
  3. A test is done using the test collection.
BTW, below is one of the first numerical evaluation:
 - loading test graphs
  	C:\Local\meunier\git\TranskribusDU\usecases\StAZH\trnskrbs_3832\col\8251.mpxml
  - 58 nodes,  75 edges)
 1 graphs loaded
  	- computing features on test set
 done
 - predicting on test set
 done
 Line=True class, column=Prediction
               OTHER  [[21  1  2  0  1]
              header   [ 0  6  2  0  0]
             heading   [ 0  0  1  0  0]
          marginalia   [ 0  0  0 15  0]
         page-number   [ 0  0  0  0  9]]
             precision    recall  f1-score   support
      OTHER       1.00      0.84      0.91        25
     header       0.86      0.75      0.80         8
    heading       0.20      1.00      0.33         1
 marginalia       1.00      1.00      1.00        15
 page-number      0.90      1.00      0.95         9
 avg / total      0.95      0.90      0.92        58
  1. An automated annotation is done, resulting in a new transcript to be added onto the document in Transkribus. (TO be done, upload python code not working yet)

Second Experiment

Same dataset, a few more test documents, end-to-end experimentation from Transkribus to Transkribus, but still based on manual segmentation and transcription.

Update your PYTHONPATH variable

  export PYTHONPATH=<YOURPATH>/TranskribusDU/TranskribusDU:<YOURPATH>/TranskribusPyClient/src

Training

Making a TRAINING sandbox collection on Transkribus. Technically, the train and test collections are only read, but not modified. So I could work from the original collection, actually. Also FYI: since I'm using cygwin on Windows, I've a python.sh script dealing with file path conversions...

  > ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_TRN
  --> 3820

Adding some annotated document to it

  #--- do_addDocToCollec.py  <colId>  [ <docId> | <docIdFrom>-<docIdTo> ]+
  > ./python.sh TranskribusCommands/do_addDocToCollec.py  3820  7749 7750

Downloading the XML on my machine

  > ./python.sh TranskribusCommands/Transkribus_downloader.py 3820 --noimage
  - Done, see in .\trnskrbs_3820

got this on disk

  > ls trnskrbs_3820
  col/  config.txt*  out/  ref/  run/  xml/
  > ls trnskrbs_3820/col
  7749/  7749.mpxml*  7749_max.ts*  7750/  7750.mpxml*  7750_max.ts*  trp.json*

Training! (I train a model named trn3820, which will be stored in folder mdl-StAZH_a based on collection in folder trnskrbs_3820

You can use either a CRF (--crf) model or a Neural Network model (--ecn), based on Tensorflow

  #--- DU_StAZH.py <model-name> <model-directory> [--trn <col-dir>]+ [--tst <col-dir>]+ [--prd <col-dir>]+  --ecn|crf
  > ./python.sh usecases/DU_StAZH.py   mdl-StAZH_a   trn3820   --trn trnskrbs_3820  --ecn

Testing

again we create a test collection and populate it also with annotated document to compute some performance score of the model

  > ./python.sh TranskribusCommands/Transkribus_downloader.py 3832
  - Done, see in .\trnskrbs_3832

TESTING!

  > ./python.sh usecase/DU_StAZH.py mdl-StAZH_a trn3820 --tst ./trnskrbs_3832
  --------------------------------------------------
  Trained model 'mdl-StAZH_a' in folder 'trn3820'
  Test  collection(s):['C:\\tmp_READ\\tuto\\trnskrbs_3832\\col']
  --------------------------------------------------
  - loading a crf.Model_SSVM_AD3.Model_SSVM_AD3 model
        - loading pre-computed data from: trn3820\mdl-StAZH_a_model.pkl
                 file found on disk: trn3820\mdl-StAZH_a_model.pkl
                 file is fresh
        - loading pre-computed data from: trn3820\mdl-StAZH_a_transf.pkl
                 file found on disk: trn3820\mdl-StAZH_a_transf.pkl
                 file is fresh
  done
  - classes: ['OTHER', 'catch-word', 'header', 'heading', 'marginalia', 'page-number']
  - loading test graphs
        C:\tmp_READ\tuto\trnskrbs_3832\col\8251.mpxml
        - 58 nodes,  75 edges)
   1 graphs loaded
        - computing features on test set
          #features nodes=521  edges=532
         done
        - predicting on test set
         done
  Line=True class, column=Prediction
               OTHER  [[21  1  2  0  1]
              header   [ 0  6  2  0  0]
             heading   [ 0  0  1  0  0]
          marginalia   [ 0  0  0 15  0]
         page-number   [ 0  0  0  0  9]]
             precision    recall  f1-score   support
        OTHER       1.00      0.84      0.91        25
       header       0.86      0.75      0.80         8
      heading       0.20      1.00      0.33         1
   marginalia       1.00      1.00      1.00        15
  page-number       0.90      1.00      0.95         9
  avg / total       0.95      0.90      0.92        58
  (unweighted) Accuracy score = 0.90

APPLYING the model!!

Now create the collection where I'll apply the model

  > ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_PRD
  -->3829

so here, I copy the documents to the new collection because I'll upload a new transcript produced by the model. (at this stage, I do not want to impact the "real" document.)

  > ./python.sh TranskribusCommands/Transkribus_downloader.py 3829 --noimage
  ---> - Done, see in .\trnskrbs_3829
  > ./python.sh usecases/DU_StAZH.py   mdl-StAZH_a   trn3820   --run ./trnskrbs_3829
  --> - done

we produced some ..._du.mpxml files

  > ls trnskrbs_3829/col
  8620/           8620_max.ts*  8621_du.mpxml*  8622.mpxml*     8623/           8623_max.ts*  8624_du.mpxml*
  8620.mpxml*     8621/         8621_max.ts*    8622_du.mpxml*  8623.mpxml*     8624/         8624_max.ts*
  8620_du.mpxml*  8621.mpxml*   8622/           8622_max.ts*    8623_du.mpxml*  8624.mpxml*   trp.json*

File:6943_du_p1.png

Upload

now upload to Transkribus

 Actually, this collection was also annotated, so we can compute a score on it
  > ./python.sh tasks/DU_StAZH_a.py mdl-StAZH_a trn3820 --tst trnskrbs_3829
  Line=True class, column=Prediction
               OTHER  [[176  11   6   8   1  13]
          catch-word   [  0   0   0   0   0   0]
              header   [  0   0  38   3   0   2]
             heading   [  0   0   0   2   0   0]
          marginalia   [  0   0   0   0  62   0]
         page-number   [  0   0   0   0   0  48]]
 
             precision    recall  f1-score   support
        OTHER       1.00      0.82      0.90       215
   catch-word       0.00      0.00      0.00         0
       header       0.86      0.88      0.87        43
      heading       0.15      1.00      0.27         2
   marginalia       0.98      1.00      0.99        62
  page-number       0.76      1.00      0.86        48
 
  avg / total       0.95      0.88      0.90       370
 
  (unweighted) Accuracy score = 0.88

Visualisation with Transkribus

Open your document. Go to Metadata/Structural. You should see the annotations

600px

Cheat SHeet

  ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_TRN
  ./python.sh TranskribusCommands/do_addDocToCollec.py  3820 7749 7750
  ./python.sh TranskribusCommands/Transkribus_downloader.py 3820 --noimage
  ./python.sh usecases/DU_StAZH_a.py ./mdl-StAZH_a MyModel --trn trnskrbs_3820
  ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_TST
  ./python.sh TranskribusCommands/do_addDocToCollec.py  3832 8251
  ./python.sh TranskribusCommands/Transkribus_downloader.py 3832
  ./python.sh usecases/StAZH/DU_StAZH.py ./mdl-StAZH_a MyModel --tst trnskrbs_3832
  ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_PRD
  ./python.sh TranskribusCommands/do_copyDocToCollec.py  3829 8251 8252 8564-8566
  ./python.sh TranskribusCommands/Transkribus_downloader.py 3829 --noimage
  ./python.sh usecases/StAZH/DU_StAZH_a.py ./mdl-StAZH_a MyModel --run trnskrbs_3829
  ./python.sh TranskribusCommands/TranskribusDU_transcriptUploader.py ./trnskrbs_3829 3829
Clone this wiki locally