Skip to content

Commit

Permalink
separate preprocessing steps and use AlternativeImage in ocropy wrapp…
Browse files Browse the repository at this point in the history
…ers (#10)

* separate preprocessing steps and use AlternativeImage in ocropy wrappers:

- move binarization from recognition into extra Processor
  (also allowing region and page level operation)
- move dewarping from recognition into extra Processor
  (operating on the line level; model-independent)
- move deskewing from binarization into extra Processor
  (operating on the region level, only annotating angle in PAGE)
- always dive down the PAGE hierarchy checking whether
  AlternativeImage is referenced: use it if present,
  otherwise create an ad-hoc image for the segment
  (page/region/line) from _relative_ coordinates into
  the next higher-level image by cropping (and rotating);
  also, pass down corrected coordinates:
  - offset coordinates if the image are larger than the segment
    (e.g. from rotation),
  - rotate coordinates if the region was rotated (has @orientation)
  new functions (all to be moved into ocrd core):
  - image_from_page (AlternativeImage, or crop via Border)
  - image_from_region (AlternativeImage, or crop and rotate via Coords
    and orientation)
  - image_from_line (AlternativeImage, or crop via rotated polygon
    mask, and optionally region segmentation)
  - save_image_file (save new AlternativeImage: add to METS and
    reference in PAGE)
- use polygon masks instead of rectangles when cropping lines
  (especially useful after rotation), and try to resegment regions
  to mask components from neighbouring lines (especially useful
  against ascenders and descenders when dewarping or with sensitive
  OCR like ocropy)

- move common ocropy functions into extra module
  (but with additions/improvements):
  - PIL.Image vs np.ndarray conversions
  - type and plausibility checks for line/region/page level
    (but mix absolute and relative error criteria)
  - local whitelevel estimation (but keeping exact size)
  - deskewing (but expanding image size with rotation)
  - binarization (but using larger whitelevel percentile,
    smaller whitelevel local range and zoom, and
    larger white point threshold)
  - borderclean (remove black components only in the margin)
  - black and white column separator search
  - gradmap for baseline search (but with smaller minimum size
    of boxmap and sticky top/bottom for line components that
    were chopped-off)
  - line seed search (but with horizontal merge to avoid
    splitting lines at large whitespace in the absence of
    true colseps)
  - line segmentation for regions/pages without/with colseps
    (but with larger scale estimate and tighter hscale
     for higher vertical variability of broken fonts)
  - denoising
- ocrd-tool: add default input and output file groups
- update README and setup
- version: 0.0.2 -> 0.0.3

* move optional region segmentation from common.image_from_line to binarize.process

* polygon-based AlternativeImage processing, separate resegmentation, add clip:

- make all common functions for image extraction respect and recreate
  the full polygon coordinates (not just the bounding box):
  - use Numpy arrays for coordinates instead of dicts
  - rename rotate_polygon → rotate_coordinates
  - factor out coordinates_of_segment for shared offset/rotation calc
  - offer extra coordaintes_for_segment for the reverse direction
    (to add segmentation on lower levels)
  - factor out image_from_polygon for shared background masking
- when masking a polygon from an image, fill with the background color
  (instead of white)
- when cropping a rectangle from an image, if the rectangle extends
  beyond the image (as happens with bad segmentation when segments
  extend beyond their parents in PAGE), fill with the background color
  (instead of black)

- in various processors: start introducing DPI-based zoom parameter
- when deskewing, make sure to also create a rotated AlternativeImage
- when deskewing, ignore detected angles if the drop in variance is too
  small (as happens on tiny regions)
- when binarizing, be robust against NaN results for threshold levels
- when binarizing, do not attempt borderclean (obsolete with clip)
- when binarizing, do not attempt deskewing on page level (yet)

- add new Processor clipping connected components from neighbouring
  segments (operating on the region or line level), which produces
  images with intruding foreground components clipped to white
- move re-segmentation from `image_from_line` or binarization/dewarping
  into extra Processor (operating on the line level), which instead of
  producing images creates shrinked, non-overlapping polygon outlines
- improve line segmentation (compute_line_labels) further:
  - use more robust state transitions from bottom to top line markers:
    project seed by delta from both bottom (up) and top (down), but
    stop short if they are closer to each other already (fill only)
  - horizontally blur bottom line markers just like top line markers
  - skip horizontally blurring the resulting seeds altogether
    (to avoid accidentally joining lines)
  - this obsoletes the large (6*hscale) horizontal blur of the gradient
  - this obsoletes the sticky option for compute_gradmap: do not extend
    the gradient from the bottom/top margins
  - make the old behaviour available with robust=False
  - fix hmerge relabelling
  - when spreading line seeds to the background, first make sure that
    connected components of the foreground remain in their majority label
  - when full_page=True,
    - add remove_hlines again, but with additional height threshold,
      and smaller width threshold default
    - when searching for black column separators,
      - reduce the vertical threshold (because vlines can be discontinuous)
      - keep only connected components that are properly contained in
        the detected region (i.e. avoid damaging neighbours)
  - make checks optional here as well
  - combine scale parameters with additional top-level zoom parameter
    (to be determined from DPI factor against implicit 300)
- improve docstrings

* vergessen
  • Loading branch information
bertsky authored and finkf committed Jul 16, 2019
1 parent 5d7ceb0 commit fab1e8d
Show file tree
Hide file tree
Showing 14 changed files with 2,860 additions and 318 deletions.
86 changes: 79 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,23 +69,78 @@ ocrd-cis-align \


### ocrd-cis-ocropy-train
The ocropy-train tool can be used to train lstm models.
The tool takes the ground truth from a workspace and safes the snippets from the corresponding page.
Then the model is trained on all snippets for 1 million randomized iterations or the given number from the parameter file.
The ocropy-train tool can be used to train LSTM models.
It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages.
Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.
```sh
ocrd-cis-ocropy-train \
--input-file-grp 'OCR-D-XML' \
--input-file-grp OCR-D-GT-SEG-LINE \
--mets mets.xml
--parameter file:///path/to/config.json
```

### ocrd-cis-ocropy-clip
The ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace.
It runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).
```sh
ocrd-cis-ocropy-clip \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-CLIP \
--mets mets.xml
--parameter file:///path/to/config.json
```

### ocrd-cis-ocropy-resegment
The ocropy-resegment tool can be used to remove overlap between lines of a workspace.
It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
```sh
ocrd-cis-ocropy-resegment \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-RES \
--mets mets.xml
--parameter file:///path/to/config.json
```

### ocrd-cis-ocropy-deskew
The ocropy-deskew tool can be used to deskew regions of a workspace.
It runs the Ocropy thresholding and deskewing estimation on every text region of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
```sh
ocrd-cis-ocropy-deskew \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-DES \
--mets mets.xml
--parameter file:///path/to/config.json
```

### ocrd-cis-ocropy-binarize
The ocropy-binarize tool can be used to grayscale-normalize and deskew pages / regions / lines of a workspace.
It runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.)
```sh
ocrd-cis-ocropy-binarize \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-BIN \
--mets mets.xml
--parameter file:///path/to/config.json
```

### ocrd-cis-ocropy-dewarp
The ocropy-dewarp tool can be used to dewarp text lines of a workspace.
It runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
```sh
ocrd-cis-ocropy-dewarp \
--input-file-grp OCR-D-SEG-LINE-BIN \
--output-file-grp OCR-D-SEG-LINE-DEW \
--mets mets.xml
--parameter file:///path/to/config.json
```

### ocrd-cis-ocropy-recognize
The ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace.
The tool runs the ocropy optical character recognition for each "region" given in the XML file of the workspace.
It runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
```sh
ocrd-cis-ocropy-recognize \
--input-file-grp 'OCR-D-XML' \
--output-file-grp 'OCR-D-OCROPY' \
--input-file-grp OCR-D-SEG-LINE-DEW \
--output-file-grp OCR-D-OCR-OCRO \
--mets mets.xml
--parameter file:///path/to/config.json
```
Expand Down Expand Up @@ -120,7 +175,24 @@ place them into: /usr/share/tesseract-ocr/4.00/tessdata
Tesserocr v2.4.0 seems broken for tesseract 4.0.0-beta. Install
Version v2.3.1 instead: `pip install tesseract==2.3.1`.

## Workflow configuration

A decent pipeline might look like this:

1. page-level cropping
2. page-level binarization
3. page-level deskewing
4. page-level dewarping
5. region segmentation
6. region-level clipping
7. region-level deskewing
8. line segmentation
9. line-level clipping or resegmentation
10. line-level dewarping
11. line-level recognition
12. line-level alignment

If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.

## OCR-D links

Expand Down
178 changes: 167 additions & 11 deletions ocrd_cis/ocrd-tool.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,163 @@
}
}
},
"ocrd-cis-ocropy-binarize": {
"executable": "ocrd-cis-ocropy-binarize",
"categories": [
"Image preprocessing"
],
"steps": [
"preprocessing/optimization/binarization",
"preprocessing/optimization/grayscale_normalization",
"preprocessing/optimization/deskewing"
],
"input_file_grp": [
"OCR-D-IMG",
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"output_file_grp": [
"OCR-D-IMG-BIN",
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"description": "Binarize and deskew pages / regions / lines with ocropy",
"parameters": {
"method": {
"type": "string",
"enum": ["none", "global", "otsu", "gauss-otsu", "ocropy"],
"description": "binarization method to use (only ocropy will include deskewing)",
"default": "ocropy"
},
"maxskew": {
"type": "number",
"description": "modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)",
"default": 5.0
},
"noise_maxsize": {
"type": "number",
"description": "maximum pixel number for connected components to regard as noise (0 will deactivate denoising)",
"default": 2
},
"level-of-operation": {
"type": "string",
"enum": ["page", "region", "line"],
"description": "PAGE XML hierarchy level granularity to annotate images for",
"default": "page"
}
}
},
"ocrd-cis-ocropy-deskew": {
"executable": "ocrd-cis-ocropy-deskew",
"categories": [
"Image preprocessing"
],
"steps": [
"preprocessing/optimization/deskewing"
],
"input_file_grp": [
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"output_file_grp": [
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"description": "Deskew regions with ocropy (but only by annotating orientation angle)",
"parameters": {
"maxskew": {
"type": "number",
"description": "modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)",
"default": 5.0
}
}
},
"ocrd-cis-ocropy-clip": {
"executable": "ocrd-cis-ocropy-clip",
"categories": [
"Layout analysis"
],
"steps": [
"layout/segmentation/region",
"layout/segmentation/line"
],
"input_file_grp": [
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"output_file_grp": [
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"description": "Clip text regions / lines at intersections with neighbours",
"parameters": {
"level-of-operation": {
"type": "string",
"enum": ["region", "line"],
"description": "PAGE XML hierarchy level granularity to annotate images for",
"default": "region"
},
"min_fraction": {
"type": "number",
"format": "float",
"description": "share of foreground pixels that must be retained by the largest label",
"default": 0.7
}
}
},
"ocrd-cis-ocropy-resegment": {
"executable": "ocrd-cis-ocropy-resegment",
"categories": [
"Layout analysis"
],
"steps": [
"layout/segmentation/line"
],
"input_file_grp": [
"OCR-D-SEG-LINE"
],
"output_file_grp": [
"OCR-D-SEG-LINE"
],
"description": "Resegment lines with ocropy (by shrinking annotated polygons)",
"parameters": {
"min_fraction": {
"type": "number",
"format": "float",
"description": "share of foreground pixels that must be retained by the largest label",
"default": 0.8
},
"extend_margins": {
"type": "number",
"format": "integer",
"description": "number of pixels to extend the input polygons horizontally and vertically before intersecting",
"default": 3
}
}
},
"ocrd-cis-ocropy-dewarp": {
"executable": "ocrd-cis-ocropy-dewarp",
"categories": [
"Image preprocessing"
],
"steps": [
"preprocessing/optimization/dewarping"
],
"description": "Dewarp line images with ocropy",
"input_file_grp": [
"OCR-D-SEG-LINE"
],
"output_file_grp": [
"OCR-D-SEG-LINE"
],
"parameters": {
"range": {
"type": "number",
"description": "maximum vertical disposition or maximum margin (will be multiplied by mean centerline deltas to yield pixels)",
"default": 4
}
}
},
"ocrd-cis-ocropy-recognize": {
"executable": "ocrd-cis-ocropy-recognize",
"categories": [
Expand All @@ -54,21 +211,20 @@
"steps": [
"recognition/text-recognition"
],
"description": "Recognize text in lines with ocropy",
"description": "Recognize text in (binarized+deskewed+dewarped) lines with ocropy",
"input_file_grp": [
"OCR-D-SEG-LINE",
"OCR-D-SEG-WORD",
"OCR-D-SEG-GLYPH"
],
"output_file_grp": [
"OCR-D-OCR-OCRO"
],
"parameters": {
"dewarping": {
"type": "boolean",
"description": "enable line normalization",
"default": true
},
"binarization": {
"type": "string",
"enum": ["none", "global", "otsu", "gauss-otsu", "ocropy"],
"default": "none"
},
"textequiv_level": {
"type": "string",
"enum": ["line", "word", "glyph"],
"description": "PAGE XML hierarchy level granularity to add the TextEquiv results to",
"default": "line"
},
"model": {
Expand Down
Loading

0 comments on commit fab1e8d

Please sign in to comment.