Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure general workflow, cli, services, processors, docs, tests #40

Merged
merged 68 commits into from
Mar 27, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
11bff34
Stubs for METS, PAGE, resolver and workspace, pylint, unittests
kba Mar 23, 2018
abeb696
OcrdMetsFile in its own file
kba Mar 23, 2018
f22758b
wip: ResolverCache
kba Mar 23, 2018
c54a433
model.ocrd_page: fix indexing off by one
kba Mar 23, 2018
1fc1ee7
recognize works
kba Mar 23, 2018
01045cf
gitignore libreoffice lock files
kba Mar 23, 2018
dbaa95c
wip: travis
kba Mar 23, 2018
3a05a34
typo: segent -> segment
kba Mar 23, 2018
e3e46ae
ResolverCache working
kba Mar 23, 2018
919cec9
'make test' to run all unit tests
kba Mar 23, 2018
f3cc1ae
add EXIF constants
kba Mar 23, 2018
db9eb66
OcrdMets: cache fileGrps
kba Mar 23, 2018
98ef09a
:memo: Update README
kba Mar 24, 2018
f018657
python3 compat (make PYTHON=python3 test)
kba Mar 24, 2018
18d12ee
create processor class, port exif to new api, extend page, test files…
kba Mar 24, 2018
6d81ea5
rename ocrd.log -> ocrd.utils to contain reusable static code
kba Mar 24, 2018
9641bbd
move to utils, export getLogger, coordinate_string_from_xywh
kba Mar 24, 2018
905aa1e
lazy logging
kba Mar 24, 2018
fef9b5c
pylint: stop complaining about lxml
kba Mar 24, 2018
ff1e92d
page tag constants
kba Mar 24, 2018
62684c1
OcrdPage: methods for listing/creating regions/lines
kba Mar 24, 2018
728564e
workspace: + save_mets method
kba Mar 24, 2018
8ecce18
utils: xywh_from_coordinate_string as opposite of coordinate_string_f…
kba Mar 24, 2018
fb09eba
WIP port segmenting to new api
kba Mar 24, 2018
cf724a8
pylint: stop complaining about tesserocr/cv2
kba Mar 24, 2018
9dbe659
MIMETYPE_PAGE = text/page+xml
kba Mar 24, 2018
cfb536a
processor: helpers for input/output of files
kba Mar 24, 2018
6cbb851
OcrdPage: prefer "X is not None" over "not X"
kba Mar 24, 2018
48b3cee
tests: assets module
kba Mar 24, 2018
b8a942b
mirror module structure in tests
kba Mar 25, 2018
98c099b
segment*/tesseract: use processor shortcuts
kba Mar 25, 2018
bb8832f
xsl namespace
kba Mar 25, 2018
50b1f60
workspace: output files are saved with file:// if no url
kba Mar 25, 2018
751f0c2
OcrdPage: typos
kba Mar 25, 2018
f323355
xml prettify
kba Mar 25, 2018
9244464
remove cruft from ocrd_xml_base
kba Mar 25, 2018
19e0c19
tests: run all with uniitest discover
kba Mar 25, 2018
e4c9fc4
test with pytest
kba Mar 25, 2018
93dfbba
test OcrdPage
kba Mar 25, 2018
3de7a6a
:fire: remove original characterizing/segmenting
kba Mar 25, 2018
468c698
:memo: docstrings in OcrdPage
kba Mar 25, 2018
83de2f7
:fire: remove initializing
kba Mar 25, 2018
02e01c9
start with cli
kba Mar 25, 2018
271d0bf
cli
kba Mar 25, 2018
26b9fff
run_process in ocrd.processor to flexibly create workspace and run pr…
kba Mar 25, 2018
3a9391c
Expose existing processors on web service, extend run-server script
kba Mar 25, 2018
4568bd8
rename binary to 'ocrd', merge run and run_server, update setup.py
kba Mar 25, 2018
3da3e46
CLI: ocrd process is now chainable
kba Mar 25, 2018
4bfd81b
:fire: remove ocrd.webservices
kba Mar 25, 2018
e712196
minimal repository web service
kba Mar 25, 2018
0c80139
optionally symlink instead of copy in resolver
kba Mar 25, 2018
74db476
:memo: docs, move code in in processor/__init__.py to processor/base.py
kba Mar 25, 2018
30a97c9
basic setup for documentation with sphinx
kba Mar 25, 2018
32c2ce7
move image manipulation to workspace.resolve_image_as_pil
kba Mar 25, 2018
b58e6f6
resolver: allow setting workspace directory explicitly (for testing)
kba Mar 26, 2018
fa92fa9
use xmllint --format to optionally canonicalize/pretty print XML
kba Mar 26, 2018
cabb273
canonical ID for mets:file: fileGrp@USE + 4-zero-padded index within grp
kba Mar 26, 2018
bfdf09a
page: helpers to work with TextLine
kba Mar 26, 2018
38c85a7
processor.add_output_file: pass on ID
kba Mar 26, 2018
9f3c82c
WIP recognition with tesseract3
kba Mar 26, 2018
80d33da
test assets
kba Mar 26, 2018
c8232b7
:bug: resolver: use hyper-verbose but uniqe filenames based on url
kba Mar 23, 2018
3693299
:art: remove obsolete pylint exceptions
kba Mar 26, 2018
4e0fa6f
rename tesseract3 -> tesserocr
kba Mar 26, 2018
38d29ab
make 'test-profile' to list most time-consuming lines
kba Mar 26, 2018
f911afa
workspace: remove hard-coded reference to INPUT fileGrp
kba Mar 26, 2018
1a2bd9b
:green_heart: travis add @alex-p's tesseract-ocr PPA
kba Mar 26, 2018
daade7f
properly skip recognize test, travis
kba Mar 24, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,6 @@ env2/
ocrd.egg-info
/src
spec
.pytest_cache
.~lock*
/profile
13 changes: 13 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[MASTER]
extension-pkg-whitelist=lxml
ignored-modules=cv2,tesserocr

[MESSAGES CONTROL]
disable =
missing-docstring,
no-self-use,
too-many-arguments,
superfluous-parens,
invalid-name,
line-too-long,
too-few-public-methods,
30 changes: 30 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
language: python
python:
- 2.7
- 3.6
before_install:
- sudo apt-get -qq update
- sudo apt-get install -y autoconf automake libtool
- sudo apt-get install -y libpng12-dev
- sudo apt-get install -y libjpeg62-dev
- sudo apt-get install -y libtiff4-dev
- sudo apt-get install -y zlib1g-dev
- wget http://www.leptonica.org/source/leptonica-1.73.tar.gz -O /tmp/leptonica.tar.gz
- tar -xvf /tmp/leptonica.tar.gz
- pushd leptonica-1.73 && ./configure && make && sudo make install && popd
- wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz -O /tmp/tesseract.tar.gz
- tar -xvf /tmp/tesseract.tar.gz
- cd tesseract-3.04.01
- ./autogen.sh && ./configure
- LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
- sudo make install && sudo ldconfig
- cd ..
- wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz -O /tmp/tessdata.tar.gz
- tar -xvf /tmp/tessdata.tar.gz
- sudo mkdir -p /usr/local/share/tessdata/
- sudo rsync -a tessdata-3.04.00/ /usr/local/share/tessdata
- sudo apt-get install -y libimage-exiftool-perl libxml2-utils
install:
- make deps-pip test-deps-pip
script:
- make test
60 changes: 50 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,14 +1,25 @@
export

SHELL = /bin/bash
PYTHON = python2
PYTHONPATH := .:$(PYTHONPATH)
PIP = pip
LOG_LEVEL = INFO

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
@echo ""
@echo " Targets"
@echo ""
@echo " deps-ubuntu Dependencies for deployment in an ubuntu/debian linux"
@echo " deps-pip Install python deps via pip"
@echo " spec Clone the spec dir for sample files"
@echo " install (Re)install the tool"
@echo " test-run Test the run command"
@echo " deps-ubuntu Dependencies for deployment in an ubuntu/debian linux"
@echo " deps-pip Install python deps via pip"
@echo " spec Clone the spec dir for sample files"
@echo " install (Re)install the tool"
@echo " test-deps-pip Install test python deps via pip"
@echo " test Run all unit tests"
@echo " docs Build documentation"
@echo " docs-clean Clean docs"

# END-EVAL

Expand All @@ -20,22 +31,51 @@ deps-ubuntu:
libtesseract-dev \
libleptonica-dev \
libimage-exiftool-perl \
libxml2-utils \
tesseract-ocr-eng \
tesseract-ocr-deu \
tesseract-ocr-deu-frak

# Install python deps via pip
deps-pip:
pip3 install --user -r requirements.txt
$(PIP) install -r requirements.txt

# Clone the spec dir for sample files
spec:
git clone https://github.com/OCR-D/spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that the example files in spec need updating - the mets.xml should be updated to reflect recent discussions and ideally we should pick some sample images that are a) lightweight and b) for which we already have ground truth in ocr-d.de/daten.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, tracking in #41


# (Re)install the tool
install:
pip3 install --user .
$(PIP) install .

test/assets: spec
mkdir -p test/assets
cp -r spec/io/example test/assets/herold

# Install test python deps via pip
test-deps-pip:
$(PIP) install -r requirements.txt

.PHONY: test
# Run all unit tests
test:
$(PYTHON) -m pytest --log-level=$(LOG_LEVEL) --duration=10 test

.PHONY: docs
# Build documentation
docs:
sphinx-apidoc -f -o docs/api ocrd
cd docs ; $(MAKE) html

# Clean docs
docs-clean:
cd docs ; rm -rf _build api

pyclean:
rm **/*.pyc
rm -rf .pytest_cache

test-profile:
$(PYTHON) -m cProfile -o profile $(which py.test) test
$(PYTHON) analyze_profile.py

# Test the run command
test-run: spec
run-ocrd spec/io/example/mets.xml
34 changes: 27 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,25 @@ To bootstrap the tool, you'll need installed (Ubuntu packages):

* Python (``python``)
* pip (``python-pip``)
* Tesseract (3.04) headers (``libtesseract-dev``)
* Some tesseract (3.04) language models (``tesseract-ocr-{eng,deu,deu-frak,...}``)
* Tesseract headers (``libtesseract-dev``)
* Some tesseract language models (``tesseract-ocr-{eng,deu,deu-frak,...}``)
* Leptonica headers (``libleptonica-dev``)
* exiftool (``libimage-exiftool-perl``)
* libxml2-utils for xmllint (``libxml2-utils``)

To install system-wide:

::

sudo make deps-ubuntu
pip install -r requirements.txt
pip install .

To install to user HOME dir

::

sudo make deps-ubuntu
pip install --user -r requirements.txt
pip install .

Expand All @@ -48,21 +51,38 @@ If tesserocr fails to compile with an error:::

This is due to some inconsistencies in the installed tesseract C headers. Replace ``string`` with ``std::string`` in ``$PREFIX/include/tesseract/unicharset.h:265:5:`` and ``$PREFIX/include/tesseract/unichar.h:164:10:`` ff.

If tesserocr fails with an error about ``LSTM``/``CUBE``, you are using th 4.00
headers. Downgrade to 3.04: ``apt install libtesseract-dev=3.04.01-6`` or
whatever ``apt policy libtesseract-dev`` offers. Make sure there are no spurious pkg-config artifacts, e.g. in ``/usr/local/lib/pkgconfig/tesseract.pc``. The same goes for language models
If tesserocr fails with an error about ``LSTM``/``CUBE``, you are have a
mismatch between tesseract header/data/pkg-config versions. ``apt policy
libtesseract-dev`` lists the apt-installable versions, keep it consistent. Make
sure there are no spurious pkg-config artifacts, e.g. in
``/usr/local/lib/pkgconfig/tesseract.pc``. The same goes for language models.


Usage
-----

pyocrd installs a binary ``ocrd`` that can be used to invoke the processors
directly (``ocrd process``) or start (development) webservices (``ocrd server``)

Examples:

::

run-ocrd <METS-FILE>
# List available processors
ocrd process

# Region-segment with tesserocr all files in METS INPUT fileGrp
ocrd process -m /path/to/mets.xml segment-region/tesserocr

# Chain multiple processors
ocrd process -m /path/to/mets.xml characterize/exif segment-line/tesserocr recognize/tesserocr

This will run the image characterization, page segmentation and region segmentation.
# Start a processor web service at port 6543
ocrd server process -p 6543
http PUT localhost:6543/characterize url==http://server/path/to/mets.xml

See Also
--------

* `OCR-D Specifications <https://github.com/ocr-d/spec>`_
* `pyocrd wiki <https://github.com/ocr-d/pyocrd/wiki>`_
5 changes: 5 additions & 0 deletions analyze_profile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import pstats
p = pstats.Stats('profile')
p.strip_dirs()
p.sort_stats('tottime')
p.print_stats(50)
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = pyocrd
SOURCEDIR = .
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
7 changes: 7 additions & 0 deletions docs/api/modules.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
ocrd
====

.. toctree::
:maxdepth: 4

ocrd
30 changes: 30 additions & 0 deletions docs/api/ocrd.cli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
ocrd.cli package
================

Submodules
----------

ocrd.cli.merge\_ocr\_txt module
-------------------------------

.. automodule:: ocrd.cli.merge_ocr_txt
:members:
:undoc-members:
:show-inheritance:

ocrd.cli.run module
-------------------

.. automodule:: ocrd.cli.run
:members:
:undoc-members:
:show-inheritance:


Module contents
---------------

.. automodule:: ocrd.cli
:members:
:undoc-members:
:show-inheritance:
46 changes: 46 additions & 0 deletions docs/api/ocrd.model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
ocrd.model package
==================

Submodules
----------

ocrd.model.ocrd\_file module
----------------------------

.. automodule:: ocrd.model.ocrd_file
:members:
:undoc-members:
:show-inheritance:

ocrd.model.ocrd\_mets module
----------------------------

.. automodule:: ocrd.model.ocrd_mets
:members:
:undoc-members:
:show-inheritance:

ocrd.model.ocrd\_page module
----------------------------

.. automodule:: ocrd.model.ocrd_page
:members:
:undoc-members:
:show-inheritance:

ocrd.model.ocrd\_xml\_base module
---------------------------------

.. automodule:: ocrd.model.ocrd_xml_base
:members:
:undoc-members:
:show-inheritance:


Module contents
---------------

.. automodule:: ocrd.model
:members:
:undoc-members:
:show-inheritance:
22 changes: 22 additions & 0 deletions docs/api/ocrd.processor.characterize.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
ocrd.processor.characterize package
===================================

Submodules
----------

ocrd.processor.characterize.exif module
---------------------------------------

.. automodule:: ocrd.processor.characterize.exif
:members:
:undoc-members:
:show-inheritance:


Module contents
---------------

.. automodule:: ocrd.processor.characterize
:members:
:undoc-members:
:show-inheritance:
31 changes: 31 additions & 0 deletions docs/api/ocrd.processor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
ocrd.processor package
======================

Subpackages
-----------

.. toctree::

ocrd.processor.characterize
ocrd.processor.segment_line
ocrd.processor.segment_region

Submodules
----------

ocrd.processor.base module
--------------------------

.. automodule:: ocrd.processor.base
:members:
:undoc-members:
:show-inheritance:


Module contents
---------------

.. automodule:: ocrd.processor
:members:
:undoc-members:
:show-inheritance:
Loading