Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New processor API #1240

Open
wants to merge 245 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
245 commits
Select commit Hold shift + click to select a range
833dac7
deprecate Processor.process()
bertsky Jun 10, 2024
3f4c7f9
fix #274: no default -I / -O
bertsky Jun 10, 2024
d2b5df3
workspace.download: fix typo in exception
bertsky Jun 18, 2024
9827c4d
Processor: factor-out show_resource(), delegate to resolve_resource()
bertsky Jun 24, 2024
38fd4aa
Processor: add setup(), run once in get_processor()
bertsky Jun 24, 2024
580988a
ocrd_cli_wrap_processor: fix workspace arg (not a kwarg)
bertsky Jun 24, 2024
224dfc5
Processor: refactor processing API…
bertsky Jun 24, 2024
9714aab
DummyProcessor: re-implement via new process_page_*
bertsky Jun 24, 2024
e5d4736
run_processor: adapt to process→process_workspace
bertsky Jun 24, 2024
809a01b
test DummyProcessor: adapt to new `download` default by setting `down…
bertsky Jun 24, 2024
dfe7f8e
test DummyProcessor: override process_workspace() by delegating to pr…
bertsky Jun 24, 2024
1550668
test builtin ocrd-dummy: adapt to consistent filename
bertsky Jun 24, 2024
75809b1
test processor: adapt to `input_file_grp` required
bertsky Jun 24, 2024
c429da5
test processor: adapt to `self.workspace` only during run_processor
bertsky Jun 24, 2024
295cdb6
Workspace.save_image_file: add kwarg file_path for predetermined loca…
bertsky Jun 26, 2024
e2cbcb9
Processor.process_page_pcgts: add kwargs and allow returning derived …
bertsky Jun 26, 2024
20a6a1c
Workspace.save_image_file: save DPI metadata, too
bertsky Jun 26, 2024
679ad85
Workspace.image_from_*: annotate 'DPI' in result dict and ensure it's…
bertsky Jun 26, 2024
565a3d9
test_workspace: adapt to image_from_* DPI and add assertions
bertsky Jun 26, 2024
46f81aa
autoload ocrd-tool.json and version from dist, executable name from e…
bertsky Jul 6, 2024
4dd83aa
adapt to new Processor init (override metadata/version/executable name)
bertsky Jul 6, 2024
4cafbcc
tests: adapt to new Processor init (override metadata/version/executa…
bertsky Jul 6, 2024
9c9a4c9
generate_processor_help: include process_workspace docstring, too
bertsky Jul 29, 2024
aa0bd68
get_processor: also run setup if instance_caching
bertsky Aug 8, 2024
99d1628
ocrd-tool CLI: pass class in context
bertsky Aug 12, 2024
12231b8
use more specific exception if parameters are invalid
bertsky Aug 12, 2024
d112f8f
run_processor w/ mem_usage: pass as args tuple
bertsky Aug 12, 2024
319ceaa
Processor.process_workspace: add fileGrp assertions
bertsky Aug 12, 2024
80590a9
process_page_pcgts: add (variadic) type checks
bertsky Aug 12, 2024
68ae8ff
run_processor: fix typo
bertsky Aug 12, 2024
2a18883
Processor init: deprecate passing workspace
bertsky Aug 12, 2024
b9338b4
docs: fix relative VERSION path
bertsky Aug 12, 2024
6ca6a40
docs: do/not exclude tests/src
bertsky Aug 12, 2024
bc9ec05
docs: add ocrd_network module
bertsky Aug 12, 2024
54f1d88
docs:regenerated rST
bertsky Aug 12, 2024
67633f5
test_mets_server: fix arg vs kwarg
bertsky Aug 13, 2024
751a1fe
mets_server: ClientSideOcrdMets needs OcrdMets-like kwargs (without d…
bertsky Aug 13, 2024
86d9569
Processor/CLI decorator: :fire: separate kwargs and constructor…
bertsky Aug 13, 2024
1f6f0c8
Processor / ocrd-tool.json: :fire: fileGrp cardinality checks…
bertsky Aug 13, 2024
9b417d6
test_processor: adapt to Processor init changes
bertsky Aug 13, 2024
fbe83c9
adapt to ocrd-tool.json cardinality changes
bertsky Aug 13, 2024
09dd54b
use up-to-date kwargs (avoiding old deprecations)
bertsky Aug 13, 2024
af880e4
hide/test expected deprecation warnings
bertsky Aug 13, 2024
e381a0f
improve output in case of assertion failures
bertsky Aug 13, 2024
874b506
Set VERSION to upcoming 3.0.0a1
kba Aug 14, 2024
5ffe3cb
CircleCI: use version 2.1
bertsky Aug 14, 2024
dd3046e
Merge branch 'master' into new-processor-api
kba Aug 14, 2024
93a742e
test_bashlib: use version verbatim
bertsky Aug 14, 2024
5117684
.
kba Aug 14, 2024
456cc6d
fix make spec
kba Aug 14, 2024
e03a906
Merge branch 'new-processor-api' of https://github.com/bertsky/core i…
kba Aug 14, 2024
7a9fc27
adapt lib.bash to handle prerelease suffixes like a1, b2, rc3
kba Aug 14, 2024
90afb8a
process_page_pcgts must return OcrdProcessResult
kba Aug 14, 2024
70ad191
bashlib ocrd__minversion: compare prerelease suffix alphabetically
kba Aug 15, 2024
678158a
Merge pull request #7 from OCR-D/bashlib-version-yak-shaving
bertsky Aug 15, 2024
228272b
fix ocrd_tool.schema.yml cardinality oneOf syntax, update spec
bertsky Aug 15, 2024
5aba83b
bashlib: fix ocrd__minversion test syntax
bertsky Aug 15, 2024
3d094d6
reimplement OcrdPageResult
kba Aug 15, 2024
f8b6896
update spec (with new ocrd_tool.schema)
bertsky Aug 15, 2024
72eb75b
update spec to v3.25.0, ocrd_tool.schema.yml
kba Aug 15, 2024
75cb20c
process_page_file: fix handling of images
kba Aug 15, 2024
9a1c7ad
process_page_pcgts: remove output_file_id, replace OcrdPageResult.fil…
kba Aug 15, 2024
60ad424
OcrdPageResultImage requires passing alternative_image w/o filename set
kba Aug 15, 2024
50dfdd6
Processor.verify: handle -1 case
kba Aug 15, 2024
53f2634
processor.base: remove obsolete FIXME
kba Aug 15, 2024
d210afa
Processor.process_page_pcgts: update docstring for file_path/alternat…
kba Aug 15, 2024
5718cf9
export OcrdPageResult{Image} from ocrd.processor
kba Aug 15, 2024
f5f3145
Processor.process.page_pcgts: simplify references in docstring
bertsky Aug 15, 2024
db68bb5
Merge branch 'processor-result-object' of https://github.com/OCR-D/co…
kba Aug 15, 2024
7045318
allow "from ocrd_models import OcrdPage
kba Aug 15, 2024
a9dba73
Merge branch 'processor-result-object' into new-processor-api
kba Aug 15, 2024
3220e3f
:memo: v3.0.0a1
kba Aug 15, 2024
e1f5744
Update CHANGELOG.md
kba Aug 16, 2024
80d42f1
ocrd: more convenience imports
bertsky Aug 16, 2024
0e57b4b
ocrd.cli: more fix module import order, export help cmd
bertsky Aug 16, 2024
9cfd70c
fix imports
bertsky Aug 16, 2024
95212b5
fix type assertion
bertsky Aug 16, 2024
4aa288a
ocrd_utils: forgot to export scale_coordinates at toplvl
bertsky Aug 16, 2024
8044e60
fix 9cfd70cffcc
bertsky Aug 16, 2024
21ff810
fix 9cfd70cffcc (revert to wrong import order to avoid circle)
bertsky Aug 16, 2024
4077e8d
s,PcGtsType,OcrdPage,
kba Aug 16, 2024
cd4c96c
add config.OCRD_DOWNLOAD_INPUT
kba Aug 19, 2024
3125255
define self.logger in processor base constructor
kba Aug 19, 2024
0adb9fb
Merge branch 'master' into new-processor-api
kba Aug 19, 2024
dcf7c52
OcrdPage proxy object for PcGtsType, including etree and mappings
kba Aug 19, 2024
cf45d8b
Processor.base: have a (hopefully) thread-safe logger for the base class
kba Aug 19, 2024
785d607
Processor.zip_input_files: warning instead of exception for missing i…
bertsky Aug 20, 2024
b12849d
Processor.zip_input_files: introduce NonUniqueInputFile exception
bertsky Aug 20, 2024
95d3658
Processor.process_workspace: zip_input_files w/o require_first
bertsky Aug 20, 2024
2e2bda6
Merge remote-tracking branch 'origin/download-files-config-var' into …
bertsky Aug 20, 2024
c729841
Processor.zip_input_files: introduce MissingInputFile exception and c…
bertsky Aug 20, 2024
7df81af
OcrdPage: clearer docstring
kba Aug 20, 2024
0ab6942
jsonschema: switch from draft6 to draft2019-09
kba Aug 20, 2024
66c50b3
require jsonschema>4 for draft 2019-09
kba Aug 20, 2024
94e2e60
OcrdToolValidator: set defaults, handle deprecated
kba Aug 20, 2024
2e7bdc2
processor.base: validate/setdefault ocrd-tool.json on first access
kba Aug 20, 2024
346f166
update spec and ocrd_tool.schema.yml
kba Aug 20, 2024
577baa5
processor parameter decorator: no '{}' default (unnecessary)
bertsky Aug 20, 2024
f00ecda
Processor: add error handling…
bertsky Aug 20, 2024
fdd5d16
ocrd_utils.config: add variables to module docstring
bertsky Aug 21, 2024
6d87f9e
improve docstrings, re-generate docs
bertsky Aug 21, 2024
9942bbe
Processor.zip_input_files: more verbose log msg
bertsky Aug 21, 2024
8a584e9
test_processor: test for specific exception
bertsky Aug 21, 2024
8077d45
test_processor: fix missing import
bertsky Aug 21, 2024
6b68f7a
Merge pull request #12 from bertsky/new-processor-api-input-file-errors
bertsky Aug 21, 2024
1b4cd3c
Merge branch 'new-processor-api' into processor-logger
bertsky Aug 21, 2024
7f3bfa2
Merge pull request #10 from OCR-D/processor-logger
bertsky Aug 21, 2024
d4d40e3
Merge pull request #11 from OCR-D/ocrd-page-with-etree
bertsky Aug 21, 2024
111a52e
Merge pull request #13 from OCR-D/validate-ocrd-tool-runtime
bertsky Aug 21, 2024
cf7b193
OcrdPage: fix typeing typo
bertsky Aug 21, 2024
9af8670
dummy_processor: fix typos from logging
bertsky Aug 21, 2024
c6d9736
tests report.is_valid: improve output on failure
bertsky Aug 21, 2024
161cf0c
JsonValidator: fix deprecation warning (by actually checking instance)
bertsky Aug 21, 2024
b2e6485
predefine union types OcrdFileType and OcrdPageType
bertsky Aug 21, 2024
822d731
processor CLI --debug: set all to ABORT (not just MISSING_OUTPUT)
bertsky Aug 21, 2024
3a7a771
:memo: changelog
bertsky Aug 21, 2024
2bdb6c4
:package: v3.0.0a2
kba Aug 22, 2024
00bd6fe
remove make *-workaround, we will not do that for v3+
kba Aug 22, 2024
d777527
Processor.parameter: only validate when set…
bertsky Aug 22, 2024
7998aae
get_processor: ensure passing non-empty parameter, rely on `_setup` t…
bertsky Aug 22, 2024
cc8592b
test_processor: adapt, check required parameters
bertsky Aug 22, 2024
45e556d
improve _setup docstring
bertsky Aug 22, 2024
d4c802b
Processor._setup: raise with full ParameterValidator report
bertsky Aug 22, 2024
b28fefb
get_processor: parameter only as kwarg
bertsky Aug 22, 2024
642938b
tests: adapt for get_processor parameter only as kwarg
bertsky Aug 22, 2024
f5e5c54
Processor.parameter: make the bound dict read-only
bertsky Aug 22, 2024
f2d53a6
Processor.parameter: move ParameterValidator back to setter, convert …
bertsky Aug 22, 2024
7297ca2
Processor.parameter: frozendict instead of mappingproxy, add test
bertsky Aug 22, 2024
6cd4a34
introduce Processor.shutdown to be overridden (called at deinit or pa…
bertsky Aug 22, 2024
407bff8
Processor: introduce `max_instances` class attribute
bertsky Aug 23, 2024
c9fbb2c
get_cached_processor: set lru_cache maxsize from min(cfg,class) at ru…
bertsky Aug 23, 2024
9c212a9
test get_processor instance_caching w/ max_instances
bertsky Aug 23, 2024
a413f04
test get_processor instance_caching w/ clear_cache
bertsky Aug 23, 2024
870523c
:package: v3.0.0a2
kba Aug 22, 2024
20bb6d1
remove make *-workaround, we will not do that for v3+
kba Aug 22, 2024
faa59a8
Processor.metadata_location property to specify where in the package …
kba Aug 23, 2024
5819c81
Processor.verify: always check cardinality (as we now have the defaul…
bertsky Aug 23, 2024
4f88f1d
fix --log-filename (6fc606027a): apply in ocrd_cli_wrap_processor
bertsky Aug 24, 2024
d621f36
fix exception
bertsky Aug 24, 2024
4868fb1
adapt to PIL.Image moved constants
bertsky Aug 24, 2024
da72c0a
ocrd_utils: add parse_json_file_with_comments
bertsky Aug 24, 2024
ca78b94
cli.workspace: pass fileGrp as well, improve description
bertsky Aug 24, 2024
cf41745
OcrdMets.add_agent: does not have positional args
bertsky Aug 24, 2024
cadc6e6
remove misplaced kwargs from run_processor
bertsky Aug 24, 2024
7966057
Processor.metadata: refactor…
bertsky Aug 24, 2024
bba142d
bashlib input-files: adapt, allow passing ocrd-tool.json path and exe…
bertsky Aug 24, 2024
32cdc5a
add to pylint karma
bertsky Aug 24, 2024
a95f269
update pylintrc
bertsky Aug 24, 2024
50c088e
processor.metadata_location: use self.__module__ not __package__
kba Aug 24, 2024
ad8c76e
Merge pull request #17 from OCR-D/new-processor-api-parameter-setup
bertsky Aug 24, 2024
8211237
pylint: try ignoring generateds (again)
bertsky Aug 25, 2024
b53724e
Merge pull request #14 from bertsky/new-processor-api-parameter-setup
bertsky Aug 25, 2024
3e2700c
:memo: update changelog
bertsky Aug 25, 2024
342df58
test_bashlib: allow testing prereleases successfully
bertsky Aug 25, 2024
11ed8c5
Processor.process_page_file / OcrdPageResultImage: allow PageType ins…
bertsky Aug 25, 2024
69571fe
Merge branch 'master' into new-processor-api
kba Aug 26, 2024
77e31f2
:package: v3.0.0b1
kba Aug 26, 2024
d3ee57c
:fire: bad no good terrible hack to fix integration_test
kba Aug 26, 2024
0245f4b
generate_processor_help: avoid repeating docstrings from superclass
bertsky Aug 27, 2024
efe4201
Processor.process_workspace: abort anyway if too many failures (OCRD_…
bertsky Aug 27, 2024
fce7627
adapt tests for OCRD_MAX_MISSING_OUTPUTS
bertsky Aug 27, 2024
a50d0bb
Merge pull request #19 from OCR-D/new-processor-api-fix-editable
bertsky Aug 27, 2024
c08166e
Processor: add per-page timeouts and parallelism…
bertsky Aug 27, 2024
c3a8380
add tests for processor per-page timeout and parallelism
bertsky Aug 27, 2024
b1b7a49
:memo: update changelog
bertsky Aug 27, 2024
9b80ae1
ClientSideOcrdMets: use same logger name prefix as server
bertsky Aug 28, 2024
be6b59d
Processor: fix ignore (negative/zero) cases for max_workers / max_pag…
bertsky Aug 28, 2024
0b5286f
test_mets_server: use tmpdir to avoid side effects between suites
bertsky Aug 28, 2024
61e1042
test processor timeout/parallel: avoid side effects to dummy tool json
bertsky Aug 28, 2024
e395b56
tess: adapt to wording of exceptions
bertsky Aug 28, 2024
a59ba6a
ClientSideOcrdMets: partial revert of 9b80ae17ef
bertsky Aug 28, 2024
554a67d
disableLogging: re-instate root logger, to
bertsky Aug 28, 2024
1114cd9
test-logging: also remove ocrd.log from tempdir
bertsky Aug 28, 2024
ce6d239
Processor: fix 7966057f (deprecated passing of ocrd_tool or version v…
bertsky Aug 28, 2024
df99160
Processor.generate_processor_help: forgot to include --log-filename
bertsky Aug 28, 2024
eb74fab
bashlib: re-add --log-filename, implement as stderr redirect
bertsky Aug 28, 2024
8565a8f
test_processor: add legacy (v2-style) dummy case
bertsky Aug 28, 2024
abe069a
:memo: update changelog
bertsky Aug 28, 2024
11f9264
:memo: update readmes (esp. new config variables)
bertsky Aug 28, 2024
ca88122
:package: v3.0.0b2
kba Aug 30, 2024
837aba7
ocrd_utils.config: add reset_defaults()
bertsky Aug 29, 2024
85e96ff
add test for OcrdEnvConfig.reset_defaults()
bertsky Aug 29, 2024
8911c3b
Processor: improve processing log messages
bertsky Aug 30, 2024
98d97fc
ocrd.cli doc: don't rewrap description lists
bertsky Aug 30, 2024
cb758e8
:package: v3.0.0b3
kba Aug 30, 2024
1ed38a6
Processor.metadata_location: find location package prefix (necessary …
bertsky Aug 30, 2024
7d98c27
Processor: log when max_workers / max_page_seconds are in effect
bertsky Sep 1, 2024
6b23b65
Workspace.reload_mets: fix for METS server case
bertsky Sep 1, 2024
cac05cd
:memo: changelog
kba Sep 2, 2024
0b0d419
:package: v3.0.0b4
kba Sep 2, 2024
a34beb8
OcrdMetsServer.add_file: pass on 'force' kwarg, too
bertsky Sep 2, 2024
dfa715d
test_mets_server: add test for force (overwrite)
bertsky Sep 2, 2024
9a8c41d
test_processor: add test for force (overwrite) w/ METS Server
bertsky Sep 2, 2024
65ab63c
add typing, extend docs
kba Aug 26, 2024
73a395e
Processor.verify: revert 5819c816 (we still have no defaults in json …
bertsky Sep 5, 2024
3382ad9
Processor.process_page_file / OcrdPageResultImage: allow None instead…
bertsky Sep 5, 2024
cad4777
PcGts.Page.id / make_xml_id: replace '/' with '_'
bertsky Sep 13, 2024
10b2abc
ocrd.cli.ocrd-tool resolve-resource: fix (forgot to print result)
bertsky Sep 12, 2024
bd64444
processor CLI: delegate --resolve-resource, too
bertsky Sep 13, 2024
71e9841
METS Server: also export+delegate physical_pages
bertsky Sep 15, 2024
01ccdf1
ocrd.cli.workspace: consistently pass on --mets-server-url and --back…
bertsky Sep 13, 2024
3301f9c
ocrd.cli.workspace server: add 'reload' and 'save'
bertsky Sep 13, 2024
dc2c758
ocrd.cli.bashlib input-files: pass on --mets-server-url, too
bertsky Sep 12, 2024
42af6a3
ocrd.cli.validate tasks: pass on --mets-server-url, too
bertsky Sep 12, 2024
7ea8d57
Processor.process_workspace(): do not show NotImplementedError contex…
bertsky Sep 12, 2024
9751256
Processor.verify: check output fileGrps as well (or OCRD_EXISTING_OUT…
bertsky Sep 12, 2024
f66753a
run_processor: be robust if ocrd_tool is missing steps
bertsky Sep 12, 2024
eb12a80
lib.bash: fix errexit
bertsky Sep 12, 2024
3355ea4
lib.bash input-files: pass on --mets-server-url, --overwrite, and par…
bertsky Sep 12, 2024
f05f840
lib.bash input-files: do not try to validate tasks here (impossible t…
bertsky Sep 12, 2024
b5c1191
Processor / Workspace.add_file: always force if config.OCRD_EXISTING_…
bertsky Sep 12, 2024
cbe465a
test processors: no need for 'force' kwarg anymore
bertsky Sep 13, 2024
3e214ca
tests: make sure ocrd_utils.config gets reset whenever changing it gl…
bertsky Sep 13, 2024
c549c42
OcrdPage: add PageType.get_ReadingOrderGroups()
bertsky Sep 7, 2024
53b880f
update OcrdPage from generateds
bertsky Sep 7, 2024
687b06f
:package: v3.0.0b5
kba Sep 16, 2024
a43098e
:memo: improve b5 changelog
bertsky Sep 16, 2024
d2cb0fb
ocrd.cli.workspace: assert non-server in cmds mutating METS
bertsky Sep 16, 2024
f678dca
OcrdMets.get_physical_pages: cover return_divs w/o for_fileIds for_pa…
bertsky Sep 27, 2024
9064db0
ocrd.cli.workspace: use physical_pages if possible, fix default outpu…
bertsky Sep 27, 2024
9530fcd
Processor.process_page_file: avoid process_page_pcgts() if OCRD_EXIST…
bertsky Sep 27, 2024
31a8474
ocrd_utils.initLogging: also add handler to root logger (to be consis…
bertsky Oct 9, 2024
d7049b1
CLI decorator: only import ocrd_network when needed
bertsky Oct 10, 2024
a9d49c1
Processor w/ OCRD_MAX_PARALLEL_PAGES: ThreadPoolExecutor→ProcessPoolE…
bertsky Oct 10, 2024
588c91d
Processor.process_workspace: apply timeout on process_page_file worke…
bertsky Oct 17, 2024
d126bdc
Processor w/ OCRD_MAX_PARALLEL_PAGES: concurrent.futures→loky
bertsky Oct 17, 2024
afa7f30
Processor w/o OCRD_MAX_PARALLEL_PAGES: dummy instead of executor
bertsky Oct 19, 2024
5821701
ocrd.process.profile logger: account for subprocess CPU time, too
bertsky Oct 19, 2024
53b1854
Processor.process_workspace: improve reporting, raise early if too ma…
bertsky Oct 21, 2024
4d66e37
Processor: refactor process_workspace into overridable subfuncs
bertsky Oct 23, 2024
71d6d49
Processor.process_workspace_handle_page_task: do not handler sigint
bertsky Oct 30, 2024
d2d5290
Processor.process_workspace_handle_tasks: log nr of ignored exception…
bertsky Oct 30, 2024
7932a6a
Merge pull request #23 from bertsky/new-processor-api-process-worker
bertsky Oct 30, 2024
7d1503e
:package: v3.0.0b6
bertsky Oct 30, 2024
08a631c
tests: prevent side effects from ocrd_logging
bertsky Nov 7, 2024
f3e423a
initLogging: do not remove any previous handlers/levels
bertsky Nov 7, 2024
3143518
initLogging: only add root handler instead of multiple redundant hand…
bertsky Nov 7, 2024
27323c6
disableLogging: remove all handlers, reset all levels
bertsky Nov 7, 2024
eb3120d
setOverrideLogLevel: override all currently active loggers' level
bertsky Nov 7, 2024
0186c53
logging: increase default root (not ocrd) level from INFO to WARNING
bertsky Nov 7, 2024
5ba2720
Processor: update max_workers docstring
bertsky Nov 7, 2024
f8f71d8
initLogging: call disableLogging if already initialized and force_reinit
bertsky Nov 11, 2024
5f2f602
Processor: replace weakref with __del__ to trigger shutdown
bertsky Nov 11, 2024
0446b82
Processor parallel pages: log via QueueHandler in subprocess, QueueLi…
bertsky Nov 11, 2024
53c4c18
:package: v3.0.0b7
bertsky Nov 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/ocrd/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

"""

from ocrd.processor.base import run_processor, run_cli, Processor
from ocrd.processor.base import run_processor, run_cli, Processor, ResourceNotFoundError
from ocrd_models import OcrdMets, OcrdExif, OcrdFile, OcrdAgent
from ocrd.resolver import Resolver
from ocrd_validators import *
Expand Down
6 changes: 4 additions & 2 deletions src/ocrd/decorators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,11 @@ def ocrd_cli_wrap_processor(
**kwargs
):
if not sys.argv[1:]:
processorClass(workspace=None, show_help=True)
processorClass(None, show_help=True)
sys.exit(1)
if dump_json or dump_module_dir or help or version or show_resource or list_resources:
processorClass(
workspace=None,
None,
dump_json=dump_json,
dump_module_dir=dump_module_dir,
show_help=help,
Expand All @@ -71,6 +71,8 @@ def ocrd_cli_wrap_processor(
initLogging()

LOG = getLogger('ocrd.cli_wrap_processor')
assert kwargs['input_file_grp'] is not None
assert kwargs['output_file_grp'] is not None
bertsky marked this conversation as resolved.
Show resolved Hide resolved
# LOG.info('kwargs=%s' % kwargs)
if 'parameter' in kwargs:
# Disambiguate parameter file/literal, and resolve file
Expand Down
7 changes: 2 additions & 5 deletions src/ocrd/decorators/ocrd_cli_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,8 @@ def cli(mets_url):
option('-m', '--mets', help="METS to process", default=DEFAULT_METS_BASENAME),
option('-w', '--working-dir', help="Working Directory"),
option('-U', '--mets-server-url', help="METS server URL. Starts with http:// then TCP, otherwise unix socket path"),
# TODO OCR-D/core#274
# option('-I', '--input-file-grp', required=True),
# option('-O', '--output-file-grp', required=True),
option('-I', '--input-file-grp', default='INPUT'),
option('-O', '--output-file-grp', default='OUTPUT'),
option('-I', '--input-file-grp', default=None),
option('-O', '--output-file-grp', default=None),
option('-g', '--page-id'),
option('--overwrite', is_flag=True, default=False),
option('--profile', is_flag=True, default=False),
Expand Down
181 changes: 151 additions & 30 deletions src/ocrd/processor/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@
import os
from os import getcwd
from pathlib import Path
from typing import Optional
import sys
import tarfile
import io
from ocrd.workspace import Workspace
from deprecated import deprecated

from ocrd.workspace import Workspace
from ocrd_utils import (
VERSION as OCRD_VERSION,
MIMETYPE_PAGE,
Expand All @@ -31,9 +33,11 @@
list_all_resources,
get_processor_resource_types,
resource_filename,
make_file_id,
)
from ocrd_validators import ParameterValidator
from ocrd_models.ocrd_page import MetadataItemType, LabelType, LabelsType
from ocrd_models.ocrd_page import MetadataItemType, LabelType, LabelsType, OcrdPage, to_xml
from ocrd_modelfactory import page_from_file

# XXX imports must remain for backwards-compatibility
from .helpers import run_cli, run_processor, generate_processor_help # pylint: disable=unused-import
Expand Down Expand Up @@ -63,15 +67,15 @@ class Processor():

def __init__(
self,
workspace : Workspace,
# FIXME: deprecate in favor of process_workspace(workspace)
workspace : Optional[Workspace],
ocrd_tool=None,
parameter=None,
# TODO OCR-D/core#274
# input_file_grp=None,
# output_file_grp=None,
input_file_grp="INPUT",
output_file_grp="OUTPUT",
input_file_grp=None,
output_file_grp=None,
page_id=None,
download_files=True,
bertsky marked this conversation as resolved.
Show resolved Hide resolved
# FIXME: deprecate all the following in favor of respective methods
bertsky marked this conversation as resolved.
Show resolved Hide resolved
resolve_resource=None,
show_resource=None,
list_resources=False,
Expand Down Expand Up @@ -101,6 +105,7 @@ def __init__(
output_file_grp (string): comma-separated list of METS ``fileGrp``s used for output.
page_id (string): comma-separated list of METS physical ``page`` IDs to process \
(or empty for all pages).
download_files (boolean): Whether input files will be downloaded prior to processing.
resolve_resource (string): If not ``None``, then instead of processing, resolve \
given resource by name and print its full path to stdout.
show_resource (string): If not ``None``, then instead of processing, resolve \
Expand Down Expand Up @@ -131,27 +136,22 @@ def __init__(
for res in self.list_all_resources():
print(res)
return
if resolve_resource or show_resource:
initLogging()
if resolve_resource:
try:
res = self.resolve_resource(resolve_resource)
print(res)
except ResourceNotFoundError as e:
log = getLogger('ocrd.processor.base')
bertsky marked this conversation as resolved.
Show resolved Hide resolved
log.critical(e.message)
sys.exit(1)
return
if show_resource:
try:
res_fname = self.resolve_resource(resolve_resource or show_resource)
self.show_resource(show_resource)
except ResourceNotFoundError as e:
log = getLogger('ocrd.processor.base')
log.critical(e.message)
sys.exit(1)
if resolve_resource:
print(res_fname)
return
fpath = Path(res_fname)
if fpath.is_dir():
with pushd_popd(fpath):
fileobj = io.BytesIO()
with tarfile.open(fileobj=fileobj, mode='w:gz') as tarball:
tarball.add('.')
fileobj.seek(0)
copyfileobj(fileobj, sys.stdout.buffer)
else:
sys.stdout.buffer.write(fpath.read_bytes())
return
if show_help:
self.show_help(subcommand=subcommand)
Expand All @@ -161,20 +161,25 @@ def __init__(
self.show_version()
return
self.workspace = workspace
# FIXME HACK would be better to use pushd_popd(self.workspace.directory)
# but there is no way to do that in process here since it's an
# overridden method. chdir is almost always an anti-pattern.
if self.workspace:
bertsky marked this conversation as resolved.
Show resolved Hide resolved
# FIXME deprecate setting this and calling process() over using process_workspace()
bertsky marked this conversation as resolved.
Show resolved Hide resolved
# which uses pushd_popd(self.workspace.directory)
# (because there is no way to do that in process() since it's an
# overridden method. chdir is almost always an anti-pattern.)
self.old_pwd = getcwd()
os.chdir(self.workspace.directory)
self.input_file_grp = input_file_grp
self.output_file_grp = output_file_grp
self.page_id = None if page_id == [] or page_id is None else page_id
self.download = download_files
parameterValidator = ParameterValidator(ocrd_tool)
report = parameterValidator.validate(parameter)
if not report.is_valid:
raise Exception("Invalid parameters %s" % report.errors)
bertsky marked this conversation as resolved.
Show resolved Hide resolved
self.parameter = parameter
# workaround for deprecated#72 (deprecation does not work for subclasses):
bertsky marked this conversation as resolved.
Show resolved Hide resolved
setattr(self, 'process',
deprecated(version='3.0', reason='process() should be replaced with process_page() and process_workspace()')(getattr(self, 'process')))

def show_help(self, subcommand=None):
print(generate_processor_help(self.ocrd_tool, processor_instance=self, subcommand=subcommand))
Expand All @@ -188,20 +193,122 @@ def verify(self):
"""
return True

def setup(self) -> None:
"""
Prepare the processor for actual data processing,
prior to changing to the workspace directory but
after parsing parameters.

(Override this to load models into memory etc.)
"""
pass

@deprecated(version='3.0', reason='process() should be replaced with process_page() and process_workspace()')
def process(self) -> None:
"""
Process the :py:attr:`workspace`
Process all files of the :py:attr:`workspace`
from the given :py:attr:`input_file_grp`
to the given :py:attr:`output_file_grp`
for the given :py:attr:`page_id`
for the given :py:attr:`page_id` (or all pages)
under the given :py:attr:`parameter`.

(This contains the main functionality and needs to be overridden by subclasses.)
"""
raise NotImplementedError()

def process_workspace(self, workspace: Workspace) -> None:
"""
Process all files of the given ``workspace``,
from the given :py:attr:`input_file_grp`
to the given :py:attr:`output_file_grp`
for the given :py:attr:`page_id` (or all pages)
under the given :py:attr:`parameter`.

(This will iterate over pages and files, calling
:py:meth:`process_page`, handling exceptions.)
"""
# assert self.input_file_grp is not None
# assert self.output_file_grp is not None
# input_file_grps = self.input_file_grp.split(',')
# for input_file_grp in input_file_grps:
# assert input_file_grp in workspace.mets.file_groups
kba marked this conversation as resolved.
Show resolved Hide resolved
log = getLogger('ocrd.processor.base')
with pushd_popd(workspace.directory):
self.workspace = workspace
try:
# FIXME: add page parallelization by running multiprocessing.Pool (#322)
MehmedGIT marked this conversation as resolved.
Show resolved Hide resolved
for input_file_tuple in self.zip_input_files(on_error='abort'):
bertsky marked this conversation as resolved.
Show resolved Hide resolved
# FIXME: add error handling by catching exceptions in various ways (#579)
# for example:
# - ResourceNotFoundError → use ResourceManager to download (once), then retry
# - transient (I/O or OOM) error → maybe sleep, retry
# - persistent (data) error → skip / dummy / raise
input_files = [None] * len(input_file_tuple)
for i, input_file in enumerate(input_file_tuple):
if i == 0:
log.info("processing page %s", input_file.pageId)
elif input_file is None:
# file/page not found in this file grp
continue
input_files[i] = input_file
if not self.download:
continue
try:
input_files[i] = self.workspace.download_file(input_file)
except ValueError as e:
log.error(repr(e))
log.warning("skipping file %s for page %s", input_file, input_file.pageId)
self.process_page_file(*input_files)
except NotImplementedError:
# fall back to deprecated method
self.process()

def add_metadata(self, pcgts):
def process_page_file(self, *input_files) -> None:
"""
Process the given ``input_files`` of the :py:attr:`workspace`,
representing one physical page (passed as one opened
:py:class:`~ocrd_models.OcrdFile` per input fileGrp)
under the given :py:attr:`parameter`, and make sure the
results get added accordingly.

(This uses process_page_pcgts, but can be overridden by subclasses
to handle cases like multiple fileGrps, non-PAGE input etc.)
"""
log = getLogger('ocrd.processor.base')
input_pcgts = [None] * len(input_files)
for i, input_file in enumerate(input_files):
# FIXME: what about non-PAGE input like image or JSON ???
bertsky marked this conversation as resolved.
Show resolved Hide resolved
log.debug("parsing file %s for page %s", input_file.ID, input_file.pageId)
try:
input_pcgts[i] = page_from_file(input_file)
except ValueError as e:
log.info("non-PAGE input for page %s: %s", input_file.pageId, e)
output_pcgts = self.process_page_pcgts(*input_pcgts)
output_file_id = make_file_id(input_files[0], self.output_file_grp)
output_pcgts.set_pcGtsId(output_file_id)
self.add_metadata(output_pcgts)
# FIXME: what about save_image_file in process_page ???
kba marked this conversation as resolved.
Show resolved Hide resolved
# FIXME: what about non-PAGE output like JSON ???
MehmedGIT marked this conversation as resolved.
Show resolved Hide resolved
self.workspace.add_file(file_id=output_file_id,
file_grp=self.output_file_grp,
page_id=input_files[0].pageId,
local_filename=os.path.join(self.output_file_grp, output_file_id + '.xml'),
mimetype=MIMETYPE_PAGE,
content=to_xml(output_pcgts))

def process_page_pcgts(self, *input_pcgts) -> OcrdPage:
"""
Process the given ``input_pcgts`` of the :py:attr:`workspace`,
representing one physical page (passed as one parsed
:py:class:`~ocrd_models.OcrdPage` per input fileGrp)
under the given :py:attr:`parameter`, and return the
resulting :py:class:`~ocrd_models.OcrdPage`.

(This contains the main functionality and must be overridden by subclasses.)
"""
raise NotImplementedError()

def add_metadata(self, pcgts: OcrdPage) -> None:
"""
Add PAGE-XML :py:class:`~ocrd_models.ocrd_page.MetadataItemType` ``MetadataItem`` describing
the processing step and runtime parameters to :py:class:`~ocrd_models.ocrd_page.PcGtsType` ``pcgts``.
Expand Down Expand Up @@ -233,6 +340,7 @@ def resolve_resource(self, val):
Args:
val (string): resource value to resolve
"""
initLogging()
executable = self.ocrd_tool['executable']
log = getLogger('ocrd.processor.base')
if exists(val):
Expand All @@ -250,6 +358,19 @@ def resolve_resource(self, val):
return ret[0]
raise ResourceNotFoundError(val, executable)

def show_resource(self, val):
res_fname = self.resolve_resource(val)
fpath = Path(res_fname)
if fpath.is_dir():
with pushd_popd(fpath):
fileobj = io.BytesIO()
with tarfile.open(fileobj=fileobj, mode='w:gz') as tarball:
tarball.add('.')
fileobj.seek(0)
copyfileobj(fileobj, sys.stdout.buffer)
else:
sys.stdout.buffer.write(fpath.read_bytes())

def list_all_resources(self):
"""
List all resources found in the filesystem and matching content-type by filename suffix
Expand Down
Loading
Loading