-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New processor API #1240
base: master
Are you sure you want to change the base?
New processor API #1240
Conversation
- add method `process_workspace(workspace)` as a replacement for passing `workspace` in the constructor and then calling `process` (implemented by subclasses): implement in the superclass - loop over input files - delegate processing to new method `process_page_file()` if possible - otherwise fall back to old `process()` outside of loop - download input files when needed if `self.download` - add method `process_page_file()` as single-page processing procedure on OcrdFiles: implement in the superclass for the most frequent/default use-case of - (multi-) image/PAGE input files - (single) PAGE output files - delegate to new method `process_page_pcgts()` if available - add PAGE processing metadata - set PAGE PcGtsId - handle `make_file_id` and `workspace.add_file` - add method `process_page_pcgts()` as single-page processing function on OcrdPage: to be implemented only by subclasses - constructor: add kwarg `download_files` controlling `self.download` (see above)
- implement `process_page_pcgts` with behaviour for `copy_files=False` - override superclass `process_page_file` with behaviour for `copy_files=True` - remove old `process` implementation
…load_files=False` in tests (because they are not actually in the filesystem)
A couple of points worth discussing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great start.
… used in meta-data of resulting image
BTW, I consider it bad practice that we always load the ocrd-tool.json in the inheriting constructor. It should really be a class attribute. And assuming it is always placed in the same spot of the Python distribution (which is currently the case but not strictly required as our specs only require it in the repository root), one can even load it in the superclass (so nothing would have to be done in the inheriting code). |
…tent with file config and prevent imported libraries from initing logging first), but disable propagation for ocrd loggers (to avoid duplication)
…r itself (rather than future query)
…ny failures already (rate will be too low)
Processor parallel pages: switch from multithreading to multiprocessing
…lers with propagate=false
This is a first attempt at implementing #322 – without the actual error handling or parallel processing (so more or less refactoring and deprecation). It also enables fixing #579 in the process.