-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Page Input for the CLI #43
Comments
Oh the CLI supports multi-page input. You can put more than one One of the issues with the glob-like input I wanted to avoid is the inability to explicitly define inputs and outputs. The current syntax is rather verbose but I frankly haven't found a better way, yet. There should be autogenerated (from docstrings) API documentations on http://kraken.re. The whole shebang is basically calling binarization.nlbin, pageseg.segment, lib.models.load_any (loading the model), and feeding everything into rpred.rpred which return a iterator over all the lines. |
Tesseract supports multi-page tiff / list of files of any image type*. It outputs to a single txt/pdf/hocr. * Any format that Leptonica supports. Here is a related issue: |
Related in |
I know that tesseract and ocropy can do so, I just haven't seen any non-1-page-per-file documents in the wild (as from libraries etc.). |
E.g. https://archive.org/details/siopsecretusplan0000prin provides a multi-page (ziped) ABBYY file. However, it is also possible to use hocr-combine for merging several hocr files together in one hocr file afterwards. |
Thanks for the responses! I'll try the function calls you mentioned to just integrate it directly. And yes, I saw in the docs that you could pass multiple input documents to the CLI, but the docs I'm dealing with are 30+ pages long - so it's not that practical. I mean, I suppose I could throw all the tiffs into the CLI with a call to subprocess.run, but it feels cumbersome. So, I still think that glob strings would be a useful feature. |
Friendly suggestion - I love that Kraken supports python3, and is fairly lightweight on dependencies, but what is starting to be a deal-breaker is lack of support for multiple page input. My workflow (which I suspect is fairly common), is to take a .PDF, split into Group 4 Tiffs, and then OCR the tiff images into a hocr document (and then on to NLP-land)
Ocropy handles glob characters (? and * wildcards) to handle multiple pages of input, and can generate a consolidated hocr file for the whole document. As far as I (and probably a lot of people) are concerned, these are must-have features.
So, for your consideration, I'd recommend either (1) allowing the CLI to accept glob-like input, or (2) build/document an API to use it in Python code without the CLI for multiple page documents.
Maybe 2 already exists in some form or fashion, with some selective imports/etc. But on cursory review, it's tough to pick out the pieces.
Thanks!
The text was updated successfully, but these errors were encountered: