Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Page Input for the CLI #43

Closed
parkerhancock opened this issue May 26, 2017 · 6 comments
Closed

Multi-Page Input for the CLI #43

parkerhancock opened this issue May 26, 2017 · 6 comments

Comments

@parkerhancock
Copy link

parkerhancock commented May 26, 2017

Friendly suggestion - I love that Kraken supports python3, and is fairly lightweight on dependencies, but what is starting to be a deal-breaker is lack of support for multiple page input. My workflow (which I suspect is fairly common), is to take a .PDF, split into Group 4 Tiffs, and then OCR the tiff images into a hocr document (and then on to NLP-land)

Ocropy handles glob characters (? and * wildcards) to handle multiple pages of input, and can generate a consolidated hocr file for the whole document. As far as I (and probably a lot of people) are concerned, these are must-have features.

So, for your consideration, I'd recommend either (1) allowing the CLI to accept glob-like input, or (2) build/document an API to use it in Python code without the CLI for multiple page documents.

Maybe 2 already exists in some form or fashion, with some selective imports/etc. But on cursory review, it's tough to pick out the pieces.

Thanks!

@mittagessen
Copy link
Owner

Oh the CLI supports multi-page input. You can put more than one -i input.tif output.hocr option to recognize multiple pages. It doesn't support serializing multiple input files into a single output file tough (ALTO doesn't even allow it) and I haven't seen any multi-page hOCR files in the wild.

One of the issues with the glob-like input I wanted to avoid is the inability to explicitly define inputs and outputs. The current syntax is rather verbose but I frankly haven't found a better way, yet.

There should be autogenerated (from docstrings) API documentations on http://kraken.re. The whole shebang is basically calling binarization.nlbin, pageseg.segment, lib.models.load_any (loading the model), and feeding everything into rpred.rpred which return a iterator over all the lines.

@amitdo
Copy link
Contributor

amitdo commented May 26, 2017

I haven't seen any multi-page hOCR files in the wild.

Tesseract supports multi-page tiff / list of files of any image type*. It outputs to a single txt/pdf/hocr.

* Any format that Leptonica supports.

Here is a related issue:
tesseract-ocr/tesseract#928

@zuphilip
Copy link

Related in ocr-fileformat-samples: kba/ocr-fileformat-samples#8

@mittagessen
Copy link
Owner

I know that tesseract and ocropy can do so, I just haven't seen any non-1-page-per-file documents in the wild (as from libraries etc.).

@zuphilip
Copy link

E.g. https://archive.org/details/siopsecretusplan0000prin provides a multi-page (ziped) ABBYY file.

However, it is also possible to use hocr-combine for merging several hocr files together in one hocr file afterwards.

@parkerhancock
Copy link
Author

Thanks for the responses! I'll try the function calls you mentioned to just integrate it directly.

And yes, I saw in the docs that you could pass multiple input documents to the CLI, but the docs I'm dealing with are 30+ pages long - so it's not that practical. I mean, I suppose I could throw all the tiffs into the CLI with a call to subprocess.run, but it feels cumbersome. So, I still think that glob strings would be a useful feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants