Multi-Page Input for the CLI #43

parkerhancock · 2017-05-26T20:57:59Z

Friendly suggestion - I love that Kraken supports python3, and is fairly lightweight on dependencies, but what is starting to be a deal-breaker is lack of support for multiple page input. My workflow (which I suspect is fairly common), is to take a .PDF, split into Group 4 Tiffs, and then OCR the tiff images into a hocr document (and then on to NLP-land)

Ocropy handles glob characters (? and * wildcards) to handle multiple pages of input, and can generate a consolidated hocr file for the whole document. As far as I (and probably a lot of people) are concerned, these are must-have features.

So, for your consideration, I'd recommend either (1) allowing the CLI to accept glob-like input, or (2) build/document an API to use it in Python code without the CLI for multiple page documents.

Maybe 2 already exists in some form or fashion, with some selective imports/etc. But on cursory review, it's tough to pick out the pieces.

Thanks!

mittagessen · 2017-05-26T21:46:56Z

Oh the CLI supports multi-page input. You can put more than one -i input.tif output.hocr option to recognize multiple pages. It doesn't support serializing multiple input files into a single output file tough (ALTO doesn't even allow it) and I haven't seen any multi-page hOCR files in the wild.

One of the issues with the glob-like input I wanted to avoid is the inability to explicitly define inputs and outputs. The current syntax is rather verbose but I frankly haven't found a better way, yet.

There should be autogenerated (from docstrings) API documentations on http://kraken.re. The whole shebang is basically calling binarization.nlbin, pageseg.segment, lib.models.load_any (loading the model), and feeding everything into rpred.rpred which return a iterator over all the lines.

amitdo · 2017-05-26T22:14:56Z

I haven't seen any multi-page hOCR files in the wild.

Tesseract supports multi-page tiff / list of files of any image type*. It outputs to a single txt/pdf/hocr.

* Any format that Leptonica supports.

Here is a related issue:
tesseract-ocr/tesseract#928

zuphilip · 2017-05-27T06:33:28Z

Related in ocr-fileformat-samples: kba/ocr-fileformat-samples#8

mittagessen · 2017-05-27T10:48:20Z

I know that tesseract and ocropy can do so, I just haven't seen any non-1-page-per-file documents in the wild (as from libraries etc.).

zuphilip · 2017-05-27T11:18:09Z

E.g. https://archive.org/details/siopsecretusplan0000prin provides a multi-page (ziped) ABBYY file.

However, it is also possible to use hocr-combine for merging several hocr files together in one hocr file afterwards.

parkerhancock · 2017-05-28T23:42:14Z

Thanks for the responses! I'll try the function calls you mentioned to just integrate it directly.

And yes, I saw in the docs that you could pass multiple input documents to the CLI, but the docs I'm dealing with are 30+ pages long - so it's not that practical. I mean, I suppose I could throw all the tiffs into the CLI with a call to subprocess.run, but it feels cumbersome. So, I still think that glob strings would be a useful feature.

mittagessen closed this as completed Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Page Input for the CLI #43

Multi-Page Input for the CLI #43

parkerhancock commented May 26, 2017 •

edited

Loading

mittagessen commented May 26, 2017

amitdo commented May 26, 2017 •

edited

Loading

zuphilip commented May 27, 2017

mittagessen commented May 27, 2017

zuphilip commented May 27, 2017

parkerhancock commented May 28, 2017

Multi-Page Input for the CLI #43

Multi-Page Input for the CLI #43

Comments

parkerhancock commented May 26, 2017 • edited Loading

mittagessen commented May 26, 2017

amitdo commented May 26, 2017 • edited Loading

zuphilip commented May 27, 2017

mittagessen commented May 27, 2017

zuphilip commented May 27, 2017

parkerhancock commented May 28, 2017

parkerhancock commented May 26, 2017 •

edited

Loading

amitdo commented May 26, 2017 •

edited

Loading