-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Multipage Feature #928
Comments
I can think of two cases for using this.
For the first, current processing is sufficient, maybe with an additional page separator character . For the second, it would be useful to have the option you suggest. |
An implementation could also get pairs of image file name and output base from the file list instead of image files only. That would allow more flexibility. |
savedlist
Current implementation creates output file Console displays filenames being processed
Suggested option for additional implementation, to create separate output files for each line in the The syntax of the command could be different - either not specifying the outputbase or using the same name as the listfile or something else - whatever is easy to implement. The output files will be based on the filenames in the listfile, same basename with change of extenstion. The listfile in the above two cases can be created by eg. by
This would require change to listfile format. Optionally, if output base is not given separately in listfile, then it will create file based on filenames in list. |
@stweil Please compare the time taken for the list option vs multipage tif vs processing each file separately in a loop to see what may be a recommended option. here is the result of my singular test using Hindi traineddata.
|
Page numbering in this option starts with 0. Should be changed to 1 (similar to multipage tif). If a multipage tiff is listed as one of the files in the listfile, only its first page is processed. |
CC: @jbreiden |
The sum of the 5 individual tesseract processes is ~2m 46s real time, quicker than batching images in a single process. That's not quite what we were expecting to see. Anyone know why? |
I ran the test just once under WSL on Windows10 for language Hindi - there may have been other processes running at the same time which might have impacted the numbers. Hence my request to @stweil to test and compare the features. Do you find individual files to get processed faster? |
This might be related to the adaptive learning that Tesseract does. |
Here are my results for a simple hello world image. Summary for default language (identical to
Summary for language with very large
LSTM (--oem 1)
Old engine (--oem 0)
LSTM engine with large traineddata (--oem 1 -l mya)
Old engine with large traineddata (--oem 0 -l mya)
|
Amitdo, thanks for adding me to this Request For Comments. I think it is a very good idea to change the text output format to use the form feed character (U+000C) mark page boundaries. Hopefully this is very easy. Reasonable people can disagree, but I don't think Tesseract should support an output base parameter with placeholders like page number. There's a lot of combinations already between inputs and outputs. Single page input images, multipage input images, lists of images in a file, lists of images on stdin, streaming, various output format. Combinations are tricky, and it is a big reason why still haven't restored the "OCR to memory buffer" feature that has been mentioned so many times. If this is mostly about increasing throughput by eliminating initialization time, a common thing to do is to create an "OCR service" where a warmed up Tesseract daemon runs all the time. This type of program would make calls to libtesseract, but is otherwise a separate program. Not an additional feature to tesseractmain.cpp |
tesseract writes the file names to console, these can be combined with the output.
or
|
@stweil PR, please! |
Yesterday I had a look on the implementation to see where I could add the page separator and found that it is already there: The parameter The parameter So the desired behavior is achieved by I noticed that Tesseract also adds an empty line at the end of each page. Do we need / want that? I'd prefer to get rid of it. |
I think the question is whether adding of page breaks should be the default in text mode, similar to HOCR or PDF. If FF is added after each page then the empty line may not be required. |
I suggest to remove the Would that be fine for everybody? @theraysmith? |
There was no answer to my previous suggestion. If people agree, I'll prepare a pull request which removes |
My question is: Are you sure that any text editor can handle form feed? |
The ones which I know (more than 10) can handle form feed. So do all printers (which really do a form feed). |
Including Notepad? See the discussion which led to the patch: https://groups.google.com/forum/#!msg/tesseract-dev/VsgJ9R-cTQ0/OMeDjYWoAdQJ |
Notepad cannot be used reasonably with text files which use the common LF line endings – it expects CRLF. So it does not work with text files generated by Tesseract, and FF is only an additional detail. Maybe that's why I did not count Notepad as an editor. As I suggested to keep the |
I agree :-) |
I agree.
|
Thanks! |
Copied from: #911 (comment)
The text was updated successfully, but these errors were encountered: