Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print separate lines for pages in log output of extract_lines.py #591

Merged
merged 1 commit into from
Apr 20, 2024

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Apr 19, 2024

Instead of

Processing PPN1807526488/1807526488_0011.xml .....................................Processing PPN1807526488/1807526488_0005.xml ....Processing PPN1807526488/1807526488_0004.xml ..Processing PPN1807526488/1807526488_0010.xml .................................................Processing PPN1807526488/1807526488_0006.xml Processing PPN1807526488/1807526488_0012.xml ...................................Processing PPN1807526488/1807526488_0013.xml ........................Processing PPN1807526488/1807526488_0007.xml .....

it produces this log output:

Processing PPN1807526488/1807526488_0011.xml .....................................
Processing PPN1807526488/1807526488_0005.xml ....
Processing PPN1807526488/1807526488_0004.xml ..
[...]

@stweil
Copy link
Contributor Author

stweil commented Apr 19, 2024

@mittagessen, thank you for the hint to this script. It works pretty good and is really fast.

Is it also possible to extract the line images without a black background? I am not sure whether line images which only contain the original image inside of the polygon are good for Tesseract training.

I tried --legacy-polygons, but it looks like that code no longer works (it aborts with an exception).

@mittagessen mittagessen merged commit 1486bbd into mittagessen:main Apr 20, 2024
@mittagessen
Copy link
Owner

Is it also possible to extract the line images without a black background? I am not sure whether line images which only contain the original image inside of the polygon are good for Tesseract training.

Hmm, not really the extract_polygons() function the script calls just masks it out and you'd need to change that one to not apply the mask. But I'm not sure of how much use the extracted lines are for Tesseract training anyway as the baseline projection the line extractor does is obviously not available in Tesseract's bbox data model.

@stweil stweil deleted the extract_lines branch April 20, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants