[Bug]: Inconsistent language order in tesseract calls #1112

abwiersma · 2023-06-16T19:53:01Z

What were you trying to do?

I was trying to get consistent hocr results from a list of language models, but was finding that even though the list of languages supplied to ocrmypdf was consistent, the list of languages passed on to tesseract was randomly sorted.

So for example:
ocrmypdf -l lang1+lang2+lang3
would result in a random permutation of the -l parameter being passed on to tesseract, something like:
'tesseract', '-l', 'lang2+lang1+lang3

This breaks consistent language parsing as Tesseract has a sense of the primary language being given preference over the secondary languages.

Where are you installing from?

PyPI (pip, poetry, pipx, etc.)

What operating system are you working on?

Linux

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

abwiersma · 2023-06-16T19:53:24Z

Am writing a PR to fix this issue

abwiersma · 2023-06-20T09:10:24Z

Fixed with PR

abwiersma added the bug label Jun 16, 2023

abwiersma assigned jbarlow83 Jun 16, 2023

abwiersma closed this as completed Jun 16, 2023

abwiersma reopened this Jun 16, 2023

abwiersma mentioned this issue Jun 16, 2023

Fix randomly ordered languages from set() #1113

Merged

abwiersma closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Inconsistent language order in tesseract calls #1112

[Bug]: Inconsistent language order in tesseract calls #1112

abwiersma commented Jun 16, 2023

abwiersma commented Jun 16, 2023

abwiersma commented Jun 20, 2023

[Bug]: Inconsistent language order in tesseract calls #1112

[Bug]: Inconsistent language order in tesseract calls #1112

Comments

abwiersma commented Jun 16, 2023

What were you trying to do?

Where are you installing from?

What operating system are you working on?

Relevant log output

abwiersma commented Jun 16, 2023

abwiersma commented Jun 20, 2023