-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge_unicharsets.cpp #1024
Comments
Amit, Which kind of unicharsets does it merge?
Does it just append or also eliminate duplicates? Where would a merged unicharset be used? |
Good questions, Shree. Unfortunately, I simply don't have the answers... |
I just found this file, and saw it isn't in the Makefile.am which means it won't be compiled, so you can't actually use it. |
In that case, we should add it to Makefile.am so that we can test and figure out what it does :-) |
It takes two or more unicharset files with this format: I don't know when you are supposed to use it. |
It calls this function to do the merge: tesseract/ccutil/unicharset.cpp Line 439 in 29f3de9
|
It could be used to create a combined unicharset for a script-level engine, like the new Latin or Devanagari. |
Is there a similar merge_language_model program, used for building a script-level engine? Recently someone asked me:
Where lat is IAST or the roman transliteration of Sanskrit, in Latin script + English So, something like this needs a combining of Devanagari + Gujarati + san_latn or IAST or Latin What would be the best way to do this? Can multiple training_files.txt for different languages be given as input for lstmtraining or do they need to be all merged in one big file? Here is the merged unicharset for these languages: |
For next time, I suggest not to mix unrelated commits in one PR. |
Thanks anyway! |
https://github.com/tesseract-ocr/tesseract/blob/master/training/merge_unicharsets.cpp
Should we add it to
training/Makefile.am
?The text was updated successfully, but these errors were encountered: