-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hebrew issues #82
Comments
tesseract-ocr/tesseract#648 (comment) theraysmith commented:
|
(1) I didn't find mistakes in the letters themselves. (2) I found two mistakes in the letters themselves. The issues with training Tesseract to recognize nikud are:
|
(2) The second mistake in the letters themselves. The letters it wrongly chose are indeed very similar to the correct letters. The two incorrect words in (2) are not true dictionary words. |
Another issue with Hebrew - Tesseract's Dictionary. Hebrew uses both prefix and suffix with base words. http://hspell.ivrix.org.il/ (AGPL) http://hspell.ivrix.org.il/WHATSNEW
So 468,508 words are produced from 24,495 base words+their suffix forms. They don't mention the prefix forms. I think that from 24,495 base words + prefix and suffix forms you will get at least 9 million words. The Hebrew wordlist contains 152,000 words. I believe that this list will not cover enough words in Hebrew. The result: lstm + dict might be not better than raw lstm only. This is my assumption and it needs to be verified. Hspell's dictionary does not include nikud. |
@nyh, @dankenigsberg [hspell authors] Sorry to bother you. Can you please read the comment above this one, and answer these questions:
|
Returning back to your question.
Both (1) and (2) are not so good because of the issues with nikud. In (1) the OCR'ed text has a lot of nikud mistakes. In (2) the network tries to be 'too smart' and adds nikud signs which does not appear in the ground truth. As a note, this feature can be useful for another, separate application: Converting (kind of translating) text 'without nikud' to 'with nikud' form. But it will be useful only if it will have good accuracy. For training, you'll use a pair of text lines - (1) the 'without nikud' input and (2) the desired 'with nikud' output. |
To summarize the above: So, unless you can make the nikud recognition much better, IMO a reasonable solution might be to drop the nikud signs. |
heb.wordlist contains these words:
All of them are Yiddish words, not Hebrew. If you omit these words (you should), only 12 words with nikud will be left in the heb.wordlist file. |
Ray, You can also consider having two separate traineddatas, one with nikud and one without for Hebrew, each with corresponding wordlists, if that gives better accuracy for each. It won't work for mixed texts though. |
The Hebrew word list and training text should not contain the Yiddish digraphs: 05F0 װ HEBREW LIGATURE YIDDISH DOUBLE VAV |
Ray, |
Here are two examples from Project Ben-Yehuda: |
@theraysmith |
I have questions: My previous post was missing the images: In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable? It makes the training complicated because it means that words can appear either way. I can see the appeal of just discarding all the nikud, but it doesn't seem the right thing to do. Yiddish. You list a bunch of Yiddish words, and then in a separate post Yiddish-specific characters. Do all those Yiddish words contain one or more of those characters? If not, how I can separate Yiddish from Hebrew? Those Yiddish characters are not in the unicharset for the latest Hebrew model. The unicharset only has 67 characters.
I take it that the unicodes you refer to as nikud are 5b0-5c4, and that the cantillation marks 591-5af are used so rarely as to be totally ignored? Fonts. Too many to count. (attached) |
Something like these samples: |
https://fonts.google.com/?subset=hebrew A list of Hebrew fonts from the Open Siddur Project |
Both forms exist. A road - kvish כביש - plural kvishim כבישים |
I suggest to use these unicodes for heb.traineddata (Hebrew, not including additional Yiddish unicodes): Hebrew Alef-Bet (Alphabet)05D0-05EA Numerals0-9 NikudIf you want to support nikud, you should include: Unique Hebrew punctuation marks05BE ־ HEBREW PUNCTUATION MAQAF Common marksOther common marks - the ones that are already in heb.trainedata (As 'Common'). Linkshttp://unicode.org/charts/PDF/U0590.pdf |
Yes. The ideal is that Tesseract will do a good job with these texts:
The question is if it can achieve high accuracy on the 3 above kinds of texts.
I think you may want to consider and try several approaches for training: (GT here is Ground Truth)
|
My comments about the Hebrew wordlist were based on the file in the langdata repo. |
Please read my new comments, starting with #82 (comment) Talking about the files in best/heb.traineddata:
|
best/heb.traineddata has only 6 nikud signs: 5b0 HEBREW POINT SHEVA 9 nikud signs are missing. Also missing are the 3 unique Hebrew punctuation marks I mentioned earlier. |
OK I have added desired/forbidden characters for heb and yid |
That makes sense. |
it gives weights. If the recognition score of the non-dictionary word exceeds that of the possible alternatives it outputs it, if the vocabulary alternative seems more likely, it goes for that. This is why the training corpus should be representative of the
Not on real world unregistered text, faint, distorted, faded and whatnot. Anyway children who learn single letters aren't yet readers. |
ok. I had thought I saw that tesseract de-skews the image, and also turns it to binary black and white. |
problems don't end there |
I saw that you said that someone needs to 'train' tesseract. (I did find out what 'training' was). |
If you're going to undertake that, that would be a certainly a great contribution. See what has been said in the messages above. |
We are working with some Rashi based books (in judeoespañol), we have the scanned pdf from the originals, I try to use tesseract to try to extract some texts from the images but it didn't work that well (of course), not sure if there is someone already working on adding support for Rashi fonts, but it would be nice. If some help is needed just let me know. |
Hi all,
Anyone has a valid Hebrew trained data or any other solution? Thanks |
See tesseract-ocr/tesseract#1613 I suggest you try the latest version of tesseract. |
@Shreeshrii |
Please use the forum for technical support. |
You should know that ABBYY FineReader does a good but not perfect job with Rashi script. Even after training it above the standard recognition pattern, I still can't get it to tell the difference between a final Mem and a Samech which is understandable but somewhat of a nuisance. It works maybe 10% of the time. A similar problem exists with a Taf and a Het, but training has improved that to only be an issue about 20% of the time. |
What would be the best way to train tesseract on these fonts? (http://freefonts.co.il/) |
@amitdo - is there a way to tell tesseract whether to recognize or to ignore nikud during OCR with the current official If there is no way to ignore nikud during OCR - is there an easy way to delete after recognition? And what needs to be done to make nikud recognition optional? Thank you! |
Tesseract has an option to blacklist characters . Consult the docs and/or ask in the forum about this option. but note that people reported it does not work well with the lstm based models.
Yes. With a few lines of bash/perl/python script that removes the diacritics from the txt/hocr output. You'll have to write this yourself... Please use the forum to ask questions. |
@amitdo - I asked here because my question relates directly to this issue and contributes to its treatment. By blacklisting - do you mean forbidden_characters? If yes, then it seems like it has any meaning only during training data generation. If so then it looks like in order to cause tesseract to recognize ignoring nikud (the feature you've requested above) - you have to create two separate files: heb.traineddata (like the current one) and heb_nikudless.traineddata (with training data lacking nikud). Am I right? Or is there a way to "blacklist" certain characters during recognition? If so - you can run tesseract with nikud-aware traineddata, but tell it to ignore the blacklisted letters (despite the fact that they are being recognized)... |
I'm talking about the parameter This is my last answer. This place is not a support forum. |
I tried to train Tesseract to recognize Rashi script. Here are the results: https://gitlab.com/pninim.org/tessdata_heb_rashi It was my first time training Tesseract, so I might have made mistakes. I have documented the process, so if you see anything that can be improved - please let me know. It looks like the recognition is comparable to that of ABBYY FineReader (at least in my sample test). Any feedback is appreciated! |
@AvtechScientific Thank you for taking the effort to train Hebrew Rashi script. @amitdo and others can check it further. Please share the test data and results that show how well your new traineddata does in recognizing Rashi script - something on the lines of https://github.com/tesseract-ocr/tessdata_contrib/blob/main/khmLimon.md Thanks! |
just a few naive comments from a cursory look at heb.wordlist:
|
@EastEriq - thank you for your feedback! (1) / (2) (3) Indeed, probably like gershayim it should be treated as alphabetic character, to avoid word splitting... Do I treat it as punctuation somewhere? (4) I don't remember seeing books in Rashi script with nikkud, so it is probably not such a practical wide spread case. (regarding taamim - see 1./2.) (5) Do you mean two separate files - |
@AvtechScientific thank you for trying it out. I just downloaded the Training data file and found that there is still some issues with blurrier text. For context I used this file and for some comparison this is what I have: The textThe output` כזזשך כחעו יזוחי יזלהים ישל. ארץ ממולדחי נחשבחי כנודר קונם - בורח מחיוח נזורדי״ל ליו הפ I tried setting the DPI higher but this is as good as I got it. Looking forward to seeing the project progress. |
@benyamindsmith thank you for your feedback. Issues like these are expected. What is interesting - is to figure out how does this compare to FineReader or other OCR software... Do you have it to digitize the same input? |
@Shreeshrii - thank you. PR created. |
@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script? |
This issue was directed to Ray from Google, who trained all the models in the 3 tessdata repos. He does not participate in the project in recent years. The process is documented in the tessdoc repo. It's not easy to train the LSTM engine, and I don't have time to help. You can try to ask in our forum. |
The status: still unsolved. |
I trained tesseract for Rashi script some time ago and documented the training process: https://gitlab.com/pninim.org/tessdata_heb_rashi There you can also find the link to the Rashi font collection. |
Here, i'm going to raise some issues related to Tesseract's Hebrew support.
Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue', even if there are similar issues for both Arabic/Persian and Hebrew.
Let's start with the nikud issue.
Hebrew has two writing forms:
Nikud - Diacritical signs used in Hebrew writing.
Modern Hebrew is written (mostly) without nikud.
Children's books are written with nikud. Poetry is also usually written with nikud. Hebrew dictionaries also use nikud. The Hebrew bible use nikud. It also uses te'amim (Cantillation marks).
There are some mixed forms:
1a) Some paragraphs/sentences use nikud, when quoting the bible or a poem for example.
1b) One or few words in some paragraphs use nikud. This form is used for example for foreign names of people and places (like cities). Without nikud many words will be ambiguous. Usually a native Hebrew speaker will use context to solve this ambiguousness. Sometimes there will still be ambiguousness, and then using nikud can be used to solve this issue.
The following part is relevant to both (1b) and (2) above.
When adding nikud to a word, it might be in 'full' or 'partial' form. Sometimes adding just one nikud sign is enough to make the word unambiguous.
Ray, If you only use the web for building the langdata, you won't find many good sources for Hebrew with nikud.
Here is an excellent source which has both Hebrew with nikud (mostly poetry) and without nikud (most of the prose):
http://benyehuda.org/
Project Ben-Yehuda, named after Eliezer Ben-Yehuda, is like the famous Project Gutenberg, but it just for Hebrew.
Note that some parts are copyrighted. In some other parts the copyrights were expired according to the Israeli law, but might be still copyrighted in the US. For your use case, building a corpus, I don't think the copyrights matters, but IANAL.
Do you use the Hebrew Bible as a source (like the one from Wikisource)?
I don't sure if it is a good idea to use it for modern Hebrew.
More information will follow later.
The text was updated successfully, but these errors were encountered: