Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hebrew issues #82

Open
amitdo opened this issue Jul 27, 2017 · 63 comments
Open

Hebrew issues #82

amitdo opened this issue Jul 27, 2017 · 63 comments

Comments

@amitdo
Copy link

amitdo commented Jul 27, 2017

Here, i'm going to raise some issues related to Tesseract's Hebrew support.

Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue', even if there are similar issues for both Arabic/Persian and Hebrew.

Let's start with the nikud issue.

Hebrew has two writing forms:

  • Hebrew with nikud
  • Hebrew without nikud

Nikud - Diacritical signs used in Hebrew writing.

Modern Hebrew is written (mostly) without nikud.

Children's books are written with nikud. Poetry is also usually written with nikud. Hebrew dictionaries also use nikud. The Hebrew bible use nikud. It also uses te'amim (Cantillation marks).

There are some mixed forms:

  1. In this form, most of the body text is written without nikud, but in a few places nikud is used.
    1a) Some paragraphs/sentences use nikud, when quoting the bible or a poem for example.
    1b) One or few words in some paragraphs use nikud. This form is used for example for foreign names of people and places (like cities). Without nikud many words will be ambiguous. Usually a native Hebrew speaker will use context to solve this ambiguousness. Sometimes there will still be ambiguousness, and then using nikud can be used to solve this issue.
  2. In this form, most (or at least a large percent) of the words in the text is written with nikud, but for the words with nikud, the nikud is only partial.

The following part is relevant to both (1b) and (2) above.
When adding nikud to a word, it might be in 'full' or 'partial' form. Sometimes adding just one nikud sign is enough to make the word unambiguous.

Ray, If you only use the web for building the langdata, you won't find many good sources for Hebrew with nikud.

Here is an excellent source which has both Hebrew with nikud (mostly poetry) and without nikud (most of the prose):
http://benyehuda.org/
Project Ben-Yehuda, named after Eliezer Ben-Yehuda, is like the famous Project Gutenberg, but it just for Hebrew.
Note that some parts are copyrighted. In some other parts the copyrights were expired according to the Israeli law, but might be still copyrighted in the US. For your use case, building a corpus, I don't think the copyrights matters, but IANAL.

Do you use the Hebrew Bible as a source (like the one from Wikisource)?
I don't sure if it is a good idea to use it for modern Hebrew.

More information will follow later.

@amitdo
Copy link
Author

amitdo commented Jul 27, 2017

tesseract-ocr/tesseract#648 (comment)

theraysmith commented:

Here are some examples of test data with diacritics:

Truth: שֶׁהוּא נָס מִפָּנָיו, אָמַר לוֹ מֹשֶה: כָּל הַיּוֹם הָיִיתִי אוֹמֵר לְךָ בְּשֵׁם הַקֹּדֶשׁ וְלֹא הָיִיתָ
OCR: שָהוּא נס מִפָּנִיו, אָמַר לו משָה: כָּל הַיום הָיִיתי אוּמר לֶך בָּשַם הקדש ולא הָיִיתָ
Confs: 0.84 0.56 0.64 0.93 0.96 0.77 0.88 0.76 0.63 0.64 0.54 0.45 0.91 0.88 0.58
Diff: שָהוּא נָס מִפָּנִיו, אָמַר לוֹ מֹשָה: כָּל הַיּוֹם הָיִיתִי אוּמֵר לֶ ךָ בָּשַם הַקֹּדֶשׁ וְלֹא הָיִיתָ
Recall Errors = 12
Precision Errors = 2

Truth: ותופחים בבטנים ובשירים שארכם כארך סיגריה,
OCR: וְתופחים בַּבַּטָנִיס וּבשירים שאַרכֶּם כַארְף סִיגְרִיה,
Confs: 0.71 0.71 0.91 0.8 0.56 0.56
Diff: וְתופחים בַּבַּטָנִיס וּבשירים שאַרכֶּם כַארְף סִיגְרִיה,
Recall Errors = 6
Precision Errors = 1

In all these cases, tesseract gets a poor result.
In case 1, the diacritics are in the truth text, and Tesseract gets them badly wrong.
In case 2, the diacritics are NOT in the truth text, and Tessseract suggests some anyway
I don't think that both of these truth texts can be "correct" in the sense that one has the diacritics and the other does not. Which way should it be and why?

@amitdo
Copy link
Author

amitdo commented Jul 27, 2017

(1) I didn't find mistakes in the letters themselves.

(2) I found two mistakes in the letters themselves.
Samekh [ס] instead of Mem-sofit [ם].

The issues with training Tesseract to recognize nikud are:

  • You need good sources for the training text. I suspect you don't have good sources.
  • For Hebrew without nikud the number of glyphs the network needs to learn is small: 22 letters + 5 final letters forms + [0-9] + punctuation + a few common marks.
    For Hebrew with nikud the number of glyphs the network needs to learn is much greater - some hundreds.
  • The signs of nikud are very small.
  • The signs can be confused with noise in the image.
  • I think you have some issues with the Hebrew unicharset, will report later about that.

@amitdo
Copy link
Author

amitdo commented Jul 27, 2017

(2) The second mistake in the letters themselves.
Pe-sofit [ף] instead of Kaf-sofit [ך].

The letters it wrongly chose are indeed very similar to the correct letters.

The two incorrect words in (2) are not true dictionary words.

@amitdo
Copy link
Author

amitdo commented Jul 27, 2017

Another issue with Hebrew - Tesseract's Dictionary.

Hebrew uses both prefix and suffix with base words.

http://hspell.ivrix.org.il/ (AGPL)

http://hspell.ivrix.org.il/WHATSNEW

Vocabulary: 468,508 words (when built with "--enable-fatverb")
based on 24495 base words:
12908 nouns,
3889 adjectives,
5261 verb stems,
and 2437 other words.

So 468,508 words are produced from 24,495 base words+their suffix forms.

They don't mention the prefix forms. I think that from 24,495 base words + prefix and suffix forms you will get at least 9 million words.

The Hebrew wordlist contains 152,000 words. I believe that this list will not cover enough words in Hebrew. The result: lstm + dict might be not better than raw lstm only. This is my assumption and it needs to be verified.

Hspell's dictionary does not include nikud.

@amitdo
Copy link
Author

amitdo commented Jul 27, 2017

@nyh, @dankenigsberg [hspell authors]

Sorry to bother you.

Can you please read the comment above this one, and answer these questions:

  • How much 'words' can hspell recognize (including all the suffix forms)?
  • Zipf law; How many 'words' do you think are needed for 70% coverage in Hebrew (no nikud)? for 80%? 85%? 90%? 95%?

@amitdo
Copy link
Author

amitdo commented Jul 27, 2017

Returning back to your question.

In all these cases, tesseract gets a poor result.
In case 1, the diacritics are in the truth text, and Tesseract gets them badly wrong.
In case 2, the diacritics are NOT in the truth text, and Tessseract suggests some anyway
I don't think that both of these truth texts can be "correct" in the sense that one has the diacritics and the other does not. Which way should it be and why?

Both (1) and (2) are not so good because of the issues with nikud.

In (1) the OCR'ed text has a lot of nikud mistakes.
If you omit (or try to completely ignore) the nikud in the OCR'ed text, the text is almost perfect in its 'without nikud' form.
When you omit the nikud, for some words you'll have to add vav [ו] or yud [י] letters instead of the nikud.
The right way to write הַקֹּדֶשׁ in the 'without nikud' form is הקודש
Here you add a vav instead of the omitted holam-haser sign.

In (2) the network tries to be 'too smart' and adds nikud signs which does not appear in the ground truth.
For OCR this 'feature' is not something that you want.

As a note, this feature can be useful for another, separate application: Converting (kind of translating) text 'without nikud' to 'with nikud' form. But it will be useful only if it will have good accuracy. For training, you'll use a pair of text lines - (1) the 'without nikud' input and (2) the desired 'with nikud' output.
Something like that was done a few years ago with HMM by two Israeli students.
https://www.cs.bgu.ac.il/~elhadad/hocr/.
A funny thing is that they trained an old version of Tesserract to read Hebrew with nikud and then used the OCR'ed output of scanned book written with nikud as part of training the HMM 'nikud translator'.

@amitdo
Copy link
Author

amitdo commented Jul 28, 2017

To summarize the above:
(For OCR) I think (1) is more preferable than (2). But both aren't good.

So, unless you can make the nikud recognition much better, IMO a reasonable solution might be to drop the nikud signs.

@amitdo
Copy link
Author

amitdo commented Jul 29, 2017

heb.wordlist contains these words:

אַרטיקל-
קאַװע־שטיבל
בלאַט:
נאַװיגאציע,
אַר־עס־עס
באַניצער
װאָס
קאַטעגאָריע
אינהאַלט
באַהאַלטן
נאָך
אַלע
צוזאַמענארבעט
אַרטיקל
רעדאַקטירן
דאָס
אַריבערשליסונגען
אָקטאָבער
אַנאָנימע
נאָר
באַנוצערס
אַלץ
האָט
[בעאַרבעטן]
זאָל
קאָנטאַקט
אַהער
ראָבאָטן
װי
װען
װעגן
װעט
(װערסיעס)
מעדיעװיקי
װערסיעס
װעלכע
װערסיע

All of them are Yiddish words, not Hebrew.

If you omit these words (you should), only 12 words with nikud will be left in the heb.wordlist file.

@Shreeshrii
Copy link
Contributor

Ray,
I got better results while segregating Devanagari with Vedic accents from regular Devanagari for training. See tesseract-ocr/tessdata#61 (comment)

You can also consider having two separate traineddatas, one with nikud and one without for Hebrew, each with corresponding wordlists, if that gives better accuracy for each. It won't work for mixed texts though.

@amitdo
Copy link
Author

amitdo commented Jul 30, 2017

The Hebrew word list and training text should not contain the Yiddish digraphs:

05F0 װ HEBREW LIGATURE YIDDISH DOUBLE VAV
05F1 ױ HEBREW LIGATURE YIDDISH VAV YOD
05F2 ײ HEBREW LIGATURE YIDDISH DOUBLE YOD

@amitdo
Copy link
Author

amitdo commented Jul 30, 2017

Ray,
How many fonts do you use for Hebrew training?

@amitdo
Copy link
Author

amitdo commented Jul 31, 2017

Here are two examples from Project Ben-Yehuda:

Poem:
http://benyehuda.org/bialik/bia060.html

Prose:
http://benyehuda.org/yalag/yalag_086.html

@amitdo
Copy link
Author

amitdo commented Aug 3, 2017

@theraysmith
If you have further questions about this subject, I'll be happy to answer them.

@theraysmith
Copy link
Contributor

I have questions:
Prefix and suffixes. Do they just append to the base word without changing it (like cat->cats) or do they potentially change the word slightly (like lady->ladies)? If they never do the latter, it might be possible to fix the problem by allowing no-space word concatenation, like Chinese, Japanese, and Thai.

My previous post was missing the images:
image
image
The nikuds were on both images, but the ground truth was wrong, as it didn't contain them.

In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable? It makes the training complicated because it means that words can appear either way. I can see the appeal of just discarding all the nikud, but it doesn't seem the right thing to do.

Yiddish. You list a bunch of Yiddish words, and then in a separate post Yiddish-specific characters. Do all those Yiddish words contain one or more of those characters? If not, how I can separate Yiddish from Hebrew? Those Yiddish characters are not in the unicharset for the latest Hebrew model. The unicharset only has 67 characters.

If you omit these words (you should), only 12 words with nikud will be left in the heb.wordlist file.
You need good sources for the training text. I suspect you don't have good sources.
That gives me a new idea for filtering the wordlists. I think I can solve the problem of the nikuds. I have all of the web at my disposal, so it is just a matter of filtering correctly, provided there aren't changes of font to deal with. (See below.)

I take it that the unicodes you refer to as nikud are 5b0-5c4, and that the cantillation marks 591-5af are used so rarely as to be totally ignored?
The unicharset for the best model that I just pushed, only has 5b0, 5b4, 5b6, 5b7, 5b8, 5bc + 27 base letters. I could force others to be included, or just broaden the filter to capture more of the corpus. I notice that 5b9 is the most frequent dropped character. Please suggest which others should be included.

Fonts. Too many to count. (attached)
hebrewfonts.txt
Problem: I have noticed that there is an older style of font, in which the letters are very rounded instead of rather square. Tesseract is very inaccurate on this style as there are few if any fonts that look like that.
Question: Do any of the attached fonts use this older style? If so which? (I can boost the frequency to get the accuracy up). If not, are there any publicly available?

@amitdo
Copy link
Author

amitdo commented Aug 4, 2017

I have noticed that there is an older style of font, in which the letters are very rounded instead of rather square.

Something like these samples:
http://fontbit.co.il/search.asp?tag=3&style=13
?
That's the style used for handwriting in Hebrew. It's different from the printed style.

@amitdo
Copy link
Author

amitdo commented Aug 6, 2017

https://fonts.google.com/?subset=hebrew

A list of Hebrew fonts from the Open Siddur Project
http://opensiddur.org/tools/fonts/

@amitdo
Copy link
Author

amitdo commented Aug 6, 2017

Prefix and suffixes. Do they just append to the base word without changing it (like cat->cats) or do they potentially change the word slightly (like lady->ladies)? If they never do the latter, it might be possible to fix the problem by allowing no-space word concatenation, like Chinese, Japanese, and Thai.

Both forms exist.

A road - kvish כביש - plural kvishim כבישים
A meeting - pgisha פגישה - plural pgishot פגישות

@amitdo
Copy link
Author

amitdo commented Aug 6, 2017

I suggest to use these unicodes for heb.traineddata (Hebrew, not including additional Yiddish unicodes):

Hebrew Alef-Bet (Alphabet)

05D0-05EA
22 letters + 5 final forms = 27

Numerals

0-9
0030-0039

Nikud

If you want to support nikud, you should include:
05B0-05BC, 05C1, 05C2

Unique Hebrew punctuation marks

05BE ‫־‬ HEBREW PUNCTUATION MAQAF
05F3 ‫׳‬ HEBREW PUNCTUATION GERESH
05F4 ‫״‬ HEBREW PUNCTUATION GERSHAYIM

Common marks

Other common marks - the ones that are already in heb.trainedata (As 'Common').

Links

http://unicode.org/charts/PDF/U0590.pdf
https://en.wikipedia.org/wiki/Hebrew_alphabet
https://en.wikipedia.org/wiki/Hebrew_punctuation
https://en.wikipedia.org/wiki/Niqqud

@amitdo
Copy link
Author

amitdo commented Aug 6, 2017

In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable?

Yes.

The ideal is that Tesseract will do a good job with these texts:

  1. Text without nikud.
  2. Text with minor use of nikud.
  3. Text with nikud in all/most words.

It makes the training complicated because it means that words can appear either way.

The question is if it can achieve high accuracy on the 3 above kinds of texts.

I can see the appeal of just discarding all the nikud, but it doesn't seem the right thing to do.

I think you may want to consider and try several approaches for training:

(GT here is Ground Truth)

  1. The GT, in all text lines, does not include nikud signs.
    This model will be used for most Hebrew texts, which either does not use nikud, or that the nikud appears in 0.1% up to 1% of the words.
  2. The GT has nikud in all/most words.
    This model will be used on Hebrew texts which is very likely to have nikud: poetry and texts aimed to children.
  3. Half of text lines in the GT have nikud and the second half does not have it.
  4. Like 3, but any letter+nikud sign(s) combination in the GT will be normalized to a form without nikud as a first step in training.
  5. Like 3 but during OCR the user will have an option to blacklist all nikud signs.

@amitdo
Copy link
Author

amitdo commented Aug 7, 2017

My comments about the Hebrew wordlist were based on the file in the langdata repo.

@amitdo
Copy link
Author

amitdo commented Aug 7, 2017

@theraysmith,

Please read my new comments, starting with #82 (comment)

Talking about the files in best/heb.traineddata:

  • The heb.lstm-unicharset does have some nikud signs, but it lacks some other nikud signs.
  • In heb.lstm-word-dawg, there are only 67 words with nikud. Some of them are Yiddish words.

@amitdo
Copy link
Author

amitdo commented Aug 9, 2017

best/heb.traineddata has only 6 nikud signs:

5b0 HEBREW POINT SHEVA
5b4 HEBREW POINT HIRIQ
5b6 HEBREW POINT SEGOL
5b7 HEBREW POINT PATAH
5b8 HEBREW POINT QAMATS
5bc HEBREW POINT DAGESH OR MAPIQ

9 nikud signs are missing.

Also missing are the 3 unique Hebrew punctuation marks I mentioned earlier.

@theraysmith
Copy link
Contributor

OK I have added desired/forbidden characters for heb and yid
I assume that apart from the 3 unique characters that you listed (for each) the list of nikuds should be the same?

@Doragon-S
Copy link

That makes sense.
But then how does tesseract deal with non-dictionary words? like names, or maybe an intentionally misspelled word (h8, l33t, etc.)?
can it deal with those?
once tesseract has the text in black and white, what is so hard about identifying each letter individually? ('u' is different that 'a' because it isn't connected on top, etc.) isn't that how children learn the letters? is it possible to program computers on the differences?

@EastEriq
Copy link

But then how does tesseract deal with non-dictionary words? like names, or maybe an intentionally misspelled word (h8, l33t, etc.)?
can it deal with those?

it gives weights. If the recognition score of the non-dictionary word exceeds that of the possible alternatives it outputs it, if the vocabulary alternative seems more likely, it goes for that. This is why the training corpus should be representative of the
texts which are aimed to.

once tesseract has the text in black and white, what is so hard about identifying each letter individually? ('u' is different that 'a' because it isn't connected on top, etc.) isn't that how children learn the letters?

Not on real world unregistered text, faint, distorted, faded and whatnot. Anyway children who learn single letters aren't yet readers.

@Doragon-S
Copy link

ok. I had thought I saw that tesseract de-skews the image, and also turns it to binary black and white.

@EastEriq
Copy link

problems don't end there

@Doragon-S
Copy link

Doragon-S commented Sep 29, 2020

I saw that you said that someone needs to 'train' tesseract. (I did find out what 'training' was).
I was confused though why you couldn't just take a database that is already in text, and make a program to save some of the lines as pictures, and have it train tesseract automatically. If it has both the image and answer, it should only have to check the answer against tesseract's response, which is what a human would be doing anyway, right?

@EastEriq
Copy link

If you're going to undertake that, that would be a certainly a great contribution. See what has been said in the messages above.

@chsanch
Copy link

chsanch commented Feb 8, 2021

We are working with some Rashi based books (in judeoespañol), we have the scanned pdf from the originals, I try to use tesseract to try to extract some texts from the images but it didn't work that well (of course), not sure if there is someone already working on adding support for Rashi fonts, but it would be nice. If some help is needed just let me know.

@matantech
Copy link

Hi all,
I can't get tesseract scan Hebrew at all and keep getting the error

actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file ../../ccutil/tessdatamanager.cpp, line 53

Anyone has a valid Hebrew trained data or any other solution?

Thanks

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Feb 8, 2021

@matantech
Copy link

@Shreeshrii
I didn't mention I'm using tesseract for iOS. my bad.
Is there any valid Hebrew trained data for it?
Thanks!

@amitdo
Copy link
Author

amitdo commented Feb 9, 2021

@matantech

Please use the forum for technical support.

@MDjavaheri
Copy link

You should know that ABBYY FineReader does a good but not perfect job with Rashi script. Even after training it above the standard recognition pattern, I still can't get it to tell the difference between a final Mem and a Samech which is understandable but somewhat of a nuisance. It works maybe 10% of the time. A similar problem exists with a Taf and a Het, but training has improved that to only be an issue about 20% of the time.
If you're going to teach Tesseract what to do, keep that in mind.

@amirbachar
Copy link

What would be the best way to train tesseract on these fonts? (http://freefonts.co.il/)
We've tried manually creating a tif image for each one, and then tagging the bounding boxes, but it's a tedious process, and the results are not optimal (perhaps due to inconsistencies in the tagging).
Is there a library that automate the whole training process by using the font file? (bounding boxes also have intersections for some fonts)

@AvtechScientific
Copy link

Hebrew has two writing forms:

* Hebrew with nikud

* Hebrew without nikud

@amitdo - is there a way to tell tesseract whether to recognize or to ignore nikud during OCR with the current official heb.traineddata?

If there is no way to ignore nikud during OCR - is there an easy way to delete after recognition? And what needs to be done to make nikud recognition optional?

Thank you!

@amitdo
Copy link
Author

amitdo commented Aug 31, 2021

Tesseract has an option to blacklist characters . Consult the docs and/or ask in the forum about this option. but note that people reported it does not work well with the lstm based models.

is there an easy way to delete after recognition?

Yes. With a few lines of bash/perl/python script that removes the diacritics from the txt/hocr output. You'll have to write this yourself...

Please use the forum to ask questions.

@AvtechScientific
Copy link

AvtechScientific commented Aug 31, 2021

@amitdo - I asked here because my question relates directly to this issue and contributes to its treatment.

By blacklisting - do you mean forbidden_characters? If yes, then it seems like it has any meaning only during training data generation. If so then it looks like in order to cause tesseract to recognize ignoring nikud (the feature you've requested above) - you have to create two separate files: heb.traineddata (like the current one) and heb_nikudless.traineddata (with training data lacking nikud). Am I right? Or is there a way to "blacklist" certain characters during recognition? If so - you can run tesseract with nikud-aware traineddata, but tell it to ignore the blacklisted letters (despite the fact that they are being recognized)...

@amitdo
Copy link
Author

amitdo commented Aug 31, 2021

I'm talking about the parameter tessedit_char_blacklist you can give to Tesseract in the command line either with -c parameter=value or with a config file that will contain the parameter.

This is my last answer. This place is not a support forum.

@AvtechScientific
Copy link

AvtechScientific commented Jan 24, 2022

I tried to train Tesseract to recognize Rashi script. Here are the results:

https://gitlab.com/pninim.org/tessdata_heb_rashi

It was my first time training Tesseract, so I might have made mistakes. I have documented the process, so if you see anything that can be improved - please let me know. It looks like the recognition is comparable to that of ABBYY FineReader (at least in my sample test). Any feedback is appreciated!

@Shreeshrii
Copy link
Contributor

@AvtechScientific Thank you for taking the effort to train Hebrew Rashi script.

@amitdo and others can check it further.

Please share the test data and results that show how well your new traineddata does in recognizing Rashi script - something on the lines of https://github.com/tesseract-ocr/tessdata_contrib/blob/main/khmLimon.md Thanks!

@EastEriq
Copy link

just a few naive comments from a cursory look at heb.wordlist:

  1. there are many words beginning with geresh or gershayim. Otherwise, it is probably sensible and correct to treat geresh and gershayim as alphabetic characters, to avoid word splitting. There a re a few inconsistencies thoug, like sometime double geresh instead of gershayim
  2. there are several words ending with sof pasuq, which should be punctuation
  3. I'm in doubt whether maqaf should be considered punctuation
  4. what is your position w.r.t including nikkud? You have words including even taamim
  5. would it be considerable to make different wordlists for hebrew rather than aramaic, considering the corpora this would be used for?

@AvtechScientific
Copy link

AvtechScientific commented Jan 25, 2022

@EastEriq - thank you for your feedback!

(1) / (2) heb.wordlist was generated automatically from Sefaria's MongoDB dump so there might be quite a lot of inconsistencies... The question is - how problematic is it for real world recognition? That's why I asked people for feedback to see all possible test data documents (clean typed docs, scans of modern docs, scans of old books, etc..) and how current model performs compared to FineReader...

(3) Indeed, probably like gershayim it should be treated as alphabetic character, to avoid word splitting... Do I treat it as punctuation somewhere?

(4) I don't remember seeing books in Rashi script with nikkud, so it is probably not such a practical wide spread case. (regarding taamim - see 1./2.)

(5) Do you mean two separate files - heb.wordlist and ara.wordlist?

@benyamindsmith
Copy link

@AvtechScientific thank you for trying it out. I just downloaded the Training data file and found that there is still some issues with blurrier text.

For context I used this file and for some comparison this is what I have:

The text

image

The output

`

כזזשך כחעו יזוחי יזלהים ישל. ארץ ממולדחי נחשבחי כנודר קונם - בורח מחיוח נזורדי״ל ליו הפ
במורדי אור וישג ה ז טליהם את אונם - זרים רדפוני חנם - והייתי כאורח נטה ללו פה אטשטרדם
חלפחי זם הספד כע״ט מוצל משרפה - נזעשה ידי חומן שר וגדול׳ בישרחזל רועה צאנם - שר נולדחי בייזי
`

I tried setting the DPI higher but this is as good as I got it.

Looking forward to seeing the project progress.

@AvtechScientific
Copy link

AvtechScientific commented Jan 29, 2022

@benyamindsmith thank you for your feedback. Issues like these are expected. What is interesting - is to figure out how does this compare to FineReader or other OCR software... Do you have it to digitize the same input?

@AvtechScientific
Copy link

@Shreeshrii - thank you. PR created.

@ghost
Copy link

ghost commented Mar 12, 2022

I want to ocr Ladino texts. The material that I have is scanned PDFs from Hebrew books (printed in 18th and 19th centuries). The density and size are mostly incorrect. There are four variations of bet, gimel, zayin and, pe which signify ve, dj/ch, j, and fe.
I have prepared some 300 lines of truth data and ran the training.
I have the following problems:
I used unicodes FB31, FB32, FB36 and FB44 for these special characters (lhebrew letters with dagesh). But I don;t see these codes in unicharset which tesseract prepares. As a result in the output the output file the dagesh moves to another letter.
I would also like to add the new font on the trained hebrew
sample
.

@Maxwell175
Copy link

@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script?

@amitdo
Copy link
Author

amitdo commented Aug 23, 2023

This issue was directed to Ray from Google, who trained all the models in the 3 tessdata repos. He does not participate in the project in recent years.

The process is documented in the tessdoc repo. It's not easy to train the LSTM engine, and I don't have time to help. You can try to ask in our forum.

@amitdo
Copy link
Author

amitdo commented Aug 23, 2023

The status: still unsolved.

@AvtechScientific
Copy link

@amitdo What is the current state of this issue. I am working on digitizing a number of Talmud and Hebrew related books. What can i do to move this forward and improve support for diacritics as Rashi script?

I trained tesseract for Rashi script some time ago and documented the training process:

https://gitlab.com/pninim.org/tessdata_heb_rashi

There you can also find the link to the Rashi font collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

14 participants