-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine Tuning Leads to Segmentation Issue #2132
Comments
Any idea of what might be causing this issue ? |
Here is a visualisation (using https://github.com/kba/hocrjs) for both results: The layout recognition is clearly different. |
What would be the reason behind the different layouts ? Why would my fine tuning have an impact ? Also, thank you for your support @stweil |
Based on recommendations for tess tutorial in wiki by @theraysmith finetuning should only be done for limited number of iterations. He had suggested 400 for finetuning for 'impact' and 3000 for finetuning to add a character. So, 60000 is probably too large. Also, please check the --psm being used for training by the ocr-d/train script. Ray has mentioned as part of the LSTM training notes that the models have been trained per line. |
@stweil Does this mean that layout analysis has changed since tessdata_best was trained? |
As shown in the learning curve uploaded above, the training process was successful (even for 60k iterations). The accuracy improved on a text line level. My issue as explained above and shown in the layout representations, is that of segmentation. When running the trained model on a complete newspaper, the accuracy goess way off. Have a look at the layout representations above. I used --psm 7 for training. |
Why do you think so? |
@jaddoughman, is this result better? I added |
There are some more components which could be taken from the original
|
No, even after adding the dawg files the issue remains. I can't seem to understand how training a model is in any way connected to the segmentation process. The layout representation should be identical in all model or am I wrong ? |
I would have thought so, too, but recently I noticed some cases which are even more strange:
|
I trained twice, once including the dawg files and the other excluding them. The training which included the dawg file was better then the one excluding it. However, both were way worse than the original model. Also, note that training was successful (learning curve attached above). On a text line level, the results are near perfect. However, I need the transcription of the complete newspaper sample. This was a part of a 12-month long research paper, to finally reach this issue is devastating. On a technical level, there needs to be an explanation of why and how training any model would in any way alter the segmentation process. |
@jaddoughman Which psm are you using for the complete newspaper sample? If is the default i.e. psm 3 then please try the training with --psm 3 (or without specifying the psm) as an experiment and see if the results are better. |
I attached @Shreeshrii 's fine tuned Arabic model below. Is it possible @stweil to generate its corresponding layout representation ? This can help us reach a conclusive decision on our initial assumption concerning the segmentation issue. Fine Tuned Model: ara-amiri-3000.traineddata.zip |
Here it is: |
Original Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1.html The above results prove that our assumption concerning the segmentation is true. Any explanation of the relation between fine tuning and word detection (segmentation) would be greatly appreciated. Understanding the problem can help in finding a workaround. |
The layout analysis phase detects:
Words and glyphs splitting is part of the OCR phase and not part of the layout analysis phase. |
Why is fine tuning changing the word recognition ? How can I fix my issue ? Also, if word splitting occurs at OCR phase, then why do I have different results when running the exact same line on a complete newspaper versus on a text line level ? Meaning: if I OCR a single text line I get a different result than OCRing a complete newspaper containing that text line. |
Looking again in the code, it seems that words splitting does occur in the layout analysis phase... I think the word splitting can still be changed by the ocr phase. Sorry, I don't have answers to your last questions. |
Can the code be altered to include splitting in the OCR phase ? I see no reason why word splitting should be altered during training. My training dataset constituted of 4000 text lines that required crowdsourcing to generate. A lot of time was invested to train the model. Any help would be greatly appreciated. If any of the other developers have an answer I would be happy to try any alternative fix. |
Sorry, I don't know how to help you with this issue. |
@jaddoughman I unpacked your traineddata file with combine_tessdata. The lstm_unicharset in it has 303 characters. So it seems to me that you have trained using script/Arabic from tessdata_best rather than ara.traineddata. Also, please share the exact version of tesseract that you are using. Your traineddata file reported beta.3. |
I trained using both the script and the tessdata_best. Both altered the segmentation leading to the same issue. I was using Tesseract 4.0 during training. However, even if another version was used, i also tried your fine tuned model that also resulted in altered word detection. Is it possible to alter the code so that the word splitting resides in the layout process and not the OCR one ? |
If i attach my training data set, is it possible for you to fine tune it to ensure that the issue isn't related to my training process ? |
@theraysmith is the only one with enough knowledge about the code to suggest a solution and according to @jbreiden he is now busy on another project at Google. |
The below dataset contains about 4000 text lines. The txt files below are in RTL order. I was informed that they needed to be changed to LTR. I attempted to change them to LTR by inverting the string of every text file. The below dataset is in RTL since one of my issue might be caused by my conversion attempt. I fine tuned for 60,000 iterations and saw a great improvement in accuracy on a text line level. I think your training attempt can help us reach a conclusive decision on the origin of the issue. Thank you for your help. Dataset: dataset.zip |
According to posts by Ray, training for all languages is done in LTR order and there is a routine in tesseract to handle the change to RTL later. I do not know Arabic hence can not check whether the conversion is correct. I am relying on text2image to create the correct box files. I have concatenated your text files to create a training_text for fine tuning. I will run the training with Scheherazade font and share the results. From my earlier experience fine tuning seems to work best when the training text used is what was used for initial training. For Arabic we do not have that file available, we only have the 80 line training_text (similar to 3.04). |
No concrete proof :-( There have been issues with page segmentation, word dropping for a while. There are probably a number of issues related to them still open. So, something has definitely changed. If it is not seen in eng, deu and other latin script based languages then it maybe related to complex script processing / unichar compression / recoding. Were you able to get the unit tests related to unichar compression to work? Maybe they can help in figuring out the issue. |
Please see Ray's comments in #648 (comment) - These are from Jan 2017. He has made changes to the processing for Arabic after that. I will try and find those comments and commits and link here for reference too. |
Okay, did your attempt at fine tuning work with the given dataset ? Your attempt is important since my extracted traineddata file is reported beta version 3. |
400 iterations - Scheherazade font - training text made by concatenating text lines from dataset provided in tesseract-ocr/tesseract#2132 (comment)
PlusMinus training using training text based on dataset in tesseract-ocr/tesseract#2132 (comment) at 400, 4000 and 10000 iterations
@jaddoughman Please see https://github.com/Shreeshrii/tessdata_arabic I have uploaded there various versions of finetuneddata using the training text based on your dataset. If you know the font used for the newspaper, or a similar font, finetuning with that might give better result. |
I ran all your trained models on 5 testing samples, but the accuracy decreased on each one. The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one. This is unfortunate. If any possible explanation arises concerning the connection between training and segmentation, please let me know. Thank your for your help. @Shreeshrii |
Our Fine tuned model is performing better on a text line level. Hence, training is improving the accuracy on a text line level. One possible solution I'm exploring, is to segment the newspaper samples into text lines and OCR them using our fine tuned model. The issue here would be that I would need a segmentation algorithm to automate this process. I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect. Do you recommend any other way to automatically segment the newspaper samples into text lines ? Or word extraction ? I just need the segmented text lines which can be trasncribed using our fine tuned model. SAMPLE NEWSPAPER: Sample1.tif.zip |
As an experiment, try to create the HOCR files using different language traineddata and see if the boxing is better. Also try with --oem 0 i.e. base tesseract instead of lstm tesseract, and with older versions of tesseract (3.05, 3.04). It would be good to know whether segmentation is different in all these cases and whether any are better for your use case. You can also use leptonica directly for segmentation. Please look at the sample programs provided with it, I recall one which had good results for Arabic. |
Please also see tesseract-ocr/tesstrain#7 |
I uploaded below the generated text line images by the Arabic, Arabic Fine-tuned, and English models (using their respective hOCR files). The results are that the English and Arabic text line images differ probably due to writing orientations (RTL and LTR), but the ara and ara_finetuned had the same results. This is what I predicted, but this doesn't lead me anywhere, since we already knew that fine tuning doesn't change on a text line level, but the recognition of words is what differs. ENGLISH MODEL: Sample1_eng.zip |
My reasoning for the experiment was that if another model gives you better segmentation, you can use it for splitting to line images and then use your finetuned model to ocr. |
Also see #657 |
I tried all variations of different language models and OEMs. No major difference was found. I think the most reasonable solution would be using Leptonica. However, isn't Tesseract powered by Leptonica ? If so, is it possible to generate different results than the hOCRs files generated by Tesseract ? |
I just used The complete newspaper did not work well with it. I cropped a section with two columns.
Results for that are attached. |
Thank you for you help. However, isn't Tesseract using the arabic_lines code to segment the inputted image ? If not, what is the code you are using ? |
No. Tesseract has its own layout analysis code which may be using other leptonica functions. |
Will you be fixing the issue of fine tuning leading to altered word detection in the coming Tesseract 4.1 updated version ? I believe this is a major obstacle specially in Arabic, since the pre-trained models are performing very bad. Even after you trained using a separate training data-set, the word detection was altered and the accuracy decreased substantially. If you have any immediate fix or can guide me in a direction that fixes this issue, let me know. I have 185,000 images similar to the ones attached, and my trained model is suffering from the bug discussed above. Thank you for your help. |
The official traineddata has been trained by Ray Smith at Google. As far as I know there are no new updates planned. I try to follow the guidelines given by Ray in tesstutorial or comments on issues for experimenting with training. Regarding layout analysis, there are other similar open issues. I am not sure if there are any plans to address those for 4.1.0. You can try posting in tesseract-ocr google group to see if someone has had better luck with improving Arabic traineddata. |
Do you know which font is used in the images that you want to recognize? Or
suggest a similar font.
…On Sun, Mar 10, 2019 at 1:40 PM Jad Doughman ***@***.***> wrote:
Will you be fixing the issue of fine tuning leading to altered word
detection in the coming Tesseract 4.1 updated version ? I believe this is a
major obstacle specially in Arabic, since the pre-trained models are
performing very bad. Even after you trained using a separate training
data-set, the word detection was altered and the accuracy decreased
substantially.
If you have any immediate fix or can guide me in a direction that fixes
this issue, let me know. I have 185,000 images similar to the ones
attached, and my trained model is suffering from the bug discussed above.
Thank you for your help.
@Shreeshrii <https://github.com/Shreeshrii>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2132 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o8c5iri1zILCMme7qQUYCRt9Mvrdks5vVL35gaJpZM4ZgdTt>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
The font family can be easily found, since the images are from a well-known newspaper which uses consistent font families throughout the archive. However, the bigger issue is the altered word detection post-training. You attempted to train on a certain font family, and the results were worse than the pre-trained model. My questions is how is fine-tuning a model decreasing the accuracy ? Also, how is fine-tuning altering the detection of the word itself. The OCR Process as you know has 4 main steps:
Word detection occurs prior to the classification of the letters themselves. The generated layout analysis attached above shows an altered and incorrect word detection for the trained model. The above questions should be included in your updated version of Tesseract. OCRing the 185K archive is part of a research paper, investing months to train Tesseract shouldn't go to waste. I have a lot of samples if you wish to experiment on. |
Google/Ray have not shared the training text used for LSTM training for Arabic, so we only have the 80 lines from langdata repo. Finetuning works best, AFAIK, when the original training text is used with minimal changes. Trying a different text leads to worse results, as you have pointed out. |
@jaddoughman: As far as I understand Cognitive Services Arabic OCR API is part of Microsoft Computer Vision which is alternative for Cloud Vision and not for tesseract. These kind of services are not free and neither open source. |
Your assumption is wrong. As Shree pointed out, you should not train too much lines with the same font. It will lead to overfitting. |
Environment
Tesseract Version:
tesseract 4.0.0
leptonica-1.77.0
libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Platform:
Ubuntu 16.04
Current Behavior:
I wanted to OCR a large dataset of Arabic newspapers with difficult delimiters and spacing. After running your original pre-trained model, I managed to recall about 80% of the required data. I opted to fine tune your existing ara.traineddata file by using text lines as my training and test data set. I used the "OCR-d Train" tool on GitHub to generate the neccessary .box. files.
Throughout the fine tuning process, the Eval percentages decreased tremendously, which means that the model was successfully trained. I re-evaluated using my own method and confirmed the successful training process.
However, the test dataset used was made up of text lines. So, your and my evaluation were generated on a text line level. The issue occurred when I ran the fine tuned model on a complete Newspaper sample (constituted of the same text line fonts). The accuracy decreased significantly compared to your original pre-trained model. This made no sense at all, my fine tuned model has better accuracy than your model on a text line level, but when running it on a complete newspaper (constituted of the same text line fonts), your pre-trained model is performing better than my successfully fine-tuned model.
The issue seems to be connected to your segmentation algorithm. This is a major problem, since this means that your training tool only works on a text line level and cannot be applied to any other form of dynamic text extraction. You will find below a sample newspaper, my fined tuned model, and the learning curve from the training process.
Sample Newspaper:
Sample Newspaper.zip
Fine Tuned Model:
ara_finetuned.traineddata.zip
Learning Curve:
Learning Curve (60k Iterations).pdf
The text was updated successfully, but these errors were encountered: